About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.
This update includes loads of tweaks, additions and fixes that can be found in the release information. The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about. With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.
But does this warrant the version number clicking over from 1.xx to 2.0?
These new types and properties are only the tip of the 2.0 iceberg. There is a heck of a lot of other stuff going on in this release that apart from these additions. Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.
Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced. So part of the 2.0 effort has included some rationalisation. For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details. If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded. However you are encouraged to moved towards the updated terminology as convenient. The question of what is in which version brings me to an enhancement to the supporting documentation. Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0. So if you want to refer to a term at a particular version you now can.
How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication. Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘. Unsurprisingly for a newly introduced property, there is virtually no usage of it yet. If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘. Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.
This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary. The motivation for this new approach to extending is explained thus:
Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.
As yet there are no extensions published. However, there are some on the way.
As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org. The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics. I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone. Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.
W3C Community Group for Schema.org A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place. The creation of the Schema.org Community Group has now provided an open forum for this.
So is 2.0 a bit of a milestone? Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.
I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
All resource pages include Linked Data
Human visible under a Linked Data tab at the bottom of the page
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Beacons of Availability
As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.
As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.
Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.
I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!
Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about. This is not a problem isolated to library interfaces. In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references. This vocabulary is Schema.org – I have previously posted about its success and significance.
With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand. By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person. With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results. I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.
For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites. We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them. If not an identical process, it is clear that similar rules will apply for structured data linking. If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them. If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.
Page ranking does not depend on all pages having to link to all others. Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.
WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource. However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web. Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.
We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks. We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources. To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.
So who is going to do all this then – will every library need to employ a linked data expert? I certainly hope not.
One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs. Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers. Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.
By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources. One beacon is not enough. A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.
This won’t happen over night, but we are certainly in for some interesting times ahead.
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!
I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.
So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.
Choose a triplestore. I followed my own advise and chose 4Store. The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work. Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.
Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!
Operating system. Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems. I had three options. I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.
As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application. The Mac in question being a fairly modern MacBook Air. The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.
Download and install. I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/. As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder. (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)
Setup and load. Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store. Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:
$ 4s-backend-setup WorldCatMillion
Next start the database:
$ 4s-backend WorldCatMillion
Then I need to load the triples from the WorldCat Most Highly Held data set. This step takes a while – over an hour on my system.
This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.
Access via a web browser. I chose Firefox, as it seems to handle unformatted XML better than most. 4Store comes with a very simple SPARQL interface: http://localhost:8000/test/ This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query. The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.
Some simple SPARQL queries. Try these and see what you get:
Why is it those sceptical about a new technology resort, within a very few sentences, to the show me the Killer App line. As if the appearance of, a gold-star approved by the bloggerati, example of something useful implemented in said technology is going to change their mind.
Remember the Web – what was the Web’s Killer App?
Step back a little further – what was the Killer App for HTML?
Answers in the comments section for both of the above.
I relate this to a blog post by Senior Program Officer for OCLC Research, Roy Tennant. I’m sure Roy won’t mind me picking on him as a handy current example of a much wider trend. In his post, which he asserts that Microdata, Not RDF, Will Power the Semantic Web (and elicits an interesting comment stream), he says:
Twelve years ago I basically called the Resource Description Framework (RDF) “dead on arrival”. That was perhaps too harsh of an assessment, but I had my reasons and since then I haven’t had a lot of motivation to regret those words. Clearly there is a great deal more data available in RDF-encoded form now than there was then.
But I’m still waiting for the killer app. Or really any app at all. Show me something that solves a problem or fulfills a need I have that requires RDF to function. Go ahead. I’ll wait.
Oh, you’ve got nothing? Well then, keep reading.
I could go off on a rant about something that was supposedly dead a decade ago still having trouble laying down; or comparing Microdata and RDF as vehicles for describing and relating things is like comparing text files to XML; or even that the value is not in the [Microdata or RDF] encoding, it is in the linking of things, concepts, and relationships – but that is for another time.
So to the search for Kill Apps.
Was VisiCalc a Killer App for the PC – yes it probably was. Was Windows a Killer App – more difficult to answer. Is/was Windows an App or something more general, in technology wave, terms that would begat it’s own Killer Apps. What about Hypertext? Ignoring all the pre-computer era work on it’s principles, I would contend that it’s first Killer App was the Windows Help System – WinHelp. Although, with a bit of assistance from the html concept of the href, it was somewhat eclipsed by the Web.
The further up the breakfast pancake like stack of technologies, standards, infrastructures, and ecosystems we evolve away from the bit of silicon, at the core of what we still belittle with the simple name of computer, the more vague and pointless is our search for a Killer Use, and therefore App.
For a short period in history you could have considered the killer application of the internal combustion engine to be the automobile, but it wasn’t long before those two became intimately linked into a single entity with more applications than you could shake a stick at – all and none of which could be considered to be the killer.
Back to my domain , data. As I have postulated previously, I believe we are nearing a point where data, it’s use, our access to it, and the attention and action large players are giving to it, is going to fundamentally change the way we do things. Change that will come about, not from a radical change in what we do, but from a gradual adoption of several techniques and technologies atop of what we already have. As I have also said before, Linked Data (and its data format RDF) will be a success when we stop talking about it as it it becomes just a tool in the bag. A tool that is used when and where appropriate and mundane enough not to warrant a Linked Data Powered sticker on the side.
Take these Apps being built on the Kasabi, [Linked Data Powered] Platform. Will the users of these apps be aware of the Linked Data (and RDF) under the hood? No. Will they benefit from it? Yes, the aggregation and linking capabilities should deliver a better experience. Are any of this Killers? I doubt it, but nevertheless they are no less worth while.
So stop searching for that Killer App, you may be fruitlessly hunting for a long time. When someone invents a pocket-sized fusion reactor or the teleport , they might be back in vogue.
Prospector picture from ToOliver2 on Flickr Pancakes picture from eyeliam on Flickr
When moving to a new place we bring loads of baggage and stuff from our old house that we feel will be necessary in our new abode. One poke of the head in to your attic will clearly demonstrate how much of the necessary stuff was not so necessary after all. Yet you will also find things lurking around the house that may have survived several moves. The moral here is that, although some things are core to what we are and do, it is difficult to predict what we will need in a new situation.
This is also how it is when we move to doing things in a different way – describing our assets in RDF for instance.
I have watched many flounder when they first try to get their head around describing the things they already know in this new Linked Data format, RDF. Just like moving house, we initially grasp for the familiar and that might not always be helpful.
So. You understand your data, you are comfortable with XML, you recognise some familiar vocabularies that someone has published in RDF-XML, this recreation in RDF sounds simple-ish, you take a look at some other data in RDF-XML to see what it should look like, and…… Your brain freezes as you try to work out how you nest one vocabulary within another and how your schema should look.
This is is where stepping back from the XML is a good idea. XML is only one encoding/transmission format for RDF, it is a way of encoding RDF so that it can be transmitted from one machine process to another. XML is ugly. XML is hierarchical and therefore introduces lots of compromises when imparting the graph nature of an RDF model. XML brings with it a [hierarchical] way of thinking about data that constrains your decisions when creating a mode for your resources.
I suggest you not only step back from XML, you initially step back from the computer as well.
Get out your white/blackboard or flip-chat & pen and start drawing some ellipses, rectangles, and arrows. You know your domain, go ahead draw a picture of it. Draw an ellipse for each significant thing in your domain – each type of object, concept, or event, etc. Draw a rectangle for each literal (string, date, number). Beware of strings that are really ‘things’ – things that have other attributes including the string as a name attribute. Draw arrows to show the relationships between your things and things and their attributes. Label the arrows to define those relationships – don’t initially worry about vocabularies yet, use simple labels such as name, manufacturer, creator, publication event, discount, owner, etc. Create identifiers for your things in the form of URIs, so that they will be unique and accessible when you eventually publish them. What you end up with should look a bit like this slide from my recent presentation at the Semantic Tech and Business Conference in Berlin – the full deck is on SlideShare. Note how how I have also represented resources the model in a machine readable form of RDF – yes you can now turn to the computer again.
Once back at the computer you can now start referring to generally used vocabularies such as foaf, dcterms, etc. to identify generally recognised labels for relationships – an obvious candidate being foaf:name in this example. Move on to more domain specific vocabularies as you go. When you find a relationship not catered for elsewhere you may have to consider publishing your own to supplement the rest of the world.
Once you have your model, it is often a simple bit of scripting to take data from your current form (CSV, database, XML record) and produce some simple files of triples, in n-triples format. Then use a useful tool like Raptor to transform it in to good old ugly XML for transfer. Better still, take your n-triples files and load them in to a storage/publishing platform like Kasabi. This was the route the British Library took when the published the British National Bibliography as RDF.
Here for instance is a picture, of the village next to where I live, discovered in Ookaboo associated with the village as a topic:
But there is more.
Ookaboo have released an RDF dump of the metadata behind the images, concept mappings and links to concepts in Freebase and Dbpedia for topics such as places, people and organism classifications.
This is not a one-off exercise, [Ookaboo] intend to use an automated process to make regular releases of the Ookaboo dump in the future.
For the SPARQLy inclined, they also provide an overview of the structure, namespaces, and properties used in the RDF plus a SPARQL cookbook of example queries.
This looks to be a great resource, and when merged with other data sets, potentially capable of adding significant benefit.
We require the following attribution:
Papers, books, and other works that incorporate Ookaboo data or report results from Ookaboo data must cite Ookaboo and Ontology2.
HTML pages that incorporate images from Ookaboo must include a hyperlink to the page describing the image that is linked with the ookaboo:ookabooPage property.
Data products derived from Ookaboo must make it possible to maintain the provenance and attribution chain for images. In the case an RDF dump, it is sufficient to provide a connection to Ookaboo identifiers and documentation that refers users to the Ookaboo RDF dump. SPARQL endpoints must contain attribution information, which can be done by importing selected records from the Ookaboo dump.
No problem in principle, but in practice some may find the share-alike elements of the last item a bit difficult to comply with, once you start building applications built on layers of layers of data APIs. Commercial players especially may shy away from using Oookaboo because of the copyleft ramifications. For the data itself, I would have thought CC-BY would have been sufficient.
Maybe Paul Houle, of Ontology2 who are behind Ookaboo, would like to share his reasoning behind this.
My message here is that we need to be creating data, not records, and that we need to create the data first, then build records with it for those applications where records are needed.
Karen was addressing the result of looking at the move, of bibliographic metadata in to Linked Data and by definition RDF, from the “I’ve got all these records that I think I want to preserve” end of the telescope. She points out there are many translation efforts focused upon representing library record formats such as ISBD, FRBR, RDA, MODS and Marc in RDF, none of which have been endorsed by a standards body and all mutuality incompatible information silos. With an enlightening explanation of how there are more ways, to describe the Place of Publication in library land, than you can shake a stick at, she identifies that these are conditioned by context. A context that travels with the translations from a library record format in to an RDF representation thereof.
This is the antithesis of the linked data concept, where data sets from diverse sources share metadata elements. It is this re-use of elements that creates the “link” in linked data. To achieve this, metadata elements need to be unconstrained by a particular context.
It is interesting to note that this first use of the words Linked Data in the post, and a significant way in to her text. I believe this is symptomatic of the audience Karen is trying to enlighten and convince. For decades they have been focused on tightly controlled record formats and rules. So much so that the business model for a new format appears to be based upon marketing the documentation required to understand and apply it. It is hardly surprising therefore, that they hook on to RDF as a ‘standard’ and then try to represent and enforce their context upon it. Yes they end up with valid RDF (that is not difficult as you can represent almost anything in RDF), but is it useful Linked Data – more often than not – no.
I find it helps, to visualise how information should be captured as Linked Data, by approaching it from the point of view of a non-domain expert. With that hat on, the place of publication is easy – eg. London = the place where the publisher that published the item was located. A library domain expert would explain how under certain circumstances, such as a printing error or a special context, it is not that simple. The non-domain expert would also assert in my example that the London in place of publication, is the same London as the subject of the book, the same London referenced in the text, the same London that the author resided in, the same London in the title, the same London in which you can find a library to loan a copy from, and the same London that has http://dbpedia.org/resource/London as it’s DbPedia URI. I leave you to imagine the response from your favourite librarian…
So how do we square this circle – by, as Karen says, creating data, not records. We can create some data describing London as a place with it’s geographic location, and strings of characters in different languages that are accepted as it’s name. We could then create some more about a book, even more about a publication event (date, a link to a publisher, and a link to the place we just described). If later on we, or someone else, want to point to [link] our London piece of data as describing the place the book is about that would be equally valid, and not detrimental to the relationships we had created between it and any other data entities we described earlier.
Although we can not un-invent them, I see these translations to RDF as unhelpful. What is needed are entity extraction and linking processes to help create the entity descriptions we need from the information captured in the individual record silos that we have built up over the decades.
In her comments, Karen points to the data model(pdf) used by the British Library when publishing the British National Bibliography as Linked Data. A project, with which I was associated, that took the approach I describe, and is well worth a look at. When you do, remember that it is a first pass at modelling their data which will be built upon later by the BL and others. If you do not see your favourite, FRBR or other, relationships represented, why not comment on how they could be added to the model alongside the relationships already represented. That is one of the major benefits of using RDF and Linked Data, you can mix different ways of expressing relationships in a single model.