OCLC WorldCat Linked Data Release – Significant In Many Ways

logo_wcmasthead_enTypical!  Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net?  35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.

By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news.  Nevertheless it is significant news, significant in many ways.

OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years.  At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009.  As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus.  Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap.  These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.

Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well.  As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.

Let me dissect the announcement a bit….

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org] First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them.  Just the sort of authoritative resources that will help stitch the emerging web of data together.

Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org.  Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them.  A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.

As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup.   It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise?  Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.

LinkedData Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages.  You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form.Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1  You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser.  Not very pretty to human eyes I know, but just the kind of structured data that systems love.

Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources.  With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples.  OCLC is playing it’s part in doing this for the library sector.

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-2 Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary.  Attributes such as library:holdingsCount and library:oclcnum.  This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data.  What better way of testing out such a vocabulary –  markup several million records with it, publish them and see what the world makes of them.

Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.

Sixth significant bit of news – This release is an experimental release.  This is the start, not the end, of a process.  We know we have not got this right yet.  There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data.  There are obvious areas for discussion around the use of schema.org and the proposed library extension to it.  There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for.  Over the coming months OCLC wants to constructively engage with all that are interested in this process.  It is only with the help of the library and wider web communities that we can get it right.  In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.

For more information about this release, check out the background to linked data at OCLC, join the conversation on the OCLC Developer Network, or email data@oclc.org.

As you can probably tell I am fairly excited about this announcement.  This, and future stuff like it, are behind some of my reasons for joining OCLC.  I can’t wait to see how this evolves and develops over the coming months.  I am also looking forward to engaging in the discussions it triggers.

