
Semantic Tech & Business Conference
San Francisco 2-5 June, 2013
Register
You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.
The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.
After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?
We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.
So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.
To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.
Another area of question around the publication of WorldCat linked data, has been about licensing. Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat. The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?”
To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area. You can find these at http://www.oclc.org/data/attribution.html. They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data. As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.
As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data. So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at data@oclc.org.
[...] Richard Wallis has written an article about the latest updates to WorldCat.org. He writes, “After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data? We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.” [...]
[...] Get Yourself a Linked Data Piece of WorldCat to Play With by Richard Wallis. [...]
[...] 书社会远洋过客转贴 2012-8-18 原博文:Data Liberate: Get Yourself a Linked Data Piece of WorldCat to Play With / By Richard Wallis on August 12, 2012 OCLC官网消息: OCLC provides downloadable linked data [...]
[...] data file for the 1 million most widely held works in WorldCat, 14 August 2012 Via Data Liberate: Get Yourself a Linked Data Piece of WorldCat to Play With / By Richard Wallis on August 12, [...]
[...] and practice some SPARQL on them” and then not following it up.I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 [...]
[...] of these resources (i.e. the 1.2M bibliographic resources held by at least 250 libraries) are also available as a RDF dataset that one can easily load into a triplestore such as [...]
[...] a reflection of this. This is also exemplified by the use of Schema.org in the RDF N-Triples dump file OCLC has published of a sub-set of WorldCat data.So moving on. You have your resources already [...]
[...] from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held [...]
Thanks for this write-up! I took your advice, downloaded http://purl.oclc.org/dataset/WorldCat/datadumps/WorldCatMostHighlyHeld-2012-05-15.nt.gz to my MacBook Pro, installed 4store, and then loaded the data. After 70 minutes of churning (mostly because of io-bound access to the disk) to load the data, I was able to do a simple query. My SPARQL chops are weak but I did manage to figure out the incantation to sort by the holdingsCount predicate; I want to list the 1000 most held items. Problem is that the holdingsCount was loaded as a string and not an integer! So I’m looking at http://answers.semanticweb.com/questions/321/correct-handling-of-numbers-in-rdf for possible solutions. Stay tuned!