You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.
The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.
After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?
We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.
So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.
To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.
Another area of question around the publication of WorldCat linked data, has been about licensing. Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat. The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?”
To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area. You can find these at http://www.oclc.org/data/attribution.html. They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data. As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.
As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data. So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at email@example.com.
Here for instance is a picture, of the village next to where I live, discovered in Ookaboo associated with the village as a topic:
But there is more.
Ookaboo have released an RDF dump of the metadata behind the images, concept mappings and links to concepts in Freebase and Dbpedia for topics such as places, people and organism classifications.
This is not a one-off exercise, [Ookaboo] intend to use an automated process to make regular releases of the Ookaboo dump in the future.
For the SPARQLy inclined, they also provide an overview of the structure, namespaces, and properties used in the RDF plus a SPARQL cookbook of example queries.
This looks to be a great resource, and when merged with other data sets, potentially capable of adding significant benefit.
We require the following attribution:
Papers, books, and other works that incorporate Ookaboo data or report results from Ookaboo data must cite Ookaboo and Ontology2.
HTML pages that incorporate images from Ookaboo must include a hyperlink to the page describing the image that is linked with the ookaboo:ookabooPage property.
Data products derived from Ookaboo must make it possible to maintain the provenance and attribution chain for images. In the case an RDF dump, it is sufficient to provide a connection to Ookaboo identifiers and documentation that refers users to the Ookaboo RDF dump. SPARQL endpoints must contain attribution information, which can be done by importing selected records from the Ookaboo dump.
No problem in principle, but in practice some may find the share-alike elements of the last item a bit difficult to comply with, once you start building applications built on layers of layers of data APIs. Commercial players especially may shy away from using Oookaboo because of the copyleft ramifications. For the data itself, I would have thought CC-BY would have been sufficient.
Maybe Paul Houle, of Ontology2 who are behind Ookaboo, would like to share his reasoning behind this.
The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document. Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.
In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme to “encourage people — especially government data owners — along the road to good linked data”. The key message was to Open Data. You may have the best RDF encoded and modelled data on the planet, but if it is not associated with an open license, you don’t get even a single star. That emphasis on government data owners is unsurprising as he was at the time, and still is, working with the UK and other governments as they come to terms with the transparency thing.
Once you have cleared the hurdle of being openly licensed (more of this later), your data climbs the steps of Linked Open Data stardom based on how available and therefore useful it is. So:
Available on the web (whatever format) but with an open licence, to be Open Data
Available as machine-readable structured data (e.g. excel instead of image scan of a table)
as (2) plus non-proprietary format (e.g. CSV instead of excel)
All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
All the above, plus: Link your data to other people’s data to provide context
By usefulness I mean how low is the barrier to people using your data for their purposes. The usefulness of 1 star data does not spread much beyond looking at it on a web page. 3 Star data can at least be downloaded, and programmatically worked with to deliver analysis or for specific applications, using non-proprietary tools. Whereas 5 star data is consumable in a standard form, RDF, and contains links to other (4 or 5 star) data out on the web in the same standard consumable form. It is at the 5 star level that the real benefits of Linked Open Data kick in, and why the scheme encourages publishers to strive for the highest rating.
Tim’s scheme is not the only open data star rating scheme in town. There is another one that emerged from the LOD-LAM Summit in San Francisco last summer – fortunately it is complementary and does not compete with his. The draft 4 star classification-scheme for linked open cultural metadata approaches the usefulness issue from a licensing point of view. If you can not use someone’s data because of onerous licensing conditions it is obviously not useful to you.
permission to use the metadata is contingent on providing attribution in a way specified by the provider
metadata can only be combined with data that allows re-distributions under the terms of this license
So when you are addressing opening up your data, you should be asking yourself how useful will it be to those that want to consume and use it. Obviously you would expect me to encourage you to publish your data as ★★★★★ – ★★★★ to make it as technically useful with as few licensing constraints as possible. Many just focus on Tim’s stars, however, if you put yourself in the place of an app or application developer, a one LOD-LAM star dataset is almost unusable whilst still complying with the licence.
So think before you open – put yourself in the consumers’ shoes – publish your data with the stars.
One final though, when you do publish your data, tell your potential viewers, consumers, and users in very simple terms what you are publishing and under what terms. As the UK Government does through data.gov.uk using the Open Government Licence, which I believe is a ★★★.
Building on the success of the Open Government Licence, The National Archives has extended the scope of its licensing policy, encouraging and enabling even easier re-use of a wider range of public sector information.
The UK Government Licensing Framework (UKGLF), the policy and legal framework for the re-use of public sector information, now offers a growing portfolio of licences and guidance to meet the diverse needs and requirements of both public sector information providers and re-user communities.
So the [data publishers] thought process should be to try to publish under the OGL and then, only if ownership/licensing/cost of production provide an overwhelming case to be more restrictive, utilise these extensions and/or guidance.
My concern, having listened to many questions at conferences from what I would characterise as government conservative traditionalists, is that many will start at the charge-for/non-commercial use end of this licensing spectrum because of the fear/danger of opening up data too openly. I do hope my concerns are unfounded and that the use of these extensions will be the exception, with the OGL being the de facto licence of choice for all public sector data.
This post was also published on the Talis Consulting Blog