OCLC Declare OCLC Control Numbers Public Domain

ocn Little things mean a lot.  Little things that are misunderstood often mean a lot more.

Take the OCLC Control Number, often known as the OCN, for instance.

Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times.  The individual number can be found represented in the record it is associated with.  Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.

Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits.  Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries.  It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN.  The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records.  Processes that I believe, in a few cases, are still in place.

So in the current and future climate of open sharing of data, where for instance WorldCat Linked Data, is published under an open data license, such assumptions and practices are an anomaly.

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:

The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.

As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.

Get Yourself a Linked Data Piece of WorldCat to Play With

WorldCat_Logo_V_Color You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org.   Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.

The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere.  For now, you will find my presentation Library Linked Data Progress on my SlideShare site.

After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?

We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it.  So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.

So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples.   Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms.  So which chunk to choose was a question.  We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.

To make the cut, a resource had to be held by more than 250 libraries.  It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed.   To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself.  Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.

Another area of question around the publication of WorldCat linked data, has been about licensing.   Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat.  The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?

To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area.  You can find these at http://www.oclc.org/data/attribution.html.  They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data.   As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.

As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data.  So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at data@oclc.org.

OCLC WorldCat Linked Data Release – Significant In Many Ways

logo_wcmasthead_enTypical!  Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net?  35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.

By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news.  Nevertheless it is significant news, significant in many ways.

OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years.  At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009.  As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus.  Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap.  These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.

Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well.  As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.

Let me dissect the announcement a bit….

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org] First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them.  Just the sort of authoritative resources that will help stitch the emerging web of data together.

Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org.  Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them.  A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.

As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup.   It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise?  Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.

LinkedData Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages.  You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form.Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1  You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser.  Not very pretty to human eyes I know, but just the kind of structured data that systems love.

Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources.  With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples.  OCLC is playing it’s part in doing this for the library sector.

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-2 Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary.  Attributes such as library:holdingsCount and library:oclcnum.  This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data.  What better way of testing out such a vocabulary –  markup several million records with it, publish them and see what the world makes of them.

Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.

Sixth significant bit of news – This release is an experimental release.  This is the start, not the end, of a process.  We know we have not got this right yet.  There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data.  There are obvious areas for discussion around the use of schema.org and the proposed library extension to it.  There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for.  Over the coming months OCLC wants to constructively engage with all that are interested in this process.  It is only with the help of the library and wider web communities that we can get it right.  In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.

For more information about this release, check out the background to linked data at OCLC, join the conversation on the OCLC Developer Network, or email data@oclc.org.

As you can probably tell I am fairly excited about this announcement.  This, and future stuff like it, are behind some of my reasons for joining OCLC.  I can’t wait to see how this evolves and develops over the coming months.  I am also looking forward to engaging in the discussions it triggers.

Libraries Through the Linked Data Telescope






Linked Data and Linked Open Data have arrived on the library agenda. The consequence of this rise interest in library Linked Data is that the community is now exploring and debating how to migrate library records from formats such as Marc into this new RDF. In my opinion there is a great danger here of getting bogged down in the detail of how to represent every scintilla of information from a library record in every linked data view.






Monkey telescope For an interested few associated with libraries and data, like myself, Linked Data has been a topic of interest and evangelism for several years.  For instance, I gave this presentation at IFLA 2010.

However, Linked Data and Linked Open Data have now arrived on the library agenda.  Last summer, it was great to play a small part in the release of the British National Bibliography as Linked Data by the British Library – openly available via Talis and their Kasabi Platform.  Late last year the Library of Congress announced that Linked Data and RDF was on their roadmap, soon followed by the report and plan from Stanford University with Linked Data at it’s core.  More recently still, Europeana have opened up access to a large amount of cultural heritage, including library, data.

Twitter Even more recently I note that OCLC, at their EMEA Regional Council Meeting in Birmingham this week, see Linked Data as an important topic on the library agenda.

The consequence of this rise interest in library Linked Data is that the community is now exploring and debating how to migrate library records from formats such as Marc into this new RDF.  In my opinion there is a great danger here of getting bogged down in the detail of how to represent every scintilla of information from a library record in every linked data view that might represent the thing that record describes.  This is hardly unsurprising as most engaged in the debate come from an experience where if something was not preserved on a physical or virtual record card, it would be lost forever.   By concentrating on record/format transformation I believe that they are using a Linked Data telescope to view their problem, but are not necessarily looking through the correct end of that telescope.

Let me explain what I mean by this.  There is a massive duplication of information in library catalogues.  For example, every library record describing a copy of a book about a certain boy wizard will contain one or more variations of the string of characters “Rowling, K. J”.  To us humans it is fairly easy for us to infer that all of them represent the same person, as described by each cataloguer.  VIAF For a computer, they are just strings of characters.

OCLC host the Virtual International Authority File (VIAF) project which draws together these strings of characters and produces a global identifier for each author.  Associated with that author they collect the local language representations of their name.

One simple step down the Linked Data road would be to replace those strings of characters in those records with the relevant VIAF permalink, or URI – http://viaf.org/viaf/116796842/.  One result of this would be that your system could follow that link and return an authoritative naming of that person, with the added benefit of it being available in several languages.  A secondary, and more powerful, result is that any process scanning such records can identify exactly which [VIAF identified] person is the creator, regardless of the local language or formatting practices.

Why stop at the point of only identifying creators with globally unique identifiers.   Why not use an identifier to represent the combined concept of a text, authored by a person, published by an organisation, in the form of a book – each of those elements having their own unique identifiers.  If you enabled such a system on the Linked Data web, what would a local library catalogue need to contain?  – Probably only a local identifier of some sort with links to local information such as supplier, price, date of purchase, license conditions, physical location, etc. plus a link to the global description provided by a respected source such as Open Library, Library of Congress, British Library, OCLC etc.  A very different view of what might constitute a record in a local library.

So far I have looked at this from the library point of view. What about the view from the rest of the world?

I contend that most wishing to reference books, journal articles, curated and provided by libraries would happiest if they could refer to a global identifier that represents the concept of a particular work.  Such consumers would only need a small sub-set of the data assembled by a library for basic display and indexing purposes – title, author.   The next question may be, where is there a locally available copy of this book or article that I can access.  In the model I describe, where these global identifiers are linked to local information such as loan status, the lookup would be a simple process compared with a current contrived search against inferred strings of characters.

Currently Google and other search engines have great difficulty in managing the massive amount of library catalogue pages that will mach a search for a book title.  As referred to previously, Google are assembling a graph of related things.  In this context the thing is the concept of the book or article, not the thousands of library catalogue pages describing the same thing.

Pulling these thoughts together, and looking down the Linked Data telescope from the non-library end, I envisage a layered approach to accessing library data.

  • A simple global identifier, or interlinked identifiers from several respected sources, that represents the concept of a particular thing (book, article, etc.)
  • A simple set of high-level description information for each thing – links to author, title, etc., associated with the identifier.   This level of information would be sufficient for many uses on the web and could contain only publicly available information.
  • For those wishing more in depth bibliographic information, those unique identifiers, either directly or via SameAs links, could link you to more of the rich resources catalogued by libraries around the world, which may or may not be behind slightly less open licensing or commercial constraints.
  • Finally library holding/access information would be available, separate from the constraints of the bibliographic information, but indexed by those global identifiers.

To get us to such a state will require a couple of changes in the way libraries do things.

Firstly the rich data collated in current library records should be used to populate a Lined Data data model of the things those records describe – not just reproducing the records we have in another format. An approach I expanded upon in a previous post Create Data Not Records.

Secondly, as such a change would be a massive undertaking, libraries will need to work together to do this.  The centralised library data holders have a great opportunity to drive this forward.  A few years ago, the distributed hosted-on-site landscape of library management systems would have prevented such a change happening.  However with library system software-as-a-service becoming an increasingly viable option for many, it is not the libraries that would have to change, just the suppliers of the systems the use.

Monkey picture from fPat on Flickr

More Linked Open Data under a More Open License from German National Library

logo The German National Library (DNB) has launched a Linked Data version of the German National Bibliography.

The bibliographic data of the DNB’s main collection (apart from the printed music and the collection of the Deutsches Exilarchiv) and the serials (magazines, newspapers and series of the German Union Catalogue of serials (ZDB)) have been converted.  Henceforth the RDF/XML-representation of the records are available at the DNB portal. This is an experimental service that will be continually expanded and improved.

This is a welcome extension to their Linked Data Service, previously delivering authority data.  Documentation on their data and modelling is available, however the English version has yet to be updated to reflect this latest release.

DNB, Katalog der Deutschen Nationalbibliothek Links to RDF-XML versions of individual records are available directly from the portal user interface, with the usual Linked Data content negotiation techniques available to obtain HTML or RDF-XML as required.

This is a welcome addition to the landscape of linked open bibliographic data, joining others such as the British Library.

88x31 Also to be welcomed is their move to CC0 licensing removing barriers, real or assumed, to the reuse of this data.

I predict that this will be the first of many more such announcements this year from national and other large libraries opening up their metadata resources as Linked Open Data.  The next challenge will be to identify the synergies between these individual approaches to modelling bibliographic data and balance the often competing needs of the libraries themselves and potential consumers of their data who very often do not speak ‘library’.

Somehow [without engaging in the traditional global library cooperation treacle-like processes that take a decade to publish a document] we need to draw together a consistent approach to modelling and publishing Linked Open Bibliographic Data for the benefit of all – not just the libraries.  With input from the DNB, British Library, Library of Congress, European National Libraries, Stanford, and others such as Schema.org, W3C, Open Knowledge Foundation etc., we could possibly get a consensus on an initial approach.  Aiming for a standard would be both too restrictive, and based on experience, too large a bite of the elephant at this early stage.

Ookaboo Release 1,000,000 Free Images For 500,000 Topics + RDF Too






Ookaboo “free pictures of everything on earth” have released nearly a million public domain and Creative Commons licensed stock images mapped with precision to concepts, instead of just words.

But there is more… They have released an RDF dump of the metadata behind the images, concept mappings and links to concepts in Freebase and Dbpedia






Ookaboofree pictures of everything on earth” have released nearly a million public domain and Creative Commons licensed stock images mapped with precision to concepts, instead of just words.

Because it uses concepts instead of words, Ookaboo is much more accurate than other sources of free stock photos. You’ll find free pictures quickly, because you will only be seeing relevant images.

Unlike some other free pictures sites, images in Ookaboo are public domain or creative commons and can be used freely for blogs, web sites, schoolwork, publications, and other creative projects.


Picture of Fladbury thanks to Jennifer Luther Thomas and Ookaboo!

Here for instance is a picture, of the village next to where I live, discovered in Ookaboo associated with the village as a topic:

But there is more.

Ookaboo have released an RDF dump of the metadata behind the images, concept mappings and links to concepts in Freebase and Dbpedia for topics such as places, people and organism classifications.

This is not a one-off exercise, [Ookaboo] intend to use an automated process to make regular releases of the Ookaboo dump in the future.

For the SPARQLy inclined, they also provide an overview of the structure, namespaces, and properties used in the RDF plus a SPARQL cookbook of example queries.

This looks to be a great resource, and when merged with other data sets, potentially capable of adding significant benefit.

I only have one concern, around the licensing.  Not of the images themselves, relevant licensing is identified clearly in the data, but the licensing of the RDF dump itself as CC-BY-SA.   In the terms of use they indicate:

We require the following attribution:

  • Papers, books, and other works that incorporate Ookaboo data or report results from Ookaboo data must cite Ookaboo and Ontology2.
  • HTML pages that incorporate images from Ookaboo must include a hyperlink to the page describing the image that is linked with the ookaboo:ookabooPage property.
  • Data products derived from Ookaboo must make it possible to maintain the provenance and attribution chain for images. In the case an RDF dump, it is sufficient to provide a connection to Ookaboo identifiers and documentation that refers users to the Ookaboo RDF dump. SPARQL endpoints must contain attribution information, which can be done by importing selected records from the Ookaboo dump.

No problem in principle, but in practice some may find the share-alike elements of the last item a bit difficult to comply with, once you start building applications built on layers of layers of data APIs.  Commercial players especially may shy away from using Oookaboo because of the copyleft ramifications. For the data itself, I would have thought CC-BY would have been sufficient.

Maybe Paul Houle, of Ontology2 who are behind Ookaboo, would like to share his reasoning behind this.

Will This Flood of Open Data Wash Past Us?

psi_logo 5959118186_19582c7b84_m@ePISplatform features fairly prominently in the stream of tweets that waft across my desktop every day – it comes from the European Public Sector Information (PSI) Platform (Europe’s One-stop Shop on PSI re-use) Working to stimulate and promote PSI re-use and open data initiatives.

In amongst the useful pointers to news, comment, and documents, I have been recently conscious of an increasing flow of tweets like these:

ePSIplatform (epsiplatform) on Twitter-2
ePSIplatform (epsiplatform) on Twitter-1
ePSIplatform (epsiplatform) on Twitter
ePSIplatform (epsiplatform) on Twitter-3

This is good news.  More and more city, local, national governments and public bodies releasing data as open data.  Of course the reference to open here is in relation to the licensing of these data, but how open in access are they?  It is not that easy to find out.

To be truly open and broadly useful data has to be both licensed openly, with few or no use constraints, and have as few technical barriers to consuming it as possible.  In many cases there will be enough enthusiasts for a particular source with the motivation to take data in whatever form, and pick their way through it to get the value they need.  These enthusiasts provide great blogging fodder and examples for presentations, but do not represent the significant value that should, and is predicted to, flow from the open data and transparency agenda spreading through governments across the globe.

5 star mug The five star data rating scheme, from Sir Tim Berners-Lee, is a simple way to describe the problem and encourage publishers to strive to achieve a 5 star Linked Open Data rating, yet not discouraging openly publishing in any form in the first place.  Check out my earlier post What Is Your Data’s Star Rating(s)? when I dig in to both types of openness a bit further.

Policy makers and data openness enthusiasts who are behind this burgeoning flood of announcements [as a broad generality] get the licensing issues – use CC0 or copy the UK’s OGL.  However what concerns me is, they tend to shy away from promoting the removal of technical barriers that could stifle the broad adoption, and consequential flow of economic benefit, that they predict.

We could look back in a few years to this time of missed opportunity to say, it was obvious that the initiatives would fail because we didn’t make it easy for those that could have delivered the value.  We let the flood of enthusiastic initiatives wash past us without grabbing the opportunities to establish easy, consistent and repeatable ways to release and build upon the value in data for all, not just an enthusiastic few. We need to get this right if open data is going fuel the next revolution.

Quality Assurance - the Data Hub Some are thinking in the same way.  CKAN for instance have delivered an extension to calculate the [technical] openness of datasets as listed on the Dataset Openness Page of the Data Hub.  Great idea but I would suggest that most data publishers will never find their way to such a listing.  Where are the stars on the individual data set pages?  Where is the star rating badges of approval that publishers can put on their sites to show off?

We have made great strides so far in promoting the opening of public and other sector information, the ePISplatform stream is testament to that.  Somehow we need to capitalise on this great start and market the benefits of technically opening up your data better.  5 Star badge of approval anyone?

Stream photo from jjjj56cp on Flickr

What Is Your Data’s Star Rating(s)?






The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document. Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.

In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme






The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document.  Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.

mug-300x300 In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme to “encourage people — especially government data owners — along the road to good linked data”. The key message was to Open Data.  You may have the best RDF encoded and modelled data on the planet, but if it is not associated with an open license, you don’t get even a single star.  That emphasis on government data owners is unsurprising as he was at the time, and still is, working with the UK and other governments as they come to terms with the transparency thing.

Once you have cleared the hurdle of being openly licensed (more of this later), your data climbs the steps of Linked Open Data stardom based on how available and therefore useful it is. So:

Available on the web (whatever format) but with an open licence, to be Open Data
★★ Available as machine-readable structured data (e.g. excel instead of image scan of a table)
★★★ as (2) plus non-proprietary format (e.g. CSV instead of excel)
★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
★★★★★ All the above, plus: Link your data to other people’s data to provide context

By usefulness I mean how low is the barrier to people using your data for their purposes.  The usefulness of 1 star data does not spread much beyond looking at it on a web page.  3 Star data can at least be downloaded, and programmatically worked with to deliver analysis or for specific applications, using non-proprietary tools.  Whereas 5 star data is consumable in a standard form, RDF, and contains links to other (4 or 5 star) data out on the web in the same standard consumable form.  It is at the 5 star level that the real benefits of Linked Open Data kick in, and why the scheme encourages publishers to strive for the highest rating.

Tim’s scheme is not the only open data star rating scheme in town.  There is another one that emerged from the LOD-LAM Summit in San Francisco last summer – fortunately it is complementary and does not compete with his.  The draft 4 star classification-scheme for linked open cultural metadata approaches the usefulness issue from a licensing point of view.  If you can not use someone’s data because of onerous licensing conditions it is obviously not useful to you.

★★★★ Public Domain (CC0 / ODC PDDL / Public Domain Mark)

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is not contingent on anything
  • metadata can be combined with any other metadata set (including closed metadata sets)
★★★ Attribution License (CC-BY / ODC-BY) when the licensor considers linkbacks to meet the attribution requirement

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution by linkback to the data source
  • metadata can be combined with any other metadata set, including closed metadata sets, as long as the attribution link is retained
★★ Attribution License (CC-BY / ODC-BY) with another form of attribution

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution in a way specified by the provider
  • metadata can be combined with any other metadata set (including closed metadata sets)
Attribution Share-Alike License (CC-BY-SA/ODC-ODbL)

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution in a way specified by the provider
  • metadata can only be combined with data that allows re-distributions under the terms of this license

So when you are addressing opening up your data, you should be asking yourself how useful will it be to those that want to consume and use it.  Obviously you would expect me to encourage you to publish your data as ★★★★★★★★★ to make it as technically useful with as few licensing constraints as possible.  Many just focus on Tim’s stars, however, if you put yourself in the place of an app or application developer, a one LOD-LAM star dataset is almost unusable whilst still complying with the licence.

So think before you open – put yourself in the consumers’ shoes – publish your data with the stars.

One final though, when you do publish your data, tell your potential viewers, consumers, and users in very simple terms what you are publishing and under what terms. As the UK Government does through data.gov.uk using the Open Government Licence, which I believe is a ★★★.

Will Europe’s National Libraries Open Data In An Open Way?

A significant step towards open bibliographic data was made in Copenhagen this week at the 25th anniversary meeting of the Conference of European National Librarians (CENL) hosted by the Royal Library of Denmark. From the CENL announcement:

…the Conference of European National Librarians (CENL), has voted overwhelmingly to support the open licensing of their data. What does that mean in practice? It means that the datasets describing all the millions of books and texts ever published in Europe – the title, author, date, imprint, place of publication and so on, which exists in the vast library catalogues of Europe – will become increasingly accessible for anybody to re-use for whatever purpose they want. The first outcome of the open licence agreement is that the metadata provided by national libraries to Europeana.eu, Europe’s digital library, museum and archive, via the CENL service The European Library, will have a Creative Commons Universal Public Domain Dedication, or CC0 licence. This metadata relates to millions of digitised texts and images coming into Europeana from initiatives that include Google’s mass digitisations of books in the national libraries of the Netherlands and Austria. ….it will mean that vast quantities of trustworthy data are available for Linked Open Data developments

There is much to be welcomed here. Firstly that the vote was overwhelming.   Secondly that the open license chosen to release this data under is Creative Commons CC0 thus enabling reuse for any purpose. You cannot expect such a vote to cover all the detail, but the phrase ‘trustworthy data are available for Linked Open Data developments’ does give rise to some possible concerns for me.   My concern is not from the point of view that this implies that the data will need to be published as Linked Data – this also should be welcomed. My concern comes from some of the library focused Linked Data conversations, presentations and initiatives I have experienced over the last few months and years. Many in the library community, that have worked with Linked Data, lean towards the approach of using Linked Data techniques to reproduce the very fine detailed structure and terminology of their bibliographic records as a representation of those records in RDF (Linked Data data format).  Two examples of this that come to mind:

  1. The recent release of an RDF representation of the MARC21 elements and vocabularies by MMA – Possibly of internal use only to someone transforming a library’s MARC record collection to identify concepts and entities to then describe as linked data.  Mind-numbingly impenetrable for anyone who is not a librarian looking for useful data.
  2. The Europeana Data Model (EDM).  An impressive and elegant Linked Data RDF representation of the internal record structure and process concerns of Europeana.  However again not modelled in a way to make it easy for those outside the [Europeana] library community to engage with, understand and extract meaning from.

The fundamental issue I have with the first of these and other examples is that their authors have approached this from the direction of wishing to encode their vast collections of bibliographic records as Linked Data.  Whereas they would have ended up with a more open [to the wider world] result if they had used the contents of their records as a rich resource from which to build descriptions of the resources they hold.  In that way you end up with descriptions of things (books, authors, places, publishers, events, etc.) as against descriptions of records created by libraries. Fortunately there is an excellent example of a national library publishing Linked Data which describe the things they hold.   The British Library have published descriptions of 2.6 million items they hold in the form of the British National Bibliography. I urge those within Europeana and the European National libraries community, who will be involved in this opening up initiative, to take a close look at the evolving data model that the BL have shared, to kick-start the conversation on the most appropriate [Linked Data] techniques to apply to bibliographic data.  For more detail see this Overview of the British Library Data Model. This opening up of data is a great opportunity for trusted librarian curated data to become a core part of the growing web of data, that should not be missed.  We must be aware of previous missed opportunities, such as the way XMLMarc just slavishly recreated an old structure in a new format.   Otherwise we could end up with what could be characterised, in web integration terms as, a significant open data white elephant. Nevertheless I am optimistic, with examples such as the British Library BnB backing up this enthusiastic move to open up a vast collection of metadata, in a useful way, that will stimulate Linked Data development, I have some confidence in a good outcome. Disclosure:Bibliographic domain experts from the British Library worked with Linked Data experts from the Talis team, in the evolution of the BnB data model – something that could be extended and or/repeated with other national and international library organisations.

This post was also published on the Talis Consulting Blog

Will Government Open Licence Extensions be a haven for the timid?

National Archives announced today UK government licensing policy extended to make more public sector information available:

Building on the success of the Open Government Licence, The National Archives has extended the scope of its licensing policy, encouraging and enabling even easier re-use of a wider range of public sector information.

The UK Government Licensing Framework (UKGLF), the policy and legal framework for the re-use of public sector information, now offers a growing portfolio of licences and guidance to meet the diverse needs and requirements of both public sector information providers and re-user communities.

On the surface this is move is to to be welcomed.  Providing, amongst other things, licensing choices and guidance for re-using information free of charge for non-commercial purposes – the Non-Commercial Government Licence; guidance to licensing where charges apply and for the licensing of software and source code.

All this is available from the UK Government Licensing Framework area of the National Archives site, along with FAQs and other useful supporting information, including machine readable licenses.

As the press release says, the extensions are building on the success of the Open Government License(OGL) and are designed to cover what the OGL can not.

So the [data publishers] thought process should be to try to publish under the OGL and then, only if ownership/licensing/cost of production provide an overwhelming case to be more restrictive, utilise these extensions and/or guidance.

My concern, having listened to many questions at conferences from what I would characterise as government conservative traditionalists, is that many will start at the charge-for/non-commercial use end of this licensing spectrum because of the fear/danger of opening up data too openly.  I do hope my concerns are unfounded and that the use of these extensions will be the exception, with the OGL being the de facto licence of choice for all public sector data.

This post was also published on the Talis Consulting Blog