A couple of months back I spoke about the preview release of Works data from WorldCat.org. Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.
These links (URIs) lead, where available, to authoritative sources for people, subjects, etc. When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs. As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.
The obvious question is “how do I get a work id for the items in my catalogue?”. The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721
In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:
To quote Neil Wilson, Head of Metadata Services at the British Library:
With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data. This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.
This is a major first step in a journey to provide linked data views of the entities within WorldCat. Looking forward to other WorldCat entities such as people, places, and events. Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years. These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.
You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.
Personally I am really looking forward to hearing about the uses that are made of this data.
The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques. I’m sure this will become an indispensible authoritative hub of terms in the Web of Data to assist those describing their resources and placing them in context in that Web.
This is the fist step in an 18 month process to release four vocabularies – the others being The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®.
A great step from Getty. I look forward to the others appearing over the months and seeing how rapidly their use is made across the web.
Little things mean a lot. Little things that are misunderstood often mean a lot more.
Take the OCLC Control Number, often known as the OCN, for instance.
Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times. The individual number can be found represented in the record it is associated with. Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.
Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits. Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries. It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN. The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records. Processes that I believe, in a few cases, are still in place.
I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:
Use of the OCLC Control Number (OCN) OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.
The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.
When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:
The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.
As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.
Back in September I formed a W3C Group – Schema Bib Extend. To quote an old friend of mine “Why did you go and do that then?”
Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web. I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web. This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.
The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.
The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list. OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.
So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources. I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others. Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.
Four months, and four conference calls (supported by OCLC – thank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary. We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.
Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges. Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.
My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise. Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.
I would characterise the stage we are at, as moving from the talking about it to doing something about it stage. I am looking forward to the next few months with enthusiasm.
If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far). If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.
You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.
The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.
After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?
We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.
So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.
To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.
Another area of question around the publication of WorldCat linked data, has been about licensing. Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat. The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?”
To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area. You can find these at http://www.oclc.org/data/attribution.html. They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data. As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.
As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data. So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at firstname.lastname@example.org.
Typical! Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net? 35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.
By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news. Nevertheless it is significant news, significant in many ways.
OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years. At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009. As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus. Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap. These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.
Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well. As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.
Let me dissect the announcement a bit….
First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them. Just the sort of authoritative resources that will help stitch the emerging web of data together.
Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org. Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them. A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.
As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup. It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise? Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.
Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages. You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form. You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser. Not very pretty to human eyes I know, but just the kind of structured data that systems love.
Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources. With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples. OCLC is playing it’s part in doing this for the library sector.
Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary. Attributes such as library:holdingsCount and library:oclcnum. This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data. What better way of testing out such a vocabulary – markup several million records with it, publish them and see what the world makes of them.
Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.
Sixth significant bit of news – This release is an experimental release. This is the start, not the end, of a process. We know we have not got this right yet. There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data. There are obvious areas for discussion around the use of schema.org and the proposed library extension to it. There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for. Over the coming months OCLC wants to constructively engage with all that are interested in this process. It is only with the help of the library and wider web communities that we can get it right. In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.
As you can probably tell I am fairly excited about this announcement. This, and future stuff like it, are behind some of my reasons for joining OCLC. I can’t wait to see how this evolves and develops over the coming months. I am also looking forward to engaging in the discussions it triggers.
Europeana recently launched an excellent short animation explaining what Linked Open Data is and why it’s a good thing, both for users and for data providers. They did this in support of the release of a large amount of Linked Open Data describing cultural heritage assets held in Libraries, Museums, Galleries and other institutions across Europe.
Europeana, as an aggregator and proxy for data supplied by other institutions is in a difficult position. They not only want to publish this information for the benefit of Europe and the wider world, they also need to maintain the provenance and relationships between the submissions of data from their partner organisations. I believe that the EDM is the result of the second of these two priorities taking precedence. Their proxy role being reflected in the structure of the data. The effect being that a potential consumer of their data, who is not versed in Europeana and their challenges, will need to understand their model before being able to identify that the Cartographer : Ryther, Augustus created the Cittie of London 31.
Fortunately as their technical overview indicates, this is a pilot and the team at Europeana are open to suggestion, particularly on the issue of providing information at the item level in the data model:
Depending on the feedback received during this pilot, we may change this and duplicate all the descriptive metadata at the level of the item URI. Such an option is costly in terms of data verbosity, but it would enable easier access to metadata, for data consumers less concerned about provenance.
In the interests of this data becoming useful, valuable, and easily consumable for those outside of the Europeana partner grouping, I encourage you to lobby them to take a hit on the duplication of some data.
Europeana have launched a video. An excellent short (03:42) video animation explaining what Linked Open Data is and why it’s a good thing, both for users and for data providers.
It’s purpose is to support their publication of a Linked Open Data representation of their cultural heritage descriptive assets and aggregation of descriptions from their contributing organisations. I will cover the ramifications and benefits of this elsewhere.
For now, checkout the video and drop it in to your favourites ready to send to anyone who asks the what is Linked Data question.
The German National Library (DNB) has launched a Linked Data version of the German National Bibliography.
The bibliographic data of the DNB’s main collection (apart from the printed music and the collection of the Deutsches Exilarchiv) and the serials (magazines, newspapers and series of the German Union Catalogue of serials (ZDB)) have been converted. Henceforth the RDF/XML-representation of the records are available at the DNB portal. This is an experimental service that will be continually expanded and improved.
This is a welcome extension to their Linked Data Service, previously delivering authority data. Documentation on their data and modelling is available, however the English version has yet to be updated to reflect this latest release.
Links to RDF-XML versions of individual records are available directly from the portal user interface, with the usual Linked Data content negotiation techniques available to obtain HTML or RDF-XML as required.
This is a welcome addition to the landscape of linked open bibliographic data, joining others such as the British Library.
Also to be welcomed is their move to CC0 licensing removing barriers, real or assumed, to the reuse of this data.
I predict that this will be the first of many more such announcements this year from national and other large libraries opening up their metadata resources as Linked Open Data. The next challenge will be to identify the synergies between these individual approaches to modelling bibliographic data and balance the often competing needs of the libraries themselves and potential consumers of their data who very often do not speak ‘library’.
Somehow [without engaging in the traditional global library cooperation treacle-like processes that take a decade to publish a document] we need to draw together a consistent approach to modelling and publishing Linked Open Bibliographic Data for the benefit of all – not just the libraries. With input from the DNB, British Library, Library of Congress, European National Libraries, Stanford, and others such as Schema.org, W3C, Open Knowledge Foundation etc., we could possibly get a consensus on an initial approach. Aiming for a standard would be both too restrictive, and based on experience, too large a bite of the elephant at this early stage.
Here for instance is a picture, of the village next to where I live, discovered in Ookaboo associated with the village as a topic:
But there is more.
Ookaboo have released an RDF dump of the metadata behind the images, concept mappings and links to concepts in Freebase and Dbpedia for topics such as places, people and organism classifications.
This is not a one-off exercise, [Ookaboo] intend to use an automated process to make regular releases of the Ookaboo dump in the future.
For the SPARQLy inclined, they also provide an overview of the structure, namespaces, and properties used in the RDF plus a SPARQL cookbook of example queries.
This looks to be a great resource, and when merged with other data sets, potentially capable of adding significant benefit.
We require the following attribution:
Papers, books, and other works that incorporate Ookaboo data or report results from Ookaboo data must cite Ookaboo and Ontology2.
HTML pages that incorporate images from Ookaboo must include a hyperlink to the page describing the image that is linked with the ookaboo:ookabooPage property.
Data products derived from Ookaboo must make it possible to maintain the provenance and attribution chain for images. In the case an RDF dump, it is sufficient to provide a connection to Ookaboo identifiers and documentation that refers users to the Ookaboo RDF dump. SPARQL endpoints must contain attribution information, which can be done by importing selected records from the Ookaboo dump.
No problem in principle, but in practice some may find the share-alike elements of the last item a bit difficult to comply with, once you start building applications built on layers of layers of data APIs. Commercial players especially may shy away from using Oookaboo because of the copyleft ramifications. For the data itself, I would have thought CC-BY would have been sufficient.
Maybe Paul Houle, of Ontology2 who are behind Ookaboo, would like to share his reasoning behind this.