As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!
Back in September I formed a W3C Group – Schema Bib Extend. To quote an old friend of mine “Why did you go and do that then?”
Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web. I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web. This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.
The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.
The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list. OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.
So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources. I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others. Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.
Four months, and four conference calls (supported by OCLC – thank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary. We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.
Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges. Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.
My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise. Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.
I would characterise the stage we are at, as moving from the talking about it to doing something about it stage. I am looking forward to the next few months with enthusiasm.
If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far). If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.
I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.
So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.
Choose a triplestore. I followed my own advise and chose 4Store. The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work. Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.
Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!
Operating system. Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems. I had three options. I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.
As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application. The Mac in question being a fairly modern MacBook Air. The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.
Download and install. I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/. As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder. (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)
Setup and load. Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store. Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:
$ 4s-backend-setup WorldCatMillion
Next start the database:
$ 4s-backend WorldCatMillion
Then I need to load the triples from the WorldCat Most Highly Held data set. This step takes a while – over an hour on my system.
This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.
Access via a web browser. I chose Firefox, as it seems to handle unformatted XML better than most. 4Store comes with a very simple SPARQL interface: http://localhost:8000/test/ This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query. The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.
Some simple SPARQL queries. Try these and see what you get:
You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.
The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.
After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?
We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.
So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.
To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.
Another area of question around the publication of WorldCat linked data, has been about licensing. Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat. The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?”
To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area. You can find these at http://www.oclc.org/data/attribution.html. They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data. As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.
As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data. So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at firstname.lastname@example.org.
Typical! Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net? 35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.
By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news. Nevertheless it is significant news, significant in many ways.
OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years. At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009. As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus. Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap. These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.
Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well. As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.
Let me dissect the announcement a bit….
First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them. Just the sort of authoritative resources that will help stitch the emerging web of data together.
Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org. Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them. A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.
As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup. It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise? Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.
Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages. You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form. You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser. Not very pretty to human eyes I know, but just the kind of structured data that systems love.
Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources. With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples. OCLC is playing it’s part in doing this for the library sector.
Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary. Attributes such as library:holdingsCount and library:oclcnum. This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data. What better way of testing out such a vocabulary – markup several million records with it, publish them and see what the world makes of them.
Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.
Sixth significant bit of news – This release is an experimental release. This is the start, not the end, of a process. We know we have not got this right yet. There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data. There are obvious areas for discussion around the use of schema.org and the proposed library extension to it. There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for. Over the coming months OCLC wants to constructively engage with all that are interested in this process. It is only with the help of the library and wider web communities that we can get it right. In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.
As you can probably tell I am fairly excited about this announcement. This, and future stuff like it, are behind some of my reasons for joining OCLC. I can’t wait to see how this evolves and develops over the coming months. I am also looking forward to engaging in the discussions it triggers.
I have already had some feedback on this move from several people, who almost without exception, have told me that they think it is good move for both OCLC and myself. Which is good, as I agree with them 😉
I have also had several questions about it, mostly beginning with the words why or what. My answers I thought I would share here to give some background.
Why a library organisation? – I thought you were trying to move away from libraries.
I have been associated with the library sector since joining BLCMP in 1990 to help them build a new library management system which they christened Talis. As Talis, the company named after the library system, evolved and started to look at new Web influenced technologies to open up possibilities for managing and publishing library data, they and I naturally gravitated towards Semantic Web technologies and their pragmatic use in a way that became known as Linked Data.
Even though the Talis Group transferred their library division to Capita last year, that natural connection between library data and linked data principles meant that the association remained for me, despite having no direct connection with the development of the systems to run libraries. Obvious examples of this were the Linked Data and Libraries events I ran in London with Talis and the work with the British Library to model and publish the British National Bibliography. So even if I wanted to get away from libraries I believe it would be a fruitless quest, I think I am stuck with them!
Why OCLC? – Didn’t you spend a lot of time criticising them.
I can own up to several blog posts a few years back where I either criticised them for not being as open as I thought they could be, or questioning their business model at the time. However I have always respected their mission to promote libraries and their evolution. In my time chairing and hosting the Library 2.0 Gang, and in individual podcasts, I hope that I demonstrated a fairness that I always aspire towards, whilst not shying away from the difficult questions. I have watched OCLC, and the library community they are part of, evolve over many years towards a position and vision that encompasses many of the information sharing principles and ambitions I hold. In the very short amount of time I have already spent talking with my new colleagues it is clear that they are motivated towards making best use of data for the benefit of their members, libraries in general, and the people they serve – which is all of us.
Oh and yes, they have a great deal of data which has huge potential on the Linked Data Web and it will be great to be a part of realising at least some of that potential.
What about Data Liberate? – Are you going to continue with that.
I set up Data Liberate with a dual purpose. Firstly, to promote myself as a consultant to help people and organisations realise the value in their data. Secondly, to provide a forum and focus for sharing commenting upon, and discussing issues, ideas, events, and initiatives relevant to Open, Linked, Enterprise, and Big data. Obviously the first of these is now not relevant, but I do intend to maintain Data Liberate to fulfil that second purpose. I may not be posting quite as often, but I do intend to highlight and comment upon things of relevance in the broad landscape of data issues, regardless of if they are library focussed or not.
What are you going to be doing at OCLC?
My title is Technology Evangelist, and there is a great deal of evangelism needed – promoting, explaining, and demystifying the benefits of Linked Data to libraries and librarians. This stuff is very new to a large proportion of the library sector, and not unsurprisingly there is some scepticism about it. It would be easy to view announcements from organisations such as the British Library, Library of Congress, Europeana, Stanford University, OCLC, and many many more, as a general acceptance of a Linked Data library vision. Far from it. I am certain that a large proportion of librarians are not aware of the potential benefits of Linked Data for their world, or even why they should be aware. So you will find me on stage at an increasing number of OCLC and wider library sector events, doing my bit to spread the word.
Like all technologies and techniques, Linked Data does not sit in isolation and there is obvious connections with the OCLC WorldShare Platform which is providing shared web based services for managing libraries and their data. I will also be applying some time evangelising the benefits of this approach.
Aside from evangelising I will be working with people. Working with the teams within OCLC as they coordinate and consolidate their approach to applying Linked Data principles across the organisation. Working with them as they evolve the way OCLC will publish data to libraries and the wider world. Working with libraries to gain their feedback. Working with the Linked Data and Semantic Web community to gain feedback as to the way to publish that data in a way that not only serves the library community, but also to all across the emerging Web of Data. So you will continue to find me on stage at events such as the Semantic Tech and Business Conference, doing my bit to spread the word, as well as engaging directly with the community.
Why libraries? – Aren’t they a bit of a Linked Data niche.
I believe that there are two basic sorts of data being published on the [Linked Data] web – backbone data and the non-backbone data the value of which is greatly increased by linking to the backbone.
By backbone data I mean things like: Dbpeadia with it’s identifier for most every ‘thing’; government data with authoritative identifiers for laws, departments, schools, etc.; mapping organisations, such as Ordnance Survey with authoritative identifiers for post codes etc. By linking your dataset’s concepts to these backbone sources, you immediately increase its usefulness and ability to link and merge with other data linked in the same way. I believe that the descriptions of our heritage and achievements both scientific and artistic, held by organisations such as our national, academic, and public libraries is a massive resource that has the opportunity to form a very significant vertebrae on that backbone.
Hopefully some of the above will help in the understanding of the background and motivations behind this new and exciting phase of my career. These opinions and ambitions for the evolution of data on the web, and in the enterprise, are all obviously mine, so do not read in to them any future policy decisions or directions for my new employer. Suffice to say I will not be leaving them at the door. Neither will I cast off my approach to pragmatically solving problems in the real world by evolving towards a solution recognising that the definition of the ideal changes over time and with circumstance.