Published in Consuming Data, Development, Libraries, Linked Data, OCLC, schema.org, Web, WorldCat
Tagged: Linked Data, RDF, RDFa, WorldCat
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
- All resource pages include Linked Data
- Human visible under a Linked Data tab at the bottom of the page
- Embedded as RDFa within the page html
- Described using the Schema.org vocabulary
- Released under an ODC-BY open data license
That is all still valid – so what’s new from now?
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
curl -L -H “Accept: application/rdf+xml” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: application/ld+json” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: text/turtle” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: text/plain” http://www.worldcat.org/oclc/41266045
So, how can I use it? However you like.
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
Published in Consuming Data, Data Publishing, Development, Linked Data
Tagged: APIs, Linked Data, Ordnace Survey
Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.
There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.
Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.
A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.
The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.
Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). What you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle. As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.
An excellent demonstration of the basic power and benefit of Linked Data. So why is this often missed? Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.
To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.
The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.
Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets. Nice that you can work with an individually focussed set or get data from all in a consolidated graph.
There are four APIs:
- Lookup – a simple way to extract an RDF description of a single resource, using its URI.
- Search – for running keyword searches over a dataset.
- Sparql – a fully-compliant SPARQL 1.1 endpoint.
- Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.
Each API is available to play with on a web page complete with examples and pop-up help hints. It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.
For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form. They even include details of the available output formats and expected http response codes.
Finally a few general comments.
Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs. Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way. How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it. Also a site with a snappy response ups your confidence that your application will perform well when using their service.
Secondly the range of APIs, all cleanly and individually satisfying specific general needs. So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.
The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind. Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance. The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.
I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.
Well done to the team behind its production.
Published in Development, Libraries, Linked Data, OCLC, Semantic Tech & Business, Semantic Web
Tagged: Libraries, Linked Data, OCLC, SemTechBiz
Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5
Update from organisers:
We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.
Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.
Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!
OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data. SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.
SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities. Amongst the great selection of speakers you will find yours truly!
If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.
Nominations for the Spotlight are being accepted through May 10th. Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing. Self-nominations will be accepted
Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing. As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering. (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)
For more details checkout the OCLC Innovation Series page.
Thank you for all the nominations we received for the first Semantic Web.com Spotlight on Innovation in Libraries.
Published in Data Publishing, Google, Libraries, Linked Data, schema.org, Web
Tagged: Data, Entities, Libraries, Linked Data, RDF
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Beacons of Availability
As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.
As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.
Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.
I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!
Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about. This is not a problem isolated to library interfaces. In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references. This vocabulary is Schema.org – I have previously posted about its success and significance.
With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand. By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person. With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results. I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.
For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites. We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them. If not an identical process, it is clear that similar rules will apply for structured data linking. If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them. If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.
Page ranking does not depend on all pages having to link to all others. Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.
WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource. However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web. Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.
We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks. We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources. To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.
So who is going to do all this then – will every library need to employ a linked data expert? I certainly hope not.
One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs. Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers. Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.
By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources. One beacon is not enough. A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.
This won’t happen over night, but we are certainly in for some interesting times ahead.
Beacons picture from wallpapersfor.me
Published in Consuming Data, Data Publishing, Development, Libraries, Linked Data, OCLC, Web
Tagged: Data, Entities, Libraries, Linked Data, RDF, Records
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
Image by the trial on Flickr
Published in Development, Libraries, OCLC, Uncategorized
Tagged: Data, Entities, Libraries, RDF, Records
As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!Russian doll image by smcgee on Flickr
Published in Data Publishing, Libraries, Linked Data, OCLC, Open Data, schema.org
Tagged: Libraries, Linked Data, OCLC, Open Data, schema.org
Back in September I formed a W3C Group – Schema Bib Extend. To quote an old friend of mine “Why did you go and do that then?”
Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web. I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web. This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.
At the same time, I believe that summer 2012 was a bit of a watershed for Linked Data in the library world. Over the preceding few years we have had various national libraries publishing linked data (British Library, Bibliothèque nationale de France, Deutsche National Bibliothek, National Library of Sweden, to name just a few). We have had linked data published versions of authority files such as LCSH, RAMEAU, National Diet Library, plus OCLC hosted services such as VIAF, FAST, and Dewey. These plus many other initiatives have lead me to conclude that we are moving to the next stage – for instance the British Library and Deutsche Nationalbibliothek are starting to cross-link their data, and the Library of Congress BIBFRAME initiative is starting to expose some of its [very linked data] thinking.
Of course the other major initiative was that publication of Linked Data, using Schema.org, from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held works.
The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.
The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list. OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.
So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources. I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others. Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.
Four months, and four conference calls (supported by OCLC – thank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary. We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.
Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges. Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.
My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise. Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.
I would characterise the stage we are at, as moving from the talking about it to doing something about it stage. I am looking forward to the next few months with enthusiasm.
If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far). If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.
I have been banging on about Schema.org for a while. For those that have been lurking under a structured data rock for the last year, it is an initiative of cooperation between Google, Bing, Yahoo!, and Yandex to establish a vocabulary for embedding structured data in web pages to describe ‘things’ on the web. Apart from the simple significance of having those four names in the same sentence as the word cooperation, this initiative is starting to have some impact. As I reported back in June, the search engines are already seeing some 7%-10% of pages they crawl containing Schema.org markup. Like it or not, it is clear that Schema.org is rapidly becoming a de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.
It is no coincidence then, at OCLC we chose Schema.org as the way to expose linked data in WorldCat. If you haven’t seen it, just search for any item at worldcat.org, scroll to the bottom of the page and open up the Linked Data tab and there you will see the [not very pretty, but hay it’s really designed for systems not humans] Schema.org marked up linked data for the item, with links out to other data sources such as VIAF, LCSH, FAST, and Dewey.
As with everything new it was not perfect from the start. We discovered some limitations in the vocabulary as my colleagues attempted to describe WorldCat resources. Leading to the creation of a Library vocabulary (as a potential extension to Schema.org) to help encode some of the stuff that Schema couldn’t. Fortunately, those at Schema.org are open to extension proposals and, with the help of the W3C, run a Group [WebSchemas]to propose and discuss them. Proposals that have already been accepted include those from news and ecommerce groups.
Things have moved on and, I have launched another W3C community Group – Schema Bib Extend to attempt to build a consensus, across a wide group of those concerned about things bibliographic, around proposing extensions to the Schema.org vocabulary. Addressing it’s capability for describing these types of resources – books, journals, articles, theses, etc., etc. in all forms and formats.
My personal hope being that the resulting proposals, if and when adopted by Schema.org, will enable libraries, publishers, interest groups, universities, retailers, OCLC, and others to not only publish data about their resources in a way that the search engines can understand, but also have a light weight way to interconnect them to each other and authoritative identifiers for place, name, subject, etc., that will help us begin to form a distributed web of bibliographic data. A bit of a grand ambition for a fairly simple vocabulary you may think, but things worth having are worth reaching for.
So focusing back on the short term for the moment. Extending Schema.org to better describe bib resources could have significant benefits anyway. What is in library catalogues, and other bibliographic sources, is mostly hidden to search engines – OPAC pages are almost impossible scrape intuitively, data formats used are only understood by the library and publisher worlds, and even if they ascertain the work a library is describing, there is little way to identify that it is, or is not, the same as one in another library. It is no accident that Google Book Search came into being utilising special data ingest processes and search techniques to help. Unfortunately there is a significant part of the population unaware of it’s existence and few who use it as part of their general search activities. By marking up your resources in their terms, your data should appear in the main search indexes and you may even get a better results listing (courtesy of Google Rich Snippets).
OK, that’s the pitch for Schema.org (and getting together to extend it a little in the bibliographic direction) over. Now on to the point of this post – the mindset we should adopt when approaching the generic, high level, course grained, broad but shallow, simplistic [choose your own phrase] Schema.org vocabulary to describe rich and [already] richly described resources we find in libraries. Although all my examples will be library/bibliographic ones, I believe that much of what I describe here will be of use and relevance to those in other industries with rich and established ways to describe their data and resources.
Initially let me get a few simple things out of the way. Firstly, the Schema.org vocabulary is not designed to, and will never, replace any rich industry specific vocabularies or ontologies. It’s prime benefits are that it is light-weight (understandable by non-experts) and cross-sectoral (data from many domains can be merged and mixed) and, oh yes becoming broadly adopted. Secondly nobody is advocating that anyone starts to use it instead of their currently used standards – either mix it with your domain specific standards and/or use it as ‘publicly understandable’ publishing format for web pages and the like. Finally, although initially conceived as a web page markup (Microdata) format, the schema.org vocabulary is equally applicable as Linked Data vocabulary that can be used in the creation of RDF data. The increasing use and reference to RDFa in Schema.being a reflection of this. This is also exemplified by the use of Schema.org in the RDF N-Triples dump file OCLC has published of a sub-set of WorldCat data.
So moving on. You have your resources already being described, following established practice, in domain specific format(s) and you want to attempt to describe them using the Schema.org vocabulary. In the library/publishing community we have more such standards than you can shake a stick at – MARC (of several mostly incompatible flavours), MODS, METS, ONIX, ISBD, RDA, to name just some. Each have their enthusiasts, and proponents, many being a great starting point for a process that might go something like this:
Working my way through all the elements of the [insert your favourite here] standard let me find an equivalent in Schema that I can map my data to.
This can become a bit of an involved operation. Take something as simple as the author of a book for instance. Bibliographic standards have concepts such as main author, corporate, creator, contributor, etc. Schema>Book only has the simple property ‘author’. How can I reflect the rich nuances and detail in my [library] format, in this simplistic Schema.org vocabulary? Simple answer – you can’t, so don’t try. The question you have to ask yourself at this point is: By adding all this detail will I confuse potential consumers of this data, or will the Googles of this world just want to know the people and organisations connected with [linked to] this book in a creative (text) way. Taking this approach of looking at the problem from the data/domain expert’s end of the telescope means that you have to go through a similar process for each and every element in your data forma/vocabulary/standard. An approach that will most probably lead to a long list of things missing from and recommendations for Schema.org that they (the group, not the vocabulary) would be unlikely to accept.
Let me propose an alternative approach by turning the telescope around and viewing the data, that you care about and want to publish, from the non-expert consumer’s point of view. Using my book example again it might go like this:
Schema has a Book class (great!) let me step through it’s properties and identify where in [insert your favourite standard here] I could get that from.
So for example, the ‘author’ property of Schema’s Book class comes from it being a sub-class of the generic CreativeWork class where it is defined as being a Person or Organization – The author of this content. You can now look into your own vocabulary or standard to find the elements which would contain author-ish data to map to Schema.
Hang on a moment though! The Book>author property is defined as being a instance of (or link to) Person or Organization classes. This means that when we start to publish our data in this form, it is not a matter of just extracting the text string of the author’s name from our data; we need to provide a link to a description of that author (preferably also in Schema.org format). WorldCat data does this by providing author links to VIAF – a pattern repeated with other properties such as ‘about’ (with links to Dewey and LCSH).
Taking this approach limits you to only thinking about the things Schema [currently] concerns itself with – a much simpler process.
If that was all there was to it, there would be no need for the Schema Bib Extend Group. As we did at OCLC with WorldCat, some gaps were identified in the result, making it unsatisfactory in some areas in providing a description for even a non-expert. Obvious candidates [for a Book] include a holding statement, and how to describe the type of book (ebook, talking book, etc.) and the format it is in (paper/hard back, large print, CD, Cassette, MP3, etc.) However, approaching it from this direction encourages you to firstly look across other areas of the Schema.org vocabulary and other extension proposals for solutions. GoodRelations, soon to be merged into Schema, offers some promising potential answers for holdings (describing them as Offers to hire/lease). A proposal from the Radio/TV community includes a PublicationEvent.
Finally it is only the gaps, or anomalies, apparent at a Schema.org level that should turn into proposals for extension. How they would map to elements of standards from our own domain would be down to us [as with what is already in Schema.org] to establish and share consensus driven good practice and lots, and lots, of examples.
We, especially in the library community, have invested much time and effort over many decades in describing [cataloguing] our resources so that people can discover and benefit from them. Long gone are the days when the way to find things was to visit the library and flick through draws full of catalogue cards. Libraries were quick to take advantage of the web, putting up their WebOPAC’s so that you could ‘search from home’. However, study after study has shown that people are now not visiting the library online either. The de facto [and often only] start point is now a search engine – increasingly as represented by a generic search prompt on your phone or tablet device.
This evolution in searching practice would be fine [from a library point of view] if library resources were identified and described to the search engines such that they can easily consume and understand it – so far it hasn’t been. Schema.org is a way to do that, and to be realistic at the moment is the only show in town that fits that particular bill. We realised decades, if not centuries ago, that for people to find our things we need to describe them, but the best descriptions in the world are about as much use as a chocolate teapot if they are not in places where those people are looking.
If you want to know more about bibliographic extension proposals to Schema.org, or help in creating them, join us at Schema Bib Extend.
And remember – when you are thinking about relating your favourite standard to Schema.org, check which end of the telescope you are using before you start.