Forming Consensus on Schema.org for Libraries and More

w3c_home Back in September I formed a W3C Group – Schema Bib Extend.  To quote an old friend of mine “Why did you go and do that then?” 

schema-org1 Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web.  I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web.  This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.

At the same time, I believe that summer 2012 was a bit of a watershed for Linked Data in the library world.  Over the preceding few years we have had various national libraries publishing linked data (British Library, Bibliothèque nationale de France, Deutsche National Bibliothek, National Library of Sweden, to name just a few).  We have had linked data published versions of authority files such as LCSH, RAMEAU, National Diet Library, plus OCLC hosted services such as VIAF, FAST, and Dewey.  These plus many other initiatives have lead me to conclude that we are moving to the next stage – for instance the British Library and Deutsche Nationalbibliothek are starting to cross-link their data, and the Library of Congress BIBFRAME initiative is starting to expose some of its [very linked data] thinking.

 WorldCat_Logo_V_ColorOf course the other major initiative was that publication of Linked Data, using Schema.org, from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held works.

 Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.

The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list.  OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.

So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources.  I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others.  Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.

Four months, and four conference calls (supported by OCLCthank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary.  We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.  

linkeddata_blue Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges.  Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.

My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise.  Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.

I would characterise the stage we are at, as moving from the talking about it to doing something about it stage.  I am looking forward to the next few months with enthusiasm. 

If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far).  If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.

The Correct End Of Your Telescope – Viewing Schema.org Adoption

schema-org1telescope I have been banging on about Schema.org for a while.  For those that have been lurking under a structured data rock for the last year, it is an initiative of cooperation between Google, Bing, Yahoo!, and Yandex to establish a vocabulary for embedding structured data in web pages to describe ‘things’ on the web.  Apart from the simple significance of having those four names in the same sentence as the word cooperation, this initiative is starting to have some impact.  As I reported back in June, the search engines are already seeing some 7%-10% of pages they crawl containing Schema.org markup.  Like it or not, it is clear that Schema.org is rapidly becoming a de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.

Snapshot It is no coincidence then, at OCLC we chose Schema.org as the way to expose linked data in WorldCat.  If you haven’t seen it, just search for any item at worldcat.org, scroll to the bottom of the page and open up the Linked Data tab and there you will see the [not very pretty, but hay it’s really designed for systems not humans] Schema.org marked up linked data for the item, with links out to other data sources such as VIAF, LCSH, FAST, and Dewey.

As with everything new it was not perfect from the start.  We discovered some limitations in the vocabulary as my colleagues attempted to describe WorldCat resources. Leading to the creation of a Library vocabulary (as a potential extension to Schema.org) to help encode some of the stuff that Schema couldn’t.  Fortunately, those at Schema.org are open to extension proposals and, with the help of the W3C, run a Group [WebSchemas]to propose and discuss them.  Proposals that have already been accepted include those from news and ecommerce groups.

Things have moved on and, I have launched another W3C community Group – Schema Bib Extend to attempt to build a consensus, across a wide group of those concerned about things bibliographic, around proposing extensions to the Schema.org vocabulary.  Addressing it’s capability for describing these types of resources – books, journals, articles, theses, etc., etc. in all forms and formats. 

My personal hope being that the resulting proposals, if and when adopted by Schema.org, will enable libraries, publishers, interest groups, universities, retailers, OCLC, and others to not only publish data about their resources in a way that the search engines can understand, but also have a light weight way to interconnect them to each other and authoritative identifiers for place, name, subject, etc., that will help us begin to form a distributed web of bibliographic data.   A bit of a grand ambition for a fairly simple vocabulary you may think, but things worth having are worth reaching for.  

So focusing back on the short term for the moment. Extending Schema.org to better describe bib resources could have significant benefits anyway. What is in library catalogues, and other bibliographic sources, is mostly hidden to search engines – OPAC pages are almost impossible scrape intuitively, data formats used are only understood by the library and publisher worlds, and even if they ascertain the work a library is describing, there is little way to identify that it is, or is not, the same as one in another library.  It is no accident that Google Book Search came into being utilising special data ingest processes and search techniques to help. Unfortunately there is a significant part of the population unaware of it’s existence and few who use it as part of their general search activities.  By marking up your resources in their terms, your data should appear in the main search indexes and you may even get a better results listing (courtesy of Google Rich Snippets).

OK, that’s the pitch for Schema.org (and getting together to extend it a little in the bibliographic direction) over.  Now on to the point of this post – the mindset we should adopt when approaching the generic, high level, course grained, broad but shallow, simplistic [choose your own phrase] Schema.org vocabulary to describe rich and [already] richly described resources we find in libraries.  Although all my examples will be library/bibliographic ones, I believe that much of what I describe here will be of use and relevance to those in other industries with rich and established ways to describe their data and resources.

Initially let me get a few simple things out of the way.  Firstly, the Schema.org vocabulary is not designed to, and will never, replace any rich industry specific vocabularies or ontologies.  It’s prime benefits are that it is light-weight (understandable by non-experts) and cross-sectoral (data from many domains can be merged and mixed) and, oh yes becoming broadly adopted.  Secondly nobody is advocating that anyone starts to use it instead of their currently used standards – either mix it with your domain specific standards and/or use it as ‘publicly understandable’ publishing format for web pages and the like.  Finally, although initially conceived as a web page markup (Microdata) format, the schema.org vocabulary is equally applicable as Linked Data vocabulary that can be used in the creation of RDF data.  The increasing use and reference to RDFa in Schema.being a reflection of this.  This is also exemplified by the use of Schema.org in the RDF N-Triples dump file OCLC has published of a sub-set of WorldCat data.

So moving on. You have your resources already being described, following established practice, in domain specific format(s) and you want to attempt to describe them using the Schema.org vocabulary.  In the library/publishing community we have more such standards than you can shake a stick at – MARC (of several mostly incompatible flavours), MODS, METS, ONIX, ISBD, RDA, to name just some. Each have their enthusiasts, and proponents, many being a great starting point for a process that might go something like this:

Working my way through all the elements of the [insert your favourite here] standard let me find an equivalent in Schema that I can map my data to.

This can become a bit of an involved operation.  Take something as simple as the author of a book for instance.  Bibliographic standards have concepts such as main author, corporate, creator, contributor, etc.  Schema>Book only has the simple property ‘author’.  How can I reflect the rich nuances and detail in my [library] format, in this simplistic Schema.org vocabulary?  Simple answer – you can’t, so don’t try.  The question you have to ask yourself at this point is: By adding all this detail will I confuse potential consumers of this data, or will the Googles of this world just want to know the people and organisations connected with [linked to] this book in a creative (text) way.  Taking this approach of looking at the problem from the data/domain expert’s end of the telescope means that you have to go through a similar process for each and every element in your data forma/vocabulary/standard.  An approach that will most probably lead to a long list of things missing from and recommendations for Schema.org that they (the group, not the vocabulary) would be unlikely to accept.

Let me propose an alternative approach by turning the telescope around and viewing the data, that you care about and want to publish, from the non-expert consumer’s point of view.  Using my book example again it might go like this:

Schema has a Book class (great!) let me step through it’s properties and identify where in [insert your favourite standard here] I could get that from.

So for example, the ‘author’ property of Schema’s Book class comes from it being a sub-class of the generic CreativeWork class where it is defined as being a Person or Organization – The author of this content.  You can now look into your own vocabulary or standard to find the elements which would contain author-ish data to map to Schema. 

Hang on a moment though!  The Book>author property is defined as being a instance of (or link to) Person or Organization classes.  This means that when we start to publish our data in this form, it is not a matter of just extracting the text string of the author’s name from our data; we need to provide a link to a description of that author (preferably also in Schema.org format).  WorldCat data does this by providing author links to VIAF – a pattern repeated with other properties such as ‘about’ (with links to Dewey and LCSH).

Taking this approach limits you to only thinking about the things Schema [currently] concerns itself with – a much simpler process. 

If that was all there was to it, there would be no need for the Schema Bib Extend Group. As we did at OCLC with WorldCat, some gaps were identified in the result, making it unsatisfactory in some areas in providing a description for even a non-expert.  Obvious candidates [for a Book] include a holding statement, and how to describe the type of book (ebook, talking book, etc.) and the format it is in (paper/hard back, large print, CD, Cassette, MP3, etc.)  However, approaching it from this direction encourages you to firstly look across other areas of the Schema.org vocabulary and other extension proposals for solutions.  GoodRelations, soon to be merged into Schema, offers some promising potential answers for holdings (describing them as Offers to hire/lease). A proposal from the Radio/TV community includes a PublicationEvent.

Finally it is only the gaps, or anomalies, apparent at a Schema.org level that should turn into proposals for extension.  How they would map to elements of standards from our own domain would be down to us [as with what is already in Schema.org] to establish and share consensus driven good practice and lots, and lots, of examples.

We, especially in the library community, have invested much time and effort over many decades in describing [cataloguing] our resources so that people can discover and benefit from them.  Long gone are the days when the way to find things was to visit the library and flick through draws full of catalogue cards.   Libraries were quick to take advantage of the web, putting up their WebOPAC’s so that you could ‘search from home’.  However, study after study has shown that people are now not visiting the library online either. The de facto [and often only] start point is now a search engine – increasingly as represented by a generic search prompt on your phone or tablet device.

This evolution in searching practice would be fine [from a library point of view] if library resources were identified and described to the search engines such that they can easily consume and understand it – so far it hasn’t been.  Schema.org is a way to do that, and to be realistic at the moment is the only show in town that fits that particular bill.  We realised decades, if not centuries ago, that for people to find our things we need to describe them, but the best descriptions in the world are about as much use as a chocolate teapot if they are not in places where those people are looking. 

If you want to know more about bibliographic extension proposals to Schema.org, or help in creating them, join us at Schema Bib Extend.

And remember – when you are thinking about relating your favourite standard to Schema.org, check which end of the telescope you are using before you start.

Putting WorldCat Data Into A Triple Store

WorldCat_Logo_V_Color I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.

I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 million most highly held resources in WorldCat.org – to make the cut, a resource had to be held by more than 250 libraries.

So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.

4storeStep 1
Choose a triplestore.  I followed my own advise and chose 4Store.  The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work.  Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.

Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!

Step 2
Operating system.  Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems.  I had three options.  I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.

As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application.   The Mac in question being a fairly modern MacBook Air.  The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.

Step 3
Download and install.   I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/.  As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder.  (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)

Step 4
Setup and load.  Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store.  Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:

  $ 4s-backend-setup WorldCatMillion

Next start the database:

  $ 4s-backend WorldCatMillion

Then I need to load the triples from the WorldCat Most Highly Held data set.  This step takes a while – over an hour on my system.

  $ 4s-import WorldCatMillion –format ntriples /Users/walllisr/Downloads/WorldCatMostHighlyHeld-2012-05-15.nt

This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.

Now to start the http server to access it:

  $ 4s-httpd -p 8000 WorldCatMillion

A quick test to see if it all worked:

  $ 4s-query WorldCatMillion ‘SELECT * WHERE { ?s ?p ?o } LIMIT 10’

This should output some XML encoded  triples

Step 5
Access via a web browser.  I chose Firefox, as it seems to handle unformatted XML better than most.  4Store comes with a very simple SPARQL interface: http://localhost:8000/test/  This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query.  The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.

Step 6
Some simple SPARQL queries.  Try these and see what you get:

Describe a resource:

  DESCRIBE <http://www.worldcat.org/oclc/46843162>

Select all the genre used:

  SELECT DISTINCT ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Select 100 resources, with a genre triple, outputting the resource URI and it’s genre. (By adjusting the OFFSET value, you can page through all the results):

  SELECT ?s, ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Ok there is a start, now I need to play a bit to brush up on my SPARQL!

Get Yourself a Linked Data Piece of WorldCat to Play With

WorldCat_Logo_V_Color You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org.   Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.

The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere.  For now, you will find my presentation Library Linked Data Progress on my SlideShare site.

After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?

We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it.  So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.

So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples.   Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms.  So which chunk to choose was a question.  We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.

To make the cut, a resource had to be held by more than 250 libraries.  It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed.   To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself.  Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.

Another area of question around the publication of WorldCat linked data, has been about licensing.   Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat.  The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?

To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area.  You can find these at http://www.oclc.org/data/attribution.html.  They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data.   As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.

As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data.  So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at data@oclc.org.

OCLC WorldCat Linked Data Release – Significant In Many Ways

logo_wcmasthead_enTypical!  Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net?  35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.

By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news.  Nevertheless it is significant news, significant in many ways.

OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years.  At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009.  As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus.  Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap.  These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.

Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well.  As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.

Let me dissect the announcement a bit….

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org] First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them.  Just the sort of authoritative resources that will help stitch the emerging web of data together.

Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org.  Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them.  A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.

As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup.   It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise?  Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.

LinkedData Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages.  You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form.Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1  You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser.  Not very pretty to human eyes I know, but just the kind of structured data that systems love.

Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources.  With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples.  OCLC is playing it’s part in doing this for the library sector.

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-2 Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary.  Attributes such as library:holdingsCount and library:oclcnum.  This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data.  What better way of testing out such a vocabulary –  markup several million records with it, publish them and see what the world makes of them.

Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.

Sixth significant bit of news – This release is an experimental release.  This is the start, not the end, of a process.  We know we have not got this right yet.  There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data.  There are obvious areas for discussion around the use of schema.org and the proposed library extension to it.  There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for.  Over the coming months OCLC wants to constructively engage with all that are interested in this process.  It is only with the help of the library and wider web communities that we can get it right.  In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.

For more information about this release, check out the background to linked data at OCLC, join the conversation on the OCLC Developer Network, or email data@oclc.org.

As you can probably tell I am fairly excited about this announcement.  This, and future stuff like it, are behind some of my reasons for joining OCLC.  I can’t wait to see how this evolves and develops over the coming months.  I am also looking forward to engaging in the discussions it triggers.

Schema.org Consensus at SemTechBiz

Day three of the Semantic Tech & Business Conference in San Francisco brought us a panel to discuss Schema.org, populated by an impressive array of names and organisations:

IMG_0306 Ivan Herman, World Wide Web Consortium
Alexander Shubin, Yandex
Dan Brickley, Schema.org at Google
Evan Sandhaus, New York Times Company
Jeffrey W. Preston, Disney Interactive Media Group
Peter Mika, Yahoo!
R.V. Guha, Google
Steve Macbeth, Microsoft

This well attended panel started with a bit of a crisis – the stage in the room was not large enough to seat all of the participants causing a quick call out for bar seats and much microphone passing.  Somewhat reflective of the crisis of concern about the announcement of Schema.org, immediately prior to last year’s event which precipitated the hurried arrangement of a birds of a feather session to settle fears and disquiet in the semantic community.

Asking a fellow audience member what they thought of this session, they replied that the wasn’t much new said.  In my opinion I think that is a symptom of good things happening around the initiative.  He was right in saying that there was nothing substantive said, but there were some interesting pieces that came out of what the participants had to say.  Guha indicated that Google were already seeing that 7-10% of pages crawled already contained Schema.org mark-up, surprising growth in such a short time.  Steve Macbeth confirmed that Microsoft were also seeing around 7%.

Another unexpected but interesting insight from Microsoft was that they are looking to use Schema.org mark-up as a way to pass data between applications in Windows 8.  All the search engine folks were playing it close when asked what they were actually using the structured data they were capturing from Schema.org mark-up – lots of talk about projects around better search algorithms and indexing.  Guha, indicated that the Schema.org data was not siloed inside Google.  As with any other data it was used across the organisation, including within the Google Knowledge Graph functionality.

Jeffrey Preston responded to a question about the tangible benefits of applying Schema.org mark-up by describing how kids searching for games on the Disney site were being directed more accurately to the game as against pages that referenced it.  Evan Sandhaus described how it enabled a far easier integration with a vendor who could access their article data without having to work with a specific API.  Guha spoke about a Veterans job search site was created with the Department of Defence as they could constrain their search only to sites which only included Schema.org mark-up and identified jobs as appropriate for Veterans.

In questions from the floor, the panel explained the best way of introducing schema extensions, using the IPTC rNews as an example – get industry consensus to provide a well formed proposal and then be prepared to be flexible.   All done via the W3C hosted Public Vocabs List.

All good progress in only a year!

Richard Wallis is Technology Evangelist at OCLC and Founder of Data Liberate

Surfacing at Semtech San Francisco

San Francisco So where have I been?   I announce that I am now working as a Technology Evangelist for the the library behemoth OCLC, and then promptly disappear.  The only excuse I have for deserting my followers is that I have been kind of busy getting my feet under the OCLC table, getting to know my new colleagues, the initiatives and projects they are engaged with, the longer term ambitions of the organisation, and of course the more mundane issues of getting my head around the IT, video conferencing, and expense claim procedures.

It was therefore great to find myself in San Francisco once again for the Semantic Tech & Business Conference (#SemTechBiz) for what promises to be a great program this year.  Apart from meeting old and new friends amongst those interested in the potential and benefits of the Semantic Web and Linked Data, I am hoping for a further step forward in the general understanding of how this potential can be realised to address real world challenges and opportunities.

As Paul Miller reported, the opening session contained an audience with 75% first time visitors.  Just like the cityscape vista presented to those attending the speakers reception yesterday on the 45th floor of the conference hotel, I hope these new visitors get a stunningly clear view of the landscape around them.

Of course I am doing my bit to help on this front by trying to cut through some of the more technical geek-speak. Tuesday 8:00am will find me in Imperial Room B presenting The Simple Power of the Link – a 30 minute introduction to Linked Data, it’s benefits and potential without the need to get you head around the more esoteric concepts of Linked Data such as triple stores, inference, ontology management etc.  I would not only recommend this session for an introduction for those new to the topic, but also for those well versed in the technology as a reminder that we sometimes miss the simple benefits when trying to promote our baby.

For those interested in the importance of these techniques and technologies to the world of Libraries Archives and Museums I would also recommend a panel that I am moderating on Wednesday at 3:30pm in Imperial B – Linked Data for Libraries Archives and Museums.  I will be joined by LOD-LAM community driver Jon Voss, Stanford Linked Data Workshop Report co-author Jerry Persons, and  Sung Hyuk Kim from the National Library of Korea.  As moderator I will, not only let the four of us make small presentations about what is happening in our worlds, I will be insistent that at least half the time will be there for questions from the floor, so bring them along!

I am not only surfacing at Semtech, I am beginning to see, at last, the technologies being discussed surfacing as mainstream.  We in the Semantic Web/Linked world are very good at frightening off those new to it.  However, driven by pragmatism in search of a business model and initiatives such as Schema.org, it is starting to become mainstream buy default.  One very small example being Yahoo’!s Peter Mika telling us, in the Semantic Search workshop, that RDFa is the predominant format for embedding structured data within web pages.

Looking forward to a great week, and soon more time to get back to blogging!

Richard Wallis Joins OCLC

200px-OCLC_logo.svg You may have noticed this press release Richard Wallis joins OCLC staff as Technology Evangelist today from OCLC.

I have already had some feedback on this move from several people, who almost without exception, have told me that they think it is good move for both OCLC and myself. Which is good, as I agree with them 😉

I have also had several questions about it, mostly beginning with the words why or what.  My answers I thought I would share here to give some background.

Why a library organisation? – I thought you were trying to move away from libraries.
I have been associated with the library sector since joining BLCMP in 1990 to help them build a new library management system which they christened Talis.  As Talis, the company named after the library system, evolved and started to look at new Web influenced technologies to open up possibilities for managing and publishing library data, they and I naturally gravitated towards Semantic Web technologies and their pragmatic use in a way that became known as Linked Data.

Even though the Talis Group transferred their library division to Capita last year, that natural connection between library data and linked data principles meant that the association remained for me, despite having no direct connection with the development of the systems to run libraries.  Obvious examples of this were the Linked Data and Libraries events I ran in London with Talis and the work with the British Library to model and publish the British National Bibliography.  So even if I wanted to get away from libraries I believe it would be a fruitless quest, I think I am stuck with them!

Why OCLC? – Didn’t you spend a lot of time criticising them.
I can own up to several blog posts a few years back where I either criticised them for not being as open as I thought they could be, or questioning their business model at the time.  However I have always respected their mission to promote libraries and their evolution.   In my time chairing and hosting the Library 2.0 Gang, and in individual podcasts, I hope that I demonstrated a fairness that I always aspire towards, whilst not shying away from the difficult questions.   I have watched OCLC, and the library community they are part of, evolve over many years towards a position and vision that encompasses many of the information sharing principles and ambitions I hold.   In the very short amount of time I have already spent talking with my new colleagues it is clear that they are motivated towards making best use of data for the benefit of their members, libraries in general, and the people they serve – which is all of us.

Oh and yes, they have a great deal of data which has huge potential on the Linked Data Web and it will be great to be a part of realising at least some of that potential.

What about Data Liberate? – Are you going to continue with that.
I set up Data Liberate with a dual purpose.  Firstly, to promote myself as a consultant to help people and organisations realise the value in their data.  Secondly, to provide a forum and focus for sharing commenting upon, and discussing issues, ideas, events, and initiatives relevant to Open, Linked, Enterprise, and Big data.  Obviously the first of these is now not relevant, but I do intend to maintain Data Liberate to fulfil that second purpose.  I may not be posting quite as often, but I do intend to highlight and comment upon things of relevance in the broad landscape of data issues, regardless of if they are library focussed or not.

What are you going to be doing at OCLC?
My title is Technology Evangelist, and there is a great deal of evangelism needed – promoting, explaining, and demystifying the benefits of Linked Data to libraries and librarians.  This stuff is very new to a large proportion of the library sector, and not unsurprisingly there is some scepticism about it.  It would be easy to view announcements from organisations such as the British Library, Library of Congress, Europeana, Stanford University, OCLC, and many many more, as a general acceptance of a Linked Data library vision.  Far from it.  I am certain that a large proportion of librarians are not aware of the potential benefits of Linked Data for their world, or even why they should be aware.   So you will find me on stage at an increasing number of OCLC and wider library sector events, doing my bit to spread the word.

Like all technologies and techniques, Linked Data does not sit in isolation and there is obvious connections with the OCLC WorldShare Platform which is providing shared web based services for managing libraries and their data.  I will also be applying some time evangelising the benefits of this approach.

Aside from evangelising I will be working with people.  Working with the teams within OCLC as they coordinate and consolidate their approach to applying Linked Data principles across the organisation.  Working with them as they evolve the way OCLC will publish data to libraries and the wider world.  Working with libraries to gain their feedback.  Working with the Linked Data and Semantic Web community to gain feedback as to the way to publish that data in a way that not only serves the library community, but also to all across the emerging Web of Data.  So you will continue to find me on stage at events such as the Semantic Tech and Business Conference, doing my bit to spread the word, as well as engaging directly with the community.

Why libraries? – Aren’t they a bit of a Linked Data niche.
I believe that there are two basic sorts of data being published on the [Linked Data] web – backbone data and the non-backbone data the value of which is greatly increased by linking to the backbone.

By backbone data I mean things like: Dbpeadia with it’s identifier for most every ‘thing’; government data with authoritative identifiers for laws, departments, schools, etc.; mapping organisations, such as Ordnance Survey with authoritative identifiers for post codes etc.  By linking your dataset’s concepts to these backbone sources, you immediately increase its usefulness and ability to link and merge with other data linked in the same way.  I believe that the descriptions of our heritage and achievements both scientific and artistic, held by organisations such as our national, academic, and public libraries is a massive resource that has the opportunity to form a very significant vertebrae on that backbone.

Hopefully some of the above will help in the understanding of the background and motivations behind this new and exciting phase of my career.  These opinions and ambitions for the evolution of data on the web, and in the enterprise, are all obviously mine, so do not read in to them any future policy decisions or directions for my new employer.  Suffice to say I will not be leaving them at the door. Neither will I cast off my approach to pragmatically solving problems in the real world by evolving towards a solution recognising that the definition of the ideal changes over time and with circumstance.

Who Will Be Mostly Right – Wikidata, Schema.org?






Two, on the surface, totally unconnected posts – yet the the same message. Well that’s how they seem to me anyway.

Post 1 – The Problem With Wikidata from Mark Graham writing in the Atlantic. Post 2 – Danbri has moved on – should we follow? by a former colleague Phil Archer.






democracy Two, on the surface, totally unconnected posts – yet the the same message.  Well that’s how they seem to me anyway.

Post 1The Problem With Wikidata from Mark Graham writing in the Atlantic.

wikimedia When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February,  I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource.   This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.

In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

He also highlights issues about the unevenness or bias of contributors to Wikipedia:

We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.

A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority.  If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.

I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin –  “WikiData will not define the truth, it will collect the references to the data….  WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.”  They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion.  In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Post 2Danbri has moved on – should we follow? by a former colleague Phil Archer.

schema-org1 The Danbri in question is Dan Brickley, one of the original architects of the Semantic Web, now working for Google in Schema.org.  Dan presented at an excellent Semantic Web Meetup, which I attended at the BBC Academy a couple of weeks back.  This was a great event.  I recommend investing in the time to watch the videos of Dan and all the other speakers.

Phil picked out a section of Dan’s presentation for comment:

In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…

Then reflecting on current practice in Linked Data he went on to postulate:

… best practice for the RDF community…  …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.

Except schema.org doesn’t.

schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?

As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data.  Or should we stick with the current collection of terms from suitable smaller vocabularies.

One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out.  I have watched the frown skip across several people’s faces when you first tell them that foaf:name is a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.

As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary?   Another former colleague, David Wood Tweeted  No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation.  If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.

You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it.   When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology.  What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.

OK a couple of interesting posts, but where is the similar message and connection?  I see it as democracy of opinion.  Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view.  More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few.  Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.

Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found.  However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points.  This way the ‘how’ of data publishing should become simpler, more widespread, and extensible.   On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.

Main image via democracy.org.au.

Semantic Search, Discovery, and Serendipity






An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html






IMG_0256 So I need to hang up some tools in my shed.  I need some bent hook things – I think.  Off to the hardware store in which I search for the fixings section.  Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes.  After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices.  Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice!  Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge.  I grab one of those and head for the checkout before my credit card regrets me discovering anything else.

We all know the library ‘browse’ experience.  Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.

An ambition for the web is to reflect and assist what we humans do in the real world.  Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition.   By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine.  A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.

How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post.  It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them.  Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues.  That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.

But surely there is more to it than that.  Yes there is, but it is not search – it is discovery.  As in my scenario above, humans do not only search for things.  We search to get ourselves to a start point for discovery.  I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby.  By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.

BBC Nature - Aardvark videos, news and facts Take what is still one of my favourite sites – BBC wildlife.  Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours.  These are all things or concepts related to the lion.  Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal.  Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe.  In such an environment serendipity is only a few clicks away.

There are two sides to the finding stuff coin – Search and Discovery.  Humans naturally do both, systems and the web are only just starting to move beyond search only.  This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data.  A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.