I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of Schema.org. So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.
That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:
Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed schema.org terms to be documented, tested and revised. The anticipation being that this area will be updated with proposals relatively frequently, in between formal Schema.org releases.
How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.
For fuller details on what is in 3.1 and other releases, checkout the Releases document.
Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:
Snappy Performance – The Schema.org site, a Python app hosted on the Google App Engine, is shall we say a very popular site. Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy. This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files. From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary. (Tip: The page displays the link to the file that will always return the latest version.)
Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance. I know from working with colleagues and clients, that it is sometimes difficult to get your head around Schema.org’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value. It is good to now have somewhere to point people when they question such things.
Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary. The rdfs:comment section of term definitions are now passed through a Markdown processor. This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc. So for example a link can be input as [The Link](http://example.com/mypage) and italic text would be input as *italic*. The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link. This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files.
I could go on, but won’t – If you are new to Schema.org, or very familiar, I suggest you take a look.
Marketing Hype! I hear you thinking – well at least I didn’t use the tired old ‘Next Generation’ label.
Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.
This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org. But let me first set the context by explaining my thoughts on this New Web.
Having once been considered an expert on Web 2.0 (I hasten to add by others, not myself) I know how dangerous it can be to attach labels to things. It tends to spawn screen full’s of passionate opinions on the relevance of the name, date of the revolution, and over detailed analysis of isolated parts of what is a general movement. I know I am on dangerous ground here!
To my mind something is new when it feels different. The Internet felt different when the Web (aka HTTP + HTML + browsers) arrived. The Web felt different (Web 2.0?) when it became more immersive (write as well as read) and visually we stopped trying to emulate in a graphical style what we saw on character terminals. Oh, and yes we started to round our corners.
There have been many times over the last few years when it felt new – when it suddenly arrived in our pockets (the mobile web) – when the inner thoughts, and eating habits, of more friends that you ever remember meeting became of apparent headline importance (the social web) – when [the contents of] the web broke out of the boundaries of the browser and appeared embedded in every app, TV show, and voice activated device.
The feeling different phase I think we are going through at the moment, like previous times, is building on what went before. It is exemplified by information [data] breaking out of the boundaries of our web sites and appearing where it is useful for the user.
We are seeing the tip of this iceberg in the search engine Knowledge Panels, answer boxes, and rich snippets, The effect of this being that often your potential user can get what they need without having to find and visit your site – answering questions such as what is the customer service phone number for an organisation; is the local branch open at the moment;give me driving directions to it; what is available and on offer. Increasingly these interactions can occur without the user even being aware they are using the web – “Siri! Where is my nearest library?“ A great way to build relationships with your customers. However a new and interesting challenge for those trying to measure the impact of your web site.
So, what is fundamental to this New Web?
There are several things – HTTP, the light-weight protocol designed to transfer text, links and latterly data, across an internet previously used to specific protocols for specific purposes – HTML, that open, standard, easily copied light-weight extensible generic format for describing web pages that all browsers can understand – Microdata, RDFa, JSON, JSON-LD – open standards for easily embedding data into HTML – RDF, an open data format for describing things of any sort, in the form of triples, using shared vocabularies. Building upon those is Schema.org – an open, [de facto] standard, generic vocabulary for describing things in most areas of interest.
Why is one vocabulary fundamental when there are so many others to choose from? Check out the 500+ referenced on the Linked Open Vocabularies (LOV) site. Schema.org however differs from most of the others in a few key areas:
Size and scope – its current 642 Types and 992 Properties is significantly larger and covers far more domains of interest than most others. This means that if you are looking to describe a something, you are highly likely to to find enough to at least start. Despite its size, it is yet far from capable of describing everything on, or off, the planet.
Evolution – it is under continuous evolutionary development and extension, driven and guided by an open community under the wing of the W3C and accessible in a GitHub repository.
Flexibility – from the beginning Schema.org was designed to be used in a choice of your favourite serialisation – Microdata, RDFa, JSON-LD, with the flexibility of allowing values to default to text if you have not got a URI available.
Consumers – The major search engines Google, Bing, Yahoo!, and Yandex, not only back the open initiative behind Schema.org but actively search out Schema.org markup to add to their Knowledge Graphs when crawling your sites.
Guidance – If you search out guidance on supplying structured data to those major search engines, you are soon supplied with recommendations and examples for using Schema.org, such as this from Google. They even supply testing tools for you to validate your markup.
With this support and adoption, the Schema.org initiative has become self-fulfilling. If your objective is to share or market structured data about your site, organisation, resources, and or products with the wider world; it would be difficult to come up with a good reason not to use Schema.org.
Is it a fully ontologically correct semantic web vocabulary? Although you can see many semantic web and linked data principles within it, no it is not. That is not its objective. It is a pragmatic compromise between such things, and the general needs of webmasters with ambitions to have their resources become an authoritative part of the global knowledge graphs, that are emerging as key to the future of the development of search engines and the web they inhabit.
Note that I question if Schema.org is a fundamental component, of what I am feeling is a New Web. It is not the fundamental component, but one of many that over time will become just the way we do things.
Schema.org is basically a simple vocabulary for describing stuff, on the web. Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs. They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding. I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.
The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for. Ideally provide a URI (URL in Schema.org) for a property that references another thing, but if you don’t have one a simple string will do.
There are a few strangenesses, that hit you when you first delve into using the vocabulary. For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place. This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.
At first look at the documentation, Role looks like a very simple type with a handful of properties. On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary. That is because it is capable of fitting almost anywhere. Anywhere there is a relationship between one type and another, that is. It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes. Some might term this as a form of annotation.
So what need is this satisfying you may ask. It must be a significant need to cause the creation of a special case in the vocabulary. Let me walk through a case, that is used in a Schema.org Blog post, to explain a need scenario and how Role satisfies that need.
Starting With American Football
Say you are describing members of an American Football Team. Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:
So we now have Chucker Roberts described as an athlete on the Touchline Gods team. The obvious question then is how do we describe the position he plays in the team. We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as Schema.org.
This is where Role comes in handy. Regardless of the range defined for any property in Schema.org, it is acceptable to provide a Role as a value. The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person). In simple terms we have have just inserted a Role type between the original two descriptions.
This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person). For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.
So far I have just been stepping through the example provided in the Schema.org blog post on this. Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.
There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator, contributor, publisher, character, etc.) in CreativeWork and its subtypes. By using Role, in the same way as in the sports example above, we have the flexibility to describe what is needed.
Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor. However by using Role can qualify the contribution role that he played to be that of the writer of preface.
<span property="roleName"src="http://id.loc.gov/vocabulary/relators/wpr">Writer of preface</span>
<span property="contributor"src="http://http://viaf.org/viaf/283191359">Floor de Goede</span>
You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings. I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.
Do I always use Role? In the above example I relate a book to two people, the author and the writer of preface. I could have linked to the author via another role with the roleName being ‘Author’ or <http://id.loc.gov/vocabulary/relators/aut>. Although possible, it is not a recommended approach. Wherever possible use the properties defined for a type. This is what data consumers such as search engines are going to be initially looking for.
One last example
To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:
@prefix schema:<http://schema.org/> .
This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.
Whenever possible use URLs for roleName In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”. However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL. If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data the URI always means the same thing. Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.
Final advice This very flexible mechanism has many potential uses when describing your resources in Schema.org. There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.
Good luck in your role in describing your resources and the relationships between them using Schema.org
It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data. The Web of Data that is starting to take shape and drive benefits for everyone. Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.
OK, you may say, we’ve heard all that before, so what is new now?
As always it is a couple of seemingly unconnected events that throw things into focus.
Event 1: An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go. An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future. Get our data out there, everywhere, so it can find its way to those users, wherever they are. Making it accessible to all. David references three main ways to provide this access:
APIs – to allow systems to directly access our library system data and functionality
Linked Data – can help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
The Library Graph – an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities
(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)
It is the 3rd of those, however, that triggered recognition for me. Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’ – if graphs can have corners that is. Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data. However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.
ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.
The post is interesting from a couple of points of view. Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data. Command-line calls that were no doubt embedded in simple scripts.
Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions. A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages. What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.
The ZBW Labs experiment is interesting in its own way – simple approach enlightening results. What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.
The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data. Using it for things we never envisaged, delivering unexpected value to our community. I can’t wait to hear about other initiatives like this that we can all learn from.
So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph. Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.
As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing. When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data. An important authoritative corner, of a corner, of the Giant Global Graph.
As I said at the start these are baby steps towards a vision that is forming out of the mist. I hope you and others can see it too.
A couple of months back I spoke about the preview release of Works data from WorldCat.org. Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.
These links (URIs) lead, where available, to authoritative sources for people, subjects, etc. When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs. As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.
The obvious question is “how do I get a work id for the items in my catalogue?”. The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721
In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:
To quote Neil Wilson, Head of Metadata Services at the British Library:
With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data. This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.
This is a major first step in a journey to provide linked data views of the entities within WorldCat. Looking forward to other WorldCat entities such as people, places, and events. Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years. These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.
You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.
Personally I am really looking forward to hearing about the uses that are made of this data.
Instead of keeping the answers within individual email threads, I thought they may be of interest to a wider audience:
QI don’t see anything that describes the criteria for “workness.” “Workness” definition is more the result of several interdependent algorithmic decision processes than a simple set of criteria. To a certain extent publishing the results as linked data was the easy (huh!) bit. The efforts to produce these definitions and their relationships are the ongoing results of a research process, by OCLC Research, that has been in motion for several years, to investigate and benefit from FRBR. You can find more detail behind this research here: http://www.oclc.org/research/activities/frbr.html?urlm=159763
Q Defining what a “work” is has proven next to impossible in the commercial world, how will this be more successful? Very true for often commercial and/or political, reasons previous initiatives in this direction have not been very successful. OCLC make no broader claim to the definition of a WorldCat Work, other than it is the result of applying the results of the FRBR and associated algorithms, developed by OCLC Research, to the vast collection of bibliographic data contributed, maintained, and shared by the OCLC member libraries and partners.
QWill there be links to individual ISBN/ISNI records?
ISBN – ISBNs are attributes of manifestation [in FRBR terms] entities, and as such can be found in the already released WorldCat Linked Data. As each work is linked to its related manifestation entities [by schema:workExample] they are therefore already linked to ISBNs.
ISNI – ISNI is an identifier for a person and as such an ISNI URI is a candidate for use in linking Works to other entity types. VIAF URIs being another for Person/Organisation entities which, as we have the data, we will be using. No final decisions have been made as to which URIs we use and as to using multiple URIs for the same relationship. Do we Use ISNI, VIAF, & Dbpedia URIs for the same person, or just use one and rely on interconnection between the authoritative hubs, is a question still to be concluded.
Q Can you say more about how the stable identifiers will be managed as the grouping of records that create a work change? You correctly identify the issue of maintaining identifiers as work groups split & merge. This is one of the tasks the development team are currently working on as they move towards full release of this data over the coming weeks. As I indicated in my blog post, there is a significant data refresh due and from that point onwards any changes will be handled correctly.
Q Is there a bulk download available? No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.
Q Where should bugs be reported? Today, you can either use the comment link from the Linked Data Explorer or report them to firstname.lastname@example.org. We will be building on this as we move towards full release.
QThere appears to be something funky with the way non-existent IDs are handled. You have spotted a defect! – The result of access to a non established URI should be no triples returned with that URI as subject. How this is represented will differ between serialisations. Also you would expect to receive a http status of 404 returned.
QIt’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well?. The next release of data will be linked to a void document providing information, including licensing, for the dataset.
Q How might WorldCat Works intersect with the BIBFRAME model? – these work descriptions could be very useful as a bf:hasAuthority for a bf:Work. The OCLC team monitor, participate in, and take account of many discussions – BIBFRAME, Schema.org, SchemaBibEx, WikiData, etc. – where there are some obvious synergies in objectives, and differences in approach and/or levels of detail for different audiences. The potential for interconnection of datasets using sameAs, and other authoritative relationships such as you describe is significant. As the WorldCat data matures and other datasets are published, one would expect initiatives from many in starting to interlink bibliographic resources from many sources.
Q Will your team be making use of ISTC? Again it is still early for decisions in this area. However we would not expect to store the ISTC code as a property of Work. ISTC is one of many work based data sets, from national libraries and others, that it would be interesting to investigate processes for identifying sameAs relationships between.
The answer to the above question stimulated a follow-on question based upon the fact that ISTC Codes are allocated on a language basis. In FRBR terms language of publication is associated with the Expression, not the Work level description. As such therefore you would not expect to find ISTC on a ‘Work’ – My response to this was:
Note that the Works published from WorldCat.org are defined as instances of schema:CreativeWork.
What you say may well be correct for FRBR, but the the WorldCat data may not adhere strictly to the FRBR rules and levels. I say ‘may not’ as we are still working the modelling behind this and a language specific Work may become just an example of a more general Work – there again it may become more Expression-like. There is a balance to be struck between FRBR rules and a wider, non-library, understanding.
Q Which triplestore are you using? We are not using a triplestore. Already, in this early stage of the journey to publish linked data about the resources within WorldCat, the descriptions of hundreds of millions of entities have been published. There is obvious potential for this to grow to many billions. The initial objective is to reliably publish this data in ways that it is easily consumed, linked to, and available in the de facto linked data serialisations. To achieve this we have put in place a simple very scalable, flexible infrastructure currently based upon Apache Tomcat serving up individual RDF descriptions stored in Apache HBase (built on top of Apache Hadoop HDFS). No doubt future use cases will emerge, which will build upon this basic yet very valuable publishing of data, that will require additional tools, techniques, and technologies to become part of that infrastructure over time. I know the development team are looking forward to the challenges that the quantity, variety, and always changing nature of data within WorldCat will provide for some of the traditional [for smaller data sets] answers to such needs.
As an aside, you may be interested to know that significant use is made of the map/reduce capabilities of Apache Hadoop in the processing of data extracted from bibliographic records, the identification of entities within that data, and the creation of the RDF descriptions. I think it is safe to say that the creation and publication of this data would not have been feasible without Hadoop being part of the OCLC architecture.
Hopefully this background will help those interested in the process. When we move from preview to a fuller release I expect to see associated documentation and background information appear.
I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.
Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.
Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:
The release of 194 Million Open Linked Data Bibliographic Work descriptions
The WorldCat Linked Data Explorer interface
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat. Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503
You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything. Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements. This is not designed as an end-user interface, it is designed as a data exploration tool. This is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.
Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.
This is the first production release of WorldCat infrastructure delivering linked data. The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat. Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources. One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc. In this preview release, it is some of those connections that are not yet in place.
What you see in their place at the moment is a link to, what can be described as, a local authority. These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name. Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.
So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).
In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:
What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations. From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain. It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.
As I said on the program today, we are looking for feedback on these releases.
We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data. Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for. So any feedback on any aspect of this will be more than welcome.
I am excitedly looking forward to being able to comment further as this journey progresses.
The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques. I’m sure this will become an indispensible authoritative hub of terms in the Web of Data to assist those describing their resources and placing them in context in that Web.
This is the fist step in an 18 month process to release four vocabularies – the others being The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®.
A great step from Getty. I look forward to the others appearing over the months and seeing how rapidly their use is made across the web.
I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
All resource pages include Linked Data
Human visible under a Linked Data tab at the bottom of the page
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.
There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.
Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.
A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.
The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.
Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). What you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle. As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.
An excellent demonstration of the basic power and benefit of Linked Data. So why is this often missed? Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.
To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.
The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.
Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets. Nice that you can work with an individually focussed set or get data from all in a consolidated graph.
There are four APIs:
Lookup – a simple way to extract an RDF description of a single resource, using its URI.
Search – for running keyword searches over a dataset.
Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.
Each API is available to play with on a web page complete with examples and pop-up help hints. It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.
For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form. They even include details of the available output formats and expected http response codes.
Finally a few general comments.
Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs. Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way. How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it. Also a site with a snappy response ups your confidence that your application will perform well when using their service.
Secondly the range of APIs, all cleanly and individually satisfying specific general needs. So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.
The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind. Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance. The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.
I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.