I find myself in New York for the day on my way back from the excellent Smart Data 2015 Conference in San Jose. It’s a long story about red-eye flights and significant weekend savings which I won’t bore you with, but it did result in some great chill-out time in Central Park to reflect on the week.
In its long auspicious history the SemTech, Semantic Tech & Business, and now Smart Data Conference has always attracted a good cross section of the best and brightest in Semantic Web, Linked Data, Web, and associated worlds. This year was no different for me in my new role as an independent working with OCLC and at Google.
I was there on behalf of OCLC to review significant developments with Schema.org in general – now with 640 Types (Classes) & 988 properties – used on over 10 Million web sites. Plus the pioneering efforts OCLC are engaged with, publishing Schema.org data in volume from WorldCat.org and via APIs in their products. Check out my slides:
By mining the 300+ million records in WorldCat to identify, describe, and publish approx. 200 million Work entity descriptions, and [soon to be shared] 90+ million Person entity descriptions, this pioneering continues.
These are not only significant steps forward for the bibliographic sector, but a great example of a pattern to be followed by most sectors:
Identify the entities in your data
Describe them well using Schema.org
Publish embedded in html
Work with, do not try to replace, the domain specific vocabularies – Bibframe in the library world
Work with the community to extend an enhance Schema.org to enable better representation of your resources
If Schema.org is still not broad enough for you, build an extension to it that solves your problems whilst still maintaining the significant benefits of sharing using Schema.org – in the library world’s case this was BiblioGraph.net
Extending Schema.org Through OCLC and now Google I have been working with and around Schema.org since 2012. The presentation at Smart Data arrived at an opportune time to introduce and share some major developments with the vocabulary and the communities that surround it.
On a personal note the launch of these extensions, bib.schema.org in particular, is the culmination of a bit of a journey that started a couple of years ago with forming of the Schema Bib Extend W3C Community Group (SchemaBibEx) which had great success in proposing additions and changes to the core vocabulary.
A journey that then took in the formation of the BiblioGraph.net extension vocabulary which demonstrated both how to build a domain focused vocabulary on top of Schema.org as well as how the open source software, that powers the Schema.org site, could be forked for such an effort. These two laying the ground work for defining how hosted and external extensions will operate, and for SchemaBibex to be one of the first groups to propose a hosted extension.
Finally this last month working at Google with Dan Brickley on Schema.org, has been a bit of a blur as I brushed up my Python skills to turn the potential in version 2.0 in to the reality of fully integrated and operational extensions in version 2.1. And to get it all done in time to talk about at Smart Data was the icing on the cake.
Of course things are not stoping there. On the not too distant horizon are:
The final acceptance of bib.schema.org & auto.schema.org – currently they are in final review.
SchemaBibEx can now follow up this initial version of bib.schema.org with items from its backlog.
New extension proposals are already in the works such as: health.schema.org, archives.schema.org, fibo.schema.org.
More work on the software to improve the navigation and helpfulness of the site for those looking to understand and adopt Schema.org and/or the extensions.
The checking of the capability for the software to host external extensions without too much effort.
And of course the continuing list of proposals and fixes for the core vocabulary and the site itself.
I believe we are on the cusp of a significant step forward for Schema.org as it becomes ubiquitous across the web; more organisations, encouraged by extensions, prepare to publish their data; and the SEO community recognise proof of it actually working – but more of that in the next post.
The Culture Grid closed to ‘new accessions’ (ie. new collections of metadata) on the 30th April
The existing index and API will continue to operate in order to ensure legacy support
Museums, galleries, libraries and archives wishing to contribute material to Europeana can still do so via the ‘dark aggregator’, which the Collections Trust will continue to fund
Interested parties are invited to investigate using the Europeana Connection Kit to automate the batch-submission of records into Europeana
The reasons he gave for the ending of this aggregation service are enlightening for all engaged with or thinking about data aggregation in the library, museum, and archives sectors.
Throughout its history, the Culture Grid has been tough going. Looking back over the past 7 years, I think there are 3 primary and connected reasons for this:
The value proposition for aggregation doesn’t stack up in terms that appeal to museums, libraries and archives. The investment of time and effort required to participate in platforms like the Culture Grid isn’t matched by an equal return on that investment in terms of profile, audience, visits or political benefit. Why would you spend 4 days tidying up your collections information so that you can give it to someone else to put on their website? Where’s the kudos, increased visitor numbers or financial return?
Museum data (and to a lesser extent library and archive data) is non-standard, largely unstructured and dependent on complex relations. In the 7 years of running the Culture Grid, we have yet to find a single museum whose data conforms to its own published standard, with the result that every single data source has required a minimum of 3-5 days and frequently much longer to prepare for aggregation. This has been particularly salutary in that it comes after 17 years of the SPECTRUM standard providing, in theory at least, a rich common data standard for museums;
Metadata is incidental. After many years of pump-priming applications which seek to make use of museum metadata it is increasingly clear that metadata is the salt and pepper on the table, not the main meal. It serves a variety of use cases, but none of them is ‘proper’ as a cultural experience in its own right. The most ‘real’ value proposition for metadata is in powering additional services like related search & context-rich browsing.
The first of these two issues represent a fundamental challenge for anyone aiming to promote aggregation. Countering them requires a huge upfront investment in user support and promotion, quality control, training and standards development.
The 3rd is the killer though – countering these investment challenges would be possible if doing so were to lead directly to rich end-user experiences. But they don’t. Instead, you have to spend a huge amount of time, effort and money to deliver something which the vast majority of users essentially regard as background texture.
As an old friend of mine would depressingly say – Makes you feel like packing up your tent and going home!
Interestingly earlier in the post Nick give us an insight into the purpose of Culture Grid:
.… we created the Culture Grid with the aim of opening up digital collections for discovery and use ….
That basic purpose is still very valid for both physical and digital collections of all types. The what [helping people find, discover, view and use cultural resources] is as valid as it has ever been. It is the how [aggregating metadata and building shared discovery interfaces and landing pages for it] that has been too difficult to justify continuing in Culture Grid’s case.
In my recent presentations to library audiences I have been asking a simple question “Why do we catalogue?” Sometimes immediately, sometimes after some embarrassed shuffling of feet, I inevitably get the answer “So we can find stuff!“. In libraries, archives, and museums helping people finding the stuff we have is core to what we do – all the other things we do are a little pointless if people can’t find, or even be aware of, what we have.
If you are hoping your resources will be found they have to be referenced where people are looking. Where are they looking?
It is exceedingly likely they are not looking in your aggregated discovery interface, or your local library, archive or museum interface either. Take a look at this chart detailing the discovery starting point for college students and others. Starting in a search engine is up in the high eighty percents, with things like library web sites and other targeted sources only just making it over the 1% hurdle to get on the chart. We have known about this for some time – the chart comes from an OCLC Report ‘College Students’ Perceptions of Libraries and Information Resources‘ published in 2005. I would love to see a similar report from recent times, it would have to include elements such as Siri, Cortana, and other discovery tools built-in to our mobile devices which of course are powered by the search engines. Makes me wonder how few cultural heritage specific sources would actually make that 1% cut today.
Our potential users are in the search engines in one way or another, however it is the vast majority case that our [cultural heritage] resources are not there for them to discover.
Culture Grid, I would suggest, is probably not the only organisation, with an ‘aggregate for discovery’ reason for their existence, that may be struggling to stay relevant, or even in existence.
You may well ask about OCLC, with it’s iconic WorldCat.org discovery interface. It is a bit simplistic say that it’s 320 million plus bibliographic records are in WorldCat only for people to search and discover through the worldcat.org user interface. Those records also underpin many of the services, such as cooperative cataloguing, record supply, inter library loan, and general library back office tasks, etc. that OCLC members and partners benefit from. Also for many years WorldCat has been at the heart of syndication partnerships supplying data to prominent organisations, including Google, that help them reference resources within WorldCat.org which in turn, via find in a library capability, lead to clicks onwards to individual libraries. [Declaration: OCLC is the company name on my current salary check] Nevertheless, even though WorldCat has a broad spectrum of objectives, it is not totally immune from the influences that are troubling the likes of Culture Graph. In fact they are one of the web trends that have been driving the Linked Data and Schema.org efforts from the WorldCat team, but more of that later.
How do we get our resources visible in the search engines then? By telling the search engines what we [individual organisations] have. We do that by sharing a relevant view of our metadata about our resources, not necessarily all of it, in a form that the search engines can easily consume. Basically this means sharing data embeded in your web pages, marked up using the Schema.org vocabulary. To see how this works, we need look no further than the rest of the web – commerce, news, entertainment etc. There are already millions of organisations, measured by domains, that share structured data in their web pages using the Schema.org vocabulary with the search engines. This data is being used to direct users with more confidence directly to a site, and is contributing to the global web of data.
There used to be a time that people complained in the commercial world of always ending up being directed to shopping [aggregation] sites instead of directly to where they could buy the TV or washing machine they were looking for. Today you are far more likely to be given some options in the search engine that link you directly to the retailer. I believe is symptomatic of the disintermediation of the aggregators by individual syndication of metadata from those retailers.
Can these lessons be carried through to the cultural heritage sector – of course they can. This is where there might be a bit of light at the end of the tunnel for those behind the aggregations such as Culture Grid. Not for the continuation as an aggregation/discovery site, but as a facilitator for the individual contributors. This stuff, when you first get into it, is not simple and many organisations do not have the time and resources to understand how to share Schema.org data about their resources with the web. The technology itself is comparatively simple, in web terms, it is the transition and implementation that many may need help with.
Schema.org is not the perfect solution to describing resources, it is not designed to be. It is there to describe them sufficiently to be found on the web. Nevertheless it is also being evolved by community groups to enhance it capabilities. Through my work with the Schema Bib Extend W3C Community Group, enhancements to Schema.org to enable better description of bibliographic resources, have been successfully proposed and adopted. This work is continuing towards a bibliographic extension – bib.schema.org. There is obvious potential for other communities to help evolve and extend Schema to better represent their particular resources – archives for example. I would be happy to talk with others who want insights into how they may do this for their benefit.
Schema.org is not a replacement for our rich common data standards such as MARC for libraries, and SPECTRUM for museums as Nick describes. Those serve purposes beyond sharing information with the wider world, and should be continued to be used for those purposes whilst relevant. However we can not expect the rest of the world to get its head around our internal vocabularies and formats in order to point people at our resources. It needs to be a compromise. We can continue to use what is relevant in our own sectors whilst sharing Schema.org data so that our resources can be discovered and then explored further.
So to return to the question I posed – Is There Still a Case for Cultural Heritage Data Aggregation? – If the aggregation is purely for the purpose of supporting discovery, I think the answer is a simple no. If it has broader purpose, such as for WorldCat, it is not as clear cut.
I do believe nevertheless that many of the people behind the aggregations are in the ideal place to help facilitate the eventual goal of making cultural heritage resources easily discoverable. With some creative thinking, adoption of ‘web’ techniques, technologies and approaches to provide facilitation services, reviewing what their real goals are [which may not include running a search interface]. I believe we are moving into an era where shared authoritative sources of easily consumable data could make our resources more visible than we previously could have hoped.
Are there any black clouds on this hopeful horizon? Yes there is one. In the shape of traditional cultural heritage technology conservatism. The tendency to assume that our vocabulary or ontology is the only way to describe our resources, coupled with a reticence to be seen to engage with the commercial discovery world, could still hold back the potential.
As an individual library, archive, or museum scratching your head about how to get your resources visible in Google and not having the in-house ability to react; try talking within the communities around and behind the aggregation services you already know. They all should be learning and a problem shared is more easily solved. None of this is rocket science, but trying something new is often better as a group.
Schema.org is basically a simple vocabulary for describing stuff, on the web. Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs. They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding. I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.
The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for. Ideally provide a URI (URL in Schema.org) for a property that references another thing, but if you don’t have one a simple string will do.
There are a few strangenesses, that hit you when you first delve into using the vocabulary. For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place. This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.
At first look at the documentation, Role looks like a very simple type with a handful of properties. On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary. That is because it is capable of fitting almost anywhere. Anywhere there is a relationship between one type and another, that is. It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes. Some might term this as a form of annotation.
So what need is this satisfying you may ask. It must be a significant need to cause the creation of a special case in the vocabulary. Let me walk through a case, that is used in a Schema.org Blog post, to explain a need scenario and how Role satisfies that need.
Starting With American Football
Say you are describing members of an American Football Team. Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:
So we now have Chucker Roberts described as an athlete on the Touchline Gods team. The obvious question then is how do we describe the position he plays in the team. We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as Schema.org.
This is where Role comes in handy. Regardless of the range defined for any property in Schema.org, it is acceptable to provide a Role as a value. The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person). In simple terms we have have just inserted a Role type between the original two descriptions.
This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person). For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.
So far I have just been stepping through the example provided in the Schema.org blog post on this. Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.
There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator, contributor, publisher, character, etc.) in CreativeWork and its subtypes. By using Role, in the same way as in the sports example above, we have the flexibility to describe what is needed.
Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor. However by using Role can qualify the contribution role that he played to be that of the writer of preface.
<span property="roleName"src="http://id.loc.gov/vocabulary/relators/wpr">Writer of preface</span>
<span property="contributor"src="http://http://viaf.org/viaf/283191359">Floor de Goede</span>
You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings. I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.
Do I always use Role? In the above example I relate a book to two people, the author and the writer of preface. I could have linked to the author via another role with the roleName being ‘Author’ or <http://id.loc.gov/vocabulary/relators/aut>. Although possible, it is not a recommended approach. Wherever possible use the properties defined for a type. This is what data consumers such as search engines are going to be initially looking for.
One last example
To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:
@prefix schema:<http://schema.org/> .
This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.
Whenever possible use URLs for roleName In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”. However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL. If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data the URI always means the same thing. Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.
Final advice This very flexible mechanism has many potential uses when describing your resources in Schema.org. There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.
Good luck in your role in describing your resources and the relationships between them using Schema.org
It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data. The Web of Data that is starting to take shape and drive benefits for everyone. Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.
OK, you may say, we’ve heard all that before, so what is new now?
As always it is a couple of seemingly unconnected events that throw things into focus.
Event 1: An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go. An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future. Get our data out there, everywhere, so it can find its way to those users, wherever they are. Making it accessible to all. David references three main ways to provide this access:
APIs – to allow systems to directly access our library system data and functionality
Linked Data – can help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
The Library Graph – an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities
(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)
It is the 3rd of those, however, that triggered recognition for me. Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’ – if graphs can have corners that is. Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data. However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.
ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.
The post is interesting from a couple of points of view. Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data. Command-line calls that were no doubt embedded in simple scripts.
Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions. A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages. What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.
The ZBW Labs experiment is interesting in its own way – simple approach enlightening results. What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.
The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data. Using it for things we never envisaged, delivering unexpected value to our community. I can’t wait to hear about other initiatives like this that we can all learn from.
So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph. Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.
As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing. When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data. An important authoritative corner, of a corner, of the Giant Global Graph.
As I said at the start these are baby steps towards a vision that is forming out of the mist. I hope you and others can see it too.
Regular readers of this blog may well know I am an enthusiast for Schema.org – the generic vocabulary for describing things on the web as structured data, backed by the major search engines Google, Bing, Yahoo! & Yandex. When I first got my head around it back in 2011 I soon realised it’s potential for making bibliographic resources, especially those within libraries, a heck of a lot more discoverable. To be frank library resources did not, and still don’t, exactly leap in to view when searching the web – a bit of a problem when most people start searching for things with Google et al – and do not look elsewhere.
Schema.org as a generic vocabulary to describe most stuff, easily embedded in your web pages, has been a great success. As was reported by Google’s R.V. Guha, at the recent Semantic Technology and Business Conference in San Jose, a sample of 12B pages showed approximately 21% containing Schema.org markup. Right from the beginning, however, I had concerns about its applicability to the bibliographic world – great start with the Book type, but there were gaps the coverage for such things as journal issues & volumes, multi-volume works, citations, and the relationship between a work and its editions. Discovering others shared my combination of enthusiasm and concerns, I formed a W3C Community Group – Schema Bib Extend – to propose some bibliographic focused extensions to Schema.org. Which brings me to the events behind this post…
The SchemaBibEx group have had several proposals accepted over the last couple of years, such as making the [commercial] Offer more appropriate for describing loanable materials, and broadening of the citation property. Several other significant proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org. For many in our group these latest proposals were a long time coming after their initial proposal. Although frustrating, the delays were symptomatic of a very healthy process.
Although the number of new types and properties are small, their addition to Schema opens up potential for much better description of periodicals and creative work relationships. To introduce the background to this, SchemaBibEx member Dan Scott and I were invited to jointly post on the Schema.org Blog.
So, another step forward for Schema.org. I believe that is more than just a step however, for those wishing to make the bibliographic resources more visible on the Web. There as been some criticism that Schema.org has been too simplistic to be able represent some of the relationships and subtleties from our world. Criticism that was not unfounded. Now with these enhancements, much of these criticisms are answered. There is more to do, but the major objective of the group that proposed them has been achieved – to lay the broad foundation for the description of bibliographic, and creative work, resources in sufficient detail for them to be understood by the search engines to become part of their knowledge graphs. Of course that is not the final end we are seeking. The reason we share data is so that folks are guided to our resources – by sharing, using the well understood vocabulary, Schema.org.
Examples of a conceptual creative work being related to its editions, using exampleOfWork and workExample, have been available for some time. In anticipation of their appearance in Schema, they were introduced into the OCLC WorldCat release of 194 million Work descriptions (for example: http://worldcat.org/entity/work/id/1363251773) with the inverse relationship being asserted in an updated version of the basic WorldCat linked data that has been available since 2012.
A couple of months back I spoke about the preview release of Works data from WorldCat.org. Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.
These links (URIs) lead, where available, to authoritative sources for people, subjects, etc. When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs. As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.
The obvious question is “how do I get a work id for the items in my catalogue?”. The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721
In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:
To quote Neil Wilson, Head of Metadata Services at the British Library:
With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data. This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.
This is a major first step in a journey to provide linked data views of the entities within WorldCat. Looking forward to other WorldCat entities such as people, places, and events. Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years. These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.
You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.
Personally I am really looking forward to hearing about the uses that are made of this data.
One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.
The scenario goes something like this…..
“Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”
At about this time you usually display a slide showing html code with embedded RDFa. It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora of angle brackets and quotes swimming before their eyes is fairly remote.
Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages – “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].
You ask them to take on trust that Google is actually understanding, indexing and using that structured data. At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.
I enjoy a challenge, but I also like to find a better way sometimes. w3
When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page. This made my job far easier!
But it has a couple of downsides. Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data. Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.
That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept. But that is not all. This excellent little extension has a couple of extra tricks up its sleeve.
It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data. Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted. As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.
Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not. I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.
Custom Searches that know about Schema! Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types. Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.
A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.
Instead of keeping the answers within individual email threads, I thought they may be of interest to a wider audience:
QI don’t see anything that describes the criteria for “workness.” “Workness” definition is more the result of several interdependent algorithmic decision processes than a simple set of criteria. To a certain extent publishing the results as linked data was the easy (huh!) bit. The efforts to produce these definitions and their relationships are the ongoing results of a research process, by OCLC Research, that has been in motion for several years, to investigate and benefit from FRBR. You can find more detail behind this research here: http://www.oclc.org/research/activities/frbr.html?urlm=159763
Q Defining what a “work” is has proven next to impossible in the commercial world, how will this be more successful? Very true for often commercial and/or political, reasons previous initiatives in this direction have not been very successful. OCLC make no broader claim to the definition of a WorldCat Work, other than it is the result of applying the results of the FRBR and associated algorithms, developed by OCLC Research, to the vast collection of bibliographic data contributed, maintained, and shared by the OCLC member libraries and partners.
QWill there be links to individual ISBN/ISNI records?
ISBN – ISBNs are attributes of manifestation [in FRBR terms] entities, and as such can be found in the already released WorldCat Linked Data. As each work is linked to its related manifestation entities [by schema:workExample] they are therefore already linked to ISBNs.
ISNI – ISNI is an identifier for a person and as such an ISNI URI is a candidate for use in linking Works to other entity types. VIAF URIs being another for Person/Organisation entities which, as we have the data, we will be using. No final decisions have been made as to which URIs we use and as to using multiple URIs for the same relationship. Do we Use ISNI, VIAF, & Dbpedia URIs for the same person, or just use one and rely on interconnection between the authoritative hubs, is a question still to be concluded.
Q Can you say more about how the stable identifiers will be managed as the grouping of records that create a work change? You correctly identify the issue of maintaining identifiers as work groups split & merge. This is one of the tasks the development team are currently working on as they move towards full release of this data over the coming weeks. As I indicated in my blog post, there is a significant data refresh due and from that point onwards any changes will be handled correctly.
Q Is there a bulk download available? No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.
Q Where should bugs be reported? Today, you can either use the comment link from the Linked Data Explorer or report them to firstname.lastname@example.org. We will be building on this as we move towards full release.
QThere appears to be something funky with the way non-existent IDs are handled. You have spotted a defect! – The result of access to a non established URI should be no triples returned with that URI as subject. How this is represented will differ between serialisations. Also you would expect to receive a http status of 404 returned.
QIt’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well?. The next release of data will be linked to a void document providing information, including licensing, for the dataset.
Q How might WorldCat Works intersect with the BIBFRAME model? – these work descriptions could be very useful as a bf:hasAuthority for a bf:Work. The OCLC team monitor, participate in, and take account of many discussions – BIBFRAME, Schema.org, SchemaBibEx, WikiData, etc. – where there are some obvious synergies in objectives, and differences in approach and/or levels of detail for different audiences. The potential for interconnection of datasets using sameAs, and other authoritative relationships such as you describe is significant. As the WorldCat data matures and other datasets are published, one would expect initiatives from many in starting to interlink bibliographic resources from many sources.
Q Will your team be making use of ISTC? Again it is still early for decisions in this area. However we would not expect to store the ISTC code as a property of Work. ISTC is one of many work based data sets, from national libraries and others, that it would be interesting to investigate processes for identifying sameAs relationships between.
The answer to the above question stimulated a follow-on question based upon the fact that ISTC Codes are allocated on a language basis. In FRBR terms language of publication is associated with the Expression, not the Work level description. As such therefore you would not expect to find ISTC on a ‘Work’ – My response to this was:
Note that the Works published from WorldCat.org are defined as instances of schema:CreativeWork.
What you say may well be correct for FRBR, but the the WorldCat data may not adhere strictly to the FRBR rules and levels. I say ‘may not’ as we are still working the modelling behind this and a language specific Work may become just an example of a more general Work – there again it may become more Expression-like. There is a balance to be struck between FRBR rules and a wider, non-library, understanding.
Q Which triplestore are you using? We are not using a triplestore. Already, in this early stage of the journey to publish linked data about the resources within WorldCat, the descriptions of hundreds of millions of entities have been published. There is obvious potential for this to grow to many billions. The initial objective is to reliably publish this data in ways that it is easily consumed, linked to, and available in the de facto linked data serialisations. To achieve this we have put in place a simple very scalable, flexible infrastructure currently based upon Apache Tomcat serving up individual RDF descriptions stored in Apache HBase (built on top of Apache Hadoop HDFS). No doubt future use cases will emerge, which will build upon this basic yet very valuable publishing of data, that will require additional tools, techniques, and technologies to become part of that infrastructure over time. I know the development team are looking forward to the challenges that the quantity, variety, and always changing nature of data within WorldCat will provide for some of the traditional [for smaller data sets] answers to such needs.
As an aside, you may be interested to know that significant use is made of the map/reduce capabilities of Apache Hadoop in the processing of data extracted from bibliographic records, the identification of entities within that data, and the creation of the RDF descriptions. I think it is safe to say that the creation and publication of this data would not have been feasible without Hadoop being part of the OCLC architecture.
Hopefully this background will help those interested in the process. When we move from preview to a fuller release I expect to see associated documentation and background information appear.
I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.
Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.
Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:
The release of 194 Million Open Linked Data Bibliographic Work descriptions
The WorldCat Linked Data Explorer interface
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat. Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503
You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything. Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements. This is not designed as an end-user interface, it is designed as a data exploration tool. This is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.
Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.
This is the first production release of WorldCat infrastructure delivering linked data. The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat. Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources. One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc. In this preview release, it is some of those connections that are not yet in place.
What you see in their place at the moment is a link to, what can be described as, a local authority. These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name. Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.
So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).
In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:
What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations. From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain. It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.
As I said on the program today, we are looking for feedback on these releases.
We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data. Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for. So any feedback on any aspect of this will be more than welcome.
I am excitedly looking forward to being able to comment further as this journey progresses.
Little things mean a lot. Little things that are misunderstood often mean a lot more.
Take the OCLC Control Number, often known as the OCN, for instance.
Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times. The individual number can be found represented in the record it is associated with. Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.
Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits. Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries. It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN. The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records. Processes that I believe, in a few cases, are still in place.
I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:
Use of the OCLC Control Number (OCN) OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.
The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.
When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:
The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.
As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.