A Step for Schema.org – A Leap for Bib Data on the Web






Several significant bibliographic related proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org

schema-org1 Regular readers of this blog may well know I am an enthusiast for Schema.org – the generic vocabulary for describing things on the web as structured data, backed by the major search engines Google, Bing, Yahoo! & Yandex.  When I first got my head around it back in 2011 I soon realised it’s potential for making bibliographic resources, especially those within libraries, a heck of a lot more discoverable.  To be frank library resources did not, and still don’t, exactly leap in to view when searching the web – a bit of a problem when most people start searching for things with Google et al – and do not look elsewhere.

Schema.org as a generic vocabulary to describe most stuff, easily embedded in your web pages, has been a great success.  IMG_0655As was reported by Google’s R.V. Guha, at the recent Semantic Technology and Business Conference in San Jose, a sample of 12B pages showed approximately 21% containing Schema.org markup.  Right from the beginning, however, I had concerns about its applicability to the bibliographic world – great start with the Book type, but there were gaps the coverage for such things as journal issues & volumes, multi-volume works, citations, and the relationship between a work and its editions.  Discovering others shared my combination of enthusiasm and concerns, I formed a W3C Community Group – Schema Bib Extend – to propose some bibliographic focused extensions to Schema.org. Which brings me to the events behind this post…

The SchemaBibEx group have had several proposals accepted over the last couple of years, such as making the [commercial] Offer more appropriate for describing loanable materials, and broadening of the citation property. Several other significant proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org.  For many in our group these latest proposals were a long time coming after their initial proposal.  Although frustrating, the delays were symptomatic of a very healthy process.

Our proposals to add hasPart, isPartOf, exampleOfWork, and workExample to the CreativeWork Type will be available to many, as CreativeWork is the superclass to many types in many areas. Our proposals for issueNumber on PublicationIssue and volumeNumber on PerodicalVolume are very similar to others in the vocabulary, such as seasonNumber and episodeNumber in TV & Radio.  Under Dan Brickley’s careful organisation, tweaks and adjustments were made across a few areas resulting in a consistent style across parts of the vocabulary underpinned by CreativeWork.

Although the number of new types and properties are small, their addition to Schema opens up potential for much better description of periodicals and creative work relationships. To introduce the background to this, SchemaBibEx member Dan Scott and I were invited to jointly post on the Schema.org Blog.

So, another step forward for Schema.org.   I believe that is more than just a step however, for those wishing to make the bibliographic resources more visible on the Web.  There as been some criticism that Schema.org has been too simplistic to be able represent some of the relationships and subtleties from our world.  Criticism that was not unfounded.  Now with these enhancements, much of these criticisms are answered. There is more to do, but the major objective of the group that proposed them has been achieved – to lay the broad foundation for the description of bibliographic, and creative work, resources in sufficient detail for them to be understood by the search engines to become part of their knowledge graphs. Of course that is not the final end we are seeking.  The reason we share data is so that folks are guided to our resources – by sharing, using the well understood vocabulary, Schema.org.

worldcat Examples of a conceptual creative work being related to its editions, using exampleOfWork and workExample, have been available for some time.  In anticipation of their appearance in Schema, they were introduced into the OCLC WorldCat release of 194 million Work descriptions (for example: http://worldcat.org/entity/work/id/1363251773) with the inverse relationship being asserted in an updated version of the basic WorldCat linked data that has been available since 2012.

WorldCat Works – 197 Million Nuggets of Linked Data

worldcat They’re released!

A couple of months back I spoke about the preview release of Works data from WorldCat.org.  Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.

Story_of_my_experiments_with_truth___WorldCat_Entities__and_Windows_XP_Professional_2 They look a little different to the kind of metadata we are used to in the library world.  Check out this example <http://worldcat.org/entity/work/id/1151002411> and you will see that, apart from name and description strings, it is mostly links.  It is linked data after all.

These links (URIs) lead, where available, to authoritative sources for people, subjects, etc.  When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs.  As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.

The obvious question is “how do I get a work id for the items in my catalogue?”.  The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721

Gandhi___an_autobiography___the_story_of_my_experiments_with_truth__Book__2011___WorldCat_org_ In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:

<http://worldcat.org/oclc/817185721> exampleOfWork <http://worldcat.org/entity/work/id/1151002411>

To quote Neil Wilson, Head of Metadata Services at the British Library:

With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data.  This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.

This is a major first step in a journey to provide linked data views of the entities within WorldCat.  Looking forward to other WorldCat entities such as people, places, and events.  Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years.  These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.

You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.

Personally I am really looking forward to hearing about the uses that are made of this data.

Visualising Schema.org

One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.

The scenario goes something like this…..

Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”

At about this time you usually display a slide showing html code with embedded RDFa.  It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora example_RDFaof angle brackets and quotes swimming before their eyes is fairly remote.

Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages –  “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].

You ask them to take on trust that Google is actually understanding, indexing and using that structured data.  At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.

I enjoy a challenge, but I also like to find a better way sometimes.   w3

WorldCat_Logo_V_Color When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page.   This made my job far easier!

But it has a couple of downsides.  Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data.  Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.

turtle-32x32 Let the Green Turtle show the way!
Whilst looking for a better answer I discovered Green Turtle – a JavaScript library for working with RDFa and most usefully packaged in an extention for the Chrome browser.  Load this into your copy of Chrome and it will sit quietly in the background checking for RDFa (and microdata if you turn on the option) in the pages you are viewing.  When it finds one,  a green turtle iconturtle-32x32appears in the address bar.  GTtriplesClicking on that turtle opens up a new tab to show you a list of the data, in the form of triples, that it identified within the page.

That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept.  But that is not all.  This excellent little extension has a couple of extra tricks up its sleeve.

GTgraph It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data.  Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted.  As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.

Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not.  I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.

google Custom Searches that know about Schema!
Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types.  Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.

A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.

WorldCat Works Linked Data – Some Answers To Early Questions

WorldCat_Linked_Data_Explorer Since announcing the preview release of 194 Million Open Linked Data Bibliographic Work descriptions from OCLC’s WorldCat, last week at the excellent OCLC EMEA Regional Council event in Cape Town; my in-box and Twitter stream have been a little busy with questions about what the team at OCLC are doing.

Instead of keeping the answers within individual email threads, I thought they may be of interest to a wider audience:

Q  I don’t see anything that describes the criteria for “workness.”
“Workness” definition is more the result of several interdependent algorithmic decision processes than a simple set of criteria.  To a certain extent publishing the results as linked data was the easy (huh!) bit.  The efforts to produce these definitions and their relationships are the ongoing results of a research process, by OCLC Research, that has been in motion for several years, to investigate and benefit from FRBR.  You can find more detail behind this research here: http://www.oclc.org/research/activities/frbr.html?urlm=159763

Q Defining what a “work” is has proven next to impossible in the commercial world, how will this be more successful?
Very true for often commercial and/or political, reasons previous initiatives in this direction have not been very successful.  OCLC make no broader claim to the definition of a WorldCat Work, other than it is the result of applying the results of the FRBR and associated algorithms, developed by OCLC Research, to the vast collection of bibliographic data contributed, maintained, and shared by the OCLC member libraries and partners.

Q  Will there be links to individual ISBN/ISNI records?

  • ISBN – ISBNs are attributes of manifestation [in FRBR terms] entities, and as such can be found in the already released WorldCat Linked Data.  As each work is linked to its related manifestation entities [by schema:workExample] they are therefore already linked to ISBNs.
  • ISNI – ISNI is an identifier for a person and as such an ISNI URI is a candidate for use in linking Works to other entity types.  VIAF URIs being another for Person/Organisation entities which, as we have the data, we will be using.  No final decisions have been made as to which URIs we use and as to using multiple URIs for the same relationship.  Do we Use ISNI, VIAF, & Dbpedia  URIs for the same person, or just use one and rely on interconnection between the authoritative hubs, is a question still to be concluded.

Can you say more about how the stable identifiers will be managed as the grouping of records that create a work change?
You correctly identify the issue of maintaining identifiers as work groups split & merge.  This is one of the tasks the development team are currently working on as they move towards full release of this data over the coming weeks.  As I indicated in my blog post, there is a significant data refresh due and from that point onwards any changes will be handled correctly.

Is there a bulk download available?
No there is no bulk download available.  This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data.  WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source.  The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.

Where should bugs be reported?
Today, you can either use the comment link from the Linked Data Explorer or report them to data@oclc.org.  We will be building on this as we move towards full release.

Q  There appears to be something funky with the way non-existent IDs are handled.
You have spotted a defect!  – The result of access to a non established URI should be no triples returned with that URI as subject.  How this is represented will differ between serialisations. Also you would expect to receive a http status of 404 returned.

Q  It’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well?.
The next release of data will be linked to a void document providing information, including licensing, for the dataset.

How might WorldCat Works intersect with the BIBFRAME model? – these work descriptions could be very useful as a bf:hasAuthority for a bf:Work.
The OCLC team monitor, participate in, and take account of many discussions – BIBFRAME, Schema.org, SchemaBibEx, WikiData, etc. – where there are some obvious synergies in objectives, and differences in approach and/or levels of detail for different audiences. The potential for interconnection of datasets using sameAs, and other authoritative relationships such as you describe is significant.  As the WorldCat data matures and other datasets are published, one would expect initiatives from many in starting to interlink bibliographic resources from many sources.

Will your team be making use of ISTC?
Again it is still early for decisions in this area.  However we would not expect to store the ISTC code as a property of Work.  ISTC is one of many work based data sets, from national libraries and others, that it would be interesting to investigate processes for identifying sameAs relationships between.

CreativeWork_-_schema_org The answer to the above question stimulated a follow-on question based upon the fact that ISTC Codes are allocated on a language basis.  In FRBR terms language of publication is associated with the Expression, not the Work level description. As such therefore you would not expect to find ISTC on a ‘Work’ –  My response to this was:

Note that the Works published from WorldCat.org are defined as instances of schema:CreativeWork.

What you say may well be correct for FRBR, but the the WorldCat data may not adhere strictly to the FRBR rules and levels.  I say ‘may not’ as we are still working the modelling behind this and a language specific Work may become just an example of a more general Work – there again it may become more Expression-like.  There is a balance to be struck between FRBR rules and a wider, non-library, understanding.

Q   Which triplestore are you using?
We are not using a triplestore. Already, in this early stage of the journey to publish linked data about the resources within WorldCat, the descriptions of hundreds of millions of entities have been published.  There is obvious potential for this to grow to many billions.  The initial objective is to reliably publish this data in ways that it is easily consumed, linked to, and available in the de facto linked data serialisations.  To achieve this we have put in place a simple very scalable, flexible infrastructure currently based upon Apache Tomcat serving up individual RDF descriptions stored in  Apache HBase (built on top of Apache Hadoop HDFS).  No doubt future use cases will emerge, which will build upon this basic yet very valuable publishing of data, that will require additional tools, techniques, and technologies to become part of that infrastructure over time.  I know the development team are looking forward to the challenges that the quantity, variety, and always changing nature of data within WorldCat will provide for some of the traditional [for smaller data sets] answers to such needs.

As an aside, you may be interested to know that significant use is made of the map/reduce capabilities of Apache Hadoop in the processing of data extracted from bibliographic records, the identification of entities within that data, and the creation of the RDF descriptions.  I think it is safe to say that the creation and publication of this data would not have been feasible without Hadoop being part of the OCLC architecture.

 

Hopefully this background will help those interested in the process.  When we move from preview to a fuller release I expect to see associated documentation and background information appear.

OCLC Preview 194 Million Open Bibliographic Work Descriptions






demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front: 1. The release of 194 Million Linked Data Bibliographic Work descriptions. 2. The WorldCat Linked Data Explorer interface






WorldCat_Logo_V_Color I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons.  A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Open Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

ZenWorldCat Works

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat.   Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503

You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything.  Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements.  This is not designed as an end-user interface, it is designed as a data exploration tool.  viewsThis is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.

The vocabulary used to describe the data is based upon Schema.org, and enhancements to it recommended and proposed by the Schema Bib Extend W3C Community Group, which I have the pleasure to chair.

Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.

This is the first production release of WorldCat infrastructure delivering linked data.  The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat.  Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources.  One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc.  In this preview release, it is some of those connections that are not yet in place.

What you see in their place at the moment is a link to, what can be described as, a local authority.  These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name.  Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.

So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).

How do I get a work id for my resources? – Today, there is one way.  If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI: http://worldcat.org/entity/work/id/12477503

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data.  For example you will find the following in the data for OCLC number 53474380:

What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations.  From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain.  It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.

As I said on the program today, we are looking for feedback on these releases.

We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data.  Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for.  So any feedback on any aspect of this will be more than welcome.

I am excitedly looking forward to being able to comment further as this journey progresses.

Update:  I have posted answers to some interesting questions raised by this release.

Getty Release AAT Vocabulary as Linked Open Data

Linked Open date logo P3 The Getty Research Institute has announced the release of the Art & Architecture Thesaurus (AAT)® as Linked Open Data. The data set is available for download at vocab.getty.edu under an Open Data Commons Attribution License (ODC BY 1.0).

The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques.  I’m sure this will become an indispensible authoritative hub of terms in the Web of Data to assist those describing their resources and placing them in context in that Web.

This is the fist step in an 18 month process to release four vocabularies – the others being The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®.

A great step from Getty.  I look forward to the others appearing over the months and seeing how rapidly their use is made across the web.

OCLC Declare OCLC Control Numbers Public Domain

ocn Little things mean a lot.  Little things that are misunderstood often mean a lot more.

Take the OCLC Control Number, often known as the OCN, for instance.

Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times.  The individual number can be found represented in the record it is associated with.  Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.

Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits.  Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries.  It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN.  The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records.  Processes that I believe, in a few cases, are still in place.

So in the current and future climate of open sharing of data, where for instance WorldCat Linked Data, is published under an open data license, such assumptions and practices are an anomaly.

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:

The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.

As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.

Putting Linked Data on the Map






Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web. Ordnance Survey have built such an example.






Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.

BBC Olympics There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise.  They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core.  Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence.  The published data is only visible within their enterprise.

263px-DBpediaLogo.svg Dbpedia is another excellent candidate.  From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs.  But for some reason developers don’t seem to see it as a compelling example.  Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.

os_logoA third example, which I want to focus on here, is Ordnance Survey.  Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside.  A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data.  Which is what they did a couple of years ago.

The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment.  But first I must emphasise something that is often missed.

OS Charlton Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’.  With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). Charlton JSONWhat you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle.   As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.

An excellent demonstration of the basic power and benefit of Linked Data.  So why is this often missed?  Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.

To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.

The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.

Screenshot_10_05_2013_16_15 Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets.  Nice that you can work with an individually focussed set or get data from all in a consolidated graph.

There are four APIs:

  • Lookup – a simple way to extract an RDF description of a single resource, using its URI.
  • Search –  for running keyword searches over a dataset.
  • Sparql –  a fully-compliant SPARQL 1.1 endpoint.
  • Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.

Each API is available to play with on a web page complete with examples and pop-up help hints.  It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.

For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form.  They even include details of the available output formats and expected http response codes.

Finally a few general comments.

Screenshot_10_05_2013_17_02 Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs.  Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way.  How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it.  Also a site with a snappy response ups your confidence that your application will perform well when using their service.

Secondly the range of APIs, all cleanly and individually satisfying specific general needs.  So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.

The usual collection of of output formats are available, as a bit of a JavaScript kiddie I would have liked to see JSONp there too, but that is not a show stopper.  The provision of the reconciliation API, supporting tools such as Open Refine, opens up access to a far broader [non-coding] development community.

The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind.  Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance.  The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.

I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.

Well done to the team behind its production.

From Records to a Web of Library Data – Pt3 Beacons of Availability

As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series.  That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Beacons of Availability

Beacons As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff.  How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal.  In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google.  A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources.  You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources.  Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist.  A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.  Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

Findability
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about.  This is not a problem isolated to library interfaces.  In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references.  This vocabulary is Schema.org – I have previously posted about its success and significance.

With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand.   By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person.  With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results.  I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.

For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.

Visibility
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites.  We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them.   If not an identical process, it is clear that similar rules will apply for structured data linking.  If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them.  If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.

Page ranking does not depend on all pages having to link to all others.  Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.

Availability
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.

WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource.  However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web.  Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.

We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks.  We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources.   To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.

So who is going to do all this then – will every library need to employ a linked data expert?   I certainly hope not.

One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs.  Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers.  Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.

By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources.  One beacon is not enough.  A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.

This won’t happen over night, but we are certainly in for some interesting times ahead.

Beacons picture from wallpapersfor.me

From Records to a Web of Library Data – Pt2 Hubs of Authority

As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series.  That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Hubs of Authority

hub Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years.  The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations.  Two from personal experience come to mind,  BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC.  Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records.  A large number of national libraries have such lists of agreed formats for author and organisational names.  The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc.  Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..

These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.

It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names.  Almost despite their initial purpose, authorities are gaining a wider purpose.  As a source of names for, and growing descriptions of, the entities that the library world is aware of.  Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data.  As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot  responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other.  A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web.  Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community.  This data, starting to add value to the wider web, comes from those Marc records I spoke about last time.  Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued.  In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.

So where next?  One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about.  Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources.  In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject.  It is a fundamental part of the data.

With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data.  They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources.  A framework that has uses and benefits beyond the boundaries of bibliographic data.  By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.

Image by the trial on Flickr