WorldCat Works Linked Data – Some Answers To Early Questions

WorldCat_Linked_Data_Explorer Since announcing the preview release of 194 Million Open Linked Data Bibliographic Work descriptions from OCLC’s WorldCat, last week at the excellent OCLC EMEA Regional Council event in Cape Town; my in-box and Twitter stream have been a little busy with questions about what the team at OCLC are doing.

Instead of keeping the answers within individual email threads, I thought they may be of interest to a wider audience:

Q  I don’t see anything that describes the criteria for “workness.”
“Workness” definition is more the result of several interdependent algorithmic decision processes than a simple set of criteria.  To a certain extent publishing the results as linked data was the easy (huh!) bit.  The efforts to produce these definitions and their relationships are the ongoing results of a research process, by OCLC Research, that has been in motion for several years, to investigate and benefit from FRBR.  You can find more detail behind this research here: http://www.oclc.org/research/activities/frbr.html?urlm=159763

Q Defining what a “work” is has proven next to impossible in the commercial world, how will this be more successful?
Very true for often commercial and/or political, reasons previous initiatives in this direction have not been very successful.  OCLC make no broader claim to the definition of a WorldCat Work, other than it is the result of applying the results of the FRBR and associated algorithms, developed by OCLC Research, to the vast collection of bibliographic data contributed, maintained, and shared by the OCLC member libraries and partners.

Q  Will there be links to individual ISBN/ISNI records?

  • ISBN – ISBNs are attributes of manifestation [in FRBR terms] entities, and as such can be found in the already released WorldCat Linked Data.  As each work is linked to its related manifestation entities [by schema:workExample] they are therefore already linked to ISBNs.
  • ISNI – ISNI is an identifier for a person and as such an ISNI URI is a candidate for use in linking Works to other entity types.  VIAF URIs being another for Person/Organisation entities which, as we have the data, we will be using.  No final decisions have been made as to which URIs we use and as to using multiple URIs for the same relationship.  Do we Use ISNI, VIAF, & Dbpedia  URIs for the same person, or just use one and rely on interconnection between the authoritative hubs, is a question still to be concluded.

Can you say more about how the stable identifiers will be managed as the grouping of records that create a work change?
You correctly identify the issue of maintaining identifiers as work groups split & merge.  This is one of the tasks the development team are currently working on as they move towards full release of this data over the coming weeks.  As I indicated in my blog post, there is a significant data refresh due and from that point onwards any changes will be handled correctly.

Is there a bulk download available?
No there is no bulk download available.  This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data.  WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source.  The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.

Where should bugs be reported?
Today, you can either use the comment link from the Linked Data Explorer or report them to data@oclc.org.  We will be building on this as we move towards full release.

Q  There appears to be something funky with the way non-existent IDs are handled.
You have spotted a defect!  – The result of access to a non established URI should be no triples returned with that URI as subject.  How this is represented will differ between serialisations. Also you would expect to receive a http status of 404 returned.

Q  It’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well?.
The next release of data will be linked to a void document providing information, including licensing, for the dataset.

How might WorldCat Works intersect with the BIBFRAME model? – these work descriptions could be very useful as a bf:hasAuthority for a bf:Work.
The OCLC team monitor, participate in, and take account of many discussions – BIBFRAME, Schema.org, SchemaBibEx, WikiData, etc. – where there are some obvious synergies in objectives, and differences in approach and/or levels of detail for different audiences. The potential for interconnection of datasets using sameAs, and other authoritative relationships such as you describe is significant.  As the WorldCat data matures and other datasets are published, one would expect initiatives from many in starting to interlink bibliographic resources from many sources.

Will your team be making use of ISTC?
Again it is still early for decisions in this area.  However we would not expect to store the ISTC code as a property of Work.  ISTC is one of many work based data sets, from national libraries and others, that it would be interesting to investigate processes for identifying sameAs relationships between.

CreativeWork_-_schema_org The answer to the above question stimulated a follow-on question based upon the fact that ISTC Codes are allocated on a language basis.  In FRBR terms language of publication is associated with the Expression, not the Work level description. As such therefore you would not expect to find ISTC on a ‘Work’ –  My response to this was:

Note that the Works published from WorldCat.org are defined as instances of schema:CreativeWork.

What you say may well be correct for FRBR, but the the WorldCat data may not adhere strictly to the FRBR rules and levels.  I say ‘may not’ as we are still working the modelling behind this and a language specific Work may become just an example of a more general Work – there again it may become more Expression-like.  There is a balance to be struck between FRBR rules and a wider, non-library, understanding.

Q   Which triplestore are you using?
We are not using a triplestore. Already, in this early stage of the journey to publish linked data about the resources within WorldCat, the descriptions of hundreds of millions of entities have been published.  There is obvious potential for this to grow to many billions.  The initial objective is to reliably publish this data in ways that it is easily consumed, linked to, and available in the de facto linked data serialisations.  To achieve this we have put in place a simple very scalable, flexible infrastructure currently based upon Apache Tomcat serving up individual RDF descriptions stored in  Apache HBase (built on top of Apache Hadoop HDFS).  No doubt future use cases will emerge, which will build upon this basic yet very valuable publishing of data, that will require additional tools, techniques, and technologies to become part of that infrastructure over time.  I know the development team are looking forward to the challenges that the quantity, variety, and always changing nature of data within WorldCat will provide for some of the traditional [for smaller data sets] answers to such needs.

As an aside, you may be interested to know that significant use is made of the map/reduce capabilities of Apache Hadoop in the processing of data extracted from bibliographic records, the identification of entities within that data, and the creation of the RDF descriptions.  I think it is safe to say that the creation and publication of this data would not have been feasible without Hadoop being part of the OCLC architecture.

 

Hopefully this background will help those interested in the process.  When we move from preview to a fuller release I expect to see associated documentation and background information appear.

OCLC Preview 194 Million Open Bibliographic Work Descriptions






demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front: 1. The release of 194 Million Linked Data Bibliographic Work descriptions. 2. The WorldCat Linked Data Explorer interface






WorldCat_Logo_V_Color I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons.  A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Open Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

ZenWorldCat Works

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat.   Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503

You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything.  Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements.  This is not designed as an end-user interface, it is designed as a data exploration tool.  viewsThis is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.

The vocabulary used to describe the data is based upon Schema.org, and enhancements to it recommended and proposed by the Schema Bib Extend W3C Community Group, which I have the pleasure to chair.

Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.

This is the first production release of WorldCat infrastructure delivering linked data.  The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat.  Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources.  One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc.  In this preview release, it is some of those connections that are not yet in place.

What you see in their place at the moment is a link to, what can be described as, a local authority.  These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name.  Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.

So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).

How do I get a work id for my resources? – Today, there is one way.  If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI: http://worldcat.org/entity/work/id/12477503

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data.  For example you will find the following in the data for OCLC number 53474380:

What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations.  From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain.  It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.

As I said on the program today, we are looking for feedback on these releases.

We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data.  Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for.  So any feedback on any aspect of this will be more than welcome.

I am excitedly looking forward to being able to comment further as this journey progresses.

Update:  I have posted answers to some interesting questions raised by this release.

OCLC Declare OCLC Control Numbers Public Domain

ocn Little things mean a lot.  Little things that are misunderstood often mean a lot more.

Take the OCLC Control Number, often known as the OCN, for instance.

Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times.  The individual number can be found represented in the record it is associated with.  Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.

Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits.  Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries.  It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN.  The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records.  Processes that I believe, in a few cases, are still in place.

So in the current and future climate of open sharing of data, where for instance WorldCat Linked Data, is published under an open data license, such assumptions and practices are an anomaly.

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:

The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.

As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.

SemanticWeb.com Spotlight on Library Innovation






Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5






Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5

Unknown oclc_logo semanticweb.com-logo

Update from organisers:
We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.

Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.

Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!

 

OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data.   SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.

SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities.  Amongst the great selection of speakers you will find yours truly!

If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.

Nominations for the Spotlight are being accepted through May 10th.  Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing.   Self-nominations will be accepted

Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing.  As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering.  (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)

For more details checkout the OCLC Innovation Series page.

Thank you for all the nominations we received for the first Semantic Web.com Spotlight on Innovation in Libraries.

 

Forming Consensus on Schema.org for Libraries and More

w3c_home Back in September I formed a W3C Group – Schema Bib Extend.  To quote an old friend of mine “Why did you go and do that then?” 

schema-org1 Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web.  I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web.  This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.

At the same time, I believe that summer 2012 was a bit of a watershed for Linked Data in the library world.  Over the preceding few years we have had various national libraries publishing linked data (British Library, Bibliothèque nationale de France, Deutsche National Bibliothek, National Library of Sweden, to name just a few).  We have had linked data published versions of authority files such as LCSH, RAMEAU, National Diet Library, plus OCLC hosted services such as VIAF, FAST, and Dewey.  These plus many other initiatives have lead me to conclude that we are moving to the next stage – for instance the British Library and Deutsche Nationalbibliothek are starting to cross-link their data, and the Library of Congress BIBFRAME initiative is starting to expose some of its [very linked data] thinking.

 WorldCat_Logo_V_ColorOf course the other major initiative was that publication of Linked Data, using Schema.org, from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held works.

 Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.

The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list.  OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.

So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources.  I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others.  Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.

Four months, and four conference calls (supported by OCLCthank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary.  We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.  

linkeddata_blue Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges.  Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.

My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise.  Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.

I would characterise the stage we are at, as moving from the talking about it to doing something about it stage.  I am looking forward to the next few months with enthusiasm. 

If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far).  If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.

Putting WorldCat Data Into A Triple Store

WorldCat_Logo_V_Color I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.

I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 million most highly held resources in WorldCat.org – to make the cut, a resource had to be held by more than 250 libraries.

So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.

4storeStep 1
Choose a triplestore.  I followed my own advise and chose 4Store.  The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work.  Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.

Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!

Step 2
Operating system.  Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems.  I had three options.  I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.

As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application.   The Mac in question being a fairly modern MacBook Air.  The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.

Step 3
Download and install.   I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/.  As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder.  (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)

Step 4
Setup and load.  Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store.  Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:

  $ 4s-backend-setup WorldCatMillion

Next start the database:

  $ 4s-backend WorldCatMillion

Then I need to load the triples from the WorldCat Most Highly Held data set.  This step takes a while – over an hour on my system.

  $ 4s-import WorldCatMillion –format ntriples /Users/walllisr/Downloads/WorldCatMostHighlyHeld-2012-05-15.nt

This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.

Now to start the http server to access it:

  $ 4s-httpd -p 8000 WorldCatMillion

A quick test to see if it all worked:

  $ 4s-query WorldCatMillion ‘SELECT * WHERE { ?s ?p ?o } LIMIT 10’

This should output some XML encoded  triples

Step 5
Access via a web browser.  I chose Firefox, as it seems to handle unformatted XML better than most.  4Store comes with a very simple SPARQL interface: http://localhost:8000/test/  This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query.  The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.

Step 6
Some simple SPARQL queries.  Try these and see what you get:

Describe a resource:

  DESCRIBE <http://www.worldcat.org/oclc/46843162>

Select all the genre used:

  SELECT DISTINCT ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Select 100 resources, with a genre triple, outputting the resource URI and it’s genre. (By adjusting the OFFSET value, you can page through all the results):

  SELECT ?s, ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Ok there is a start, now I need to play a bit to brush up on my SPARQL!

Get Yourself a Linked Data Piece of WorldCat to Play With

WorldCat_Logo_V_Color You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org.   Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.

The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere.  For now, you will find my presentation Library Linked Data Progress on my SlideShare site.

After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?

We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it.  So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.

So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples.   Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms.  So which chunk to choose was a question.  We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.

To make the cut, a resource had to be held by more than 250 libraries.  It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed.   To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself.  Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.

Another area of question around the publication of WorldCat linked data, has been about licensing.   Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat.  The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?

To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area.  You can find these at http://www.oclc.org/data/attribution.html.  They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data.   As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.

As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data.  So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at data@oclc.org.

OCLC WorldCat Linked Data Release – Significant In Many Ways

logo_wcmasthead_enTypical!  Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net?  35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.

By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news.  Nevertheless it is significant news, significant in many ways.

OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years.  At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009.  As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus.  Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap.  These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.

Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well.  As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.

Let me dissect the announcement a bit….

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org] First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them.  Just the sort of authoritative resources that will help stitch the emerging web of data together.

Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org.  Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them.  A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.

As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup.   It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise?  Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.

LinkedData Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages.  You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form.Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1  You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser.  Not very pretty to human eyes I know, but just the kind of structured data that systems love.

Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources.  With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples.  OCLC is playing it’s part in doing this for the library sector.

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-2 Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary.  Attributes such as library:holdingsCount and library:oclcnum.  This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data.  What better way of testing out such a vocabulary –  markup several million records with it, publish them and see what the world makes of them.

Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.

Sixth significant bit of news – This release is an experimental release.  This is the start, not the end, of a process.  We know we have not got this right yet.  There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data.  There are obvious areas for discussion around the use of schema.org and the proposed library extension to it.  There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for.  Over the coming months OCLC wants to constructively engage with all that are interested in this process.  It is only with the help of the library and wider web communities that we can get it right.  In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.

For more information about this release, check out the background to linked data at OCLC, join the conversation on the OCLC Developer Network, or email data@oclc.org.

As you can probably tell I am fairly excited about this announcement.  This, and future stuff like it, are behind some of my reasons for joining OCLC.  I can’t wait to see how this evolves and develops over the coming months.  I am also looking forward to engaging in the discussions it triggers.

Surfacing at Semtech San Francisco

San Francisco So where have I been?   I announce that I am now working as a Technology Evangelist for the the library behemoth OCLC, and then promptly disappear.  The only excuse I have for deserting my followers is that I have been kind of busy getting my feet under the OCLC table, getting to know my new colleagues, the initiatives and projects they are engaged with, the longer term ambitions of the organisation, and of course the more mundane issues of getting my head around the IT, video conferencing, and expense claim procedures.

It was therefore great to find myself in San Francisco once again for the Semantic Tech & Business Conference (#SemTechBiz) for what promises to be a great program this year.  Apart from meeting old and new friends amongst those interested in the potential and benefits of the Semantic Web and Linked Data, I am hoping for a further step forward in the general understanding of how this potential can be realised to address real world challenges and opportunities.

As Paul Miller reported, the opening session contained an audience with 75% first time visitors.  Just like the cityscape vista presented to those attending the speakers reception yesterday on the 45th floor of the conference hotel, I hope these new visitors get a stunningly clear view of the landscape around them.

Of course I am doing my bit to help on this front by trying to cut through some of the more technical geek-speak. Tuesday 8:00am will find me in Imperial Room B presenting The Simple Power of the Link – a 30 minute introduction to Linked Data, it’s benefits and potential without the need to get you head around the more esoteric concepts of Linked Data such as triple stores, inference, ontology management etc.  I would not only recommend this session for an introduction for those new to the topic, but also for those well versed in the technology as a reminder that we sometimes miss the simple benefits when trying to promote our baby.

For those interested in the importance of these techniques and technologies to the world of Libraries Archives and Museums I would also recommend a panel that I am moderating on Wednesday at 3:30pm in Imperial B – Linked Data for Libraries Archives and Museums.  I will be joined by LOD-LAM community driver Jon Voss, Stanford Linked Data Workshop Report co-author Jerry Persons, and  Sung Hyuk Kim from the National Library of Korea.  As moderator I will, not only let the four of us make small presentations about what is happening in our worlds, I will be insistent that at least half the time will be there for questions from the floor, so bring them along!

I am not only surfacing at Semtech, I am beginning to see, at last, the technologies being discussed surfacing as mainstream.  We in the Semantic Web/Linked world are very good at frightening off those new to it.  However, driven by pragmatism in search of a business model and initiatives such as Schema.org, it is starting to become mainstream buy default.  One very small example being Yahoo’!s Peter Mika telling us, in the Semantic Search workshop, that RDFa is the predominant format for embedding structured data within web pages.

Looking forward to a great week, and soon more time to get back to blogging!

Richard Wallis Joins OCLC

200px-OCLC_logo.svg You may have noticed this press release Richard Wallis joins OCLC staff as Technology Evangelist today from OCLC.

I have already had some feedback on this move from several people, who almost without exception, have told me that they think it is good move for both OCLC and myself. Which is good, as I agree with them 😉

I have also had several questions about it, mostly beginning with the words why or what.  My answers I thought I would share here to give some background.

Why a library organisation? – I thought you were trying to move away from libraries.
I have been associated with the library sector since joining BLCMP in 1990 to help them build a new library management system which they christened Talis.  As Talis, the company named after the library system, evolved and started to look at new Web influenced technologies to open up possibilities for managing and publishing library data, they and I naturally gravitated towards Semantic Web technologies and their pragmatic use in a way that became known as Linked Data.

Even though the Talis Group transferred their library division to Capita last year, that natural connection between library data and linked data principles meant that the association remained for me, despite having no direct connection with the development of the systems to run libraries.  Obvious examples of this were the Linked Data and Libraries events I ran in London with Talis and the work with the British Library to model and publish the British National Bibliography.  So even if I wanted to get away from libraries I believe it would be a fruitless quest, I think I am stuck with them!

Why OCLC? – Didn’t you spend a lot of time criticising them.
I can own up to several blog posts a few years back where I either criticised them for not being as open as I thought they could be, or questioning their business model at the time.  However I have always respected their mission to promote libraries and their evolution.   In my time chairing and hosting the Library 2.0 Gang, and in individual podcasts, I hope that I demonstrated a fairness that I always aspire towards, whilst not shying away from the difficult questions.   I have watched OCLC, and the library community they are part of, evolve over many years towards a position and vision that encompasses many of the information sharing principles and ambitions I hold.   In the very short amount of time I have already spent talking with my new colleagues it is clear that they are motivated towards making best use of data for the benefit of their members, libraries in general, and the people they serve – which is all of us.

Oh and yes, they have a great deal of data which has huge potential on the Linked Data Web and it will be great to be a part of realising at least some of that potential.

What about Data Liberate? – Are you going to continue with that.
I set up Data Liberate with a dual purpose.  Firstly, to promote myself as a consultant to help people and organisations realise the value in their data.  Secondly, to provide a forum and focus for sharing commenting upon, and discussing issues, ideas, events, and initiatives relevant to Open, Linked, Enterprise, and Big data.  Obviously the first of these is now not relevant, but I do intend to maintain Data Liberate to fulfil that second purpose.  I may not be posting quite as often, but I do intend to highlight and comment upon things of relevance in the broad landscape of data issues, regardless of if they are library focussed or not.

What are you going to be doing at OCLC?
My title is Technology Evangelist, and there is a great deal of evangelism needed – promoting, explaining, and demystifying the benefits of Linked Data to libraries and librarians.  This stuff is very new to a large proportion of the library sector, and not unsurprisingly there is some scepticism about it.  It would be easy to view announcements from organisations such as the British Library, Library of Congress, Europeana, Stanford University, OCLC, and many many more, as a general acceptance of a Linked Data library vision.  Far from it.  I am certain that a large proportion of librarians are not aware of the potential benefits of Linked Data for their world, or even why they should be aware.   So you will find me on stage at an increasing number of OCLC and wider library sector events, doing my bit to spread the word.

Like all technologies and techniques, Linked Data does not sit in isolation and there is obvious connections with the OCLC WorldShare Platform which is providing shared web based services for managing libraries and their data.  I will also be applying some time evangelising the benefits of this approach.

Aside from evangelising I will be working with people.  Working with the teams within OCLC as they coordinate and consolidate their approach to applying Linked Data principles across the organisation.  Working with them as they evolve the way OCLC will publish data to libraries and the wider world.  Working with libraries to gain their feedback.  Working with the Linked Data and Semantic Web community to gain feedback as to the way to publish that data in a way that not only serves the library community, but also to all across the emerging Web of Data.  So you will continue to find me on stage at events such as the Semantic Tech and Business Conference, doing my bit to spread the word, as well as engaging directly with the community.

Why libraries? – Aren’t they a bit of a Linked Data niche.
I believe that there are two basic sorts of data being published on the [Linked Data] web – backbone data and the non-backbone data the value of which is greatly increased by linking to the backbone.

By backbone data I mean things like: Dbpeadia with it’s identifier for most every ‘thing’; government data with authoritative identifiers for laws, departments, schools, etc.; mapping organisations, such as Ordnance Survey with authoritative identifiers for post codes etc.  By linking your dataset’s concepts to these backbone sources, you immediately increase its usefulness and ability to link and merge with other data linked in the same way.  I believe that the descriptions of our heritage and achievements both scientific and artistic, held by organisations such as our national, academic, and public libraries is a massive resource that has the opportunity to form a very significant vertebrae on that backbone.

Hopefully some of the above will help in the understanding of the background and motivations behind this new and exciting phase of my career.  These opinions and ambitions for the evolution of data on the web, and in the enterprise, are all obviously mine, so do not read in to them any future policy decisions or directions for my new employer.  Suffice to say I will not be leaving them at the door. Neither will I cast off my approach to pragmatically solving problems in the real world by evolving towards a solution recognising that the definition of the ideal changes over time and with circumstance.