Marketing Hype! I hear you thinking – well at least I didn’t use the tired old ‘Next Generation’ label.
Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.
This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org. But let me first set the context by explaining my thoughts on this New Web.
Having once been considered an expert on Web 2.0 (I hasten to add by others, not myself) I know how dangerous it can be to attach labels to things. It tends to spawn screen full’s of passionate opinions on the relevance of the name, date of the revolution, and over detailed analysis of isolated parts of what is a general movement. I know I am on dangerous ground here!
To my mind something is new when it feels different. The Internet felt different when the Web (aka HTTP + HTML + browsers) arrived. The Web felt different (Web 2.0?) when it became more immersive (write as well as read) and visually we stopped trying to emulate in a graphical style what we saw on character terminals. Oh, and yes we started to round our corners.
There have been many times over the last few years when it felt new – when it suddenly arrived in our pockets (the mobile web) – when the inner thoughts, and eating habits, of more friends that you ever remember meeting became of apparent headline importance (the social web) – when [the contents of] the web broke out of the boundaries of the browser and appeared embedded in every app, TV show, and voice activated device.
The feeling different phase I think we are going through at the moment, like previous times, is building on what went before. It is exemplified by information [data] breaking out of the boundaries of our web sites and appearing where it is useful for the user.
We are seeing the tip of this iceberg in the search engine Knowledge Panels, answer boxes, and rich snippets, The effect of this being that often your potential user can get what they need without having to find and visit your site – answering questions such as what is the customer service phone number for an organisation; is the local branch open at the moment;give me driving directions to it; what is available and on offer. Increasingly these interactions can occur without the user even being aware they are using the web – “Siri! Where is my nearest library?“ A great way to build relationships with your customers. However a new and interesting challenge for those trying to measure the impact of your web site.
So, what is fundamental to this New Web?
There are several things – HTTP, the light-weight protocol designed to transfer text, links and latterly data, across an internet previously used to specific protocols for specific purposes – HTML, that open, standard, easily copied light-weight extensible generic format for describing web pages that all browsers can understand – Microdata, RDFa, JSON, JSON-LD – open standards for easily embedding data into HTML – RDF, an open data format for describing things of any sort, in the form of triples, using shared vocabularies. Building upon those is Schema.org – an open, [de facto] standard, generic vocabulary for describing things in most areas of interest.
Why is one vocabulary fundamental when there are so many others to choose from? Check out the 500+ referenced on the Linked Open Vocabularies (LOV) site. Schema.org however differs from most of the others in a few key areas:
Size and scope – its current 642 Types and 992 Properties is significantly larger and covers far more domains of interest than most others. This means that if you are looking to describe a something, you are highly likely to to find enough to at least start. Despite its size, it is yet far from capable of describing everything on, or off, the planet.
Evolution – it is under continuous evolutionary development and extension, driven and guided by an open community under the wing of the W3C and accessible in a GitHub repository.
Flexibility – from the beginning Schema.org was designed to be used in a choice of your favourite serialisation – Microdata, RDFa, JSON-LD, with the flexibility of allowing values to default to text if you have not got a URI available.
Consumers – The major search engines Google, Bing, Yahoo!, and Yandex, not only back the open initiative behind Schema.org but actively search out Schema.org markup to add to their Knowledge Graphs when crawling your sites.
Guidance – If you search out guidance on supplying structured data to those major search engines, you are soon supplied with recommendations and examples for using Schema.org, such as this from Google. They even supply testing tools for you to validate your markup.
With this support and adoption, the Schema.org initiative has become self-fulfilling. If your objective is to share or market structured data about your site, organisation, resources, and or products with the wider world; it would be difficult to come up with a good reason not to use Schema.org.
Is it a fully ontologically correct semantic web vocabulary? Although you can see many semantic web and linked data principles within it, no it is not. That is not its objective. It is a pragmatic compromise between such things, and the general needs of webmasters with ambitions to have their resources become an authoritative part of the global knowledge graphs, that are emerging as key to the future of the development of search engines and the web they inhabit.
Note that I question if Schema.org is a fundamental component, of what I am feeling is a New Web. It is not the fundamental component, but one of many that over time will become just the way we do things.
About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.
This update includes loads of tweaks, additions and fixes that can be found in the release information. The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about. With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.
But does this warrant the version number clicking over from 1.xx to 2.0?
These new types and properties are only the tip of the 2.0 iceberg. There is a heck of a lot of other stuff going on in this release that apart from these additions. Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.
Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced. So part of the 2.0 effort has included some rationalisation. For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details. If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded. However you are encouraged to moved towards the updated terminology as convenient. The question of what is in which version brings me to an enhancement to the supporting documentation. Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0. So if you want to refer to a term at a particular version you now can.
How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication. Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘. Unsurprisingly for a newly introduced property, there is virtually no usage of it yet. If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘. Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.
This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary. The motivation for this new approach to extending is explained thus:
Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.
As yet there are no extensions published. However, there are some on the way.
As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org. The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics. I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone. Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.
W3C Community Group for Schema.org A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place. The creation of the Schema.org Community Group has now provided an open forum for this.
So is 2.0 a bit of a milestone? Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.
One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.
The scenario goes something like this…..
“Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”
At about this time you usually display a slide showing html code with embedded RDFa. It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora of angle brackets and quotes swimming before their eyes is fairly remote.
Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages – “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].
You ask them to take on trust that Google is actually understanding, indexing and using that structured data. At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.
I enjoy a challenge, but I also like to find a better way sometimes. w3
When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page. This made my job far easier!
But it has a couple of downsides. Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data. Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.
That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept. But that is not all. This excellent little extension has a couple of extra tricks up its sleeve.
It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data. Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted. As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.
Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not. I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.
Custom Searches that know about Schema! Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types. Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.
A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.
I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.
Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.
Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:
The release of 194 Million Open Linked Data Bibliographic Work descriptions
The WorldCat Linked Data Explorer interface
A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work. The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary. In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat. Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503
You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything. Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements. This is not designed as an end-user interface, it is designed as a data exploration tool. This is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.
Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.
This is the first production release of WorldCat infrastructure delivering linked data. The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat. Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources. One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc. In this preview release, it is some of those connections that are not yet in place.
What you see in their place at the moment is a link to, what can be described as, a local authority. These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name. Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.
So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).
In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data. For example you will find the following in the data for OCLC number 53474380:
What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations. From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain. It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.
As I said on the program today, we are looking for feedback on these releases.
We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data. Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for. So any feedback on any aspect of this will be more than welcome.
I am excitedly looking forward to being able to comment further as this journey progresses.
I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
All resource pages include Linked Data
Human visible under a Linked Data tab at the bottom of the page
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.
There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.
Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.
A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.
The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.
Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). What you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle. As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.
An excellent demonstration of the basic power and benefit of Linked Data. So why is this often missed? Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.
To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.
The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.
Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets. Nice that you can work with an individually focussed set or get data from all in a consolidated graph.
There are four APIs:
Lookup – a simple way to extract an RDF description of a single resource, using its URI.
Search – for running keyword searches over a dataset.
Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.
Each API is available to play with on a web page complete with examples and pop-up help hints. It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.
For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form. They even include details of the available output formats and expected http response codes.
Finally a few general comments.
Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs. Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way. How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it. Also a site with a snappy response ups your confidence that your application will perform well when using their service.
Secondly the range of APIs, all cleanly and individually satisfying specific general needs. So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.
The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind. Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance. The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.
I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.
So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.
Choose a triplestore. I followed my own advise and chose 4Store. The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work. Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.
Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!
Operating system. Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems. I had three options. I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.
As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application. The Mac in question being a fairly modern MacBook Air. The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.
Download and install. I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/. As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder. (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)
Setup and load. Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store. Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:
$ 4s-backend-setup WorldCatMillion
Next start the database:
$ 4s-backend WorldCatMillion
Then I need to load the triples from the WorldCat Most Highly Held data set. This step takes a while – over an hour on my system.
This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.
Access via a web browser. I chose Firefox, as it seems to handle unformatted XML better than most. 4Store comes with a very simple SPARQL interface: http://localhost:8000/test/ This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query. The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.
Some simple SPARQL queries. Try these and see what you get:
Typical! Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net? 35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.
By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news. Nevertheless it is significant news, significant in many ways.
OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years. At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009. As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus. Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap. These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.
Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well. As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.
Let me dissect the announcement a bit….
First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them. Just the sort of authoritative resources that will help stitch the emerging web of data together.
Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org. Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them. A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.
As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup. It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise? Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.
Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages. You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form. You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser. Not very pretty to human eyes I know, but just the kind of structured data that systems love.
Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources. With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples. OCLC is playing it’s part in doing this for the library sector.
Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary. Attributes such as library:holdingsCount and library:oclcnum. This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data. What better way of testing out such a vocabulary – markup several million records with it, publish them and see what the world makes of them.
Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.
Sixth significant bit of news – This release is an experimental release. This is the start, not the end, of a process. We know we have not got this right yet. There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data. There are obvious areas for discussion around the use of schema.org and the proposed library extension to it. There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for. Over the coming months OCLC wants to constructively engage with all that are interested in this process. It is only with the help of the library and wider web communities that we can get it right. In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.
As you can probably tell I am fairly excited about this announcement. This, and future stuff like it, are behind some of my reasons for joining OCLC. I can’t wait to see how this evolves and develops over the coming months. I am also looking forward to engaging in the discussions it triggers.
So I need to hang up some tools in my shed. I need some bent hook things – I think. Off to the hardware store in which I search for the fixings section. Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes. After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices. Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice! Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge. I grab one of those and head for the checkout before my credit card regrets me discovering anything else.
We all know the library ‘browse’ experience. Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.
An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.
As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine. A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.
How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post. It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them. Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues. That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.
But surely there is more to it than that. Yes there is, but it is not search – it is discovery. As in my scenario above, humans do not only search for things. We search to get ourselves to a start point for discovery. I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby. By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.
Take what is still one of my favourite sites – BBC wildlife. Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours. These are all things or concepts related to the lion. Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal. Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe. In such an environment serendipity is only a few clicks away.
There are two sides to the finding stuff coin – Search and Discovery. Humans naturally do both, systems and the web are only just starting to move beyond search only. This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data. A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.