OCLC WorldCat Linked Data Release – Significant In Many Ways

logo_wcmasthead_enTypical!  Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net?  35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.

By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news.  Nevertheless it is significant news, significant in many ways.

OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years.  At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009.  As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus.  Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap.  These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.

Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well.  As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.

Let me dissect the announcement a bit….

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org] First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them.  Just the sort of authoritative resources that will help stitch the emerging web of data together.

Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org.  Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them.  A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.

As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup.   It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise?  Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.

LinkedData Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages.  You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form.Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1  You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser.  Not very pretty to human eyes I know, but just the kind of structured data that systems love.

Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources.  With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples.  OCLC is playing it’s part in doing this for the library sector.

Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-2 Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary.  Attributes such as library:holdingsCount and library:oclcnum.  This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data.  What better way of testing out such a vocabulary –  markup several million records with it, publish them and see what the world makes of them.

Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.

Sixth significant bit of news – This release is an experimental release.  This is the start, not the end, of a process.  We know we have not got this right yet.  There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data.  There are obvious areas for discussion around the use of schema.org and the proposed library extension to it.  There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for.  Over the coming months OCLC wants to constructively engage with all that are interested in this process.  It is only with the help of the library and wider web communities that we can get it right.  In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.

For more information about this release, check out the background to linked data at OCLC, join the conversation on the OCLC Developer Network, or email data@oclc.org.

As you can probably tell I am fairly excited about this announcement.  This, and future stuff like it, are behind some of my reasons for joining OCLC.  I can’t wait to see how this evolves and develops over the coming months.  I am also looking forward to engaging in the discussions it triggers.

Schema.org Consensus at SemTechBiz

Day three of the Semantic Tech & Business Conference in San Francisco brought us a panel to discuss Schema.org, populated by an impressive array of names and organisations:

IMG_0306 Ivan Herman, World Wide Web Consortium
Alexander Shubin, Yandex
Dan Brickley, Schema.org at Google
Evan Sandhaus, New York Times Company
Jeffrey W. Preston, Disney Interactive Media Group
Peter Mika, Yahoo!
R.V. Guha, Google
Steve Macbeth, Microsoft

This well attended panel started with a bit of a crisis – the stage in the room was not large enough to seat all of the participants causing a quick call out for bar seats and much microphone passing.  Somewhat reflective of the crisis of concern about the announcement of Schema.org, immediately prior to last year’s event which precipitated the hurried arrangement of a birds of a feather session to settle fears and disquiet in the semantic community.

Asking a fellow audience member what they thought of this session, they replied that the wasn’t much new said.  In my opinion I think that is a symptom of good things happening around the initiative.  He was right in saying that there was nothing substantive said, but there were some interesting pieces that came out of what the participants had to say.  Guha indicated that Google were already seeing that 7-10% of pages crawled already contained Schema.org mark-up, surprising growth in such a short time.  Steve Macbeth confirmed that Microsoft were also seeing around 7%.

Another unexpected but interesting insight from Microsoft was that they are looking to use Schema.org mark-up as a way to pass data between applications in Windows 8.  All the search engine folks were playing it close when asked what they were actually using the structured data they were capturing from Schema.org mark-up – lots of talk about projects around better search algorithms and indexing.  Guha, indicated that the Schema.org data was not siloed inside Google.  As with any other data it was used across the organisation, including within the Google Knowledge Graph functionality.

Jeffrey Preston responded to a question about the tangible benefits of applying Schema.org mark-up by describing how kids searching for games on the Disney site were being directed more accurately to the game as against pages that referenced it.  Evan Sandhaus described how it enabled a far easier integration with a vendor who could access their article data without having to work with a specific API.  Guha spoke about a Veterans job search site was created with the Department of Defence as they could constrain their search only to sites which only included Schema.org mark-up and identified jobs as appropriate for Veterans.

In questions from the floor, the panel explained the best way of introducing schema extensions, using the IPTC rNews as an example – get industry consensus to provide a well formed proposal and then be prepared to be flexible.   All done via the W3C hosted Public Vocabs List.

All good progress in only a year!

Richard Wallis is Technology Evangelist at OCLC and Founder of Data Liberate

Surfacing at Semtech San Francisco

San Francisco So where have I been?   I announce that I am now working as a Technology Evangelist for the the library behemoth OCLC, and then promptly disappear.  The only excuse I have for deserting my followers is that I have been kind of busy getting my feet under the OCLC table, getting to know my new colleagues, the initiatives and projects they are engaged with, the longer term ambitions of the organisation, and of course the more mundane issues of getting my head around the IT, video conferencing, and expense claim procedures.

It was therefore great to find myself in San Francisco once again for the Semantic Tech & Business Conference (#SemTechBiz) for what promises to be a great program this year.  Apart from meeting old and new friends amongst those interested in the potential and benefits of the Semantic Web and Linked Data, I am hoping for a further step forward in the general understanding of how this potential can be realised to address real world challenges and opportunities.

As Paul Miller reported, the opening session contained an audience with 75% first time visitors.  Just like the cityscape vista presented to those attending the speakers reception yesterday on the 45th floor of the conference hotel, I hope these new visitors get a stunningly clear view of the landscape around them.

Of course I am doing my bit to help on this front by trying to cut through some of the more technical geek-speak. Tuesday 8:00am will find me in Imperial Room B presenting The Simple Power of the Link – a 30 minute introduction to Linked Data, it’s benefits and potential without the need to get you head around the more esoteric concepts of Linked Data such as triple stores, inference, ontology management etc.  I would not only recommend this session for an introduction for those new to the topic, but also for those well versed in the technology as a reminder that we sometimes miss the simple benefits when trying to promote our baby.

For those interested in the importance of these techniques and technologies to the world of Libraries Archives and Museums I would also recommend a panel that I am moderating on Wednesday at 3:30pm in Imperial B – Linked Data for Libraries Archives and Museums.  I will be joined by LOD-LAM community driver Jon Voss, Stanford Linked Data Workshop Report co-author Jerry Persons, and  Sung Hyuk Kim from the National Library of Korea.  As moderator I will, not only let the four of us make small presentations about what is happening in our worlds, I will be insistent that at least half the time will be there for questions from the floor, so bring them along!

I am not only surfacing at Semtech, I am beginning to see, at last, the technologies being discussed surfacing as mainstream.  We in the Semantic Web/Linked world are very good at frightening off those new to it.  However, driven by pragmatism in search of a business model and initiatives such as Schema.org, it is starting to become mainstream buy default.  One very small example being Yahoo’!s Peter Mika telling us, in the Semantic Search workshop, that RDFa is the predominant format for embedding structured data within web pages.

Looking forward to a great week, and soon more time to get back to blogging!

Who Will Be Mostly Right – Wikidata, Schema.org?






Two, on the surface, totally unconnected posts – yet the the same message. Well that’s how they seem to me anyway.

Post 1 – The Problem With Wikidata from Mark Graham writing in the Atlantic. Post 2 – Danbri has moved on – should we follow? by a former colleague Phil Archer.






democracy Two, on the surface, totally unconnected posts – yet the the same message.  Well that’s how they seem to me anyway.

Post 1The Problem With Wikidata from Mark Graham writing in the Atlantic.

wikimedia When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February,  I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource.   This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.

In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

He also highlights issues about the unevenness or bias of contributors to Wikipedia:

We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.

A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority.  If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.

I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin –  “WikiData will not define the truth, it will collect the references to the data….  WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.”  They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion.  In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Post 2Danbri has moved on – should we follow? by a former colleague Phil Archer.

schema-org1 The Danbri in question is Dan Brickley, one of the original architects of the Semantic Web, now working for Google in Schema.org.  Dan presented at an excellent Semantic Web Meetup, which I attended at the BBC Academy a couple of weeks back.  This was a great event.  I recommend investing in the time to watch the videos of Dan and all the other speakers.

Phil picked out a section of Dan’s presentation for comment:

In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…

Then reflecting on current practice in Linked Data he went on to postulate:

… best practice for the RDF community…  …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.

Except schema.org doesn’t.

schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?

As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data.  Or should we stick with the current collection of terms from suitable smaller vocabularies.

One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out.  I have watched the frown skip across several people’s faces when you first tell them that foaf:name is a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.

As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary?   Another former colleague, David Wood Tweeted  No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation.  If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.

You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it.   When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology.  What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.

OK a couple of interesting posts, but where is the similar message and connection?  I see it as democracy of opinion.  Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view.  More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few.  Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.

Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found.  However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points.  This way the ‘how’ of data publishing should become simpler, more widespread, and extensible.   On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.

Main image via democracy.org.au.

Semantic Search, Discovery, and Serendipity






An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html






IMG_0256 So I need to hang up some tools in my shed.  I need some bent hook things – I think.  Off to the hardware store in which I search for the fixings section.  Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes.  After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices.  Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice!  Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge.  I grab one of those and head for the checkout before my credit card regrets me discovering anything else.

We all know the library ‘browse’ experience.  Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.

An ambition for the web is to reflect and assist what we humans do in the real world.  Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition.   By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine.  A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.

How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post.  It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them.  Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues.  That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.

But surely there is more to it than that.  Yes there is, but it is not search – it is discovery.  As in my scenario above, humans do not only search for things.  We search to get ourselves to a start point for discovery.  I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby.  By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.

BBC Nature - Aardvark videos, news and facts Take what is still one of my favourite sites – BBC wildlife.  Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours.  These are all things or concepts related to the lion.  Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal.  Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe.  In such an environment serendipity is only a few clicks away.

There are two sides to the finding stuff coin – Search and Discovery.  Humans naturally do both, systems and the web are only just starting to move beyond search only.  This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data.  A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.

A Data 7th Wave Approaching






I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world. It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.






4405831072_3c769de659_b Some in the surfing community will tell you that every seventh wave is a big one.  I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself.  Because of that, it is a little constraining to talk about this next one only effecting the world of the Web.  This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.

Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches.  Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us.  The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.

I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact.  Enough of the grandiose metaphors and down to business.

Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s.  Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API.  With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.

Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?

It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.  Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives.  The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.

In the world of libraries, the Ontology Engineering Group (OEG) at the  Universidad Politécnica de Madrid are providing a contribution of linked bibliographic data to the gathering mass, alongside the British and Germans, with 2.4 Million bibliographic records from the Spanish National Library.  This adds weight to the arguments for a Linked Data future for libraries proposed by the Library of Congress and Stanford University.

I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.

Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.

Data Marketplaces are emerging to provide platforms to share and possibly monetise your data.  An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis.  Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept.  This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts.  An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.

The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy.  Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.

The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.

I wrote about Schema.org a while back.  An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html.  The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets.  When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much.  However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported.  By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.

The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.

So what caused me to open up my Macbook and start writing this.  It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified.  They are already harvesting, and using, structured information from web pages that has been encoded using RDF.  The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.

Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa.  This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand.  It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level.  Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies.  These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.

So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice.  By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically.  Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them.  A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all.  Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.

The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links.  So far this evangelism has had little success.  However, this shift may give them what they want via an unexpected route.

Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow.  A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company.  Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.

These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out.  Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching.  Perhaps I should look at the beach again in five years time to see if I was right.

Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.

More Linked Open Data under a More Open License from German National Library

logo The German National Library (DNB) has launched a Linked Data version of the German National Bibliography.

The bibliographic data of the DNB’s main collection (apart from the printed music and the collection of the Deutsches Exilarchiv) and the serials (magazines, newspapers and series of the German Union Catalogue of serials (ZDB)) have been converted.  Henceforth the RDF/XML-representation of the records are available at the DNB portal. This is an experimental service that will be continually expanded and improved.

This is a welcome extension to their Linked Data Service, previously delivering authority data.  Documentation on their data and modelling is available, however the English version has yet to be updated to reflect this latest release.

DNB, Katalog der Deutschen Nationalbibliothek Links to RDF-XML versions of individual records are available directly from the portal user interface, with the usual Linked Data content negotiation techniques available to obtain HTML or RDF-XML as required.

This is a welcome addition to the landscape of linked open bibliographic data, joining others such as the British Library.

88x31 Also to be welcomed is their move to CC0 licensing removing barriers, real or assumed, to the reuse of this data.

I predict that this will be the first of many more such announcements this year from national and other large libraries opening up their metadata resources as Linked Open Data.  The next challenge will be to identify the synergies between these individual approaches to modelling bibliographic data and balance the often competing needs of the libraries themselves and potential consumers of their data who very often do not speak ‘library’.

Somehow [without engaging in the traditional global library cooperation treacle-like processes that take a decade to publish a document] we need to draw together a consistent approach to modelling and publishing Linked Open Bibliographic Data for the benefit of all – not just the libraries.  With input from the DNB, British Library, Library of Congress, European National Libraries, Stanford, and others such as Schema.org, W3C, Open Knowledge Foundation etc., we could possibly get a consensus on an initial approach.  Aiming for a standard would be both too restrictive, and based on experience, too large a bite of the elephant at this early stage.

Ambitious Technology Plan Emerges From Stanford Linked Data Workshop






Although there has been a half year lag between the the workshop held at Stanford University, at the end of June 2011, and the Stanford Linked Data Workshop Technology Plan [pdf] published on December 31st, the folks behind it obviously have not been twiddling their thumbs.






stanfordreportlogo Although there has been a half year lag between the the workshop held at Stanford University, at the end of June 2011, and the Stanford Linked Data Workshop Technology Plan [pdf] published on December 31st, the folks behind it obviously have not been twiddling their thumbs.  The 44 pages constitute a significant well thought through proposal.   There is always benefit in shooting high when making a plan – from the introduction:

This is a plan for a multi-national, multi-institutional discovery environment built on Linked Open Data principles. If instantiated at several institutions, will demonstrate to end users the value of the Linked Data approach to recording machine operable facts about the products of teaching, learning, and research.

…The resulting discovery environments will demonstrate the dramatic change that is possible in the academic information resource discovery environment when organizations move beyond closed and rule-bound metadata creation and utilization.

…This model also postulates dramatic changes to the creation, adoption, editing, and maintenance of metadata records for bibliographic holdings as well as scholarly information resources licensed for use in research institutions;

schema-org1 Refreshingly different for an academic report on proposed academic processes, the authors seem to be shying away from some of the traditional institutionally focused or unwieldy and elaborate coordination mechanisms.  Their basic premise is to deliver a Linked Data model that is adopted by schema.org.  Building on the role that schema.org already plays with the models/schema it already supports. Such a model would not only be easily referenced by those in the worlds of libraries and academia, but more generally across the data web and by users of other schema.org schema.  An obvious example that immediately springs to mind would be an academic publisher wishing to intermix globally recognised metadata formatted data about their products with, equally globally recognised sales offer information.

Moving on beyond the introduction, the report starts by setting some goals:

  • Implement an information ecosystem that exploits Linked Data’s ability to record and make discoverable an ongoing, richly detailed history of the intellectual activity embodied in all of a research university’s academic endeavors and its use of library resources and programs.
  • Design and implement data models, processes, workflows, applications, and delivery
    services….
  • Construct an ecosystem based on linked-data principles that draws on the intellectual
    activity and resources found throughout a research university’s programs and its libraries. Use structured, curated representations of these activities and resources to populate a graph of named links.

With a scope of a “… model comprises the pursuits of a research university’s faculty and students. Included in that scope are the knowledge and information resources that a research university creates, acquires, and uses in the course of its scholarship, research, and teaching programs.” – which kind of includes most everything we do – they are not playing at this.

For many in the world of libraries and associated domains, Linked Data may seem to be just the latest brand of technological snake-oil.  A brand that not only promises to add value, but to radically disrupt the way they do things.  I obviously agree with that (except the snake-oil bit) but know from experience it is not an easy sell to the sceptical.  The authors of the report approach this difficulty by referencing several examples and initiatives.

One of the core things they reference is close to my heart, having been closely involved with it with former colleagues at Talis Consulting – The British Library data model, which they used to openly publish the British National Bibliography as Linked Data.  They intend to use this model as a starting point for their work.

Doing so will ensure that the resulting model retains the BL’s high-level focus and its webderived, transparent structure for representing facts about people, organizations, places, events, and topics. Such focus represents a marked contrast to efforts based on all-inclusive models that enforce highly structured, deeply detailed and therefore exceedingly brittle representations of physical and digital objects..

I could go on picking out excellent examples and references from the report, such as LinkSailor, the recent proposal from the Library of Congress to transition to A Bibliographic Framework for the Digital Age, the vote by European libraries to support an open data policy for their bibliographic records, Talis’ Kasabi.com Linked Data powered data marketplace and Linked Data scholarly resource system Talis Aspire, Drupal’s use of RDF & Linked Data techniques, aligning with Schema.org, Google’s Freebase, etc., but I would recommend reading the report yourself as they place these things in context.

Reading it through a couple of times has left me with a couple of strongly held hopes.

Hope 1.  This report gains traction and attracts funding.  Implementation of an exemplar ecosystem for the publishing and linking of intellectual information, such as this, will be a massive boost towards the realisation [both intellectually and operationally] of the benefits of applying Linked Data techniques and technologies in the scholarly and research domains.

Hope 2.  They remain true to the ambition to “retain the BL’s high-level focus and its webderived, transparent structure for representing facts about people, organizations, places, events, and topics”.  It would be so easy to fall back in to the over-engineered minutiae with over emphasis on edge-cases and only focussed on internal domain concerns, approach to data publication that has characterised the bibliographic world for the last few decades.

Linked Data, and the way this report approaches it’s adoption, has the potential to make the world’s information accessible to all that can benefit.  To get us there requires honest evangelism and demonstrations of practical benefits, but mostly being true to your goals for implementing it.  I welcome this report and pass on my hopes for it’s proposals becoming a reality.