Baby Steps Towards A Library Graph

image It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data.  The Web of Data that is starting to take shape and drive benefits for everyone.  Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.

Building blocks such as; Linked Data in; moves to enhance capabilities for bibliographic resource description; recognition that Linked Data has a beneficial place in library data and initiatives to turn that into a reality; the release of Work entity data mined from, and linked to, the huge data set.

OK, you may say, we’ve heard all that before, so what is new now?

As always it is a couple of seemingly unconnected events that throw things into focus.

Event 1:  An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go.  An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future.  Get our data out there, everywhere, so it can find its way to those users, wherever they are.  Making it accessible to all.  David references three main ways to provide this access:

  1. APIs – to allow systems to directly access our library system data and functionality
  2. Linked Datacan help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
  3. The Library Graph –  an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities

(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)

It is the 3rd of those, however, that triggered recognition for me.  Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’  – if graphs can have corners that is.  Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data.  However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.

Event 2: A posting by ZBW Labs Other editions of this work: An experiment with OCLC’s LOD work identifiers detailing experiments in using the OCLC WorldCat Works Data.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

The post is interesting from a couple of points of view.  Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data.  Command-line calls that were no doubt embedded in simple scripts.

Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions.  A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages.  What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.

The ZBW Labs experiment is interesting in its own way – simple approach enlightening results.  What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.

The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data.  Using it for things we never envisaged, delivering unexpected value to our community.  I can’t wait to hear about other initiatives like this that we can all learn from.

So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph.  Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (, VIAF, ISNI, FAST, Getty vocabularies, etc.

As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing.  When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data.  An important authoritative corner, of a corner, of the Giant Global Graph.

As I said at the start these are baby steps towards a vision that is forming out of the mist.  I hope you and others can see it too.

(Toddler image: Harumi Ueda)

Google SEO RDFa and Semantic Search

GoogleBlueBalls Today’s Wall Street Journal gives us an insight in to the makeover underway in the Google search department.

Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.

They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.

The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold.  This is now continuing with harvesting of html embedded, encoded, structured data that is starting to spread across the web.

The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets.  A nice trade-off from Google – you embed the information we want/need for a better search and we will give you  better results.

The premise of what Google are are up to is that it will deliver better search.  Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results.  The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us.  This is the step-change.  We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.

Another example quoted in the article:

…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.

Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success.  I would say that they should be helping the search engine supplier(s) do this right and be part of the process.  There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.

As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously.   I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result.  I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities.  However I’ve been around far too long to get hung up about labels.

Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the  Google recognises both, both are ugly for humans to hand code, and web masters should not have to care.  Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.

I welcome this.  Yet it is only a symptom of something much bigger and game-changing as I postulated last month A Data 7th Wave is Approaching.

Linked Data a Recipe for Food?

What relevance does Linked Data have for a City’s food supply you may ask. “We live in a world where the agri-food supply chain, from producer all the way through to final consumer, is extremely inefficient in the flow of knowledge.. ..with the application of Semantic Web and Linked Data technologies along the food supply chain, it will make it easier for all actors along there to know more about where their food comes from and where their food goes.”

new_optimist_logoI’ve just watched a short interview with Dr Chris Brewster, of Aston Business School.  Chris is a Semantic Web and Linked Data specialist, but he was attending an event organised by the New Optimists Forum to look at Food & Cities – possible futures for Birmingham in 2050.

What relevance does Linked Data have for a City’s food supply you may ask.  As Chris put it:

We live in a world where the agri-food supply chain, from producer all the way through to final consumer, is extremely inefficient in the flow of knowledge.  It is very good at delivering food to your table but we don’t know where it comes from, what it’s history is.  That has great implication in various scenarios, for example when there are food emergencies, e colli, things of that sort.

My vision is with the application of Semantic Web and Linked Data technologies along the food supply chain, it will make it easier for all actors along there to know more about where their food comes from and where their food goes.  This will also create opportunities for new business models where local food will be more easily integrated in to the overall consumption patters of local communities.

This is a vision that can be applied to many of our traditional supply chains.  Industries have become very efficient at producing, building, and delivering often very complex things to customers and consumers, but only the thing travels along the chain, it is not accompanied by information about the thing, other than what you may find on a label or delivery note.  These supply chains are highly tuned processes that the logisticians will tell you have had most every drop of efficiency squeezed out of them already.  Information about all aspects of and steps within a chain could possibly allow parts of the chain to react, and possibly apply some local agility, feedback, and previously hidden efficiencies.

Another example of a traditional chain that exhibits an, on the surface, poor information supply chain is sustainable wood supply.  As covered by the BBC You&Yours radio program today (about 43 minutes in), coincidentally within minutes of me watching Dr Brewster.

fsc-logo3The Forest Stewardship Council has had a problem where one of their producers had part of their license revoked but apparently still applied the FSC label on the wood they were shipping.  Some of this wood travelled through the supply chain and was unwittingly displayed on UK retailers shelves as certified sustainable wood.  Listening to the FSC representative it was clear that if an integrated information supply network had been available, the chances of this happening would have been decreased, or at least it being identified sooner.

All very well, but why Linked Data?

One of the characteristics of supply chains it that they tend to deal with many differing organisations engaged in many differing processes – cultivation, packing, assembly, manufacture, shipping, distribution, retailing, etc.  Traditionally the computerisation of information flow between differing organisations and their differing systems and procedures has been a difficult nut to crack.  Getting all the players to agree and conform is an almost impossible task.  One of the many benefits of Linked Data is the ability to extract data from disparate sources describing different things and aggregate them together.  Yes you need some loose coordination between the parties around, identification of concepts etc., but you do not need to enforce a regimented vanilla system everywhere.

The automotive industry have already hooked in on this to address the problem of disseminating the mass of information around models of cars and their options.  There was a great panel on the 2nd day of the Semantic Tech and Business Conference in Berlin last month:

My takeaway from the panel: Here is an industry that pragmatically picked up a technology that can not only solve it’s problems but also can enable it to take innovative steps, not only for individual company competitive advantage but also to move the industry forward in it’s dealings with its supply/value chain and customers.  However, they are also looking more broadly and openly to for instance make data publicly available which will enhance the used car market.

So back to food.  The local food part of Dr Brewster’s new business model vision stems from the fact it should easier for a local producer to broadcast availability of their produce to the world.  Similarly, it should be easier for a retailer to tune in to that information in an agile way and not only become aware of the produce but also be linked to information about the supplier.

kasabi-foodFood and Linked Data is also something the team at Kasabi have been focussing in on recently.  Because of the Linked Data infrastructure underpinning the Kasabi Data Marketplace, they have been able to produce an aggregate Food dataset initially from BBC and Foodista.

As the dataset is updated, the Data Team will broaden the sources of food data, and increase the data quality for those passionate about food. They’ll be adding resources and improving the links between them to include things like: chefs, diets, seasonality information, and more.

Food aims to answer questions such as:

  • I fancy cooking something with “X”, but I don’t like “Y” what shall I cook?
  • I am pregnant and vegan, what should I prepare for dinner?

Ambitiously, it could also provide data to be used to aid the invention of new recipes based on the co-occurrence of ingredients.

Answering questions like how can I create something new from what I have is one of those difficult to measure yet nevertheless very apparent benefits of using Linked Data techniques and technologies.

It is very easy to imagine the [Linked Data] enhanced food supply chain of Chris’ vision integrated/aggregated with an evolved Kasabi Food dataset answering questions such as “what can I make for dinner which is wheat-free, contains ingredients grown locally that are available from the local major-chain-supermarket which has disabled parking bays?”.

A bit utopian I know, but what differs today from the similar visions that accompanied Tim Berners-Lees original Semantic Web descriptions is that folks like those in the automotive industry and at Kasabi are demonstrating bits of it already.

Bee on plate image from Kasabi.
Declaration I am a shareholder of Kasabi parent company Talis.

A Data 7th Wave Approaching

I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world. It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.

4405831072_3c769de659_b Some in the surfing community will tell you that every seventh wave is a big one.  I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself.  Because of that, it is a little constraining to talk about this next one only effecting the world of the Web.  This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.

Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches.  Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us.  The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.

I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact.  Enough of the grandiose metaphors and down to business.

Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s.  Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API.  With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.

Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?

It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.  Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives.  The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.

In the world of libraries, the Ontology Engineering Group (OEG) at the  Universidad Politécnica de Madrid are providing a contribution of linked bibliographic data to the gathering mass, alongside the British and Germans, with 2.4 Million bibliographic records from the Spanish National Library.  This adds weight to the arguments for a Linked Data future for libraries proposed by the Library of Congress and Stanford University.

I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.

Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.

Data Marketplaces are emerging to provide platforms to share and possibly monetise your data.  An example that takes this one step further is from the leading Semantic Web technology company, Talis.  Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept.  This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts.  An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.

The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy.  Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.

The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.

I wrote about a while back.  An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html.  The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets.  When first announce, the folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much.  However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported.  By engaging with SEO folks on terms that they understand, this move from from had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.

The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.

So what caused me to open up my Macbook and start writing this.  It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that will support) to be ratified.  They are already harvesting, and using, structured information from web pages that has been encoded using RDF.  The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.

Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering data encoded as RDFa.  This approach answers some of the critics of embedding data into a site’s html, especially as RDF, who say it is ugly and difficult to understand.  It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level.  Currently supports a small but important number of ‘things’ in it’s recognised vocabularies.  These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.

So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice.  By then most if not all web site delivery tools should be able to embed the RDF data automatically.  Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them.  A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all.  Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.

The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links.  So far this evangelism has had little success.  However, this shift may give them what they want via an unexpected route.

Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow.  A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company.  Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.

These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out.  Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching.  Perhaps I should look at the beach again in five years time to see if I was right.

Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.

A Kasabi Day at Semtech Berlin

I spent yesterday at the first day of excellent Semantic Tech and Business Conference 2012 in Berlin.  It was a good day covering a wide range of topics, a great range of speakers and talks, and most encouragingly some really good conversations in the breaks.  I had the pleasure of presenting the opening session The Simple Power of the Link which seemed to provide a good grounding introduction to what to some is a fairly complex topic.  My slides are available on Slideshare, and I provided a background article on, if you want to check them out.

In my role as guest blogger for I created an overview of Day 1 sessions I attended and enjoyed.

kasabi_logo_4col Something that struck me throughout the day was the number of references to the Kasabi Data Marketplace during the day.  Well yes, you might say, you are a Kasabi Partner and Kasabi Staff members Knud Möller and Benjamin Nowack gave presentations.  Of course you would be right.  However, I also noticed references to it in other presentations and in general conversations.

For example keynote speaker and ‘Semantic Fireman’ Bart van Leuwen, share the fact that there is an open publicly available version of the Amsterdam Fire Service Data hosted in Kasabi.  The reasoning he gave for doing this was that once he had decided to make his data open, he needed somewhere easy to put it, that did not require him to worry about things like infrastructure, servers, and scaling.  Kasabi provides that, plus the Sparql and APi access that enables people to play with his data, which he encouraged people to do.

Other reasons for referencing Kasabi seemed to be two fold.  Firstly, as with Bart, it is an easy cloud-based place to put your data and let it handle access, APIs and loadings that you initially have no idea about.  Secondly, and far less clearly understood, is the idea that the team at Kasabi may have an insight into a possible business model for delivering generic services with Liked Data at the core.

This is not intended to be a sales pitch for Kasabi, the team there can do that very well themselves.  I just found it interesting to note that it seems to be hitting a spot in the Semantic Web / Linked Data consciousness that nothing else quite is at the moment.

Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.

Cloud Computing Back In The Future

Checking out Cloud Expo Europe. Much, in Linked Data circles, is implied about the mutual benefit of adopting the Cloud and Linked Open Data.
I was interested to see if the cloud vendors were as data [and the value within it] aware as the Linked Data vendors are cloud aware.

3851653293_287b8817ca_z Yesterday found me in the National Hall of London’s Olympia, checking out Cloud Expo Europe. Much, in Linked Data circles, is implied about the mutual benefit of adopting the Cloud and Linked Open Data.  Many of the technology and service providers in the Linked Data / Semantic Web space benefit from the scalability, flexibility, of delivering their services from the Cloud as Software and/or Data as a Service (SaaS / DaaS). Many employ raw services from the cloud themselves, helping them accrue and pass on those benefits.

kasabi_logo_4col A prime example of this is Kasabi. A Linked Data powered data marketplace, built on the latest generation of the Talis SaaS data platform that already provides services for organisations such as Ordnance Survey and the British Library.

I know from experience, the Kasabi operation realises many of the benefits put forward as reasons to reach for the cloud by proponents of the technology – near to zero in-house infrastructure costs, ability to rapidly scale up or down in reaction to demands, availability everywhere on anything, lower costs, etc.

So I was interested to see if the cloud vendors were as data [and the value within it] aware as the Linked Data vendors are cloud aware.

Unfortunately I can only report back a massive sense of disappointment in the lack of vision, and siloed parochial views, from practically everyone I met. The visit to the show took me back a decade or so, to the equivalent events that then extolled the virtues of the latest stack of servers that would grace your datacenter. Same people, same type of sales pitch, but now slightly fewer scantily clad or silly costumed sales people, and an iPad to win on most every stand.

How can a cloud solution salesperson be siloed and parochial in their views you may ask. Isn’t the cloud all about opening up access to your business processes across locations and devices, taking you data out of your datacenters into hosted services, and saving money whilst gaining flexibility? Yes it is, and if I was running any organisation from a tiny one like Data Liberate to a massive corporation or government I would expect to be shot for not looking to the cloud for any new or refreshed service.

But, I would also expect to be severely criticised for not also looking to see what other value could be delivered by capitalising on the distributed nature of the cloud and the Web that delivers it. The basic pitch from many I spoke to boiled down to “let us take what you do, unchanged, and put it in our cloud“. One line I must share with you was “back your data up to our cloud and we cane save you all that hassle of mucking about with tapes“.

Perhaps I am being a bit harsh. There is potentially significant ROI that can be gained from moving processes, as is, in to the cloud and I would recommend all organisations to consider it. I expected a significant number of exhibitors to be doing exactly what I describe. My disappointment comes from finding not a single one who could see beyond the simple replacement of locally hosted hardware (and staff) with cloud services.

Perhaps I am getting a bit too visionary in my old age.

There was glimmer of light during the day –  I read Paul Miller’s write up, and scanned the Twitter stream for #cloudcamp, which took place in London the evening before.  Maybe I should have just attended that, which unfortunately I couldn’t.  Then I might be less downbeat about the ‘Cloud’ future just taking us back to the same old implementations of the past, just hosted elsewhere – an opportunity being missed.

If you know different, let me know and raise my mood a bit.

Disclosure: I am a Kasabi Partner and shareholder in Kasabi’s parent company, Talis.
Clouds from picture by Martin Sojka on Flickr

Open Data: Digital Fuel or Raw Material?

I have been reading with interest ‘Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect’ by Vivek Kundra. Well worth a read to place the current [Digital] Revolution we are somewhere in the middle of, in relation to preceding revolutions and the ages that they begat.

I have been reading with interest the recently published discussion paper from Harvard University’s Joan Shoreenstein Center on the Press, Politics and Public Policy by former U.S. Chief Information Officer, Vivek Kundra, entitled Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect [pdf].  Well worth a read to place the current [Digital] Revolution we are somewhere in the middle of, in relation to preceding revolutions and the ages that they begat – the agricultural age, lasting thousands of years – the industrial age, lasting hundreds of years – and the digital revolution/age, which has already had massive impacts on individuals, society, governments and commerce in just a few short decades.

d70_kundra.pdf Paraphrasing his introduction: the microprocessor, “the new steam engine” powering the Information Economy is being fuelled by open data.  Stepping on to dangerous mixed metaphor territory here but,  I see him implying that the network effects, both technological and social, are turning that basic open data fuel in to high-octane brew driving massive change to our world.

Vivek goes on to catalogue some of the effects of this digital revolution.  With his background in county, sate, and federal US government it is unsurprising that his examples are around the effects of opening up public data, but that does not make them less valid.  He talks about four shifts in power that are emerging and/or need to occur:

  • Fighting government corruption, improving accountability and enhancing government services – open [democratised] data driving the public’s ability to hold the public sector to account, exposing hidden, or unknown, facts and trends.
  • Changing the default setting of government to open, transparent and participatory – changing the attitude of those within government to openly publish their data by default so that it can be used to inform their populations, challenge their actions and services, and stimulate innovation.
  • Create new models of journalism to separate signal from noise to provide meaningful insights – innovative analysis of publicly available data can surface issues and stories that would otherwise be buried in the noise of general government output.
  • Launch multi-billion dollar businesses based upon public sector data – by applying their specific expertise to the analysis, collation, and interpretation of open public data

All good stuff, and a great overview for those looking at this digital revolution as impacted by public open data.  As to what sort of age it will lead to, I think we need to look at a couple of steps further on in the revolution.

The agricultural revolution was based upon the move away from a nomadic existence, the planting and harvesting of crops and the creation of settlements.  The age that follows, I would argue, was based upon the outputs of those efforts enabling the creation of business and the trading of surpluses.  A new layer of commerce emerged, built upon the basic outputs of the revolutionary activities.

The industrial revolution introduced powered machines, replacing manual labour, massively increasing efficiency and productivity.  The age that followed was characterised by manufacturing – a new layer of added value, taking the basic raw materials produced or mined buy these machines and combining them in to new complex products.

Which brings me to what I would prefer to call the data revolution, where today we are seeing data as a fuel consumed to drive our information steam engines.  I would argue that soon we will recognise that data is not just a fuel but also a raw material.  Data from from many sources (public, private and personal) in many forms (open, commercially licensed and closed), will be combined with entrepreneurial innovation and refined to produce new complex products and services. In the same way that whole new industries emerged in the industrial era, I believe we will look back at today and see the foundations of new and future industries.  I published some thoughts on this in a previous post a year or so ago which I believe are still relevant.

Today, unless you want to expound significant effort and understanding of individual data, it is difficult to deliver an information service or application that depends on more than a couple of data sources.  This is because we are still trying to establish the de facto standards for presenting, communicating and consuming data.  We have mostly succeeded for web pages, with html and the gradual demise of pragmatic moment-in-time diversionary solutions such as flash.  However on the data front, we are still where the automobile industry was before agreeing what order and where to place the foot peddles in a car.

The answer I believe will emerge to be the adoption of data packaging, and linking techniques and standards – Linked Data.  I say this, not just because I am evangelist for the benefits of Linked Data, but because it exhibits the same distributed open and generic features that exemplify what has been successful for the Web.  It also builds upon those Web standards.  Much is talked, and hyped, about Big Data – another moment-in-time term.  Once we start linking, consuming, and building, it will be on a foundation of data that could only be described as big.  What we label Big today, will soon appear to be normal.

What of the Semantic web I am asked.  I believe the Semantic Web is a slightly out of focus vision of how the Information Age may look when it is established, expressed in the terms only of what we understand today.  So this is what I am predicting will arrive, but I am also predicting that we will eventually call it something else.

Picture of Vivek Kundra from Wikipedia.

OK So Who Noticed the SOPA Blackout

All in all, I believe the campaign has been surprisingly effective on the visible web. However, what prompted this post was trying to ascertain how effective it was on the Data Web, which almost by definition is the invisible web. Ahead of the dark day, a move started on the Semantic Web and Linked Open Data mailing lists to replicate what Wikipedia was doing by going dark on Dbpedia

0118-wikipedia-blackout-sopa-blackout_full_600 Well I did for a start!  I chose this auspicious day to move the Data Liberate web site from one hosting provider to another.  The reasons why are a whole other messy story, but I did need some help on the WordPress side of things and [quite rightly in my opinion] they had ‘gone dark’ in support of the SOPA protests.  Frustration, but in a good cause.

Looking at the press coverage from my side of the Atlantic, such as from BBC News, it seems that some in Congress have also started to take notice.  The most fuss in general seemed to be around Wikipedia going dark, demonstrating what the world would be like without the free and easy access to information we have become used to.  All in all I believe the campaign has been surprisingly effective on the visible web.

However, what prompted this post was trying to ascertain how effective it was on the Data Web, which almost by definition is the invisible web.  Ahead of the dark day, a move started on the Semantic Web and Linked Open Data mailing lists to replicate what Wikipedia was doing by going dark on Dbpedia – the Linked Data version of Wikipedia structured information.  The discussion was based around the fact that SOPA would not discriminate between human readable web pages and machine-to-machine data transfer and linking, therefore we [concerned about the free web] should be concerned.  Of that there was little argument.

The main issue was that systems, consuming data that suddenly goes away, would just fail.  This was countered by the assertion that, regardless of the machines in the data pipeline, there will always be a human at the end.  Responsible systems providers, should be aware of the issue and report the error/reason to their consuming humans.

Some suggested that instead of delivering the expected data, systems [operated by those that are] protesting, should provide data explaining the issue.  How many application developers have taken this circumstance in to account in their design I wonder.  If you, as a human, are accessing a SPARQL endpoint, are presented with a ‘dark’ page, you can understand and come back to query tomorrow.  If you are a system getting different types of, or no, data back, you will see an error.

The question I have is, who using systems that use Linked Data [that went dark] noticed that there was either a problem, or preferably an effect of the protest?

I suspect the answer is very few, but I would like to hear the experiences of others on this. Déjà vu

schema-org1The Web has been around for getting on for a couple of decades now, and massive industries have grown up around the magic of making it work for you and your organisation.  Some of it, it has to be said, can be considered snake-oil.  Much of it is the output of some of the best brains on the planet.  Where, on the hit parade of technological revolutions to influence mankind, the Web is placed is oft disputed, but it is definitely up there with fire, steam, electricity, computing, and of course the wheel.  Similar debates, are and will virtually rage, around the hit parade of web features that will in retrospect have been most influential – pick your favourites, http, XML, REST, Flash, RSS, SVG, the URL, the href, CSS, RDF – the list is a long one.

I have observed a pattern as each of the successful new enhancements to the web have been introduced, and then generally adopted.  Firstly there is a disconnect between the proponents of the new approach/technology/feature and the rest of us.  The former split their passions between focusing on the detailed application, rules, and syntax of it’s use and; broadcasting it’s worth to the world, not quite understanding why the web masses do not ‘get it’ and adopt it immediately.  This phase is then followed by one of post-hype disillusionment from the creators, especially when others start suggesting simplifications to their baby.  Also at this time back-room adoption by those who find it interesting, but are not evangelistic about it, starts to occur.  The real kick for the web comes from those back-room folks who just use this next thing to deliver stuff and solve problems in a better way.  It is the results of their work that the wider world starts to emulate, so that they can keep up with the pack and remain competitive.  Soon this new feature is adopted by the majority, because all the big boys are using it, and it becomes just part of the tool kit.

A great example of this was RSS.  Not a technological leap but a pragmatic mix of current techniques and technologies mixed in with some lateral thinking and a group of people agreeing to do it in ‘this way’ then sharing it with the world.  As you will see from the Wikipedia page on RSS, the syntax wars raged in the early days – I remember it well 0.9, 0.91, 1.0, 1.1, 2.0- 2.01, etc.  I also remember trying, not always with success, to convince people around me to use it, because it was so simple.  Looking back it is difficult to say exactly when it became mainstream, but this line from Wikipedia gives me a clue: In December 2005, the Microsoft Internet Explorer team and Microsoft Outlook team announced on their blogs that they were adopting the feed icon first used in the Mozilla Firefox browser. In February 2006, Opera Software followed suit.  From then on, the majority of consumers of RSS were not aware of what they were using and it became just one of the web technologies you use to get stuff done.

I am now seeing the pattern starting to repeat itself again, with structured and linked data.  Many, including me, have been evangelising the benefits of web friendly, structured, linked data for some time now – preaching to a crowd that has been slow in growing, but growing it is.   Serious benefit is now being gained by organisations adopting these techniques and technologies, as our selection of case studiesdemonstrate.  They are getting on with it, often with our help, using it to deliver stuff.  We haven’t hit the mainstream yet.  For instance, the SEO folks still need to get their head around the difference between content and data.

Something is stirring around the edge of the Semantic Web/Linked Data community  that has the potential to give structured web enabled data the kick towards mainstream that RSS got when Microsoft adopted the RSS logo and all that came with it.   That something is, an initiative backed by the heavyweights of the search engine world, Google, Yahoo, and Bing.  For the SEO and web developer folks, offers a simple attractive proposition – embed some structured data in your html and, via things like Google’s Rich Snippets, we will give you a value added display in our search results.  Result, happy web developers with their sites getting improve listing display.  Result, lots of structured data starting to be published by people that you would have had an impossible task in convincing that it would be a good idea to publish structured data on the web.

I was at Semtech in San Francisco in June, just after was launched and caused a bit of a stir.  They’ve over simplified the standards that we have been working on for years, dumbing down RDF, diluting the capability, with to small a set of attributes, etc., etc.  When you get under the skin of, you see that with support for RDFa and supporting RDFa 1.1 lite, they are not that far from the RDF/Linked Data community. should be welcomed as an enabler for getting loads more structured and linked data on the web.  Is their approach now perfect,? No.  Will it influence the development of Linked Data? Yes.  Will the introduction be messy? Yes.  Is it about more than just rich snippets?  Oh yes.  Do the webmasters care at the moment? No.

If you want a friendly insight in to what is about, I suggest a listen to this month’s Semantic Link podcast, with their guest from Google/ Ramanathan V. Guha.

Now where have I seen that name before? – Oh yes, back on the Wikipedia RSS pageThe basic idea of restructuring information about websites goes back to as early as 1995, when Ramanathan V. Guha and others in Apple Computer’s Advanced Technology Group developed the Meta Content Framework.”  So it probably isn’t just me who is getting a feeling of Déjà vu.

This post was also published on the Talis Consulting Blog

Web, Semantic Web, SEO, SERP and Linked Data

RDF Magnify Like many of my posts, this one comes from the threads of several disparate conversations coming together in my mind, in an almost astrological conjunction.

One thread stems from my recent Should SEO Focus in on Linked Data? post, in which I was concluding that the group, loosely described as the SEO community, could usefully focus in on the benefits of Linked Data in their quest to improve the business of the sites and organisations they support. Following the post I received an email looking for clarification of something I said.

I am interested in understanding better the allusion you make in this paragraph:

One of the major benefits of using RDFa is that it can encode the links to other sources, that is the heart of Linked Data principles and thus describe the relationships between things. It is early days with these technologies & initiatives. The search engine providers are still exploring the best way to exploit structured information embedded in and/or linked to from a page. The question is do you just take RDFa as a new way of embedding information in to a page for the search engines to pick up, or do you delve further in to the technology and see it as public visibility of an even more beneficial infrastructure for your data.

If the immediate use-case for RDFa (microdata, etc.) is search engine optimization, what is the “even more beneficial infrastructure”? If the holy grail is search engine visibility, rank, relevance and rich-results, what is the “even more”?

In reply I offered:

What I was trying to infer is that if you build your web presence on top of a Linked Data described dataset / way of thinking / platform, you get several potential benefits:

  • Follow-your-nose navigation
  • Flexible easier to maintain page structure
  • Value added data from external sources….
  • … therefore improved [user] value with less onerous cataloguing processes
  • Agile/flexible systems – easy to add/mix in new data
  • Lower cost of enhancement (eg. BBC added dinosaurs to the established Wildlife Finder with minimal effort)
  • In-built APIs [with very little extra effort] to allow others to access / build apps upon / use your data in innovative ways
  • As per the BBC a certain level of default SEO goodness
  • Easy to map, and therefore link, your categorisations to ones the engines do/may use (eg. Google are using MusicBrainz to help folks navigate around – if, say as the BBC do, you link your music categories to those of MusicBrainz you can share in that effect.

So what I am saying is that you can ‘just’ take RDFa as a dialect to send your stuff to the Google (in which case microdata/microformats could be equally as good), but then you will miss out on the potential benefits I describe.

From my point of view there are two holy grails (if that isn’t breaking the analogy 😉

  1. Get visibility and hence folks to hit your online resources.
  2. Provide the best experience/usefulness/value to them when they do.

Linked Data techniques and technologies, have great value for the data owners in the second of those, with the almost spin-off benefit of helping you with the first one.

The next thread was not a particular item but a general vibe, from several bits and pieces I read – that RDFa was confusing and difficult. This theme I detect was coming from those only looking at it from a ‘how do I encode my metadata for Google to grab it for it’s snippets’ point of view (and there is nothing wrong in that) or those trying to justify a ‘ is the only show in town’ position. Coming at it from the first of those two points of view, I have some sympathy – those new to RDFa must feel like I do (with my basic understanding of html) when I peruse the contents of many a css file looking for clues as to the designer’s intention.

However I would make two comments. Firstly, a site surfacing lots of data and hence wanting to encode RDFa amongst the human-readable stuff, will almost certainly be using tools to format the data as it is extracted from an underlying data source – it is those tools that should be evolved to produce the RDFa as a by-product. Secondly, it is the wider benefits of Linked Data, which I’m trying to promote in my posts, that justify people investing in time to focus on it. The fact that you may use RDFa to surface that data embedded in html, so that search engines can pick it up, is implementation detail – important detail, but missing the point if that is all you focus upon.

Thread number three, is the overhype of the Semantic Web. Someone who I won’t name, but I’m sure won’t mind me quoting, suggested the following as the introduction to a bit of marketing: The Semantic Web is here and creating new opportunities to revamp and build your business.

The Semantic Web is not here yet, and won’t be for some while. However what is here, and is creating opportunities, is Linked Data and the pragmatic application of techniques, technologies and standards that are enabling the evolution towards an eventual Semantic Web.

This hyped approach is a consequence of the stance of some in the Semantic Web community who with fervour have been promoting it’s coming, in it’s AI entirety, for several years and fail to understand why all of us, [enthusiasts, researchers, governments, commerce and industry] are not implementing all of it’s facets now. If you have the inclination, you can see some of the arguments playing out now in this thread on a SemWeb email list where Juan Sequeda asks for support for his SXSW panel topic suggestion.

A simple request, that I support, but the thread it created shows that the ‘eating the whole elephant’ of the Semantic Web will be too much to introduce it successfully to the broad Web, SEO, SERP, community and the ‘one mouthful at a time’ approach may have better chance of success. Also any talk of a ‘killer app’ is futile – we are talking about infrastructure here. What is the killer app feature of the Web? You could say linked, globally distributed, consistently accessed documents; an infrastructure that facilitated the development of several killer businesses and business models. We will see the same when we look back on a web enriched by linked, globally distributed, consistently accessed data.

So what is my astrological conjunction telling me? There is definitely fertile ground to be explored between the Semantic Web and the Web in the area of the pragmatic application of Linked Data techniques and technologies. People in both camps need to open their minds to the motivations and vision of the other. There is potential to be realised, but we are definitely not in silver bullet territory.

As I said in my previous post, I would love to explore this further with folks from the world of SEO & SERP. If you want to talk through what I have described, I encourage you to drop me an email or comment on this post.

This post was also published on the Talis Consulting Blog