I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
All resource pages include Linked Data
Human visible under a Linked Data tab at the bottom of the page
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.
There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.
Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.
A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.
The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.
Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). What you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle. As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.
An excellent demonstration of the basic power and benefit of Linked Data. So why is this often missed? Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.
To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.
The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.
Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets. Nice that you can work with an individually focussed set or get data from all in a consolidated graph.
There are four APIs:
Lookup – a simple way to extract an RDF description of a single resource, using its URI.
Search – for running keyword searches over a dataset.
Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.
Each API is available to play with on a web page complete with examples and pop-up help hints. It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.
For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form. They even include details of the available output formats and expected http response codes.
Finally a few general comments.
Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs. Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way. How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it. Also a site with a snappy response ups your confidence that your application will perform well when using their service.
Secondly the range of APIs, all cleanly and individually satisfying specific general needs. So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.
The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind. Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance. The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.
I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.
Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5
Update from organisers: We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.
Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.
Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!
OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data. SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.
SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities. Amongst the great selection of speakers you will find yours truly!
If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.
Nominations for the Spotlight are being accepted through May 10th. Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing. Self-nominations will be accepted
Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing. As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering. (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!
Typical! Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net? 35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.
By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news. Nevertheless it is significant news, significant in many ways.
OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years. At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009. As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus. Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap. These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.
Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well. As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.
Let me dissect the announcement a bit….
First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them. Just the sort of authoritative resources that will help stitch the emerging web of data together.
Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org. Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them. A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.
As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup. It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise? Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.
Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages. You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form. You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser. Not very pretty to human eyes I know, but just the kind of structured data that systems love.
Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources. With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples. OCLC is playing it’s part in doing this for the library sector.
Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary. Attributes such as library:holdingsCount and library:oclcnum. This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data. What better way of testing out such a vocabulary – markup several million records with it, publish them and see what the world makes of them.
Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.
Sixth significant bit of news – This release is an experimental release. This is the start, not the end, of a process. We know we have not got this right yet. There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data. There are obvious areas for discussion around the use of schema.org and the proposed library extension to it. There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for. Over the coming months OCLC wants to constructively engage with all that are interested in this process. It is only with the help of the library and wider web communities that we can get it right. In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.
As you can probably tell I am fairly excited about this announcement. This, and future stuff like it, are behind some of my reasons for joining OCLC. I can’t wait to see how this evolves and develops over the coming months. I am also looking forward to engaging in the discussions it triggers.
Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not. They will tell you that RDF is one of the simplest data forms for describing things, and they are right. They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will. They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too. Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it. They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.
So would I recommend a self-taught do-it-yourself approach to adopting Linked Data? For an enthusiastic individual, maybe. For a company or organisation wanting to get to know and then identify the potential benefits, no I would not. Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.
Let me explain this apparent contradiction. I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies. These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models. Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context. That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself. However, I do council care in the way you approach gaining this understanding.
A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee. The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.
I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path. And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.
You want to understand how Linked Data could benefit your organisation? Get some help to relate the benefits to your issues, challenges and opportunities. Don’t go off to a third party and get them to implement something for you. Bring in a steadying hand, encouragement, and guidance to stay on track. Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data. There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi. Get your head around what you have, how you are going to publish and link it, and what the usage might be. Then you can size and specify the technology and/or service you need to support it.
So back to my original question – Is Linked Data DIY a good idea? Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation. However, I would not recommend a do-it-yourself introduction to this. Get yourself a steadying hand.
Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.
Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.
Some in the surfing community will tell you that every seventh wave is a big one. I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself. Because of that, it is a little constraining to talk about this next one only effecting the world of the Web. This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.
Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches. Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us. The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.
I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. Enough of the grandiose metaphors and down to business.
Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s. Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.
Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?
It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant. Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives. The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.
I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.
Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.
Data Marketplaces are emerging to provide platforms to share and possibly monetise your data. An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis. Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept. This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts. An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.
The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy. Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.
The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.
I wrote about Schema.org a while back. An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html. The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets. When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much. However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported. By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.
The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.
So what caused me to open up my Macbook and start writing this. It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified. They are already harvesting, and using, structured information from web pages that has been encoded using RDF. The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.
Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa. This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand. It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level. Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies. These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.
So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice. By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically. Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them. A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all. Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.
The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links. So far this evangelism has had little success. However, this shift may give them what they want via an unexpected route.
Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow. A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company. Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.
These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out. Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching. Perhaps I should look at the beach again in five years time to see if I was right.
Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.
When moving to a new place we bring loads of baggage and stuff from our old house that we feel will be necessary in our new abode. One poke of the head in to your attic will clearly demonstrate how much of the necessary stuff was not so necessary after all. Yet you will also find things lurking around the house that may have survived several moves. The moral here is that, although some things are core to what we are and do, it is difficult to predict what we will need in a new situation.
This is also how it is when we move to doing things in a different way – describing our assets in RDF for instance.
I have watched many flounder when they first try to get their head around describing the things they already know in this new Linked Data format, RDF. Just like moving house, we initially grasp for the familiar and that might not always be helpful.
So. You understand your data, you are comfortable with XML, you recognise some familiar vocabularies that someone has published in RDF-XML, this recreation in RDF sounds simple-ish, you take a look at some other data in RDF-XML to see what it should look like, and…… Your brain freezes as you try to work out how you nest one vocabulary within another and how your schema should look.
This is is where stepping back from the XML is a good idea. XML is only one encoding/transmission format for RDF, it is a way of encoding RDF so that it can be transmitted from one machine process to another. XML is ugly. XML is hierarchical and therefore introduces lots of compromises when imparting the graph nature of an RDF model. XML brings with it a [hierarchical] way of thinking about data that constrains your decisions when creating a mode for your resources.
I suggest you not only step back from XML, you initially step back from the computer as well.
Get out your white/blackboard or flip-chat & pen and start drawing some ellipses, rectangles, and arrows. You know your domain, go ahead draw a picture of it. Draw an ellipse for each significant thing in your domain – each type of object, concept, or event, etc. Draw a rectangle for each literal (string, date, number). Beware of strings that are really ‘things’ – things that have other attributes including the string as a name attribute. Draw arrows to show the relationships between your things and things and their attributes. Label the arrows to define those relationships – don’t initially worry about vocabularies yet, use simple labels such as name, manufacturer, creator, publication event, discount, owner, etc. Create identifiers for your things in the form of URIs, so that they will be unique and accessible when you eventually publish them. What you end up with should look a bit like this slide from my recent presentation at the Semantic Tech and Business Conference in Berlin – the full deck is on SlideShare. Note how how I have also represented resources the model in a machine readable form of RDF – yes you can now turn to the computer again.
Once back at the computer you can now start referring to generally used vocabularies such as foaf, dcterms, etc. to identify generally recognised labels for relationships – an obvious candidate being foaf:name in this example. Move on to more domain specific vocabularies as you go. When you find a relationship not catered for elsewhere you may have to consider publishing your own to supplement the rest of the world.
Once you have your model, it is often a simple bit of scripting to take data from your current form (CSV, database, XML record) and produce some simple files of triples, in n-triples format. Then use a useful tool like Raptor to transform it in to good old ugly XML for transfer. Better still, take your n-triples files and load them in to a storage/publishing platform like Kasabi. This was the route the British Library took when the published the British National Bibliography as RDF.
Data is going to be come more core to our world than we could ever have imagined a few short years ago. Although we have be producing it for decades, data has either been treated as something in the core of a project not to expose to prying eyes, or often as a toxic waste product of business processes. Some of the traditional professions that have emerged, to look after and work with these data, reflect this relationship between us and and our digital assets. In the data warehouse, they archive, preserve, catalogue, and attempt to make sense of vast arrays of data. The data miners, precariously dig through mountains of data as it shifts and settles around them, propping up their expensive burrows with assumptions and inferred relationships, hoping a change in the strata does not cause a logical cave-in and they have to start again.
As I have postulated previously, I believe we are on the edge of a new revolution where data becomes a new raw material that drives the emergence of new industries, analogous to the emergence of manufacturing as a consequence of the industrial revolution. As this new era rolls out, the collection of data wrangling enthusiasts that have done a great job in getting us thus far will not be sufficient to sustain a new industry of extracting, transforming, linking, augmenting, analysing and publishing data.
So this initiative from the OKF & P2PU is very welcome:
The explosive growth in data, especially open data, in recent years has meant that the demand for data skills — for data “wranglers” or “scientists” — has been growing rapidly. Moreover, these skills aren’t just important for banks, supermarkets or the next silicon valley start-up, they are also going to be cruicial in research, in journalism, and in civil society organizations (CSOs).
However, there is currently a significant shortfall of data “wranglers” to satisfy this growing demand, especially in civil society organisations — McKinsey expects a skills shortage in data expertise to reach 50-60% by 2018 in the US alone.
It is welcome, not just because they are doing it but also, because of who they are and the direction they are taking:
The School of Data will adopt the successful peer-to-peer learning model established by P2PU and Mozilla in their ‘School of Webcraft’ partnership. Learners will progress by taking part in ‘learning challenges’ – series of structured, achievable tasks, designed to promote collaborative and project-based learning.
As learners gain skills, their achievements will be rewarded through assessments which lead to badges. Community support and on-demand mentoring will also be available for those who need it.
They are practically approaching real world issues and tasks from the direction of the benefit to society of opening up data. Taking this route will engage with those that have the desire, need and enthusiasm to become either part or full time data wranglers. Hopefully these will establish an ethos that will percolate into commercial organisations, taking an open world view with it. I am not suggesting that commerce should be persuaded to freely and ,openly share all their data but they should learn the techniques of the open data community as the best way to share data under whatever commercial and licensing conditions are appropriate.