In amongst the useful pointers to news, comment, and documents, I have been recently conscious of an increasing flow of tweets like these:
This is good news. More and more city, local, national governments and public bodies releasing data as open data. Of course the reference to open here is in relation to the licensing of these data, but how open in access are they? It is not that easy to find out.
To be truly open and broadly useful data has to be both licensed openly, with few or no use constraints, and have as few technical barriers to consuming it as possible. In many cases there will be enough enthusiasts for a particular source with the motivation to take data in whatever form, and pick their way through it to get the value they need. These enthusiasts provide great blogging fodder and examples for presentations, but do not represent the significant value that should, and is predicted to, flow from the open data and transparency agenda spreading through governments across the globe.
The five star data rating scheme, from Sir Tim Berners-Lee, is a simple way to describe the problem and encourage publishers to strive to achieve a 5 star Linked Open Data rating, yet not discouraging openly publishing in any form in the first place. Check out my earlier post What Is Your Data’s Star Rating(s)? when I dig in to both types of openness a bit further.
Policy makers and data openness enthusiasts who are behind this burgeoning flood of announcements [as a broad generality] get the licensing issues – use CC0 or copy the UK’s OGL. However what concerns me is, they tend to shy away from promoting the removal of technical barriers that could stifle the broad adoption, and consequential flow of economic benefit, that they predict.
We could look back in a few years to this time of missed opportunity to say, it was obvious that the initiatives would fail because we didn’t make it easy for those that could have delivered the value. We let the flood of enthusiastic initiatives wash past us without grabbing the opportunities to establish easy, consistent and repeatable ways to release and build upon the value in data for all, not just an enthusiastic few. We need to get this right if open data is going fuel the next revolution.
Some are thinking in the same way. CKAN for instance have delivered an extension to calculate the [technical] openness of datasets as listed on the Dataset Openness Page of the Data Hub. Great idea but I would suggest that most data publishers will never find their way to such a listing. Where are the stars on the individual data set pages? Where is the star rating badges of approval that publishers can put on their sites to show off?
We have made great strides so far in promoting the opening of public and other sector information, the ePISplatform stream is testament to that. Somehow we need to capitalise on this great start and market the benefits of technically opening up your data better. 5 Star badge of approval anyone?
I have been reading with interest the recently published discussion paper from Harvard University’s Joan Shoreenstein Center on the Press, Politics and Public Policy by former U.S. Chief Information Officer, Vivek Kundra, entitled Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect [pdf]. Well worth a read to place the current [Digital] Revolution we are somewhere in the middle of, in relation to preceding revolutions and the ages that they begat – the agricultural age, lasting thousands of years – the industrial age, lasting hundreds of years – and the digital revolution/age, which has already had massive impacts on individuals, society, governments and commerce in just a few short decades.
Paraphrasing his introduction: the microprocessor, “the new steam engine” powering the Information Economy is being fuelled by open data. Stepping on to dangerous mixed metaphor territory here but, I see him implying that the network effects, both technological and social, are turning that basic open data fuel in to high-octane brew driving massive change to our world.
Vivek goes on to catalogue some of the effects of this digital revolution. With his background in county, sate, and federal US government it is unsurprising that his examples are around the effects of opening up public data, but that does not make them less valid. He talks about four shifts in power that are emerging and/or need to occur:
Fighting government corruption, improving accountability and enhancing government services – open [democratised] data driving the public’s ability to hold the public sector to account, exposing hidden, or unknown, facts and trends.
Changing the default setting of government to open, transparent and participatory – changing the attitude of those within government to openly publish their data by default so that it can be used to inform their populations, challenge their actions and services, and stimulate innovation.
Create new models of journalism to separate signal from noise to provide meaningful insights – innovative analysis of publicly available data can surface issues and stories that would otherwise be buried in the noise of general government output.
Launch multi-billion dollar businesses based upon public sector data – by applying their specific expertise to the analysis, collation, and interpretation of open public data
All good stuff, and a great overview for those looking at this digital revolution as impacted by public open data. As to what sort of age it will lead to, I think we need to look at a couple of steps further on in the revolution.
The agricultural revolution was based upon the move away from a nomadic existence, the planting and harvesting of crops and the creation of settlements. The age that follows, I would argue, was based upon the outputs of those efforts enabling the creation of business and the trading of surpluses. A new layer of commerce emerged, built upon the basic outputs of the revolutionary activities.
The industrial revolution introduced powered machines, replacing manual labour, massively increasing efficiency and productivity. The age that followed was characterised by manufacturing – a new layer of added value, taking the basic raw materials produced or mined buy these machines and combining them in to new complex products.
Which brings me to what I would prefer to call the data revolution, where today we are seeing data as a fuel consumed to drive our information steam engines. I would argue that soon we will recognise that data is not just a fuel but also a raw material. Data from from many sources (public, private and personal) in many forms (open, commercially licensed and closed), will be combined with entrepreneurial innovation and refined to produce new complex products and services. In the same way that whole new industries emerged in the industrial era, I believe we will look back at today and see the foundations of new and future industries. I published some thoughts on this in a previous post a year or so ago which I believe are still relevant.
Today, unless you want to expound significant effort and understanding of individual data, it is difficult to deliver an information service or application that depends on more than a couple of data sources. This is because we are still trying to establish the de facto standards for presenting, communicating and consuming data. We have mostly succeeded for web pages, with html and the gradual demise of pragmatic moment-in-time diversionary solutions such as flash. However on the data front, we are still where the automobile industry was before agreeing what order and where to place the foot peddles in a car.
The answer I believe will emerge to be the adoption of data packaging, and linking techniques and standards – Linked Data. I say this, not just because I am evangelist for the benefits of Linked Data, but because it exhibits the same distributed open and generic features that exemplify what has been successful for the Web. It also builds upon those Web standards. Much is talked, and hyped, about Big Data – another moment-in-time term. Once we start linking, consuming, and building, it will be on a foundation of data that could only be described as big. What we label Big today, will soon appear to be normal.
What of the Semantic web I am asked. I believe the Semantic Web is a slightly out of focus vision of how the Information Age may look when it is established, expressed in the terms only of what we understand today. So this is what I am predicting will arrive, but I am also predicting that we will eventually call it something else.
The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document. Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.
In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme to “encourage people — especially government data owners — along the road to good linked data”. The key message was to Open Data. You may have the best RDF encoded and modelled data on the planet, but if it is not associated with an open license, you don’t get even a single star. That emphasis on government data owners is unsurprising as he was at the time, and still is, working with the UK and other governments as they come to terms with the transparency thing.
Once you have cleared the hurdle of being openly licensed (more of this later), your data climbs the steps of Linked Open Data stardom based on how available and therefore useful it is. So:
Available on the web (whatever format) but with an open licence, to be Open Data
Available as machine-readable structured data (e.g. excel instead of image scan of a table)
as (2) plus non-proprietary format (e.g. CSV instead of excel)
All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
All the above, plus: Link your data to other people’s data to provide context
By usefulness I mean how low is the barrier to people using your data for their purposes. The usefulness of 1 star data does not spread much beyond looking at it on a web page. 3 Star data can at least be downloaded, and programmatically worked with to deliver analysis or for specific applications, using non-proprietary tools. Whereas 5 star data is consumable in a standard form, RDF, and contains links to other (4 or 5 star) data out on the web in the same standard consumable form. It is at the 5 star level that the real benefits of Linked Open Data kick in, and why the scheme encourages publishers to strive for the highest rating.
Tim’s scheme is not the only open data star rating scheme in town. There is another one that emerged from the LOD-LAM Summit in San Francisco last summer – fortunately it is complementary and does not compete with his. The draft 4 star classification-scheme for linked open cultural metadata approaches the usefulness issue from a licensing point of view. If you can not use someone’s data because of onerous licensing conditions it is obviously not useful to you.
permission to use the metadata is contingent on providing attribution in a way specified by the provider
metadata can only be combined with data that allows re-distributions under the terms of this license
So when you are addressing opening up your data, you should be asking yourself how useful will it be to those that want to consume and use it. Obviously you would expect me to encourage you to publish your data as ★★★★★ – ★★★★ to make it as technically useful with as few licensing constraints as possible. Many just focus on Tim’s stars, however, if you put yourself in the place of an app or application developer, a one LOD-LAM star dataset is almost unusable whilst still complying with the licence.
So think before you open – put yourself in the consumers’ shoes – publish your data with the stars.
One final though, when you do publish your data, tell your potential viewers, consumers, and users in very simple terms what you are publishing and under what terms. As the UK Government does through data.gov.uk using the Open Government Licence, which I believe is a ★★★.