Will This Flood of Open Data Wash Past Us?

psi_logo 5959118186_19582c7b84_m@ePISplatform features fairly prominently in the stream of tweets that waft across my desktop every day – it comes from the European Public Sector Information (PSI) Platform (Europe’s One-stop Shop on PSI re-use) Working to stimulate and promote PSI re-use and open data initiatives.

In amongst the useful pointers to news, comment, and documents, I have been recently conscious of an increasing flow of tweets like these:

ePSIplatform (epsiplatform) on Twitter-2
ePSIplatform (epsiplatform) on Twitter-1
ePSIplatform (epsiplatform) on Twitter
ePSIplatform (epsiplatform) on Twitter-3

This is good news.  More and more city, local, national governments and public bodies releasing data as open data.  Of course the reference to open here is in relation to the licensing of these data, but how open in access are they?  It is not that easy to find out.

To be truly open and broadly useful data has to be both licensed openly, with few or no use constraints, and have as few technical barriers to consuming it as possible.  In many cases there will be enough enthusiasts for a particular source with the motivation to take data in whatever form, and pick their way through it to get the value they need.  These enthusiasts provide great blogging fodder and examples for presentations, but do not represent the significant value that should, and is predicted to, flow from the open data and transparency agenda spreading through governments across the globe.

5 star mug The five star data rating scheme, from Sir Tim Berners-Lee, is a simple way to describe the problem and encourage publishers to strive to achieve a 5 star Linked Open Data rating, yet not discouraging openly publishing in any form in the first place.  Check out my earlier post What Is Your Data’s Star Rating(s)? when I dig in to both types of openness a bit further.

Policy makers and data openness enthusiasts who are behind this burgeoning flood of announcements [as a broad generality] get the licensing issues – use CC0 or copy the UK’s OGL.  However what concerns me is, they tend to shy away from promoting the removal of technical barriers that could stifle the broad adoption, and consequential flow of economic benefit, that they predict.

We could look back in a few years to this time of missed opportunity to say, it was obvious that the initiatives would fail because we didn’t make it easy for those that could have delivered the value.  We let the flood of enthusiastic initiatives wash past us without grabbing the opportunities to establish easy, consistent and repeatable ways to release and build upon the value in data for all, not just an enthusiastic few. We need to get this right if open data is going fuel the next revolution.

Quality Assurance - the Data Hub Some are thinking in the same way.  CKAN for instance have delivered an extension to calculate the [technical] openness of datasets as listed on the Dataset Openness Page of the Data Hub.  Great idea but I would suggest that most data publishers will never find their way to such a listing.  Where are the stars on the individual data set pages?  Where is the star rating badges of approval that publishers can put on their sites to show off?

We have made great strides so far in promoting the opening of public and other sector information, the ePISplatform stream is testament to that.  Somehow we need to capitalise on this great start and market the benefits of technically opening up your data better.  5 Star badge of approval anyone?

Stream photo from jjjj56cp on Flickr

Open Data: Digital Fuel or Raw Material?






I have been reading with interest ‘Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect’ by Vivek Kundra. Well worth a read to place the current [Digital] Revolution we are somewhere in the middle of, in relation to preceding revolutions and the ages that they begat.






I have been reading with interest the recently published discussion paper from Harvard University’s Joan Shoreenstein Center on the Press, Politics and Public Policy by former U.S. Chief Information Officer, Vivek Kundra, entitled Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect [pdf].  Well worth a read to place the current [Digital] Revolution we are somewhere in the middle of, in relation to preceding revolutions and the ages that they begat – the agricultural age, lasting thousands of years – the industrial age, lasting hundreds of years – and the digital revolution/age, which has already had massive impacts on individuals, society, governments and commerce in just a few short decades.

d70_kundra.pdf Paraphrasing his introduction: the microprocessor, “the new steam engine” powering the Information Economy is being fuelled by open data.  Stepping on to dangerous mixed metaphor territory here but,  I see him implying that the network effects, both technological and social, are turning that basic open data fuel in to high-octane brew driving massive change to our world.

Vivek goes on to catalogue some of the effects of this digital revolution.  With his background in county, sate, and federal US government it is unsurprising that his examples are around the effects of opening up public data, but that does not make them less valid.  He talks about four shifts in power that are emerging and/or need to occur:

  • Fighting government corruption, improving accountability and enhancing government services – open [democratised] data driving the public’s ability to hold the public sector to account, exposing hidden, or unknown, facts and trends.
  • Changing the default setting of government to open, transparent and participatory – changing the attitude of those within government to openly publish their data by default so that it can be used to inform their populations, challenge their actions and services, and stimulate innovation.
  • Create new models of journalism to separate signal from noise to provide meaningful insights – innovative analysis of publicly available data can surface issues and stories that would otherwise be buried in the noise of general government output.
  • Launch multi-billion dollar businesses based upon public sector data – by applying their specific expertise to the analysis, collation, and interpretation of open public data

All good stuff, and a great overview for those looking at this digital revolution as impacted by public open data.  As to what sort of age it will lead to, I think we need to look at a couple of steps further on in the revolution.

The agricultural revolution was based upon the move away from a nomadic existence, the planting and harvesting of crops and the creation of settlements.  The age that follows, I would argue, was based upon the outputs of those efforts enabling the creation of business and the trading of surpluses.  A new layer of commerce emerged, built upon the basic outputs of the revolutionary activities.

The industrial revolution introduced powered machines, replacing manual labour, massively increasing efficiency and productivity.  The age that followed was characterised by manufacturing – a new layer of added value, taking the basic raw materials produced or mined buy these machines and combining them in to new complex products.

Which brings me to what I would prefer to call the data revolution, where today we are seeing data as a fuel consumed to drive our information steam engines.  I would argue that soon we will recognise that data is not just a fuel but also a raw material.  Data from from many sources (public, private and personal) in many forms (open, commercially licensed and closed), will be combined with entrepreneurial innovation and refined to produce new complex products and services. In the same way that whole new industries emerged in the industrial era, I believe we will look back at today and see the foundations of new and future industries.  I published some thoughts on this in a previous post a year or so ago which I believe are still relevant.

Today, unless you want to expound significant effort and understanding of individual data, it is difficult to deliver an information service or application that depends on more than a couple of data sources.  This is because we are still trying to establish the de facto standards for presenting, communicating and consuming data.  We have mostly succeeded for web pages, with html and the gradual demise of pragmatic moment-in-time diversionary solutions such as flash.  However on the data front, we are still where the automobile industry was before agreeing what order and where to place the foot peddles in a car.

The answer I believe will emerge to be the adoption of data packaging, and linking techniques and standards – Linked Data.  I say this, not just because I am evangelist for the benefits of Linked Data, but because it exhibits the same distributed open and generic features that exemplify what has been successful for the Web.  It also builds upon those Web standards.  Much is talked, and hyped, about Big Data – another moment-in-time term.  Once we start linking, consuming, and building, it will be on a foundation of data that could only be described as big.  What we label Big today, will soon appear to be normal.

What of the Semantic web I am asked.  I believe the Semantic Web is a slightly out of focus vision of how the Information Age may look when it is established, expressed in the terms only of what we understand today.  So this is what I am predicting will arrive, but I am also predicting that we will eventually call it something else.

Picture of Vivek Kundra from Wikipedia.

OK So Who Noticed the SOPA Blackout






All in all, I believe the campaign has been surprisingly effective on the visible web. However, what prompted this post was trying to ascertain how effective it was on the Data Web, which almost by definition is the invisible web. Ahead of the dark day, a move started on the Semantic Web and Linked Open Data mailing lists to replicate what Wikipedia was doing by going dark on Dbpedia






0118-wikipedia-blackout-sopa-blackout_full_600 Well I did for a start!  I chose this auspicious day to move the Data Liberate web site from one hosting provider to another.  The reasons why are a whole other messy story, but I did need some help on the WordPress side of things and [quite rightly in my opinion] they had ‘gone dark’ in support of the SOPA protests.  Frustration, but in a good cause.

Looking at the press coverage from my side of the Atlantic, such as from BBC News, it seems that some in Congress have also started to take notice.  The most fuss in general seemed to be around Wikipedia going dark, demonstrating what the world would be like without the free and easy access to information we have become used to.  All in all I believe the campaign has been surprisingly effective on the visible web.

However, what prompted this post was trying to ascertain how effective it was on the Data Web, which almost by definition is the invisible web.  Ahead of the dark day, a move started on the Semantic Web and Linked Open Data mailing lists to replicate what Wikipedia was doing by going dark on Dbpedia – the Linked Data version of Wikipedia structured information.  The discussion was based around the fact that SOPA would not discriminate between human readable web pages and machine-to-machine data transfer and linking, therefore we [concerned about the free web] should be concerned.  Of that there was little argument.

The main issue was that systems, consuming data that suddenly goes away, would just fail.  This was countered by the assertion that, regardless of the machines in the data pipeline, there will always be a human at the end.  Responsible systems providers, should be aware of the issue and report the error/reason to their consuming humans.

Some suggested that instead of delivering the expected data, systems [operated by those that are] protesting, should provide data explaining the issue.  How many application developers have taken this circumstance in to account in their design I wonder.  If you, as a human, are accessing a SPARQL endpoint, are presented with a ‘dark’ page, you can understand and come back to query tomorrow.  If you are a system getting different types of, or no, data back, you will see an error.

The question I have is, who using systems that use Linked Data [that went dark] noticed that there was either a problem, or preferably an effect of the protest?

I suspect the answer is very few, but I would like to hear the experiences of others on this.

What Is Your Data’s Star Rating(s)?






The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document. Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.

In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme






The Linked Data movement was kicked off in mid 2006 when Tim Berners-Lee published his now famous Linked Data Design Issues document.  Many had been promoting the approach of using W3C Semantic Web standards to achieve the effect and benefits, but it was his document and the use of the term Linked Data that crystallised it, gave it focus, and a label.

mug-300x300 In 2010 Tim updated his document to include the Linked Open Data 5 Star Scheme to “encourage people — especially government data owners — along the road to good linked data”. The key message was to Open Data.  You may have the best RDF encoded and modelled data on the planet, but if it is not associated with an open license, you don’t get even a single star.  That emphasis on government data owners is unsurprising as he was at the time, and still is, working with the UK and other governments as they come to terms with the transparency thing.

Once you have cleared the hurdle of being openly licensed (more of this later), your data climbs the steps of Linked Open Data stardom based on how available and therefore useful it is. So:

Available on the web (whatever format) but with an open licence, to be Open Data
★★ Available as machine-readable structured data (e.g. excel instead of image scan of a table)
★★★ as (2) plus non-proprietary format (e.g. CSV instead of excel)
★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
★★★★★ All the above, plus: Link your data to other people’s data to provide context

By usefulness I mean how low is the barrier to people using your data for their purposes.  The usefulness of 1 star data does not spread much beyond looking at it on a web page.  3 Star data can at least be downloaded, and programmatically worked with to deliver analysis or for specific applications, using non-proprietary tools.  Whereas 5 star data is consumable in a standard form, RDF, and contains links to other (4 or 5 star) data out on the web in the same standard consumable form.  It is at the 5 star level that the real benefits of Linked Open Data kick in, and why the scheme encourages publishers to strive for the highest rating.

Tim’s scheme is not the only open data star rating scheme in town.  There is another one that emerged from the LOD-LAM Summit in San Francisco last summer – fortunately it is complementary and does not compete with his.  The draft 4 star classification-scheme for linked open cultural metadata approaches the usefulness issue from a licensing point of view.  If you can not use someone’s data because of onerous licensing conditions it is obviously not useful to you.

★★★★ Public Domain (CC0 / ODC PDDL / Public Domain Mark)

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is not contingent on anything
  • metadata can be combined with any other metadata set (including closed metadata sets)
★★★ Attribution License (CC-BY / ODC-BY) when the licensor considers linkbacks to meet the attribution requirement

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution by linkback to the data source
  • metadata can be combined with any other metadata set, including closed metadata sets, as long as the attribution link is retained
★★ Attribution License (CC-BY / ODC-BY) with another form of attribution

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution in a way specified by the provider
  • metadata can be combined with any other metadata set (including closed metadata sets)
Attribution Share-Alike License (CC-BY-SA/ODC-ODbL)

  • metadata can be used by anyone for any purpose
  • permission to use the metadata is contingent on providing attribution in a way specified by the provider
  • metadata can only be combined with data that allows re-distributions under the terms of this license

So when you are addressing opening up your data, you should be asking yourself how useful will it be to those that want to consume and use it.  Obviously you would expect me to encourage you to publish your data as ★★★★★★★★★ to make it as technically useful with as few licensing constraints as possible.  Many just focus on Tim’s stars, however, if you put yourself in the place of an app or application developer, a one LOD-LAM star dataset is almost unusable whilst still complying with the licence.

So think before you open – put yourself in the consumers’ shoes – publish your data with the stars.

One final though, when you do publish your data, tell your potential viewers, consumers, and users in very simple terms what you are publishing and under what terms. As the UK Government does through data.gov.uk using the Open Government Licence, which I believe is a ★★★.

Will Government Open Licence Extensions be a haven for the timid?

National Archives announced today UK government licensing policy extended to make more public sector information available:

Building on the success of the Open Government Licence, The National Archives has extended the scope of its licensing policy, encouraging and enabling even easier re-use of a wider range of public sector information.

The UK Government Licensing Framework (UKGLF), the policy and legal framework for the re-use of public sector information, now offers a growing portfolio of licences and guidance to meet the diverse needs and requirements of both public sector information providers and re-user communities.

On the surface this is move is to to be welcomed.  Providing, amongst other things, licensing choices and guidance for re-using information free of charge for non-commercial purposes – the Non-Commercial Government Licence; guidance to licensing where charges apply and for the licensing of software and source code.

All this is available from the UK Government Licensing Framework area of the National Archives site, along with FAQs and other useful supporting information, including machine readable licenses.

As the press release says, the extensions are building on the success of the Open Government License(OGL) and are designed to cover what the OGL can not.

So the [data publishers] thought process should be to try to publish under the OGL and then, only if ownership/licensing/cost of production provide an overwhelming case to be more restrictive, utilise these extensions and/or guidance.

My concern, having listened to many questions at conferences from what I would characterise as government conservative traditionalists, is that many will start at the charge-for/non-commercial use end of this licensing spectrum because of the fear/danger of opening up data too openly.  I do hope my concerns are unfounded and that the use of these extensions will be the exception, with the OGL being the de facto licence of choice for all public sector data.

This post was also published on the Talis Consulting Blog

UK Government Commits to More Open Data

Print A couple of weeks back UK Prime Minister David Cameron announced the broadening of the publicly available government data with the publishing of key data on the National Health Service, schools, criminal courts and transport.

The background to the announcement was a celebration of the preceding year of activity in the areas of transparency and open data, with many core government data sets being published. Too many to list here, but the 7,200+ listed on data.gov.uk gives you an insight.  The political guide to this is undeniable, as Mr Cameron makes clear in his YouTube speech for the announcement “Information is power because it allows people to hold the powerful to account”

His “I believe it will also drive economic growth as companies can use this new data to build web sites or apps that allow people to access this information in creative ways” statement also gives an indication of the drivers for the way forward.

To be successful in either of these ambitions, the people and the companies have to have access to information in an easy and reliable way that gives them confidence to build their opinions and their business models upon.  What do we measure that ease and reliability against – is it the against the world of audited business practice, where the legal eagles and armchair auditors strive towards perfection, or is it the web world in which a lack of perfection is accepted with the norm of good-enough is good enough?  I believe that with government data, on the web, we should still accept that it will not be perfect, but the good-enough hurdle should be set higher than we would expect from the likes of Wikipedia and some other oft used data sources.

There are two mentions, in the words that accompany the announcement, that appear to recognise this. Firstly in the announcement itself on the Number 10 website, this: All of the new datasets will be published in an open standardised format so they can be freely re-used under the Open Government Licence by third parties. What ‘open standardised format’ actually means is something we need to delve in to, but previous data.gov.uk work towards Linked Data and shared reliable identifiers for things [such as postcodes, schools, and stations] bodes well.   Secondly in Mr Cameron’s letter to his Cabinet we get a section on improving data quality, including things like plain English descriptions of scope and purpose, introducing unique identifiers to help tracking of interactions with companies, and an action plan for improving quality and comparability of data.

So where are we now?  Some of the new data is not perfect, as this thread on the UK Government Data Developers Google Group, shows.  William Waltes, identifies that the [government] reporting of transactions with the Open Knowledge Foundation, do not match the transactions in the OKF’s own books, therefore calling in to question how reliable those [government] figures are.  In my opinion, this is an example where we should applaud the release of such new data but, with conversations such as the one William started, help those who are publishing the data to improve the quality, reliability and comparability of their output.  Of course by definition it means that the publishers are prepared and ready to listen – and are listening.

What we shouldn’t do is throw our hands in the air in despair because the first publishing of data by some departments is not up to what we would expect, or decry the move towards shared [URI based] identifiers because they look confusing in a csv file.  Data publishers will get better at it with helpful criticism.  I am also convinced that sharing well known reliable identifiers for things across desperate, government and non-government, data will in the medium term have a far greater benefit than most [including enthusiasts for Linked Data like me] can currently envisage.

 This post was also published on the Talis Consulting Blog

Are We Getting A Right to Data?

Friday night – nothing on the TV – I know! I’ll browse through the Protection of Freedoms Bill, currently passing through the UK Parliament. Sad I know, but interesting.

Government spending data published %007C Number10.gov.uk Lets scroll back in time a bit to November 19th 2010 and a government press conference introduced by a video from Prime Minister David Cameron.  The headline story was about the publishing of government spending and contract data, but towards the end of this 109 second short he said the following:

… the most exciting is a new right to data. Which will let people request streams of government information and use it for social or commercial purposes.  Take all this together and we really can make this one of the most open, accountable and transparent governments there is.  Let me end by saying this. You are going to have so much information about what we do, how much of your money we spend doing it, and what the outcome is.  So use it, exploit it, hold us to account.  Together we can set a great example of what a modern democracy aught to look like. (my emphasis)

Obviously to realise this Right to Data there needs to be some legislation, which brings me to the Protection of Freedoms Bill.  This is one of those bills which covers all sorts of issues, from rules for destruction of fingerprints and DNA profiles, CCTV camera regulations, detention of terrorist suspects, to freedom of information and data protection.  Zooming in on the bits on the topic of the release and publication of datasets held by public authorities, we find a set of clauses that amend the Freedom of Information Act 2000.

Re-use

After some amendments which allow for datasets and provision in electronic form we get this: “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.”  Unfortunately there is no definition of the term re-use.  It could be argued that a pdf of some tables in a MS Word document could be re-used, where as I believe the spirit of the legislation should be made more explicit to by identifying non-proprietary data formats.  I know this would be a tricky job for the parliamentary draftsmen, as we would not want to restrict it to things, such as XML and csv, that could age and be replaced by something better which then could not be used as it had not been mentioned in the legislation, but I believe that just using the term ‘re-use’ is far too woolly and open to [mis]interpretation.

What is [not] a dataset

This is one of the areas that raises most concern for me. Checkout this wording from the Bill:text1 I am OK with (a) – data collected as part of an authority doing it’s job – and (c) – don’t change the data you have collected – publishing that raw data is important.  However (b) specifically excludes data that is the product of analysis.  Presumably analysis of collected data is one significant way that an authority measures the outcomes of its efforts.  Understanding that analysis will help understand the subsequent decisions and actions they make and take.  I assume that there may be some specific reasons that underpin this blanket exclusion of analysis data.  If there are, they should be identified, instead of generally throttling the output of useful data that will go a long way to helping with Mr Cameron’s stated ambition for us to be able to see “what the outcome is” of the spending of public money.

Release of datasets for re-use

This is a whole new section (11A)  to be added to the 2000 act to cover the release of datasets. It covers ownership, copyright, and/or database right of the information to be published and states that it should be published under “the licence specified by the Secretary of State in a code of practice issued under section 45”. Section 45 basically puts in to the hands of the Secretary of State the definition of the license(s) data should be published under.  As of today the Open Government Licence for public sector information is what is wanted to keep the publishing of information open.  However, what is there to stop a future Secretary of State, who has a less open outlook in replacing it with far more restrictive licences?  Do we not need some form of presumption of openness being attached to the Secretary of States powers as part of this change in legislation?

On the topic of presumptions of openness, the wording of this bill contains phrases such as “unless the authority is satisfied that it is not appropriate for the dataset to be published” and “where reasonably practicable”.  It is clear that many in the public sector are not as enthusiastic about publishing data as the current government position and such vague phrases as these may well be unreasonably used by some in justifying a throttling of the stream of information.   They could easily be used to build in a bureaucratic decision hurdle for each dataset to have to jump, proving its appropriateness and practicality, before publication.  I am sure that it would not be beyond a parliamentary draftsman’s skill to produce wording that means that all will be published, unless a specific objection is raised for an individual dataset, for reasons of excessive effort or data protection reasons.

Up-dated data

Data published by an authority should be published under a scheme, the following applies here:Protection of Freedoms Bill (HC Bill 146)How should we interpret “any up-dated version held by the authority of such a dataset”? My interpretation is that once a dataset has been published is shall continue to be published as it changes.  The precedent for this is spending data – having published authority spending for January 2011, authorities should be automatically publishing it for February and following months.  But what if, in response to a request, an authority publishes the contents of a spreadsheet used to track the amount of salt applied to roads in its area during winter 2010-11 and then uses a different spreadsheet for the following winter.  Does the output of that new spreadsheet constitute a new dataset, or an up-date to it’s predecessor?  From the wording in the Bill it is not clear.

Who does it cover?

I probably need a bit of help here from those that understand the public sector better than I do, but I am suspicious that references to the organisations listed in Schedule 1 and “the wider public sector”, do not take the net wide enough to cover some of the data that is relevant to our daily lives but is delivered on behalf of some authorities by third parties.  For example I am aware that recently a large city was not able to inform citizens of their rubbish collection schedules because that data was considered as commercially restricted by their service provider.

 

So in summary, I welcome the commitment to a right to data being realised by streams of government information about what we do, how much of our money is spend doing it, and what the outcomes are.  However, I am sceptical as to how effective the measures in the current Protection of Freedoms Bill will be in delivering them.  Especially in the light of very recent comments made by the Prime Minister highlighting the “enemies of enterprise” in Whitehall and town halls across the country, attacking what he called the “mad” bureaucracy that holds back entrepreneurs.  Those enemies are just the people who might take the wording of this bill as ammunition in their cause.

mug Whilst being concerned about this topic, I have been wondering why few are commenting on it.  Are the majority just taking the press conference statements by David Cameron, and his fellow Ministers, as indications of a battle won, or am I missing something?  I promote Sir Tim Berners-Lee’s 5 Star Data as the steps towards a Web of Linked Data – if we don’t get the publishing of public sector data to at least 3 star standard (Available as machine-readable structured data – in non-proprietary format), many of the current ambitions may remain just that, ambitions.  That would be a massive missed opportunity.

So are we getting a right to data? – or just some provisions to extend the Freedom of Information Act a bit further in the dataset direction?  I’m not sure.

Personal note: As you may tell from the above, I am no expert on the interpretation of parliamentary legislation, and I have left several unanswered questions hanging in this post.  Any help in clarifying my thinking, confirming or disproving my assumptions, or answering some of those questions, will be gratefully received in comments to this post or your own posted thoughts.

This post was also published on the Nodalities Blog

Linked Spending Data – How and Why Bother Pt3

linkedlocalgovAs often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted.  So my apologies to those eagerly awaiting this latest.

To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all.  In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.

I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion.  So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies  identifiers.

Ontologies

An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain  –  an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations.  So in our quest to publish spending data what ontology should we use?  The Payments Ontology, with the accompanying guide to it’s application, is what is needed.  Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc.  The next question is how do you identify the things that you are relating together using this ontology.

Lets take this one step at a time:

  1. Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
  2. Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’.  This enables anyone wanting detail about this item to follow the link to our site to get the information.
  3. Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology:
    http://spending.lichfielddc.gov.uk/spend/8605670
    payment:payer
    http://statistics.data.gov.uk/id/local-authority/41UD .
     

    Note I am using an identifier for the payer that is published by statistics.data.gov.uk.  That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.

  4. Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670
    payment:payee
    http://spending.lichfielddc.gov.uk/supplier/bristow-sutor .
  5. And then repeat the process for categorisation, payment value etc.

This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number!  – there are, but as of the moment there are no established sets of URI identifiers for these.  OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so.  Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier:
http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs
http://opencorporates.com/companies/uk/01431688 .

Identifiers

A_Colorful_Cartoon_Chicken_Laying_a_Golden_Egg_Royalty_Free_Clipart_Picture_100705-004451-507053 Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation.   I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670 

You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later.  Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616.  If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.

It does matter however, for those looking to the identifying URI for some idea of authority.  As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying.  Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information.   The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit  links between URIs at a later date.

In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many.  An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment.  Yes it is worth bothering, just don’t expect benefits on day, or even month, one.

This post was also published on the Nodalities Blog

 

Linked Spending Data – How and Why Bother Pt2

linkedlocalgovI started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption.   So learning from that experience I am going to start with another assumption in this post.  Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing.  Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.

In this post I am going to be concentrating more on the How than the Why Bother.

homeTo help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples.  Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/.   On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council.

However, it is more than a web site.  Inspection of the Download data tab shows that you can get your hands on the source data in csv format.  Here is one line, representing a line of expenditure, from that data:

“http://statistics.data.gov.uk/id/local-authority/41UD”,”Lichfield District Council”,”2010-04-06″,”7747″,”http://spending.lichfielddc.gov.uk/spend/8605670″,”120.00″,”BRISTOW & SUTOR”,”401″,”Revenue Collection”,”Supplies & Services”,”Bailiff Fees”,””

… which represents the data displayed on this human readable page:

Lichfield District Council Spending Data - Details of payment number 8605670
Looking through the csv, you can pick out the strings of characters for information such as the date, supplier name, department name etc.  In addition you can pick out a couple of URIs:

Linked Data for Lichfield District Council %007C statistics.data.gov.uk In the context of csv, that’s all these URIs are, identifiers.  However because they are http URIs you can click through to the address to get more information.  If you do that with your web browser you get a human readable representation of the data.  These sites also provide access to the same data, formatted in RDF, for use by developers.

Source of http___spending.lichfielddc.gov.uk_spend_8605670.rdf You can see that data by adding ‘.rdf’ to the end of the address, thus: http://spending.lichfielddc.gov.uk/spend/8605670.rdf and then selecting the ‘view source’ option of your browser for the page of gobbledegook that you get back.

Inspecting the RDF, you will see that most things, except descriptive labels and financial values, are are now identified as URIs such as http://spending.lichfielddc.gov.uk/subjective/bailiff-fees and http://spending.lichfielddc.gov.uk/invoice/7747.  Again if you follow those links, you will get a human readable representation of that resource, and the RDF behind it by adding a ‘.rdf’ suffix.

The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things.  Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.

Secondly, you may have spotted that there is no date, or supplier name or identifier.  That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.

BBC - Wildlife Finder - Whooper swan facts, pictures & stunning videos Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc.  This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to.  Another great example of this effect can be found on a site such as the BBC Wildlife Finder.  That is not to say that you could not have created such a site without even considering Linked Data, of course you could.  However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.

The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority.  By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.

So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s?  The simple answer is that you may not have to – others may help you do some, if not all, of it.   With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are.  As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source.  Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.

As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this.  But that is for Part 3….

This post was also published on the Nodalities Blog

Linked Spending Data – How and Why Bother Pt1

linkedlocalgovNational Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months.  For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.

This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start?

I believe that spending data is a great place to start, both for publishing local government data and for making such data linked.  Someone at national level was quite astute choosing spending as a starting point.  To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category,  a payee, and an amount.  At a very basic level it is very easy to measure if an authority has done that or not.

Guidance from data.gov.uk expands on this a little by mandating the following:

Body This should be the URI that represents (or more properly ‘identifies’ – see below) the local authority at statistics.data.gov.uk.
eg. http://statistics.data.gov.uk/id/local-authority-district/00CN
Date Should ideally be the payment date as recorded in purchase or general ledger
Transaction number To identify within authority’s system, for future reference
Amount In Sterling recorded in finance system
Supplier Details Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
Expense Area The part of the authority that spent the amount
Service Categorization

Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.

… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.

As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format – supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.

So fairy clear, and measurable, then. You either have published your required basic elements of data in a CSV format file, or you have not.  Couple this with the political ambitions and drive behind the Government’s Transparency Agenda, and local authorities will have difficulty in not delivering this.  Although some are being a bit tardy and others seem reticent to publish in formats other than pdf.

OK so why bother with applying Linked Data techniques to this [boring] spending data?  Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent.   Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.

For those new to Linked Data, the use of URIs as identifiers often causes confusion.   A URI, such as  http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans.  It has couple of attributes that make it different from traditional identifiers.  Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier.  This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion.  Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies.  In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.

Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How)  leads to a Linked Data representation of the data.  In technical terms a comparatively easy step using scripts etc.

By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.

If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities.  Which brings me to ….

linkeddata_blue Why bother publishing any local authority data as Linked Data?  Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government.  For example, the data held at statistics.data.gov.uk includes which county an authority is located within.  By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.

As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc.  Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data  platform for innovating development and the publishing of other data that will link in with this basic but important financial data.

There are however some more technical issues, URI naming, aggregation, etc.,  to be overcome or at least addressed in the short term to get us to that foundation.  I will cover these in part 2 of this series.

This post was also published on the Nodalities Blog