Many start with a large spreadsheet, or database, that they have never published to anyone before and are unsurprisingly a little concerned when confronted with feverous cries to publish everything as Linked Open Data – Now!
Relax – make yourself a mug of your favourite hot beverage and approach this rationally.
There have been many presentations, posts, and the like about taking things from Sir Tim Berners-Lee’s now famous ‘raw data now’ chant in his TED talk, to your data becoming a fully fledged member of the Linked Open Data Cloud . I produced a blog post on this a while back entitled The Data Publishing Three-Step as a help to those considering taking this data journey. Because that’s what it is, a journey. A series of stages to go through as you address different aspects of the data open publishing process – each stage building on the previous in an achievable way.
Whilst you are contemplating what might be the outcome of those stages, you would do no better than to be sipping your [by now probably only warm] beverage from a 5 Star data mug! Emblazoned on its side are the 5 star ratings for Linked Open Data:
* On the web, open licensed – get your data out there, in any form, for others to use under an open license, such as the Open Government License for Public Sector Information– clear and unambiguous. For many, this is one of the significant steps, because it often includes the convincing of others that this might be a good idea.
** Machine-readable data – make the data you have just published readable by software. If it was a spreadsheet that you previously published as a nicely formatted pdf – make the Excel file available in addition.
*** Non-proprietary format – publish a csv file as well, then it can be used in software and applications different from those Microsoft ones.
****RDF standards – start using URIs as identifiers – and publishing in RDF format. This step is another that needs a bit more thought as to how you are going to describe your data
Many of these steps could be taken in one go – you could go directly to 3* data in many cases. Some of the steps could be taken by others on your behalf – converting your 3* data and republishing it as 5*. There are many variations and options we come across when working with organisations to help them to confidently enter the world of Linked Data.
This post was also published on the Talis Platform Consulting Blog
Friday night – nothing on the TV – I know! I’ll browse through the Protection of Freedoms Bill, currently passing through the UK Parliament. Sad I know, but interesting.
Lets scroll back in time a bit to November 19th 2010 and a government press conference introduced by a video from Prime Minister David Cameron. The headline story was about the publishing of government spending and contract data, but towards the end of this 109 second short he said the following:
… the most exciting is a new right to data. Which will let people request streams of government information and use it for social or commercial purposes. Take all this together and we really can make this one of the most open, accountable and transparent governments there is. Let me end by saying this. You are going to have so much information about what we do, how much of your money we spend doing it, and what the outcome is. So use it, exploit it, hold us to account. Together we can set a great example of what a modern democracy aught to look like. (my emphasis)
After some amendments which allow for datasets and provision in electronic form we get this: “the public authority must, so far as reasonably practicable, provide the information to the applicant in an electronic form which is capable of re-use.” Unfortunately there is no definition of the term re-use. It could be argued that a pdf of some tables in a MS Word document could be re-used, where as I believe the spirit of the legislation should be made more explicit to by identifying non-proprietary data formats. I know this would be a tricky job for the parliamentary draftsmen, as we would not want to restrict it to things, such as XML and csv, that could age and be replaced by something better which then could not be used as it had not been mentioned in the legislation, but I believe that just using the term ‘re-use’ is far too woolly and open to [mis]interpretation.
What is [not] a dataset
This is one of the areas that raises most concern for me. Checkout this wording from the Bill: I am OK with (a) – data collected as part of an authority doing it’s job – and (c) – don’t change the data you have collected – publishing that raw data is important. However (b) specifically excludes data that is the product of analysis. Presumably analysis of collected data is one significant way that an authority measures the outcomes of its efforts. Understanding that analysis will help understand the subsequent decisions and actions they make and take. I assume that there may be some specific reasons that underpin this blanket exclusion of analysis data. If there are, they should be identified, instead of generally throttling the output of useful data that will go a long way to helping with Mr Cameron’s stated ambition for us to be able to see “what the outcome is” of the spending of public money.
Release of datasets for re-use
This is a whole new section (11A) to be added to the 2000 act to cover the release of datasets. It covers ownership, copyright, and/or database right of the information to be published and states that it should be published under “the licence specified by the Secretary of State in a code of practice issued under section 45”. Section 45 basically puts in to the hands of the Secretary of State the definition of the license(s) data should be published under. As of today the Open Government Licence for public sector information is what is wanted to keep the publishing of information open. However, what is there to stop a future Secretary of State, who has a less open outlook in replacing it with far more restrictive licences? Do we not need some form of presumption of openness being attached to the Secretary of States powers as part of this change in legislation?
On the topic of presumptions of openness, the wording of this bill contains phrases such as “unless the authority is satisfied that it is not appropriate for the dataset to be published” and “where reasonably practicable”. It is clear that many in the public sector are not as enthusiastic about publishing data as the current government position and such vague phrases as these may well be unreasonably used by some in justifying a throttling of the stream of information. They could easily be used to build in a bureaucratic decision hurdle for each dataset to have to jump, proving its appropriateness and practicality, before publication. I am sure that it would not be beyond a parliamentary draftsman’s skill to produce wording that means that all will be published, unless a specific objection is raised for an individual dataset, for reasons of excessive effort or data protection reasons.
Data published by an authority should be published under a scheme, the following applies here:How should we interpret “any up-dated version held by the authority of such a dataset”? My interpretation is that once a dataset has been published is shall continue to be published as it changes. The precedent for this is spending data – having published authority spending for January 2011, authorities should be automatically publishing it for February and following months. But what if, in response to a request, an authority publishes the contents of a spreadsheet used to track the amount of salt applied to roads in its area during winter 2010-11 and then uses a different spreadsheet for the following winter. Does the output of that new spreadsheet constitute a new dataset, or an up-date to it’s predecessor? From the wording in the Bill it is not clear.
Who does it cover?
I probably need a bit of help here from those that understand the public sector better than I do, but I am suspicious that references to the organisations listed in Schedule 1 and “the wider public sector”, do not take the net wide enough to cover some of the data that is relevant to our daily lives but is delivered on behalf of some authorities by third parties. For example I am aware that recently a large city was not able to inform citizens of their rubbish collection schedules because that data was considered as commercially restricted by their service provider.
So in summary, I welcome the commitment to a right to data being realised by streams of government information about what we do, how much of our money is spend doing it, and what the outcomes are. However, I am sceptical as to how effective the measures in the current Protection of Freedoms Bill will be in delivering them. Especially in the light of very recent comments made by the Prime Minister highlighting the “enemies of enterprise” in Whitehall and town halls across the country, attacking what he called the “mad” bureaucracy that holds back entrepreneurs. Those enemies are just the people who might take the wording of this bill as ammunition in their cause.
Whilst being concerned about this topic, I have been wondering why few are commenting on it. Are the majority just taking the press conference statements by David Cameron, and his fellow Ministers, as indications of a battle won, or am I missing something? I promote Sir Tim Berners-Lee’s 5 Star Data as the steps towards a Web of Linked Data – if we don’t get the publishing of public sector data to at least 3 star standard (Available as machine-readable structured data – in non-proprietary format), many of the current ambitions may remain just that, ambitions. That would be a massive missed opportunity.
So are we getting a right to data? – or just some provisions to extend the Freedom of Information Act a bit further in the dataset direction? I’m not sure.
Personal note: As you may tell from the above, I am no expert on the interpretation of parliamentary legislation, and I have left several unanswered questions hanging in this post. Any help in clarifying my thinking, confirming or disproving my assumptions, or answering some of those questions, will be gratefully received in comments to this post or your own posted thoughts.
This post was also published on the Nodalities Blog
As often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted. So my apologies to those eagerly awaiting this latest.
To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all. In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.
I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion. So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies identifiers.
An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain – an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations. So in our quest to publish spending data what ontology should we use? The Payments Ontology, with the accompanying guide to it’s application, is what is needed. Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc. The next question is how do you identify the things that you are relating together using this ontology.
Lets take this one step at a time:
Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’. This enables anyone wanting detail about this item to follow the link to our site to get the information.
Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology: http://spending.lichfielddc.gov.uk/spend/8605670
Note I am using an identifier for the payer that is published by statistics.data.gov.uk. That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.
Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670
And then repeat the process for categorisation, payment value etc.
This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number! – there are, but as of the moment there are no established sets of URI identifiers for these. OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so. Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier: http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs http://opencorporates.com/companies/uk/01431688 .
Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation. I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670
You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later. Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616. If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.
It does matter however, for those looking to the identifying URI for some idea of authority. As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying. Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information. The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit links between URIs at a later date.
In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many. An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment. Yes it is worth bothering, just don’t expect benefits on day, or even month, one.
This post was also published on the Nodalities Blog
I started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption. So learning from that experience I am going to start with another assumption in this post. Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing. Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.
In this post I am going to be concentrating more on the How than the Why Bother.
To help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples. Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/. On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council.
However, it is more than a web site. Inspection of the Download data tab shows that you can get your hands on the source data in csv format. Here is one line, representing a line of expenditure, from that data:
“http://statistics.data.gov.uk/id/local-authority/41UD”,”Lichfield District Council”,”2010-04-06″,”7747″,”http://spending.lichfielddc.gov.uk/spend/8605670″,”120.00″,”BRISTOW & SUTOR”,”401″,”Revenue Collection”,”Supplies & Services”,”Bailiff Fees”,””
In the context of csv, that’s all these URIs are, identifiers. However because they are http URIs you can click through to the address to get more information. If you do that with your web browser you get a human readable representation of the data. These sites also provide access to the same data, formatted in RDF, for use by developers.
The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things. Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.
Secondly, you may have spotted that there is no date, or supplier name or identifier. That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.
Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc. This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to. Another great example of this effect can be found on a site such as the BBC Wildlife Finder. That is not to say that you could not have created such a site without even considering Linked Data, of course you could. However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.
The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority. By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.
So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s? The simple answer is that you may not have to – others may help you do some, if not all, of it. With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are. As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source. Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.
As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this. But that is for Part 3….
This post was also published on the Nodalities Blog
A colleague sharing their experience of visiting Ironbridge, promoted as “The Birthplace of the Industrial Revolution” helped clarify some thoughts I have been brewing to help convey where the current Linked Data enthusiasms and initiatives may lead us.
The famous Iron Bridge, opened in 1781, spans the River Severn in Shropshire, England. To quote the Wikipedia “It was the first arch bridge in the world to be made out of cast iron, a material which was previously far too expensive to use for large structures. However, a new blast furnace nearby lowered the cost and so encouraged local engineers and architects to solve a long-standing problem of a crossing over the river.” The raw materials of iron ore and coal had been known for a long time, but it took the building of a nearby furnace, using the innovation of coke as a fuel, that enabled the local community to invest in the construction. The outcome was not only to stimulate the local commercial and administrative economy, but it also became an 18th century tourist attraction, which it continues to be today.
All very interesting, but what has this to do with Linked Data and it’s future?
The impact of Linked Data and the Web of Data it enables, on the way we interact and do business, will be greater than that of the World Wide Web that it builds upon.
When one makes statements like that one, you are often asked to justify yourself. As you may know, I like to use analogies to help clarify things and I believe the Industrial Revolution is a good one in the case of the future for Linked Data and associated techniques. I am also very aware that analogies tend to fall apart if you pick at the detail too much, so please bear with me on this one.
Like the Industrial Revolution, Linked Data is building on what went before. Before the Iron Bridge, there were other bridges, roads, and uses of iron. Before Linked data there was/is the Web – a globally distributed web of linked human-readable web pages, upon which are surfaced words and images for our information, entertainment and commercial desires. Data of course plays it’s part, often powering the websites that we all consume.
So what is so special about a Web of Data? – The data comes out from behind those websites to be linked with other data across the web, or maybe an intranet. Using the same techniques for linking pages together [the URL], data identifiers are given URIs. This means that a piece of data is given an identifier that is addressable across the web and therefore linkable with other data identified in a similar way.
So where does the Industrial Revolution analogy kick in? Well, once data are identifiable in a globally distributed context, they can be linked, mixed, mashed, and generally used to add value to each other. Your data can become the raw material for someone else’s process – your Wikipedia comment about an animal can become the description on a, data powered, BBC page about that species. As with coal, which after some refinement can become coke to be used to add value to the iron smelting process, any published data can be the raw material for value adding/combining processes. The processor, utilising their knowledge, skills, and experience to produce an alloy of data, the combination of which is greater than the sum of it’s parts.
In the same way that some freely available elements, such as the air pumped in to that blast furnace, were needed to get the process going; freely and openly available data, such as governments and the media are publishing, are priming the pumps of a data revolution.
Whenever there is value to be added in a process there is both community and commercial opportunity. Once people start using their skill and understanding of a facet of knowledge, to link data from one free, or commercial, source with more free or commercial data they can produce either a saleable result, and/or an enhancement to their own services. The output of one value-add process can then become one of the sources for yet another, and so on.
To finally stretch my analogy just a little further – looking back to those early days in the Severn Valley, it is possible to identify the building-blocks that led to commercial steel production, the age of steam, the automobile industry, and space flight. Most of which would have been unthinkable by those early pioneers. Pre-1994, could we have predicted the growth of Google, YouTube, Wikipeadia, and Twitter? In 2010 can we identify the building-blocks of a data revolution? – I think maybe we can.
So how will such a revolution, underpinned by Linked Data, change the way we interact and do business, more fundamentally than the Web has? – By creating whole new communities and industries to connect, supply, trade, enhance, distribute, interpret, and build services and applications upon a supporting web of globally available data elements and alloys.
National Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months. For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.
This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start?
I believe that spending data is a great place to start, both for publishing local government data and for making such data linked. Someone at national level was quite astute choosing spending as a starting point. To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category, a payee, and an amount. At a very basic level it is very easy to measure if an authority has done that or not.
Should ideally be the payment date as recorded in purchase or general ledger
To identify within authority’s system, for future reference
In Sterling recorded in finance system
Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
The part of the authority that spent the amount
Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.
… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.
As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format – supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.
OK so why bother with applying Linked Data techniques to this [boring] spending data? Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent. Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.
For those new to Linked Data, the use of URIs as identifiers often causes confusion. A URI, such as http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans. It has couple of attributes that make it different from traditional identifiers. Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier. This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion. Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies. In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.
Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How) leads to a Linked Data representation of the data. In technical terms a comparatively easy step using scripts etc.
By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.
If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities. Which brings me to ….
Why bother publishing any local authority data as Linked Data? Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government. For example, the data held at statistics.data.gov.uk includes which county an authority is located within. By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.
As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc. Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data platform for innovating development and the publishing of other data that will link in with this basic but important financial data.
There are however some more technical issues, URI naming, aggregation, etc., to be overcome or at least addressed in the short term to get us to that foundation. I will cover these in part 2 of this series.
This post was also published on the Nodalities Blog