A theme of my last few years has been enabling the increased discoverability of Cultural Heritage resources by making the metadata about them more open and consumable.
Much of this work has been at the libraries end of the sector but I have always have had an eye on the broad Libraries, Archives, and Museums world, not forgetting Galleries of course.
Two years ago at the LODLAM Summit 2015 I ran a session to explore if it would be possible to duplicate in some way the efforts of the Schema Bib Extend W3C Community Group which proposed and introduced an extension and enhancements to the Schema.org vocabulary to improve its capability for describing bibliographic resources, but this time for archives physical, digital and web.
Interest was sufficient for me to setup and chair a new W3C Community Group, Schema Architypes. The main activity of the group has been the creation and discussion around a straw-man proposal for adding new types to the Schema.org vocabulary.
Not least the discussion has been focused on how the concepts from the world of archives (collections, fonds, etc.) can be represented by taking advantage of the many modelling patterns and terms that are already established in that generic vocabulary, and what few things would need to be added to expose archive metadata to aid discovery.
So why am I now suggesting that there maybe an opportunity for the discovery of archives and their resources?
In web terms for something to be discoverable it, or a description of it, needs to be visible on a web page somewhere. To take advantage of the current structured web data revolution, being driven by the search engines and their knowledge graphs they are building, those pages should contain structured metadata in the form of Schema.org markup.
Through initiatives such as ArchivesSpace and their application, and ArcLight it is clear that many in the world of archives have been focused on web based management, search, and delivery views of archives and the resources and references they hold and describe. As these are maturing it is clear that the need for visibility on the web is starting to be addressed.
So archives are now in a great place to grab the opportunity to take advantage of the benefits of Schema.org to aid discovery of their archives and what they contain. At least with these projects, they have the pages on which to embed that structured web data, once a consensus around the proposals from the Schema Architypes Group has formed.
I call out to those involved with the practical application of systems for the management, searching, and delivery of archives to at least take a look at the work of the group and possibly engage on a practical basis, exploring the potential and challenges for implementing Schema.org.
So if you want to understand more behind this opportunity, and how you might get involved, either join the W3C Group or contact me direct.
*Image acknowledgement to The Oberlin College Archives
It is great to launch a new venture, and I am looking forward to launching this one – Data Liberate.
Having said that, there is much continuity in this step. Those that know me from the conference circuit, my work with the Talis Group and more recently with Talis Consulting, will recognise much in the core of what Data Liberate has to offer.
What will I be doing at Data Liberate? The simple answer is much of the same, but with a wider audience & client base and less restricted to specifically Linked Data and Semantic Web techniques and technologies. Extracting value from data, financial or otherwise, is at the core of the next the next big wave of innovation on the web. The Web has become central to what we all do commercially, corporately, socially and as individuals. This, data driven, next wave of innovation will therefore influence, and potentially benefit, us all.
As with all ‘new waves of innovation’ there is much technological gobbledegook, buzzwords and marketing hype surrounding it. My driving focus for many years has been to make new stuff understandable to those that can benefit. A focus I intend to continue and increase over the coming months and years. This is not just a one way process. Experience has shown me that those with the technology are often not very adept at promoting it’s value to others in terms that they can connect with. I therefore intend to spend some of my efforts with these technology providers, helping them get their message over for the benefit of all.
Trawling the blogosphere you will find proponents of Open Data, Big Data, Linked Data, Linked Open Data, Enterprise Data and Cloud Computing. All trying to convince us that their core technology is the key to the future. As with anything the future, for extracting and benefiting from the value within data, is a mixture of most if not all of the above. Hence Data Liberate’s focus is on Data in all it’s forms.
Contact me if you want to talk through what Data Liberate will be doing and how we can help you.
The Library of Congress made an announcement earlier this week that has left some usually vocal library pundits speechless.
MARC is Dead! – RDA made irrelevant! – cries that can be heard rattling around the bibliographic blogo-twittersphere. My opinion is that this is an inevitable move based upon serious consideration, and has been building on several initiatives that have been brewing for many months.
Bold though – very bold. I am sure that there are many in the library community, who have invested much of their careers in MARC and its slightly more hip cousin RDA, who are now suffering from vertigo as they feel the floor being pulled from beneath their feet.
The Working Group of the Future of Bibliographic Control, as it examined technology for the future, wrote that the Library community’s data carrier, MARC, is “based on forty-year-old techniques for data management and is out of step with programming styles of today.”
Many of the libraries taking part in the test [of RDA] indicated that they had little confidence RDA changes would yield significant benefits…
And on a more positive note:
The Library of Congress (LC) and its MARC partners are interested in a deliberate change that allows the community to move into the future with a more robust, open, and extensible carrier for our rich bibliographic data….
….The new bibliographic framework project will be focused on the Web environment, Linked Data principles and mechanisms, and the Resource Description Framework (RDF) as a basic data model.
There is still a bit of confusion there between a data carrier and a framework for describing resources. Linked Data is about linking descriptions of things, not necessarily transporting silos of data from place to place. But maybe I quibble a little too much at this early stage.
So now what:
The Library of Congress will be developing a grant application over the next few months to support this initiative. The two-year grant will provide funding for the Library of Congress to organize consultative groups (national and international) and to support development and prototyping activities. Some of the supported activities will be those described above: developing models and scenarios for interaction within the information community, assembling and reviewing ontologies currently used or under development, developing domain ontologies for the description of resources and related data in scope, organizing prototypes and reference implementations.
I know that this is the way that LoC and the library community do things, but do I hope that this doesn’t mean that they will disappear into an insular huddle for a couple of years to re-emerge with something that is almost right yet missing some of the evolution that is going on around them over that period.
One very relevant example of the success of applying open thinking and approach to the bibliographic word using Linked Data is the open publishing of the British National Bibliography (BnB). Readers of this blog will know that we at Talis have worked closely with the team at the BL in their ground breaking work. The data model they produced is an example of one of those things that may induce that feeling of vertigo that I mentioned. It doesn’t look much like a MARC record! I can assure the sceptical that although it may be very different to what you are used to, it is easy to get your head around. (Drop us a line if you want some guidance).
As Talis host the BnB Linked Data for the BL, I can testify to the success of this work – only launched in mid July. It’s use is growing rapidly, receiving just short of 2 million hits in the last month alone.
With the British Library, along with the National Libraries of Canada and Germany, being quoted as partners with the LoC in this initiative, plus their work being referenced as an exemplar in the other reports I mention, I hold out a great hope that things are headed in the right direction.
As comments to some of my previous posts attest, there is concern from some in the community of domain experts, that this RDF stuff is too simple and light-weight and will not enable them capture the rich detail that they need. They are missing a couple of points. Firstly, it is this simplicity that will help non-domain experts to understand, reference and link to their rich resources. Secondly, RDF is more than capable of describing the rich detail that they require – using several emerging ontologies including the RDA ontology, FRBR, etc. Finally and most importantly, it is not a binary choice between widely comprehended simplicity and and domain specific detailed description. The RDF for a resource can, and probably should, contain both.
So Library of Congress, I welcome your announcement and offer a friendly reminder that you not only need to draw expertise from the forward thinking library community, but also from the wider Linked Data world. I am sure your partners from the British Library willreinforce this message.
This post was also published on the Talis Consulting Blog
As often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted. So my apologies to those eagerly awaiting this latest.
To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all. In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.
I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion. So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies identifiers.
An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain – an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations. So in our quest to publish spending data what ontology should we use? The Payments Ontology, with the accompanying guide to it’s application, is what is needed. Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc. The next question is how do you identify the things that you are relating together using this ontology.
Lets take this one step at a time:
Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’. This enables anyone wanting detail about this item to follow the link to our site to get the information.
Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology: http://spending.lichfielddc.gov.uk/spend/8605670
Note I am using an identifier for the payer that is published by statistics.data.gov.uk. That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.
Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670
And then repeat the process for categorisation, payment value etc.
This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number! – there are, but as of the moment there are no established sets of URI identifiers for these. OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so. Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier: http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs http://opencorporates.com/companies/uk/01431688 .
Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation. I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670
You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later. Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616. If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.
It does matter however, for those looking to the identifying URI for some idea of authority. As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying. Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information. The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit links between URIs at a later date.
In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many. An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment. Yes it is worth bothering, just don’t expect benefits on day, or even month, one.
This post was also published on the Nodalities Blog
I started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption. So learning from that experience I am going to start with another assumption in this post. Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing. Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.
In this post I am going to be concentrating more on the How than the Why Bother.
To help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples. Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/. On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council.
However, it is more than a web site. Inspection of the Download data tab shows that you can get your hands on the source data in csv format. Here is one line, representing a line of expenditure, from that data:
“http://statistics.data.gov.uk/id/local-authority/41UD”,”Lichfield District Council”,”2010-04-06″,”7747″,”http://spending.lichfielddc.gov.uk/spend/8605670″,”120.00″,”BRISTOW & SUTOR”,”401″,”Revenue Collection”,”Supplies & Services”,”Bailiff Fees”,””
In the context of csv, that’s all these URIs are, identifiers. However because they are http URIs you can click through to the address to get more information. If you do that with your web browser you get a human readable representation of the data. These sites also provide access to the same data, formatted in RDF, for use by developers.
The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things. Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.
Secondly, you may have spotted that there is no date, or supplier name or identifier. That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.
Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc. This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to. Another great example of this effect can be found on a site such as the BBC Wildlife Finder. That is not to say that you could not have created such a site without even considering Linked Data, of course you could. However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.
The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority. By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.
So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s? The simple answer is that you may not have to – others may help you do some, if not all, of it. With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are. As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source. Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.
As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this. But that is for Part 3….
This post was also published on the Nodalities Blog
National Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months. For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.
This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start?
I believe that spending data is a great place to start, both for publishing local government data and for making such data linked. Someone at national level was quite astute choosing spending as a starting point. To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category, a payee, and an amount. At a very basic level it is very easy to measure if an authority has done that or not.
Should ideally be the payment date as recorded in purchase or general ledger
To identify within authority’s system, for future reference
In Sterling recorded in finance system
Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
The part of the authority that spent the amount
Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.
… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.
As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format – supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.
OK so why bother with applying Linked Data techniques to this [boring] spending data? Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent. Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.
For those new to Linked Data, the use of URIs as identifiers often causes confusion. A URI, such as http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans. It has couple of attributes that make it different from traditional identifiers. Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier. This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion. Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies. In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.
Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How) leads to a Linked Data representation of the data. In technical terms a comparatively easy step using scripts etc.
By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.
If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities. Which brings me to ….
Why bother publishing any local authority data as Linked Data? Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government. For example, the data held at statistics.data.gov.uk includes which county an authority is located within. By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.
As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc. Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data platform for innovating development and the publishing of other data that will link in with this basic but important financial data.
There are however some more technical issues, URI naming, aggregation, etc., to be overcome or at least addressed in the short term to get us to that foundation. I will cover these in part 2 of this series.
This post was also published on the Nodalities Blog