The Culture Grid closed to ‘new accessions’ (ie. new collections of metadata) on the 30th April
The existing index and API will continue to operate in order to ensure legacy support
Museums, galleries, libraries and archives wishing to contribute material to Europeana can still do so via the ‘dark aggregator’, which the Collections Trust will continue to fund
Interested parties are invited to investigate using the Europeana Connection Kit to automate the batch-submission of records into Europeana
The reasons he gave for the ending of this aggregation service are enlightening for all engaged with or thinking about data aggregation in the library, museum, and archives sectors.
Throughout its history, the Culture Grid has been tough going. Looking back over the past 7 years, I think there are 3 primary and connected reasons for this:
The value proposition for aggregation doesn’t stack up in terms that appeal to museums, libraries and archives. The investment of time and effort required to participate in platforms like the Culture Grid isn’t matched by an equal return on that investment in terms of profile, audience, visits or political benefit. Why would you spend 4 days tidying up your collections information so that you can give it to someone else to put on their website? Where’s the kudos, increased visitor numbers or financial return?
Museum data (and to a lesser extent library and archive data) is non-standard, largely unstructured and dependent on complex relations. In the 7 years of running the Culture Grid, we have yet to find a single museum whose data conforms to its own published standard, with the result that every single data source has required a minimum of 3-5 days and frequently much longer to prepare for aggregation. This has been particularly salutary in that it comes after 17 years of the SPECTRUM standard providing, in theory at least, a rich common data standard for museums;
Metadata is incidental. After many years of pump-priming applications which seek to make use of museum metadata it is increasingly clear that metadata is the salt and pepper on the table, not the main meal. It serves a variety of use cases, but none of them is ‘proper’ as a cultural experience in its own right. The most ‘real’ value proposition for metadata is in powering additional services like related search & context-rich browsing.
The first of these two issues represent a fundamental challenge for anyone aiming to promote aggregation. Countering them requires a huge upfront investment in user support and promotion, quality control, training and standards development.
The 3rd is the killer though – countering these investment challenges would be possible if doing so were to lead directly to rich end-user experiences. But they don’t. Instead, you have to spend a huge amount of time, effort and money to deliver something which the vast majority of users essentially regard as background texture.
As an old friend of mine would depressingly say – Makes you feel like packing up your tent and going home!
Interestingly earlier in the post Nick give us an insight into the purpose of Culture Grid:
.… we created the Culture Grid with the aim of opening up digital collections for discovery and use ….
That basic purpose is still very valid for both physical and digital collections of all types. The what [helping people find, discover, view and use cultural resources] is as valid as it has ever been. It is the how [aggregating metadata and building shared discovery interfaces and landing pages for it] that has been too difficult to justify continuing in Culture Grid’s case.
In my recent presentations to library audiences I have been asking a simple question “Why do we catalogue?” Sometimes immediately, sometimes after some embarrassed shuffling of feet, I inevitably get the answer “So we can find stuff!“. In libraries, archives, and museums helping people finding the stuff we have is core to what we do – all the other things we do are a little pointless if people can’t find, or even be aware of, what we have.
If you are hoping your resources will be found they have to be referenced where people are looking. Where are they looking?
It is exceedingly likely they are not looking in your aggregated discovery interface, or your local library, archive or museum interface either. Take a look at this chart detailing the discovery starting point for college students and others. Starting in a search engine is up in the high eighty percents, with things like library web sites and other targeted sources only just making it over the 1% hurdle to get on the chart. We have known about this for some time – the chart comes from an OCLC Report ‘College Students’ Perceptions of Libraries and Information Resources‘ published in 2005. I would love to see a similar report from recent times, it would have to include elements such as Siri, Cortana, and other discovery tools built-in to our mobile devices which of course are powered by the search engines. Makes me wonder how few cultural heritage specific sources would actually make that 1% cut today.
Our potential users are in the search engines in one way or another, however it is the vast majority case that our [cultural heritage] resources are not there for them to discover.
Culture Grid, I would suggest, is probably not the only organisation, with an ‘aggregate for discovery’ reason for their existence, that may be struggling to stay relevant, or even in existence.
You may well ask about OCLC, with it’s iconic WorldCat.org discovery interface. It is a bit simplistic say that it’s 320 million plus bibliographic records are in WorldCat only for people to search and discover through the worldcat.org user interface. Those records also underpin many of the services, such as cooperative cataloguing, record supply, inter library loan, and general library back office tasks, etc. that OCLC members and partners benefit from. Also for many years WorldCat has been at the heart of syndication partnerships supplying data to prominent organisations, including Google, that help them reference resources within WorldCat.org which in turn, via find in a library capability, lead to clicks onwards to individual libraries. [Declaration: OCLC is the company name on my current salary check] Nevertheless, even though WorldCat has a broad spectrum of objectives, it is not totally immune from the influences that are troubling the likes of Culture Graph. In fact they are one of the web trends that have been driving the Linked Data and Schema.org efforts from the WorldCat team, but more of that later.
How do we get our resources visible in the search engines then? By telling the search engines what we [individual organisations] have. We do that by sharing a relevant view of our metadata about our resources, not necessarily all of it, in a form that the search engines can easily consume. Basically this means sharing data embeded in your web pages, marked up using the Schema.org vocabulary. To see how this works, we need look no further than the rest of the web – commerce, news, entertainment etc. There are already millions of organisations, measured by domains, that share structured data in their web pages using the Schema.org vocabulary with the search engines. This data is being used to direct users with more confidence directly to a site, and is contributing to the global web of data.
There used to be a time that people complained in the commercial world of always ending up being directed to shopping [aggregation] sites instead of directly to where they could buy the TV or washing machine they were looking for. Today you are far more likely to be given some options in the search engine that link you directly to the retailer. I believe is symptomatic of the disintermediation of the aggregators by individual syndication of metadata from those retailers.
Can these lessons be carried through to the cultural heritage sector – of course they can. This is where there might be a bit of light at the end of the tunnel for those behind the aggregations such as Culture Grid. Not for the continuation as an aggregation/discovery site, but as a facilitator for the individual contributors. This stuff, when you first get into it, is not simple and many organisations do not have the time and resources to understand how to share Schema.org data about their resources with the web. The technology itself is comparatively simple, in web terms, it is the transition and implementation that many may need help with.
Schema.org is not the perfect solution to describing resources, it is not designed to be. It is there to describe them sufficiently to be found on the web. Nevertheless it is also being evolved by community groups to enhance it capabilities. Through my work with the Schema Bib Extend W3C Community Group, enhancements to Schema.org to enable better description of bibliographic resources, have been successfully proposed and adopted. This work is continuing towards a bibliographic extension – bib.schema.org. There is obvious potential for other communities to help evolve and extend Schema to better represent their particular resources – archives for example. I would be happy to talk with others who want insights into how they may do this for their benefit.
Schema.org is not a replacement for our rich common data standards such as MARC for libraries, and SPECTRUM for museums as Nick describes. Those serve purposes beyond sharing information with the wider world, and should be continued to be used for those purposes whilst relevant. However we can not expect the rest of the world to get its head around our internal vocabularies and formats in order to point people at our resources. It needs to be a compromise. We can continue to use what is relevant in our own sectors whilst sharing Schema.org data so that our resources can be discovered and then explored further.
So to return to the question I posed – Is There Still a Case for Cultural Heritage Data Aggregation? – If the aggregation is purely for the purpose of supporting discovery, I think the answer is a simple no. If it has broader purpose, such as for WorldCat, it is not as clear cut.
I do believe nevertheless that many of the people behind the aggregations are in the ideal place to help facilitate the eventual goal of making cultural heritage resources easily discoverable. With some creative thinking, adoption of ‘web’ techniques, technologies and approaches to provide facilitation services, reviewing what their real goals are [which may not include running a search interface]. I believe we are moving into an era where shared authoritative sources of easily consumable data could make our resources more visible than we previously could have hoped.
Are there any black clouds on this hopeful horizon? Yes there is one. In the shape of traditional cultural heritage technology conservatism. The tendency to assume that our vocabulary or ontology is the only way to describe our resources, coupled with a reticence to be seen to engage with the commercial discovery world, could still hold back the potential.
As an individual library, archive, or museum scratching your head about how to get your resources visible in Google and not having the in-house ability to react; try talking within the communities around and behind the aggregation services you already know. They all should be learning and a problem shared is more easily solved. None of this is rocket science, but trying something new is often better as a group.
Paul is a prolific podcaster, but had yet to venture in to the world of the video conversation. This conversation was therefore a bit of an experiment. Take a look below and see what you think. For those that prefer audio only, Paul has helpfully included an mp3 for you to listen to. At the end of this post you will also find a link to a short survey which has posted to ascertain how successful this format.
Here are a few comments about the process from Paul and a link to the survey:
It’s perhaps unfair to draw too many conclusions from this first attempt, but a few things are immediately apparent. The whole process takes an awful lot longer. The files are larger, so processing and uploading times increase 2-3 fold. Uploading a separate audio file also takes a bit of time. Simply dumping the Skype recording into iMovie worked just fine… but I’ve (so far) not managed to find any way to balance the audio levels. Garagebandlets me do this with my audio-only podcasts, but iMovie doesn’t seem to, so Richard’s side of the conversation comes across as quite a bit louder than mine.
In amongst the useful pointers to news, comment, and documents, I have been recently conscious of an increasing flow of tweets like these:
This is good news. More and more city, local, national governments and public bodies releasing data as open data. Of course the reference to open here is in relation to the licensing of these data, but how open in access are they? It is not that easy to find out.
To be truly open and broadly useful data has to be both licensed openly, with few or no use constraints, and have as few technical barriers to consuming it as possible. In many cases there will be enough enthusiasts for a particular source with the motivation to take data in whatever form, and pick their way through it to get the value they need. These enthusiasts provide great blogging fodder and examples for presentations, but do not represent the significant value that should, and is predicted to, flow from the open data and transparency agenda spreading through governments across the globe.
The five star data rating scheme, from Sir Tim Berners-Lee, is a simple way to describe the problem and encourage publishers to strive to achieve a 5 star Linked Open Data rating, yet not discouraging openly publishing in any form in the first place. Check out my earlier post What Is Your Data’s Star Rating(s)? when I dig in to both types of openness a bit further.
Policy makers and data openness enthusiasts who are behind this burgeoning flood of announcements [as a broad generality] get the licensing issues – use CC0 or copy the UK’s OGL. However what concerns me is, they tend to shy away from promoting the removal of technical barriers that could stifle the broad adoption, and consequential flow of economic benefit, that they predict.
We could look back in a few years to this time of missed opportunity to say, it was obvious that the initiatives would fail because we didn’t make it easy for those that could have delivered the value. We let the flood of enthusiastic initiatives wash past us without grabbing the opportunities to establish easy, consistent and repeatable ways to release and build upon the value in data for all, not just an enthusiastic few. We need to get this right if open data is going fuel the next revolution.
Some are thinking in the same way. CKAN for instance have delivered an extension to calculate the [technical] openness of datasets as listed on the Dataset Openness Page of the Data Hub. Great idea but I would suggest that most data publishers will never find their way to such a listing. Where are the stars on the individual data set pages? Where is the star rating badges of approval that publishers can put on their sites to show off?
We have made great strides so far in promoting the opening of public and other sector information, the ePISplatform stream is testament to that. Somehow we need to capitalise on this great start and market the benefits of technically opening up your data better. 5 Star badge of approval anyone?
As often is the way, events have conspired to prevent me from producing this third and final part in this How & Why of Local Government Spending Data as soon as I wanted. So my apologies to those eagerly awaiting this latest.
To quickly recap, in Part 1 I addressed issues around why pick on spending data as a start point for Linked Data in Local Government, and indeed why go for Linked Data at all. In Part 2, I used some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples to demonstrate how you can publish spending data as Linked Data, for both human and programmatic consumption.
I am presuming that you are still with me on my basic assumptions “…publishing this [local government spending] data is a good thing” and “Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing”, plus the technique of using URIs to name things in a globally unique way (that also provides a link to more information) is not providing you with mental indigestion. So, I now want to move on to some of the issues that are causing debate in the community which come under the headings of ontologies identifiers.
An ontology, according to Wikipeda, is a formal representation of knowledge as a set of concepts within a domain – an ontology provides a shared vocabulary, which can be used to model a domain – that is, the type of objects and/or concepts that exist, and their properties and relations. So in our quest to publish spending data what ontology should we use? The Payments Ontology, with the accompanying guide to it’s application, is what is needed. Using it, it becomes possible to describe individual payments, or expenditure lines, and their relationship between the authority (payment:payer) the supplier (payment:payee) category (payment:expenditureCategory) etc. The next question is how do you identify the things that you are relating together using this ontology.
Lets take this one step at a time:
Give the expenditure line, or individual payment, an identifier possibly generated by our accounts system. eg. 8605670.
Make that identifier unique to our local authority by prefixing it with our internet domain name. eg. http://spending.lichfielddc.gov.uk/spend/8605670 – note the prefix of ‘http://’. This enables anyone wanting detail about this item to follow the link to our site to get the information.
Associate a payer with the payment with an RDF statement (or triple) using the Payments Ontology: http://spending.lichfielddc.gov.uk/spend/8605670
Note I am using an identifier for the payer that is published by statistics.data.gov.uk. That is so that everyone else will unambiguously understand which authority is the one responsible for the payment.
Follow the same approach for associating the payee http://spending.lichfielddc.gov.uk/spend/8605670
And then repeat the process for categorisation, payment value etc.
This immediately throws up a couple of questions, such as why use a locally defined identifier for the payee – surely there is an identifier I can use that other will recognise, such as company or VAT number! – there are, but as of the moment there are no established sets of URI identifiers for these. OpenCorporates.com are doing some excellent work in this area, but Companies House, the logical choice for publishing such identifiers, have yet to do so. Pragmatically it is probably a good idea to have a local identifier anyway and then associate it with another publicly recognised identifier: http://spending.lichfielddc.gov.uk/supplier/bristow-sutor
owl:sameAs http://opencorporates.com/companies/uk/01431688 .
Because this is all very new and still emerging, we now find ourselves in a bit of a chicken-or-egg situation. I presume that most authorities have not built a mini spending website, like Lichfield District Council has, to serve up details when someone follows a link like this: http://spending.lichfielddc.gov.uk/spend/8605670
You could still use such an identifier using your authority domain, and plan to back it up later with a web service to provide more information later. Or you could let someone else, who takes a copy of your raw data, do it for you as OpenlyLocal might: http://openlylocal.com/financial_transactions/135/2010/33854 or maybe how the project we are working on with LGID might: http://id.spending.esd.org.uk/Payment/36UF/ds00024616. If the open flexible world of Linked Data it doesn’t matter too much which domain an identifier is published from, or for that matter how many [related] identifiers are used for the same thing.
It does matter however, for those looking to the identifying URI for some idea of authority. As I say above, technically it doesn’t matter who’s domain the identifier comes from, but I believe it would be better overall if it came from the authority who’s payment it is identifying. Which puts us back in the chicken-or-egg situation as to resolving the URI to serve up more information. The joy of Linked Data is that, provided aggregators consider the possibility of being able to identify source authorities data accurately when they encode it, it should be possible to automatically retrofit links between URIs at a later date.
In summary over this series of posts we are seeing a technology which, although it has obvious benefits, is still early on the development curve; being applied to a process which is also new and scary for many. An ideal breading ground for cries of pain, assertions of ‘it doesn’t work’ or ‘not worth bothering’, yet with the potential to provide a powerful foundation for a future open, accessible, and beneficial to authorities, government, citizens, and UK Plc data rich environment. Yes it is worth bothering, just don’t expect benefits on day, or even month, one.
This post was also published on the Nodalities Blog
I started the previous post in this mini-series with an assumption – ..working on the assumption that publishing this [local government spending] data is a good thing. That post attracted several comments, fortunately none challenging the assumption. So learning from that experience I am going to start with another assumption in this post. Publishing Local Authority data, such as local spending data, as ‘Linked Data’ is also a good thing. Those new to this mini-series, check back to the previous post for my reasoning behind the assertion.
In this post I am going to be concentrating more on the How than the Why Bother.
To help with this I am going to use, some of the excellent work that Stuart Harrison at Lichfield District Council has done in this area, as examples. Take a look at the spending data part of their site: spending.lichfielddc.gov.uk/. On the surface navigating your way around the site looking at council spend by type, subject, month, and supplier is the kind of experience a user would expect. Great for a website displaying information about a single council.
However, it is more than a web site. Inspection of the Download data tab shows that you can get your hands on the source data in csv format. Here is one line, representing a line of expenditure, from that data:
“http://statistics.data.gov.uk/id/local-authority/41UD”,”Lichfield District Council”,”2010-04-06″,”7747″,”http://spending.lichfielddc.gov.uk/spend/8605670″,”120.00″,”BRISTOW & SUTOR”,”401″,”Revenue Collection”,”Supplies & Services”,”Bailiff Fees”,””
In the context of csv, that’s all these URIs are, identifiers. However because they are http URIs you can click through to the address to get more information. If you do that with your web browser you get a human readable representation of the data. These sites also provide access to the same data, formatted in RDF, for use by developers.
The eagle-eyed, inspecting the RDF-XML for Lichfield payment number 8605670, will have noticed a couple of things. Firstly, a liberal sprinkling of elements with names like payment:expenditureCategory or payment:payment. These come from the Payments Ontology as published on data.gov.uk as the recommended way of encoding spending, and other payment associated data, in RDF.
Secondly, you may have spotted that there is no date, or supplier name or identifier. That is because those pieces of information are attributes associated with a payment – invoice number 7747 in this case.
Zooming out from the data for a moment, and looking at the human readable form, you will see that most things, like spend type, invoice number, supplier name, are clickable links, which take you through to relevant information about those things – address details & payments for a supplier, all payments for a category etc. This intuitive natural navigation style often comes as a positive consequence of thinking about data as a set of linked resources instead of the traditional rows & columns that we are used to. Another great example of this effect can be found on a site such as the BBC Wildlife Finder. That is not to say that you could not have created such a site without even considering Linked Data, of course you could. However, data modelled as a set of linked resources almost self-describes the ideal navigation paths for a user interface to display it to a human.
The Linked Data practice of modelling data, such as spending data, as a set of linked resources and identifying those resources with URIs [which if looked up will provide information about that resource] is equally applicable to those outside of an individual authority. By being able to consume that data, whilst understanding the relationships within it and having confidence in the authority and persistence of the identifiers within it, a developer can approach the task of aggregating, comparing, and using that data in their applications more easily.
So, how do I (as a local authority) get my data from its raw flat csv format, in to RDF with suitable URIs and produce a site like Lichfield’s? The simple answer is that you may not have to – others may help you do some, if not all, of it. With help from organisations such as esd-toolkit, OpenlyLocal, SpotlightOnSpend, and with projects such as the xSpend project we are working on with LGID, many of the conversion [from csv], data formatting processes, and aggregation are being addressed – maybe not as quickly or completely as we would like, but they are. As to a human readable web view of your data, you may be able to copy Stuart by taking up the offer of a free Talis Platform Store and then running your own web server with his code that he hopes to share as open source. Alternatively it might be worth waiting for others to aggregate your data and provide a way for your citizens to view your data.
As easy as that then! – Well not quite, there are some issues about URI naming and creation, and how you bring the data together that still do need addressing by those engaged in this. But that is for Part 3….
This post was also published on the Nodalities Blog
National Government instructing the 300+ UK Local Authorities to publish “New items of local government spending over £500 to be published on a council-by-council basis from January 2011” has had the proponents of both open, and closed, data excited over the last few months. For this mini series of posts I am working on the assumption that publishing this data is a good thing, because I want to move on and assert that [when publishing] one format/method to make this data available should be Linked Data.
This immediately brings me to the Why Bother? bit. This itself breaks in to two connected questions – Why bother publishing any local authority data as Linked Data? and Why bother using the, unexciting simplistic, spending data as a a place to start?
I believe that spending data is a great place to start, both for publishing local government data and for making such data linked. Someone at national level was quite astute choosing spending as a starting point. To comply with the instruction all an authority has to do is produce a file containing five basic elements for each payment transaction: An Id, a date, a category, a payee, and an amount. At a very basic level it is very easy to measure if an authority has done that or not.
Should ideally be the payment date as recorded in purchase or general ledger
To identify within authority’s system, for future reference
In Sterling recorded in finance system
Name and individual authority id for supplier plus where possible Companies House, Charity Registration, or other recognised identifier
The part of the authority that spent the amount
Depending on the accounts system this may be easy or quite difficult. There are two candidates for categorization – CIPFA’s BVACOP classification and the Proclass procurement classification system.
… a little more onerous, possibly around the areas of identifying company numbers and Service Categorization, but not much room for discussion/interpretation.
As to the file formats to publish data, the same advice mandates: The files are to be published in CSV file format – supplemented by – Authorities may wish to publish the data in additional formats as well as the CSV files (e.g. linked data, XML, or PDFs for casual browsers). There is no reason why they should not do this, but this is not a substitute for the CSV files.
OK so why bother with applying Linked Data techniques to this [boring] spending data? Well, precisely because it is simple data, it is comparatively easy to do, and because everybody is publishing this data the benefits of linking should soon become apparent. Linked Data is all about identifying things and concepts, giving them a globally addressable identifiers (URIs) and then describing the relationships between them.
For those new to Linked Data, the use of URIs as identifiers often causes confusion. A URI, such as http://statistics.data.gov.uk/id/local-authority-district/00CN, is a string of characters that is as much an identifier as the payroll number on your pay-check, or a barcode on a can of beans. It has couple of attributes that make it different from traditional identifiers. Firstly, the first part of it is created from the Internet domain name of the organisation that publish the identifier. This means that it can be globally unique. Theoretically you could have the same payroll number as the the barcode number on my can of beans – adding the domain avoids any possibility of confusion. Secondly, because the domain is prefixed by http:// it gives the publisher the ability to provide information about the thing identified, using well established web technologies. In this particular example, http://statistics.data.gov.uk/id/local-authority-district/00CN is the identifier for Birmingham City Council, if you click on it [using it as an internet address] data.gov.uk will supply you information about it – name, location, type of authority etc.
Following this approach, creating URI identifiers for suppliers, categories, and individual payments and defining the relationships between them using the Payments Ontology (more on this when I come on to the How) leads to a Linked Data representation of the data. In technical terms a comparatively easy step using scripts etc.
By publishing Linked Spending Data and loading it in to a Linked Data store, as Lichfield DC have done, it becomes possible to query it, to identifies things like all payments for a supplier; or suppliers for a category, etc.
If you then load data for several authorities in to an aggregate store, as we are doing in partnership with LGID, those queries can identify patterns or comparisons across authorities. Which brings me to ….
Why bother publishing any local authority data as Linked Data? Publishing as Linked Data enables an authority’s data to be meshed with data from other authorities and other sources such as national government. For example, the data held at statistics.data.gov.uk includes which county an authority is located within. By using that data as part of a query, it would for instance be possible to identify the total spend, by category, for all authorities in a county such as the West Midlands.
As more authority data sets are published, sharing the same identifiers for authority category etc., they will naturally link together, enabling the natural navigation of the information between council departments, services, costs, suppliers, etc. Once this step has been taken and the dust settles a bit, this foundation of linked data should become an open data platform for innovating development and the publishing of other data that will link in with this basic but important financial data.
There are however some more technical issues, URI naming, aggregation, etc., to be overcome or at least addressed in the short term to get us to that foundation. I will cover these in part 2 of this series.
This post was also published on the Nodalities Blog