Back in September I formed a W3C Group – Schema Bib Extend. To quote an old friend of mine “Why did you go and do that then?”
Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web. I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web. This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.
The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.
The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list. OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.
So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources. I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others. Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.
Four months, and four conference calls (supported by OCLC – thank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary. We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.
Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges. Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.
My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise. Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.
I would characterise the stage we are at, as moving from the talking about it to doing something about it stage. I am looking forward to the next few months with enthusiasm.
If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far). If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.
You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.
The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.
After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?
We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.
So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.
To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.
Another area of question around the publication of WorldCat linked data, has been about licensing. Both the RDFa embedded, and the download, data are published as open data under the Open Data Commons Attribution License (ODC-BY), with reference to the community norms put forward by the members of the OCLC cooperative who built WorldCat. The theme of many of the questions have been along the lines of “I understand what the license says, but what does this mean for attribution in practice?”
To help clarify how you might attribute ODC-BY licensed WorldCat, and other OCLC linked data, we have produced attribution guidelines to help clarify some of the uncertainties in this area. You can find these at http://www.oclc.org/data/attribution.html. They address several scenarios, from documents containing WorldCat derived information to referencing WorldCat URIs in your linked data triples, suggesting possible ways to attribute the OCLC WorldCat source of the data. As guidelines, they obviously can not cover every possible situation which may require attribution, but hopefully they will cover most and be adapted to other similar ones.
As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data. So let us know, and if you have any comments on any aspect of its markup, schema.org extensions, publishing, or on our attribution guidelines, drop us a line at firstname.lastname@example.org.
Typical! Since joining OCLC as Technology Evangelist, I have been preparing myself to be one of the first to blog about the release of linked data describing the hundreds of millions of bibliographic items in WorldCat.org. So where am I when the press release hits the net? 35,000 feet above the North Atlantic heading for LAX, that’s where – life just isn’t fair.
By the time I am checked in to my Anahiem hotel, ready for the ALA Conference, this will be old news. Nevertheless it is significant news, significant in many ways.
OCLC have been at the leading edge of publishing bibliographic resources as linked data for several years. At dewey.info they have been publishing the top levels of the Dewey classifications as linked data since 2009. As announced yesterday, this has now been increased to encompass 32,000 terms, such as this one for the transits of Venus. Also around for a few years is VIAF (the Virtual International Authorities File) where you will find URIs published for authors, such as this well known chap. These two were more recently joined by FAST (Faceted Application of Subject Terminology), providing usefully applicable identifiers for Library of Congress Subject Headings and combinations thereof.
Despite this leading position in the sphere of linked bibliographic data, OCLC has attracted some criticism over the years for not biting the bullet and applying it to all the records in WorldCat.org as well. As today’s announcement now demonstrates, they have taken their linked data enthusiasm to the heart of their rich, publicly available, bibliographic resources – publishing linked data descriptions for the hundreds of millions of items in WorldCat.
Let me dissect the announcement a bit….
First significant bit of news – WorldCat.org is now publishing linked data for hundreds of millions of bibliographic items – that’s a heck of a lot of linked data by anyone’s measure. By far the largest linked bibliographic resource on the web. Also it is linked data describing things, that for decades librarians in tens of thousands of libraries all over the globe have been carefully cataloguing so that the rest of us can find out about them. Just the sort of authoritative resources that will help stitch the emerging web of data together.
Second significant bit of news – the core vocabulary used to describe these bibliographic assets comes from schema.org. Schema.org is the initiative backed by Google, Yahoo!, Microsoft, and Yandex, to provide a generic high-level vocabulary/ontology to help mark up structured data in web pages so that those organisations can recognise the things being described and improve the services they can offer around them. A couple of examples being Rich Snippet results and inclusion in the Google Knowledge Graph.
As I reported a couple of weeks back, from the Semantic Tech & Business Conference, some 7-10% of indexed web pages already contain schema.org, microdata or RDFa, markup. It may at first seem odd for a library organisation to use a generic web vocabulary to mark up it’s data – but just think who the consumers of this data are, and what vocabularies are they most likely to recognise? Just for starters, embedding schema.org data in WorldCat.org pages immediately makes them understandable by the search engines, vastly increasing the findability of these items.
Third significant bit of news – the linked data is published both in human readable form and in machine readable RDFa on the standard WorldCat.org detail pages. You don’t need to go to a special version or interface to get at it, it is part of the normal interface. As you can see, from the screenshot of a WordCat.org item above, there is now a Linked Data section near the bottom of the page. Click and open up that section to see the linked data in human readable form. You will see the structured data that the search engines and other systems will get from parsing the RDFa encoded data, within the html that creates the page in your browser. Not very pretty to human eyes I know, but just the kind of structured data that systems love.
Fourth significant bit of news – OCLC are proposing to cooperate with the library and wider web communities to extend Schema.org making it even more capable for describing library resources. With the help of the W3C, Schema.org is working with several industry sectors to extend the vocabulary to be more capable in their domains – news, and e-commerce being a couple of already accepted examples. OCLC is playing it’s part in doing this for the library sector.
Take a closer look at the markup on WorldCat.org and you will see attributes from a library vocabulary. Attributes such as library:holdingsCount and library:oclcnum. This library vocabulary is OCLC’s conversation starter with which we want to kick off discussions with interested parties, from the library and other sectors, about proposing a basic extension to schema.org for library data. What better way of testing out such a vocabulary – markup several million records with it, publish them and see what the world makes of them.
Fifth significant bit of news – the WorldCat.org linked data is published under an Open Data Commons (ODC-BY) license, so it will be openly usable by many for many purposes.
Sixth significant bit of news – This release is an experimental release. This is the start, not the end, of a process. We know we have not got this right yet. There are more steps to take around how we publish this data in ways in addition to RDFa markup embedded in page html – not everyone can, or will want to, parse pages to get the data. There are obvious areas for discussion around the use of schema.org and the proposed library extension to it. There are areas for discussion about the application of the ODC-BY license and attribution requirements it asks for. Over the coming months OCLC wants to constructively engage with all that are interested in this process. It is only with the help of the library and wider web communities that we can get it right. In that way we can assure that WorldCat linked data can be beneficial for the OCLC membership, libraries in general, and a great resource on the emerging web of data.
As you can probably tell I am fairly excited about this announcement. This, and future stuff like it, are behind some of my reasons for joining OCLC. I can’t wait to see how this evolves and develops over the coming months. I am also looking forward to engaging in the discussions it triggers.
Even more recently I note that OCLC, at their EMEA Regional Council Meeting in Birmingham this week, see Linked Data as an important topic on the library agenda.
The consequence of this rise interest in library Linked Data is that the community is now exploring and debating how to migrate library records from formats such as Marc into this new RDF. In my opinion there is a great danger here of getting bogged down in the detail of how to represent every scintilla of information from a library record in every linked data view that might represent the thing that record describes. This is hardly unsurprising as most engaged in the debate come from an experience where if something was not preserved on a physical or virtual record card, it would be lost forever. By concentrating on record/format transformation I believe that they are using a Linked Data telescope to view their problem, but are not necessarily looking through the correct end of that telescope.
Let me explain what I mean by this. There is a massive duplication of information in library catalogues. For example, every library record describing a copy of a book about a certain boy wizard will contain one or more variations of the string of characters “Rowling, K. J”. To us humans it is fairly easy for us to infer that all of them represent the same person, as described by each cataloguer. For a computer, they are just strings of characters.
OCLC host the Virtual International Authority File (VIAF) project which draws together these strings of characters and produces a global identifier for each author. Associated with that author they collect the local language representations of their name.
One simple step down the Linked Data road would be to replace those strings of characters in those records with the relevant VIAF permalink, or URI – http://viaf.org/viaf/116796842/. One result of this would be that your system could follow that link and return an authoritative naming of that person, with the added benefit of it being available in several languages. A secondary, and more powerful, result is that any process scanning such records can identify exactly which [VIAF identified] person is the creator, regardless of the local language or formatting practices.
Why stop at the point of only identifying creators with globally unique identifiers. Why not use an identifier to represent the combined concept of a text, authored by a person, published by an organisation, in the form of a book – each of those elements having their own unique identifiers. If you enabled such a system on the Linked Data web, what would a local library catalogue need to contain? – Probably only a local identifier of some sort with links to local information such as supplier, price, date of purchase, license conditions, physical location, etc. plus a link to the global description provided by a respected source such as Open Library, Library of Congress, British Library, OCLC etc. A very different view of what might constitute a record in a local library.
So far I have looked at this from the library point of view. What about the view from the rest of the world?
I contend that most wishing to reference books, journal articles, curated and provided by libraries would happiest if they could refer to a global identifier that represents the concept of a particular work. Such consumers would only need a small sub-set of the data assembled by a library for basic display and indexing purposes – title, author. The next question may be, where is there a locally available copy of this book or article that I can access. In the model I describe, where these global identifiers are linked to local information such as loan status, the lookup would be a simple process compared with a current contrived search against inferred strings of characters.
Currently Google and other search engines have great difficulty in managing the massive amount of library catalogue pages that will mach a search for a book title. As referred to previously, Google are assembling a graph of related things. In this context the thing is the concept of the book or article, not the thousands of library catalogue pages describing the same thing.
Pulling these thoughts together, and looking down the Linked Data telescope from the non-library end, I envisage a layered approach to accessing library data.
A simple global identifier, or interlinked identifiers from several respected sources, that represents the concept of a particular thing (book, article, etc.)
A simple set of high-level description information for each thing – links to author, title, etc., associated with the identifier. This level of information would be sufficient for many uses on the web and could contain only publicly available information.
For those wishing more in depth bibliographic information, those unique identifiers, either directly or via SameAs links, could link you to more of the rich resources catalogued by libraries around the world, which may or may not be behind slightly less open licensing or commercial constraints.
Finally library holding/access information would be available, separate from the constraints of the bibliographic information, but indexed by those global identifiers.
To get us to such a state will require a couple of changes in the way libraries do things.
Firstly the rich data collated in current library records should be used to populate a Lined Data data model of the things those records describe – not just reproducing the records we have in another format. An approach I expanded upon in a previous post Create Data Not Records.
Secondly, as such a change would be a massive undertaking, libraries will need to work together to do this. The centralised library data holders have a great opportunity to drive this forward. A few years ago, the distributed hosted-on-site landscape of library management systems would have prevented such a change happening. However with library system software-as-a-service becoming an increasingly viable option for many, it is not the libraries that would have to change, just the suppliers of the systems the use.
Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not. They will tell you that RDF is one of the simplest data forms for describing things, and they are right. They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will. They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too. Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it. They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.
So would I recommend a self-taught do-it-yourself approach to adopting Linked Data? For an enthusiastic individual, maybe. For a company or organisation wanting to get to know and then identify the potential benefits, no I would not. Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.
Let me explain this apparent contradiction. I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies. These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models. Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context. That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself. However, I do council care in the way you approach gaining this understanding.
A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee. The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.
I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path. And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.
You want to understand how Linked Data could benefit your organisation? Get some help to relate the benefits to your issues, challenges and opportunities. Don’t go off to a third party and get them to implement something for you. Bring in a steadying hand, encouragement, and guidance to stay on track. Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data. There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi. Get your head around what you have, how you are going to publish and link it, and what the usage might be. Then you can size and specify the technology and/or service you need to support it.
So back to my original question – Is Linked Data DIY a good idea? Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation. However, I would not recommend a do-it-yourself introduction to this. Get yourself a steadying hand.
Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.
Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.
Europeana recently launched an excellent short animation explaining what Linked Open Data is and why it’s a good thing, both for users and for data providers. They did this in support of the release of a large amount of Linked Open Data describing cultural heritage assets held in Libraries, Museums, Galleries and other institutions across Europe.
Europeana, as an aggregator and proxy for data supplied by other institutions is in a difficult position. They not only want to publish this information for the benefit of Europe and the wider world, they also need to maintain the provenance and relationships between the submissions of data from their partner organisations. I believe that the EDM is the result of the second of these two priorities taking precedence. Their proxy role being reflected in the structure of the data. The effect being that a potential consumer of their data, who is not versed in Europeana and their challenges, will need to understand their model before being able to identify that the Cartographer : Ryther, Augustus created the Cittie of London 31.
Fortunately as their technical overview indicates, this is a pilot and the team at Europeana are open to suggestion, particularly on the issue of providing information at the item level in the data model:
Depending on the feedback received during this pilot, we may change this and duplicate all the descriptive metadata at the level of the item URI. Such an option is costly in terms of data verbosity, but it would enable easier access to metadata, for data consumers less concerned about provenance.
In the interests of this data becoming useful, valuable, and easily consumable for those outside of the Europeana partner grouping, I encourage you to lobby them to take a hit on the duplication of some data.
Some in the surfing community will tell you that every seventh wave is a big one. I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself. Because of that, it is a little constraining to talk about this next one only effecting the world of the Web. This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.
Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches. Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us. The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.
I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. Enough of the grandiose metaphors and down to business.
Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s. Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.
Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?
It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant. Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives. The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.
I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.
Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.
Data Marketplaces are emerging to provide platforms to share and possibly monetise your data. An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis. Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept. This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts. An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.
The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy. Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.
The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.
I wrote about Schema.org a while back. An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html. The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets. When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much. However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported. By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.
The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.
So what caused me to open up my Macbook and start writing this. It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified. They are already harvesting, and using, structured information from web pages that has been encoded using RDF. The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.
Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa. This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand. It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level. Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies. These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.
So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice. By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically. Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them. A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all. Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.
The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links. So far this evangelism has had little success. However, this shift may give them what they want via an unexpected route.
Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow. A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company. Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.
These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out. Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching. Perhaps I should look at the beach again in five years time to see if I was right.
Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.
Data is going to be come more core to our world than we could ever have imagined a few short years ago. Although we have be producing it for decades, data has either been treated as something in the core of a project not to expose to prying eyes, or often as a toxic waste product of business processes. Some of the traditional professions that have emerged, to look after and work with these data, reflect this relationship between us and and our digital assets. In the data warehouse, they archive, preserve, catalogue, and attempt to make sense of vast arrays of data. The data miners, precariously dig through mountains of data as it shifts and settles around them, propping up their expensive burrows with assumptions and inferred relationships, hoping a change in the strata does not cause a logical cave-in and they have to start again.
As I have postulated previously, I believe we are on the edge of a new revolution where data becomes a new raw material that drives the emergence of new industries, analogous to the emergence of manufacturing as a consequence of the industrial revolution. As this new era rolls out, the collection of data wrangling enthusiasts that have done a great job in getting us thus far will not be sufficient to sustain a new industry of extracting, transforming, linking, augmenting, analysing and publishing data.
So this initiative from the OKF & P2PU is very welcome:
The explosive growth in data, especially open data, in recent years has meant that the demand for data skills — for data “wranglers” or “scientists” — has been growing rapidly. Moreover, these skills aren’t just important for banks, supermarkets or the next silicon valley start-up, they are also going to be cruicial in research, in journalism, and in civil society organizations (CSOs).
However, there is currently a significant shortfall of data “wranglers” to satisfy this growing demand, especially in civil society organisations — McKinsey expects a skills shortage in data expertise to reach 50-60% by 2018 in the US alone.
It is welcome, not just because they are doing it but also, because of who they are and the direction they are taking:
The School of Data will adopt the successful peer-to-peer learning model established by P2PU and Mozilla in their ‘School of Webcraft’ partnership. Learners will progress by taking part in ‘learning challenges’ – series of structured, achievable tasks, designed to promote collaborative and project-based learning.
As learners gain skills, their achievements will be rewarded through assessments which lead to badges. Community support and on-demand mentoring will also be available for those who need it.
They are practically approaching real world issues and tasks from the direction of the benefit to society of opening up data. Taking this route will engage with those that have the desire, need and enthusiasm to become either part or full time data wranglers. Hopefully these will establish an ethos that will percolate into commercial organisations, taking an open world view with it. I am not suggesting that commerce should be persuaded to freely and ,openly share all their data but they should learn the techniques of the open data community as the best way to share data under whatever commercial and licensing conditions are appropriate.
One of the more eagerly awaited presentations at the Semantic Tech & Business Conference in Berlin today was a late addition to the program from Denny Vrandecic. With the prominence of Dbpedia in the Linked Open Data Cloud, anything new from Wikipedia with data in it was bound to attract attention, and we were not disappointed.
Denny started by telling us that from March he would be moving to Berlin to work for the Wikimedia Foundation on WikiData.
He then went on to explain that the rich Wikipedia resource may have much of the world’s information but does not have all the answers. There vast differences in coverage between language versions for instance. Also it is not good at answering questions such as what are the 10 largest cities with a female mayor. You get some cities back but most if not all of them do not have a female mayor. One way to address this issue, that has proliferated in Wikipedia is Lists. The problem with lists is that there are so many of them, in several languages, with often duplicates, and then there are the array of lists of lists.
We must accept Wikipedia doesn’t have all the answers – humans can read articles but computers can not understand the meaning. WikiData created articles on a topic will point to the relevant wikipedia articles in all languages.
Dbpedia has been a great success at extracting information from Wikipedia info-boxes and publishing it as data, but it is not editable. WikiData will turn that model on it’s head, by providing an editable environment for data that will then be used to automatically populate the info-boxes. WikiData will also reference secondary databases. For example indicating that the CIA World Factbook provides a value for something.
WikiData will not define the truth, it will collect the references to the data.
Denny listed the objectives of the WikiData project to be:
Provide a database of the world’s knowledge that anyone can edit
Collect references and quotes for millions of data items
Engage a sustainable community that collects data from everywhere in a machine-readable way
Increase the quality and lower the maintenance costs of Wikipedia and related projects
Deliver software and community best practices enabling others to engage in projects of data collection and provisioning
WikiData phase 1, which includes creating one WikiData page for each Wikipedia entity which then lists representations in each language. Those individual language versions will then pull the language links from WikiData, should be complete in the summer.
The second phase will include the centralisation of data vales for info-boxes and then have the Wikipedias populate their info-boxes from WikiData.
The final phase will be to enable inline queries against WikiData to be made from Wikipedias with the results surfaced in several formats.
Denny did not provide a schedule for the second an third phases.
This is all in addition to the ability to provide freely, re-usable, machine-readable access to the world’s data.
The beginnings of an interesting project from WikiMedia that could radically influence the data landscape – well woth watching as it progresses.
The German National Library (DNB) has launched a Linked Data version of the German National Bibliography.
The bibliographic data of the DNB’s main collection (apart from the printed music and the collection of the Deutsches Exilarchiv) and the serials (magazines, newspapers and series of the German Union Catalogue of serials (ZDB)) have been converted. Henceforth the RDF/XML-representation of the records are available at the DNB portal. This is an experimental service that will be continually expanded and improved.
This is a welcome extension to their Linked Data Service, previously delivering authority data. Documentation on their data and modelling is available, however the English version has yet to be updated to reflect this latest release.
Links to RDF-XML versions of individual records are available directly from the portal user interface, with the usual Linked Data content negotiation techniques available to obtain HTML or RDF-XML as required.
This is a welcome addition to the landscape of linked open bibliographic data, joining others such as the British Library.
Also to be welcomed is their move to CC0 licensing removing barriers, real or assumed, to the reuse of this data.
I predict that this will be the first of many more such announcements this year from national and other large libraries opening up their metadata resources as Linked Open Data. The next challenge will be to identify the synergies between these individual approaches to modelling bibliographic data and balance the often competing needs of the libraries themselves and potential consumers of their data who very often do not speak ‘library’.
Somehow [without engaging in the traditional global library cooperation treacle-like processes that take a decade to publish a document] we need to draw together a consistent approach to modelling and publishing Linked Open Bibliographic Data for the benefit of all – not just the libraries. With input from the DNB, British Library, Library of Congress, European National Libraries, Stanford, and others such as Schema.org, W3C, Open Knowledge Foundation etc., we could possibly get a consensus on an initial approach. Aiming for a standard would be both too restrictive, and based on experience, too large a bite of the elephant at this early stage.