A Step for Schema.org – A Leap for Bib Data on the Web






Several significant bibliographic related proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org

schema-org1 Regular readers of this blog may well know I am an enthusiast for Schema.org – the generic vocabulary for describing things on the web as structured data, backed by the major search engines Google, Bing, Yahoo! & Yandex.  When I first got my head around it back in 2011 I soon realised it’s potential for making bibliographic resources, especially those within libraries, a heck of a lot more discoverable.  To be frank library resources did not, and still don’t, exactly leap in to view when searching the web – a bit of a problem when most people start searching for things with Google et al – and do not look elsewhere.

Schema.org as a generic vocabulary to describe most stuff, easily embedded in your web pages, has been a great success.  IMG_0655As was reported by Google’s R.V. Guha, at the recent Semantic Technology and Business Conference in San Jose, a sample of 12B pages showed approximately 21% containing Schema.org markup.  Right from the beginning, however, I had concerns about its applicability to the bibliographic world – great start with the Book type, but there were gaps the coverage for such things as journal issues & volumes, multi-volume works, citations, and the relationship between a work and its editions.  Discovering others shared my combination of enthusiasm and concerns, I formed a W3C Community Group – Schema Bib Extend – to propose some bibliographic focused extensions to Schema.org. Which brings me to the events behind this post…

The SchemaBibEx group have had several proposals accepted over the last couple of years, such as making the [commercial] Offer more appropriate for describing loanable materials, and broadening of the citation property. Several other significant proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org.  For many in our group these latest proposals were a long time coming after their initial proposal.  Although frustrating, the delays were symptomatic of a very healthy process.

Our proposals to add hasPart, isPartOf, exampleOfWork, and workExample to the CreativeWork Type will be available to many, as CreativeWork is the superclass to many types in many areas. Our proposals for issueNumber on PublicationIssue and volumeNumber on PerodicalVolume are very similar to others in the vocabulary, such as seasonNumber and episodeNumber in TV & Radio.  Under Dan Brickley’s careful organisation, tweaks and adjustments were made across a few areas resulting in a consistent style across parts of the vocabulary underpinned by CreativeWork.

Although the number of new types and properties are small, their addition to Schema opens up potential for much better description of periodicals and creative work relationships. To introduce the background to this, SchemaBibEx member Dan Scott and I were invited to jointly post on the Schema.org Blog.

So, another step forward for Schema.org.   I believe that is more than just a step however, for those wishing to make the bibliographic resources more visible on the Web.  There as been some criticism that Schema.org has been too simplistic to be able represent some of the relationships and subtleties from our world.  Criticism that was not unfounded.  Now with these enhancements, much of these criticisms are answered. There is more to do, but the major objective of the group that proposed them has been achieved – to lay the broad foundation for the description of bibliographic, and creative work, resources in sufficient detail for them to be understood by the search engines to become part of their knowledge graphs. Of course that is not the final end we are seeking.  The reason we share data is so that folks are guided to our resources – by sharing, using the well understood vocabulary, Schema.org.

worldcat Examples of a conceptual creative work being related to its editions, using exampleOfWork and workExample, have been available for some time.  In anticipation of their appearance in Schema, they were introduced into the OCLC WorldCat release of 194 million Work descriptions (for example: http://worldcat.org/entity/work/id/1363251773) with the inverse relationship being asserted in an updated version of the basic WorldCat linked data that has been available since 2012.

From Records to a Web of Library Data – Pt3 Beacons of Availability

As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series.  That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Beacons of Availability

Beacons As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff.  How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal.  In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google.  A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources.  You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources.  Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist.  A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.  Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

Findability
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about.  This is not a problem isolated to library interfaces.  In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references.  This vocabulary is Schema.org – I have previously posted about its success and significance.

With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand.   By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person.  With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results.  I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.

For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.

Visibility
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites.  We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them.   If not an identical process, it is clear that similar rules will apply for structured data linking.  If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them.  If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.

Page ranking does not depend on all pages having to link to all others.  Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.

Availability
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.

WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource.  However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web.  Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.

We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks.  We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources.   To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.

So who is going to do all this then – will every library need to employ a linked data expert?   I certainly hope not.

One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs.  Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers.  Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.

By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources.  One beacon is not enough.  A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.

This won’t happen over night, but we are certainly in for some interesting times ahead.

Beacons picture from wallpapersfor.me

From Records to a Web of Library Data – Pt2 Hubs of Authority

As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series.  That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Hubs of Authority

hub Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years.  The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations.  Two from personal experience come to mind,  BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC.  Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records.  A large number of national libraries have such lists of agreed formats for author and organisational names.  The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc.  Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..

These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.

It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names.  Almost despite their initial purpose, authorities are gaining a wider purpose.  As a source of names for, and growing descriptions of, the entities that the library world is aware of.  Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data.  As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot  responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other.  A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web.  Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community.  This data, starting to add value to the wider web, comes from those Marc records I spoke about last time.  Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued.  In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.

So where next?  One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about.  Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources.  In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject.  It is a fundamental part of the data.

With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data.  They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources.  A framework that has uses and benefits beyond the boundaries of bibliographic data.  By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.

Image by the trial on Flickr

From Records to a Web of Library Data – Pt1 Entification






The phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.






As is often the way, you start a post without realising that it is part of a series of posts – as with this one.  This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Entification

russian dolls Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…

What is it, and why should I care, you may be asking.

I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources.  Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.

That phrase ‘getting library data into a linked data form’ hides multitude of issues.  There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc.  However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.

Marc is a record based format.  For each book catalogued, a record created.  The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system.  In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc.  We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc.  A vast amount of information about, and related to, that book in a single record.  A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world.   In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.

Why do we catalogue?  A question I often ask with an obvious answer – so that people can find our stuff.  Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records .  Indexes acting as search access points to a library’s collection.

A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications.   So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection.  To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.

So where does entification come in?  Well, much of the information about authors subjects, publishers, and the like is locked up in those records.  A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text.    I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.

Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain.  A major step forward from the somewhat singular view that we have been working with over previous decades.  With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.

We are in a state of change in the library world which entification of our data will help us get to grips with.  As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens.  Challenging times – bring on the entification!

Russian doll image by smcgee on Flickr

Who Will Be Mostly Right – Wikidata, Schema.org?






Two, on the surface, totally unconnected posts – yet the the same message. Well that’s how they seem to me anyway.

Post 1 – The Problem With Wikidata from Mark Graham writing in the Atlantic. Post 2 – Danbri has moved on – should we follow? by a former colleague Phil Archer.






democracy Two, on the surface, totally unconnected posts – yet the the same message.  Well that’s how they seem to me anyway.

Post 1The Problem With Wikidata from Mark Graham writing in the Atlantic.

wikimedia When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February,  I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource.   This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.

In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

He also highlights issues about the unevenness or bias of contributors to Wikipedia:

We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.

A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority.  If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.

I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin –  “WikiData will not define the truth, it will collect the references to the data….  WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.”  They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion.  In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Post 2Danbri has moved on – should we follow? by a former colleague Phil Archer.

schema-org1 The Danbri in question is Dan Brickley, one of the original architects of the Semantic Web, now working for Google in Schema.org.  Dan presented at an excellent Semantic Web Meetup, which I attended at the BBC Academy a couple of weeks back.  This was a great event.  I recommend investing in the time to watch the videos of Dan and all the other speakers.

Phil picked out a section of Dan’s presentation for comment:

In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…

Then reflecting on current practice in Linked Data he went on to postulate:

… best practice for the RDF community…  …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.

Except schema.org doesn’t.

schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?

As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data.  Or should we stick with the current collection of terms from suitable smaller vocabularies.

One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out.  I have watched the frown skip across several people’s faces when you first tell them that foaf:name is a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.

As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary?   Another former colleague, David Wood Tweeted  No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation.  If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.

You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it.   When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology.  What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.

OK a couple of interesting posts, but where is the similar message and connection?  I see it as democracy of opinion.  Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view.  More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few.  Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.

Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found.  However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points.  This way the ‘how’ of data publishing should become simpler, more widespread, and extensible.   On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.

Main image via democracy.org.au.

Semantic Search, Discovery, and Serendipity






An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html






IMG_0256 So I need to hang up some tools in my shed.  I need some bent hook things – I think.  Off to the hardware store in which I search for the fixings section.  Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes.  After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices.  Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice!  Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge.  I grab one of those and head for the checkout before my credit card regrets me discovering anything else.

We all know the library ‘browse’ experience.  Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.

An ambition for the web is to reflect and assist what we humans do in the real world.  Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition.   By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine.  A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.

How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post.  It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them.  Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues.  That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.

But surely there is more to it than that.  Yes there is, but it is not search – it is discovery.  As in my scenario above, humans do not only search for things.  We search to get ourselves to a start point for discovery.  I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby.  By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.

BBC Nature - Aardvark videos, news and facts Take what is still one of my favourite sites – BBC wildlife.  Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours.  These are all things or concepts related to the lion.  Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal.  Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe.  In such an environment serendipity is only a few clicks away.

There are two sides to the finding stuff coin – Search and Discovery.  Humans naturally do both, systems and the web are only just starting to move beyond search only.  This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data.  A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.

Is Linked Data DIY a Good Idea?

Rocket_Science Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not.  They will tell you that RDF is one of the simplest data forms for describing things, and they are right.  They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will.  They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too.  Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it.  They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.

So would I recommend a self-taught do-it-yourself approach to adopting Linked Data?  For an enthusiastic individual, maybe.  For a company or organisation wanting to get to know and then identify the potential benefits, no I would not.  Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.

Let me explain this apparent contradiction.  I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies.  These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models.  Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context.  That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself.  However, I do council care in the way you approach gaining this understanding.

bike_girl A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee.  The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.

I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path.  And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.

You want to understand how Linked Data could benefit your organisation?  Get some help to relate the benefits to your issues, challenges and opportunities.  Don’t go off to a third party and get them to implement something for you.  Bring in a steadying hand, encouragement, and guidance to stay on track.  Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data.  There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi.  Get your head around what you have, how you are going to publish and link it, and what the usage might be.  Then you can size and specify the technology and/or service you need to support it.

So back to my original question – Is Linked Data DIY a good idea?  Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation.  However, I would not recommend a do-it-yourself introduction to this.  Get yourself a steadying hand.

Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.

Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.

A Data 7th Wave Approaching






I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world. It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.






4405831072_3c769de659_b Some in the surfing community will tell you that every seventh wave is a big one.  I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself.  Because of that, it is a little constraining to talk about this next one only effecting the world of the Web.  This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.

Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches.  Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us.  The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.

I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact.  Enough of the grandiose metaphors and down to business.

Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s.  Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API.  With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.

Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?

It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.  Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives.  The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.

In the world of libraries, the Ontology Engineering Group (OEG) at the  Universidad Politécnica de Madrid are providing a contribution of linked bibliographic data to the gathering mass, alongside the British and Germans, with 2.4 Million bibliographic records from the Spanish National Library.  This adds weight to the arguments for a Linked Data future for libraries proposed by the Library of Congress and Stanford University.

I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.

Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.

Data Marketplaces are emerging to provide platforms to share and possibly monetise your data.  An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis.  Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept.  This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts.  An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.

The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy.  Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.

The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.

I wrote about Schema.org a while back.  An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html.  The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets.  When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much.  However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported.  By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.

The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.

So what caused me to open up my Macbook and start writing this.  It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified.  They are already harvesting, and using, structured information from web pages that has been encoded using RDF.  The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.

Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa.  This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand.  It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level.  Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies.  These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.

So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice.  By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically.  Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them.  A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all.  Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.

The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links.  So far this evangelism has had little success.  However, this shift may give them what they want via an unexpected route.

Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow.  A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company.  Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.

These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out.  Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching.  Perhaps I should look at the beach again in five years time to see if I was right.

Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.

A School for Data Wranglers






The Open Knowledge Foundation in association with the Peer 2 Peer University have announced plans to launch a School of Data, to increase the skill base for Data Wranglers, and scientists. This is welcome, not just because they are doing it but also, because of who they are and the direction they are taking.






schoolofdata The Open Knowledge Foundation in association with the Peer 2 Peer University have announced plans to launch a School of Data, to increase the skill base for Data Wranglers, and scientists.

So why do we need this?

Data is going to be come more core to our world than we could ever have imagined a few short years ago.  Although we have be producing it for decades, data has either been treated as something in the core of a project not to expose to prying eyes, or often as a toxic waste product of business processes.   Some of the traditional professions that have emerged, to look after and work with these data, reflect this relationship between us and and our digital assets. In the data warehouse, they archive, preserve, catalogue, and attempt to make sense of vast arrays of data.    The data miners, precariously dig through mountains of data as it shifts and settles around them, propping up their expensive burrows with assumptions and inferred relationships, hoping a change in the strata does not cause a logical cave-in and they have to start again.

As I have postulated previously, I believe we are on the edge of a new revolution where data becomes a new raw material that drives the emergence of new industries, analogous to the emergence of manufacturing as a consequence of the industrial revolution.  As this new era rolls out, the collection of data wrangling enthusiasts that have done a great job in getting us thus far will not be sufficient to sustain a new industry of extracting, transforming, linking, augmenting, analysing and publishing data.

So this initiative from the OKF & P2PU is very welcome:

The explosive growth in data, especially open data, in recent years has meant that the demand for data skills — for data “wranglers” or “scientists” — has been growing rapidly. Moreover, these skills aren’t just important for banks, supermarkets or the next silicon valley start-up, they are also going to be cruicial in research, in journalism, and in civil society organizations (CSOs).

However, there is currently a significant shortfall of data “wranglers” to satisfy this growing demand, especially in civil society organisations — McKinsey expects a skills shortage in data expertise to reach 50-60% by 2018 in the US alone.

It is welcome, not just because they are doing it but also, because of who they are and the direction they are taking:

The School of Data will adopt the successful peer-to-peer learning model established by P2PU and Mozilla in their ‘School of Webcraft’ partnership. Learners will progress by taking part in ‘learning challenges’ – series of structured, achievable tasks, designed to promote collaborative and project-based learning.

As learners gain skills, their achievements will be rewarded through assessments which lead to badges. Community support and on-demand mentoring will also be available for those who need it.

They are practically approaching real world issues and tasks from the direction of the benefit to society of opening up data.  Taking this route will engage with those that have the desire, need and enthusiasm to become either part or full time data wranglers.  Hopefully these will establish an ethos that will percolate into commercial organisations, taking an open world view with it.  I am not suggesting that commerce should be persuaded to freely and ,openly share all their data but they should learn the techniques of the open data community as the best way to share data under whatever commercial and licensing conditions are appropriate.