Schema.org 2.0






About a month ago Version 2.0 of the Schema.org vocabulary hit the streets. But does this warrant the version number clicking over from 1.xx to 2.0?

schema-org1 About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.

This update includes loads of tweaks, additions and fixes that can be found in the release information.  The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about.  With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.

But does this warrant the version number clicking over from 1.xx to 2.0?

These new types and properties are only the tip of the 2.0 iceberg.  There is a heck of a lot of other stuff going on in this release that apart from these additions.  Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.

Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced.  So part of the 2.0 effort has included some rationalisation.  For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details.  If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded.  However you are encouraged to moved towards the updated terminology as convenient.  The question of what is in which version brings me to an enhancement to the supporting documentation.  Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0.  So if you want to refer to a term at a particular version you now can.

CreativeWork_usage How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication.  Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘.  Unsurprisingly for a newly introduced property, there is virtually no usage of it yet.  If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘.  Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.

Extensions
In the release information you will find the following cryptic reference: ‘Fix to #429: Implementation of new extension system.’

This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary.  The motivation for this new approach to extending is explained thus:

Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.

As yet there are no extensions published.  However, there are some on the way.

As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org.  The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics.  I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone.  Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.

W3C Community Group for Schema.org
A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place.  The creation of the Schema.org Community Group has now provided an open forum for this.

So is 2.0 a bit of a milestone?  Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.

Baby Steps Towards A Library Graph

image It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data.  The Web of Data that is starting to take shape and drive benefits for everyone.  Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.

Building blocks such as Schema.org; Linked Data in WorldCat.org; moves to enhance Schema.org capabilities for bibliographic resource description; recognition that Linked Data has a beneficial place in library data and initiatives to turn that into a reality; the release of Work entity data mined from, and linked to, the huge WorldCat.org data set.

OK, you may say, we’ve heard all that before, so what is new now?

As always it is a couple of seemingly unconnected events that throw things into focus.

Event 1:  An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go.  An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future.  Get our data out there, everywhere, so it can find its way to those users, wherever they are.  Making it accessible to all.  David references three main ways to provide this access:

  1. APIs – to allow systems to directly access our library system data and functionality
  2. Linked Datacan help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
  3. The Library Graph –  an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities

(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)

It is the 3rd of those, however, that triggered recognition for me.  Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’  – if graphs can have corners that is.  Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data.  However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.

Event 2: A posting by ZBW Labs Other editions of this work: An experiment with OCLC’s LOD work identifiers detailing experiments in using the OCLC WorldCat Works Data.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

The post is interesting from a couple of points of view.  Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data.  Command-line calls that were no doubt embedded in simple scripts.

Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions.  A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages.  What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.

The ZBW Labs experiment is interesting in its own way – simple approach enlightening results.  What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.

The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data.  Using it for things we never envisaged, delivering unexpected value to our community.  I can’t wait to hear about other initiatives like this that we can all learn from.

So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph.  Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.

As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing.  When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data.  An important authoritative corner, of a corner, of the Giant Global Graph.

As I said at the start these are baby steps towards a vision that is forming out of the mist.  I hope you and others can see it too.

(Toddler image: Harumi Ueda)

WorldCat Works – 197 Million Nuggets of Linked Data

worldcat They’re released!

A couple of months back I spoke about the preview release of Works data from WorldCat.org.  Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.

Story_of_my_experiments_with_truth___WorldCat_Entities__and_Windows_XP_Professional_2 They look a little different to the kind of metadata we are used to in the library world.  Check out this example <http://worldcat.org/entity/work/id/1151002411> and you will see that, apart from name and description strings, it is mostly links.  It is linked data after all.

These links (URIs) lead, where available, to authoritative sources for people, subjects, etc.  When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs.  As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.

The obvious question is “how do I get a work id for the items in my catalogue?”.  The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721

Gandhi___an_autobiography___the_story_of_my_experiments_with_truth__Book__2011___WorldCat_org_ In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:

<http://worldcat.org/oclc/817185721> exampleOfWork <http://worldcat.org/entity/work/id/1151002411>

To quote Neil Wilson, Head of Metadata Services at the British Library:

With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data.  This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.

This is a major first step in a journey to provide linked data views of the entities within WorldCat.  Looking forward to other WorldCat entities such as people, places, and events.  Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years.  These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.

You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.

Personally I am really looking forward to hearing about the uses that are made of this data.

SemanticWeb.com Spotlight on Library Innovation






Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5






Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5

Unknown oclc_logo semanticweb.com-logo

Update from organisers:
We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.

Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.

Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!

 

OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data.   SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.

SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities.  Amongst the great selection of speakers you will find yours truly!

If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.

Nominations for the Spotlight are being accepted through May 10th.  Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing.   Self-nominations will be accepted

Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing.  As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering.  (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)

For more details checkout the OCLC Innovation Series page.

Thank you for all the nominations we received for the first Semantic Web.com Spotlight on Innovation in Libraries.

 

Surfacing at Semtech San Francisco

San Francisco So where have I been?   I announce that I am now working as a Technology Evangelist for the the library behemoth OCLC, and then promptly disappear.  The only excuse I have for deserting my followers is that I have been kind of busy getting my feet under the OCLC table, getting to know my new colleagues, the initiatives and projects they are engaged with, the longer term ambitions of the organisation, and of course the more mundane issues of getting my head around the IT, video conferencing, and expense claim procedures.

It was therefore great to find myself in San Francisco once again for the Semantic Tech & Business Conference (#SemTechBiz) for what promises to be a great program this year.  Apart from meeting old and new friends amongst those interested in the potential and benefits of the Semantic Web and Linked Data, I am hoping for a further step forward in the general understanding of how this potential can be realised to address real world challenges and opportunities.

As Paul Miller reported, the opening session contained an audience with 75% first time visitors.  Just like the cityscape vista presented to those attending the speakers reception yesterday on the 45th floor of the conference hotel, I hope these new visitors get a stunningly clear view of the landscape around them.

Of course I am doing my bit to help on this front by trying to cut through some of the more technical geek-speak. Tuesday 8:00am will find me in Imperial Room B presenting The Simple Power of the Link – a 30 minute introduction to Linked Data, it’s benefits and potential without the need to get you head around the more esoteric concepts of Linked Data such as triple stores, inference, ontology management etc.  I would not only recommend this session for an introduction for those new to the topic, but also for those well versed in the technology as a reminder that we sometimes miss the simple benefits when trying to promote our baby.

For those interested in the importance of these techniques and technologies to the world of Libraries Archives and Museums I would also recommend a panel that I am moderating on Wednesday at 3:30pm in Imperial B – Linked Data for Libraries Archives and Museums.  I will be joined by LOD-LAM community driver Jon Voss, Stanford Linked Data Workshop Report co-author Jerry Persons, and  Sung Hyuk Kim from the National Library of Korea.  As moderator I will, not only let the four of us make small presentations about what is happening in our worlds, I will be insistent that at least half the time will be there for questions from the floor, so bring them along!

I am not only surfacing at Semtech, I am beginning to see, at last, the technologies being discussed surfacing as mainstream.  We in the Semantic Web/Linked world are very good at frightening off those new to it.  However, driven by pragmatism in search of a business model and initiatives such as Schema.org, it is starting to become mainstream buy default.  One very small example being Yahoo’!s Peter Mika telling us, in the Semantic Search workshop, that RDFa is the predominant format for embedding structured data within web pages.

Looking forward to a great week, and soon more time to get back to blogging!

Who Will Be Mostly Right – Wikidata, Schema.org?






Two, on the surface, totally unconnected posts – yet the the same message. Well that’s how they seem to me anyway.

Post 1 – The Problem With Wikidata from Mark Graham writing in the Atlantic. Post 2 – Danbri has moved on – should we follow? by a former colleague Phil Archer.






democracy Two, on the surface, totally unconnected posts – yet the the same message.  Well that’s how they seem to me anyway.

Post 1The Problem With Wikidata from Mark Graham writing in the Atlantic.

wikimedia When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February,  I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource.   This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.

In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

He also highlights issues about the unevenness or bias of contributors to Wikipedia:

We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.

A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority.  If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.

I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin –  “WikiData will not define the truth, it will collect the references to the data….  WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.”  They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion.  In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Post 2Danbri has moved on – should we follow? by a former colleague Phil Archer.

schema-org1 The Danbri in question is Dan Brickley, one of the original architects of the Semantic Web, now working for Google in Schema.org.  Dan presented at an excellent Semantic Web Meetup, which I attended at the BBC Academy a couple of weeks back.  This was a great event.  I recommend investing in the time to watch the videos of Dan and all the other speakers.

Phil picked out a section of Dan’s presentation for comment:

In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…

Then reflecting on current practice in Linked Data he went on to postulate:

… best practice for the RDF community…  …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.

Except schema.org doesn’t.

schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?

As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data.  Or should we stick with the current collection of terms from suitable smaller vocabularies.

One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out.  I have watched the frown skip across several people’s faces when you first tell them that foaf:name is a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.

As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary?   Another former colleague, David Wood Tweeted  No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation.  If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.

You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it.   When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology.  What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.

OK a couple of interesting posts, but where is the similar message and connection?  I see it as democracy of opinion.  Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view.  More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few.  Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.

Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found.  However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points.  This way the ‘how’ of data publishing should become simpler, more widespread, and extensible.   On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.

Main image via democracy.org.au.

Semantic Search, Discovery, and Serendipity






An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html






IMG_0256 So I need to hang up some tools in my shed.  I need some bent hook things – I think.  Off to the hardware store in which I search for the fixings section.  Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes.  After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices.  Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice!  Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge.  I grab one of those and head for the checkout before my credit card regrets me discovering anything else.

We all know the library ‘browse’ experience.  Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.

An ambition for the web is to reflect and assist what we humans do in the real world.  Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition.   By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine.  A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.

How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post.  It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them.  Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues.  That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.

But surely there is more to it than that.  Yes there is, but it is not search – it is discovery.  As in my scenario above, humans do not only search for things.  We search to get ourselves to a start point for discovery.  I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby.  By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.

BBC Nature - Aardvark videos, news and facts Take what is still one of my favourite sites – BBC wildlife.  Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours.  These are all things or concepts related to the lion.  Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal.  Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe.  In such an environment serendipity is only a few clicks away.

There are two sides to the finding stuff coin – Search and Discovery.  Humans naturally do both, systems and the web are only just starting to move beyond search only.  This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data.  A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.

Google SEO RDFa and Semantic Search

GoogleBlueBalls Today’s Wall Street Journal gives us an insight in to the makeover underway in the Google search department.

Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.

They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.

The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold.  This is now continuing with harvesting of html embedded, schema.org encoded, structured data that is starting to spread across the web.

The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets.  A nice trade-off from Google – you embed the information we want/need for a better search and we will give you  better results.

The premise of what Google are are up to is that it will deliver better search.  Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results.  The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us.  This is the step-change.  We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.

Another example quoted in the article:

…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.

Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success.  I would say that they should be helping the search engine supplier(s) do this right and be part of the process.  There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.

As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously.   I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result.  I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities.  However I’ve been around far too long to get hung up about labels.

Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the schema.org.  Google recognises both, both are ugly for humans to hand code, and web masters should not have to care.  Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.

I welcome this.  Yet it is only a symptom of something much bigger and game-changing as I postulated last month A Data 7th Wave is Approaching.

Is Linked Data DIY a Good Idea?

Rocket_Science Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not.  They will tell you that RDF is one of the simplest data forms for describing things, and they are right.  They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will.  They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too.  Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it.  They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.

So would I recommend a self-taught do-it-yourself approach to adopting Linked Data?  For an enthusiastic individual, maybe.  For a company or organisation wanting to get to know and then identify the potential benefits, no I would not.  Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.

Let me explain this apparent contradiction.  I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies.  These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models.  Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context.  That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself.  However, I do council care in the way you approach gaining this understanding.

bike_girl A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee.  The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.

I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path.  And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.

You want to understand how Linked Data could benefit your organisation?  Get some help to relate the benefits to your issues, challenges and opportunities.  Don’t go off to a third party and get them to implement something for you.  Bring in a steadying hand, encouragement, and guidance to stay on track.  Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data.  There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi.  Get your head around what you have, how you are going to publish and link it, and what the usage might be.  Then you can size and specify the technology and/or service you need to support it.

So back to my original question – Is Linked Data DIY a good idea?  Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation.  However, I would not recommend a do-it-yourself introduction to this.  Get yourself a steadying hand.

Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.

Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.

A Data 7th Wave Approaching






I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world. It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.






4405831072_3c769de659_b Some in the surfing community will tell you that every seventh wave is a big one.  I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself.  Because of that, it is a little constraining to talk about this next one only effecting the world of the Web.  This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.

Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches.  Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us.  The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.

I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact.  Enough of the grandiose metaphors and down to business.

Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s.  Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API.  With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.

Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?

It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant.  Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives.  The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.

In the world of libraries, the Ontology Engineering Group (OEG) at the  Universidad Politécnica de Madrid are providing a contribution of linked bibliographic data to the gathering mass, alongside the British and Germans, with 2.4 Million bibliographic records from the Spanish National Library.  This adds weight to the arguments for a Linked Data future for libraries proposed by the Library of Congress and Stanford University.

I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.

Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.

Data Marketplaces are emerging to provide platforms to share and possibly monetise your data.  An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis.  Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept.  This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts.  An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.

The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy.  Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.

The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.

I wrote about Schema.org a while back.  An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html.  The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets.  When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much.  However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported.  By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.

The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.

So what caused me to open up my Macbook and start writing this.  It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified.  They are already harvesting, and using, structured information from web pages that has been encoded using RDF.  The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.

Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa.  This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand.  It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level.  Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies.  These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.

So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice.  By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically.  Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them.  A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all.  Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.

The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links.  So far this evangelism has had little success.  However, this shift may give them what they want via an unexpected route.

Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow.  A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company.  Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.

These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out.  Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching.  Perhaps I should look at the beach again in five years time to see if I was right.

Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.