Hidden Gems in the new Schema.org 3.1 Release

I spend a significant amount of time working on the supporting software, vocabulary contents, and application of Schema.org. So it is with great pleasure, and a certain amount of relief, I share the release of Schema.org 3.1 and share some hidden gems you find in there.

I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of Schema.org.  So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.

That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:

  • Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
  • Finance Terms – Following an initiative and work by Financial Industry Business Ontology (FIBO) project (which I have the pleasure to be part of), in support of the W3C Financial Industry Business Ontology Community Group, several terms to improve the capability for describing things such as banks, bank accounts, financial products such as loans, and monetary amounts.
  • Spatial and Temporal and DatasetsCreativeWork now includes spatialCoverage and temporalCoverage which I know my cultural heritage colleagues and clients will find very useful.  Like many enhancements in the Schema.org community, this work came out of a parallel interest, in which  Dataset has received some attention.
  • Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
  • Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed schema.org terms to be documented, tested and revised.  The anticipation being that this area will be updated with proposals relatively frequently, in between formal Schema.org releases.
  • How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.

For fuller details on what is in 3.1 and other releases, checkout the Releases document.

Hidden Gems

Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:

  • Snappy Performance – The Schema.org site, a Python app hosted on the Google App Engine, is shall we say a very popular site.  Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
  • Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy.  This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files.  From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary.  (Tip: The page displays the link to the file that will always return the latest version.)
  • Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance.  I know from working with colleagues and clients, that it is sometimes difficult to get your head around Schema.org’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value.  It is good to now have somewhere to point people when they question such things.
  • Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary.  The rdfs:comment section of term definitions are now passed through a Markdown processor.  This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc.  So for example a link can be input as [The Link](http://example.com/mypage) and italic text would be input as *italic*.  The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link.   This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files. Winking smile

I could go on, but won’t  – If you are new to Schema.org, or very familiar, I suggest you take a look.

Schema.org 2.0

About a month ago Version 2.0 of the Schema.org vocabulary hit the streets. But does this warrant the version number clicking over from 1.xx to 2.0?

schema-org1 About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.

This update includes loads of tweaks, additions and fixes that can be found in the release information.  The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about.  With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.

But does this warrant the version number clicking over from 1.xx to 2.0?

These new types and properties are only the tip of the 2.0 iceberg.  There is a heck of a lot of other stuff going on in this release that apart from these additions.  Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.

Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced.  So part of the 2.0 effort has included some rationalisation.  For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details.  If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded.  However you are encouraged to moved towards the updated terminology as convenient.  The question of what is in which version brings me to an enhancement to the supporting documentation.  Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0.  So if you want to refer to a term at a particular version you now can.

CreativeWork_usage How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication.  Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘.  Unsurprisingly for a newly introduced property, there is virtually no usage of it yet.  If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘.  Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.

In the release information you will find the following cryptic reference: ‘Fix to #429: Implementation of new extension system.’

This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary.  The motivation for this new approach to extending is explained thus:

Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.

As yet there are no extensions published.  However, there are some on the way.

As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org.  The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics.  I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone.  Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.

W3C Community Group for Schema.org
A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place.  The creation of the Schema.org Community Group has now provided an open forum for this.

So is 2.0 a bit of a milestone?  Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.

Baby Steps Towards A Library Graph

image It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data.  The Web of Data that is starting to take shape and drive benefits for everyone.  Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.

Building blocks such as Schema.org; Linked Data in WorldCat.org; moves to enhance Schema.org capabilities for bibliographic resource description; recognition that Linked Data has a beneficial place in library data and initiatives to turn that into a reality; the release of Work entity data mined from, and linked to, the huge WorldCat.org data set.

OK, you may say, we’ve heard all that before, so what is new now?

As always it is a couple of seemingly unconnected events that throw things into focus.

Event 1:  An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go.  An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future.  Get our data out there, everywhere, so it can find its way to those users, wherever they are.  Making it accessible to all.  David references three main ways to provide this access:

  1. APIs – to allow systems to directly access our library system data and functionality
  2. Linked Datacan help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
  3. The Library Graph –  an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities

(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)

It is the 3rd of those, however, that triggered recognition for me.  Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’  – if graphs can have corners that is.  Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data.  However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.

Event 2: A posting by ZBW Labs Other editions of this work: An experiment with OCLC’s LOD work identifiers detailing experiments in using the OCLC WorldCat Works Data.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

The post is interesting from a couple of points of view.  Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data.  Command-line calls that were no doubt embedded in simple scripts.

Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions.  A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages.  What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.

The ZBW Labs experiment is interesting in its own way – simple approach enlightening results.  What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.

The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data.  Using it for things we never envisaged, delivering unexpected value to our community.  I can’t wait to hear about other initiatives like this that we can all learn from.

So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph.  Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.

As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing.  When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data.  An important authoritative corner, of a corner, of the Giant Global Graph.

As I said at the start these are baby steps towards a vision that is forming out of the mist.  I hope you and others can see it too.

(Toddler image: Harumi Ueda)

WorldCat Works – 197 Million Nuggets of Linked Data

worldcat They’re released!

A couple of months back I spoke about the preview release of Works data from WorldCat.org.  Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.

Story_of_my_experiments_with_truth___WorldCat_Entities__and_Windows_XP_Professional_2 They look a little different to the kind of metadata we are used to in the library world.  Check out this example <http://worldcat.org/entity/work/id/1151002411> and you will see that, apart from name and description strings, it is mostly links.  It is linked data after all.

These links (URIs) lead, where available, to authoritative sources for people, subjects, etc.  When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs.  As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.

The obvious question is “how do I get a work id for the items in my catalogue?”.  The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721

Gandhi___an_autobiography___the_story_of_my_experiments_with_truth__Book__2011___WorldCat_org_ In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:

<http://worldcat.org/oclc/817185721> exampleOfWork <http://worldcat.org/entity/work/id/1151002411>

To quote Neil Wilson, Head of Metadata Services at the British Library:

With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data.  This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.

This is a major first step in a journey to provide linked data views of the entities within WorldCat.  Looking forward to other WorldCat entities such as people, places, and events.  Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years.  These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.

You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.

Personally I am really looking forward to hearing about the uses that are made of this data.

SemanticWeb.com Spotlight on Library Innovation

Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5

Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5

Unknown oclc_logo semanticweb.com-logo

Update from organisers:
We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.

Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.

Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!


OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data.   SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.

SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities.  Amongst the great selection of speakers you will find yours truly!

If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.

Nominations for the Spotlight are being accepted through May 10th.  Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing.   Self-nominations will be accepted

Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing.  As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering.  (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)

For more details checkout the OCLC Innovation Series page.

Thank you for all the nominations we received for the first Semantic Web.com Spotlight on Innovation in Libraries.


Surfacing at Semtech San Francisco

San Francisco So where have I been?   I announce that I am now working as a Technology Evangelist for the the library behemoth OCLC, and then promptly disappear.  The only excuse I have for deserting my followers is that I have been kind of busy getting my feet under the OCLC table, getting to know my new colleagues, the initiatives and projects they are engaged with, the longer term ambitions of the organisation, and of course the more mundane issues of getting my head around the IT, video conferencing, and expense claim procedures.

It was therefore great to find myself in San Francisco once again for the Semantic Tech & Business Conference (#SemTechBiz) for what promises to be a great program this year.  Apart from meeting old and new friends amongst those interested in the potential and benefits of the Semantic Web and Linked Data, I am hoping for a further step forward in the general understanding of how this potential can be realised to address real world challenges and opportunities.

As Paul Miller reported, the opening session contained an audience with 75% first time visitors.  Just like the cityscape vista presented to those attending the speakers reception yesterday on the 45th floor of the conference hotel, I hope these new visitors get a stunningly clear view of the landscape around them.

Of course I am doing my bit to help on this front by trying to cut through some of the more technical geek-speak. Tuesday 8:00am will find me in Imperial Room B presenting The Simple Power of the Link – a 30 minute introduction to Linked Data, it’s benefits and potential without the need to get you head around the more esoteric concepts of Linked Data such as triple stores, inference, ontology management etc.  I would not only recommend this session for an introduction for those new to the topic, but also for those well versed in the technology as a reminder that we sometimes miss the simple benefits when trying to promote our baby.

For those interested in the importance of these techniques and technologies to the world of Libraries Archives and Museums I would also recommend a panel that I am moderating on Wednesday at 3:30pm in Imperial B – Linked Data for Libraries Archives and Museums.  I will be joined by LOD-LAM community driver Jon Voss, Stanford Linked Data Workshop Report co-author Jerry Persons, and  Sung Hyuk Kim from the National Library of Korea.  As moderator I will, not only let the four of us make small presentations about what is happening in our worlds, I will be insistent that at least half the time will be there for questions from the floor, so bring them along!

I am not only surfacing at Semtech, I am beginning to see, at last, the technologies being discussed surfacing as mainstream.  We in the Semantic Web/Linked world are very good at frightening off those new to it.  However, driven by pragmatism in search of a business model and initiatives such as Schema.org, it is starting to become mainstream buy default.  One very small example being Yahoo’!s Peter Mika telling us, in the Semantic Search workshop, that RDFa is the predominant format for embedding structured data within web pages.

Looking forward to a great week, and soon more time to get back to blogging!

Who Will Be Mostly Right – Wikidata, Schema.org?

Two, on the surface, totally unconnected posts – yet the the same message. Well that’s how they seem to me anyway.

Post 1 – The Problem With Wikidata from Mark Graham writing in the Atlantic. Post 2 – Danbri has moved on – should we follow? by a former colleague Phil Archer.

democracy Two, on the surface, totally unconnected posts – yet the the same message.  Well that’s how they seem to me anyway.

Post 1The Problem With Wikidata from Mark Graham writing in the Atlantic.

wikimedia When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February,  I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource.   This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.

In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:

It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).

He also highlights issues about the unevenness or bias of contributors to Wikipedia:

We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.

A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority.  If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.

I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin –  “WikiData will not define the truth, it will collect the references to the data….  WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.”  They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion.  In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.

Post 2Danbri has moved on – should we follow? by a former colleague Phil Archer.

schema-org1 The Danbri in question is Dan Brickley, one of the original architects of the Semantic Web, now working for Google in Schema.org.  Dan presented at an excellent Semantic Web Meetup, which I attended at the BBC Academy a couple of weeks back.  This was a great event.  I recommend investing in the time to watch the videos of Dan and all the other speakers.

Phil picked out a section of Dan’s presentation for comment:

In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…

Then reflecting on current practice in Linked Data he went on to postulate:

… best practice for the RDF community…  …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.

Except schema.org doesn’t.

schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?

As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data.  Or should we stick with the current collection of terms from suitable smaller vocabularies.

One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out.  I have watched the frown skip across several people’s faces when you first tell them that foaf:name is a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.

As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary?   Another former colleague, David Wood Tweeted  No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation.  If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.

You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it.   When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology.  What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.

OK a couple of interesting posts, but where is the similar message and connection?  I see it as democracy of opinion.  Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view.  More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few.  Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.

Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found.  However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points.  This way the ‘how’ of data publishing should become simpler, more widespread, and extensible.   On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.

Main image via democracy.org.au.

Semantic Search, Discovery, and Serendipity

An ambition for the web is to reflect and assist what we humans do in the real world. Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition. By harvesting Schema.org described metadata embedded in html

IMG_0256 So I need to hang up some tools in my shed.  I need some bent hook things – I think.  Off to the hardware store in which I search for the fixings section.  Following the signs hanging from the roof, my search is soon directed to a rack covered in lots of individual packets and I spot the thing I am looking for, but what’s this – they come in lots of different sizes.  After a bit of localised searching I grab the size I need, but wait – in the next rack there are some specialised tool hanging devices.  Square hooks, long hooks, double-prong hooks, spring clips, an amazing choice!  Pleased with what I discovered and selected I’m soon heading down the isle when my attention is drawn to a display of shelving with hidden brackets – just the thing for under the TV in the lounge.  I grab one of those and head for the checkout before my credit card regrets me discovering anything else.

We all know the library ‘browse’ experience.  Head for a particular book, and come away with a different one on the same topic that just happened to be on a nearby shelf, or even a totally different one that you ‘found’ on the recently returned books shelf.

An ambition for the web is to reflect and assist what we humans do in the real world.  Search has only brought us part of the way. By identifying key words in web page text, and links between those pages, it makes a reasonable stab at identifying things that might be related to the keywords we enter.

As I commented recently, Semantic Search messages coming from Google indicate that they are taking significant steps towards the ambition.   By harvesting Schema.org described metadata embedded in html, by webmasters enticed by Rich Snippets, and building on the 12 million entity descriptions in Freebase they are amassing the fuel for a better search engine.  A search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things.

How much closer will this better, semantic, search get to being able to replicate online the scenario I shared at the start of this post.  It should do a better job of relating our keywords to the things that would be of interest, not just the pages about them.  Having a better understanding of entities should help with the Paris Hilton problem, or at least help us navigate around such issues.  That better understanding of entities, and related entities, should enable the return of related relevant results that did not contain our keywords.

But surely there is more to it than that.  Yes there is, but it is not search – it is discovery.  As in my scenario above, humans do not only search for things.  We search to get ourselves to a start point for discovery.  I searched for an item in the fixings section in the hardware store or a book in the the library I then inspected related items on the rack and the shelf to discover if there was anything more appropriate for my needs nearby.  By understanding things and the [semantic] relationships between them, systems could help us with that discovery phase. It is the search engine’s job to expose those relationships but the prime benefit will emerge when the source web sites start doing it too.

BBC Nature - Aardvark videos, news and facts Take what is still one of my favourite sites – BBC wildlife.  Take a look at the Lion page, found by searching for lions in Google. Scroll down a bit and you will see listed the lion’s habitats and behaviours.  These are all things or concepts related to the lion.  Follow the link to the flooded grassland habitat, where you will find lists of flora and fauna that you will find there, including the aardvark which is nocturnal.  Such follow-your-nose navigation around the site supports the discovery method of finding things that I describe.  In such an environment serendipity is only a few clicks away.

There are two sides to the finding stuff coin – Search and Discovery.  Humans naturally do both, systems and the web are only just starting to move beyond search only.  This move is being enabled by the constantly growing data that is describing things and their relationships – Linked Data.  A growth stimulated by initiatives such as Schema.org, and Google providing quick return incentives, such as Rich Snippets & SEO goodness, for folks to publish structured data for reasons other than a futuristic Semantic Web.

Google SEO RDFa and Semantic Search

GoogleBlueBalls Today’s Wall Street Journal gives us an insight in to the makeover underway in the Google search department.

Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.

They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.

The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold.  This is now continuing with harvesting of html embedded, schema.org encoded, structured data that is starting to spread across the web.

The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets.  A nice trade-off from Google – you embed the information we want/need for a better search and we will give you  better results.

The premise of what Google are are up to is that it will deliver better search.  Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results.  The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us.  This is the step-change.  We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.

Another example quoted in the article:

…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.

Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success.  I would say that they should be helping the search engine supplier(s) do this right and be part of the process.  There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.

As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously.   I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result.  I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities.  However I’ve been around far too long to get hung up about labels.

Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the schema.org.  Google recognises both, both are ugly for humans to hand code, and web masters should not have to care.  Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.

I welcome this.  Yet it is only a symptom of something much bigger and game-changing as I postulated last month A Data 7th Wave is Approaching.

Is Linked Data DIY a Good Idea?

Rocket_Science Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not.  They will tell you that RDF is one of the simplest data forms for describing things, and they are right.  They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will.  They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too.  Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it.  They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.

So would I recommend a self-taught do-it-yourself approach to adopting Linked Data?  For an enthusiastic individual, maybe.  For a company or organisation wanting to get to know and then identify the potential benefits, no I would not.  Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.

Let me explain this apparent contradiction.  I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies.  These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models.  Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context.  That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself.  However, I do council care in the way you approach gaining this understanding.

bike_girl A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee.  The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.

I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path.  And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.

You want to understand how Linked Data could benefit your organisation?  Get some help to relate the benefits to your issues, challenges and opportunities.  Don’t go off to a third party and get them to implement something for you.  Bring in a steadying hand, encouragement, and guidance to stay on track.  Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data.  There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi.  Get your head around what you have, how you are going to publish and link it, and what the usage might be.  Then you can size and specify the technology and/or service you need to support it.

So back to my original question – Is Linked Data DIY a good idea?  Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation.  However, I would not recommend a do-it-yourself introduction to this.  Get yourself a steadying hand.

Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.

Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.