Hidden Gems in the new Schema.org 3.1 Release

I spend a significant amount of time working on the supporting software, vocabulary contents, and application of Schema.org. So it is with great pleasure, and a certain amount of relief, I share the release of Schema.org 3.1 and share some hidden gems you find in there.

I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of Schema.org.  So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.

That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:

  • Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
  • Finance Terms – Following an initiative and work by Financial Industry Business Ontology (FIBO) project (which I have the pleasure to be part of), in support of the W3C Financial Industry Business Ontology Community Group, several terms to improve the capability for describing things such as banks, bank accounts, financial products such as loans, and monetary amounts.
  • Spatial and Temporal and DatasetsCreativeWork now includes spatialCoverage and temporalCoverage which I know my cultural heritage colleagues and clients will find very useful.  Like many enhancements in the Schema.org community, this work came out of a parallel interest, in which  Dataset has received some attention.
  • Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
  • Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed schema.org terms to be documented, tested and revised.  The anticipation being that this area will be updated with proposals relatively frequently, in between formal Schema.org releases.
  • How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.

For fuller details on what is in 3.1 and other releases, checkout the Releases document.

Hidden Gems

Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:

  • Snappy Performance – The Schema.org site, a Python app hosted on the Google App Engine, is shall we say a very popular site.  Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
  • Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy.  This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files.  From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary.  (Tip: The page displays the link to the file that will always return the latest version.)
  • Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance.  I know from working with colleagues and clients, that it is sometimes difficult to get your head around Schema.org’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value.  It is good to now have somewhere to point people when they question such things.
  • Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary.  The rdfs:comment section of term definitions are now passed through a Markdown processor.  This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc.  So for example a link can be input as [The Link](http://example.com/mypage) and italic text would be input as *italic*.  The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link.   This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files. Winking smile

I could go on, but won’t  – If you are new to Schema.org, or very familiar, I suggest you take a look.

A Fundamental Component of a New Web?

Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.

This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org.

cogsbrowser Marketing Hype! I hear you thinking – well at least I didn’t use the tired old ‘Next Generation’ label.

Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.

This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org.  But let me first set the context by explaining my thoughts on this New Web.

Having once been considered an expert on Web 2.0 (I hasten to add by others, not myself) I know how dangerous it can be to attach labels to things.  It tends to spawn screen full’s of passionate opinions on the relevance of the name, date of the revolution, and over detailed analysis of isolated parts of what is a general movement.   I know I am on dangerous ground here!

To my mind something is new when it feels different.  The Internet felt different when the Web (aka HTTP + HTML + browsers) arrived.  The Web felt different (Web 2.0?) when it became more immersive (write as well as read) and visually we stopped trying to emulate in a graphical style what we saw on character terminals. Oh, and yes we started to round our corners. 

There have been many times over the last few years when it felt new – when it suddenly arrived in our pockets (the mobile web) – when the inner thoughts, and eating habits, of more friends that you ever remember meeting became of apparent headline importance (the social web) – when [the contents of] the web broke out of the boundaries of the browser and appeared embedded in every app, TV show, and voice activated device.

The feeling different phase I think we are going through at the moment, like previous times, is building on what went before.  It is exemplified by information [data] breaking out of the boundaries of our web sites and appearing where it is useful for the user. 

library_of_congress We are seeing the tip of this iceberg in the search engine Knowledge Panels, answer boxes, and rich snippets,ebay-rich-snippet   The effect of this being that often your potential user can get what they need without having to find and visit your site – answering questions such as what is the customer service phone number for an organisation; is the local branch open at the moment; give me driving directions to it; what is available and on offer.  Increasingly these interactions can occur without the user even being aware they are using the web – “Siri! Where is my nearest library?
A great way to build relationships with your customers. However a new and interesting challenge for those trying to measure the impact of your web site.

So, what is fundamental to this New Web

There are several things – HTTP, the light-weight protocol designed to transfer text, links and latterly data, across an internet previously used to specific protocols for specific purposes – HTML, that open, standard, easily copied light-weight extensible generic format for describing web pages that all browsers can understand – Microdata, RDFa, JSON, JSON-LD – open standards for easily embedding data into HTML – RDF, an open data format for describing things of any sort, in the form of triples, using shared vocabularies.  Building upon those is Schema.org – an open, [de facto] standard, generic vocabulary for describing things in most areas of interest. 

icon-LOVWhy is one vocabulary fundamental when there are so many others to choose from? Check out the 500+ referenced on the  Linked Open Vocabularies (LOV) site.  Schema.org however differs from most of the others in a few key areas:Schema.org square 

  • Size and scope – its current 642 Types and 992 Properties is significantly larger and covers far more domains of interest than most others.  This means that if you are looking to describe a something, you are highly likely to to find enough to at least start.  Despite its size, it is yet far from capable of describing everything on, or off, the planet.
  • Adoption – it is estimated to be in use on over 12 million sites.  A sample of 10 billion pages showed over 30% containing Schema.org markup.  Checkout this article for more detail: Schema.org: Evolution of Structured Data on the Web – Big data makes common schemas even more necessary. By Guha, Brickley and Macbeth.
  • Evolution – it is under continuous evolutionary development and extension, driven and guided by an open community under the wing of the W3C and accessible in a GitHub repository.
  • Flexibility – from the beginning Schema.org was designed to be used in a choice of your favourite serialisation – Microdata, RDFa, JSON-LD, with the flexibility of allowing values to default to text if you have not got a URI available.
  • Consumers – The major search engines Google, Bing, Yahoo!, and Yandex, not only back the open initiative behind Schema.org but actively search out Schema.org markup to add to their Knowledge Graphs when crawling your sites.
  • Guidance – If you search out guidance on supplying structured data to those major search engines, you are soon supplied with recommendations and examples for using Schema.org, such as this from Google.  They even supply testing tools for you to validate your markup.

With this support and adoption, the Schema.org initiative has become self-fulfilling.  If your objective is to share or market structured data about your site, organisation, resources, and or products with the wider world; it would be difficult to come up with a good reason not to use Schema.org.

Is it a fully ontologically correct semantic web vocabulary? Although you can see many semantic web and linked data principles within it, no it is not.  That is not its objective. It is a pragmatic compromise between such things, and the general needs of webmasters with ambitions to have their resources become an authoritative part of the global knowledge graphs, that are emerging as key to the future of the development of search engines and the web they inhabit.

Note that I question if Schema.org is a fundamental component, of what I am feeling is a New Web. It is not the fundamental component, but  one of many that over time will become just the way we do things

Evolving Schema.org in Practice Pt2: Working Within the Vocabulary

Having covered the working environment; I now intend to describe some of the important files that make up Schema.org and how you can work with them to create or update, examples and term definitions within your local forked version.

Thing_rdfa In the previous post in this series Pt1: The Bits and Pieces I stepped through the process of obtaining your own fork of the Schema.org GitHub repository; working on it locally; uploading your version for sharing; and proposing those changes to the Schema.org community in the form of a GitHub Pull Request.

Having covered the working environment; in this post I now intend to describe some of the important files that make up Schema.org and how you can work with them to create or update, examples and term definitions within your local forked version in preparation for proposing them in a Pull Request.

files The File Structure
If you inspect the repository you will see a simple directory structure.  At the top level you will find a few files sporting a .py suffix.  These contain the python application code to run the site you see at http://schema.org.  They load the configuration files, build an in-memory version of the vocabulary that are used to build the html pages containing the definitions of the terms, schema listings, examples displays, etc.  They are joined by a file named app.yaml, which contains the configuration used by the Google App Engine to run that code.

At this level there are some directories containing supporting files: docs & templates contain static content for some pages; tests & scripts are used in the building and testing of the site; data contains the files that define the vocabulary, its extensions, and the examples used to demonstrate its use.

The Data Files
The data directory itself contains various files and directories.  schema.rdfa is the most important file, it contains the core definitions for the majority of the vocabulary.  Although, most of the time, you will see schema.rdfa as the only file with a .rdfa suffix in the data directory, the application will look for and load any .rdfa files it finds here.  This is a very useful feature when working on a local version – you can keep your enhancements together only merging them into the main schema.rdfa file when ready to propose them.

Also in the data directory you will find an examples.txt file and several others ending with –examples.txt.  These contain the examples used on the term pages, the application loads all of them.

Amongst the directories in data, there are a couple of important ones.  releases contains snapshots of versions of the vocabulary from version 2.0 onwards.  datafilesThe directory named ext contains the files that define the vocabulary extensions and examples that relate to them.  Currently you will find auto and bib directories within ext, corresponding to the extensions currently supported.  The format within these directories follows the basic pattern of the data directory – one or more .rdfa files containing the term definitions and –examples.txt files containing relevant examples.

Getting to grips with the RDFa

Enough preparation let’s get stuck into some vocabulary!

Take your favourite text/code editing application and open up schema.rdfa. You will notice two things – it is large [well over 12,500 lines!], and it is in the format of a html file.  Schema_html This second attribute makes it easy for non-technical viewing – you can open it with a browser.

Once you get past a bit of CSS formatting information and a brief introduction text, you arrive [about 35 lines down] at the first couple of definitions – for Thing and CreativeWork.

The Anatomy of a Type Definition
Standard RDFa (RDF in attributes) html formatting is used to define each term.  A vocabulary Type is defined as a RDFa marked up <div> element with its attributes contained in marked up <span> elements.

The Thing Type definition:

The attributes of the <div> element indicate that this is the definition of a Type (typeof=”rdfs:Class”) and its canonical identifier (resource=”http://schema.org/Thing”).  The <span> elements filling in the details – it has a label (rdfs:label) of ‘Thing‘ and a descriptive comment (rdfs:comment) of ‘The most generic type of item‘.  There is one formatting addition to the <span> containing the label.  The class=”h” is there to make the labels stand out when viewing in a browser – it has no direct relevance to the structure of the vocabulary.

The CreativeWork Type definition

Inspecting the CreativeWork definition reveals a few other attributes of a Type defined in <span> elements.  The rdfs:subClassOf property, with the associated href on the <a> element, indicates that http://schema.org/CreativeWork is a sub-type of http://schema.org/Thing.

Acknowledge Finally there is the dc:source property and its associated href value.  This has no structural impact on the vocabulary, its purpose is to acknowledge and reference the source of the inspiration for the term.  It is this reference that results in the display of a paragraph under the Acknowledgements section of a term page.

Defining Properties
The properties that can be used with a Type are defined in a very similar way to the Types themselves.

The name Property definition:

The attributes of the <div> element indicate that this is the definition of a Property (typeof=”rdf:Property”) and its canonical identifier (resource=”http://schema.org/name”).  As with Types the <span> elements fill in the details.

name Properties have two specific <span> elements to define the domain and range of a property.  If these concepts are new to you, the concepts are basically simple.  The Type(s) defined as being in the domain of a property are are those for which the property is a valid attribute.  The Type(s) defined as being in the range of a property, are those that expected values for that property.  So inspecting the above name example we can see that name is a valid property of the Thing Type with an expected value type of Text.  Also specific to property definitions is rdfs:subPropertyOf which defies that one property is a sub-property another.  For html/RDFa format reasons this is defined using a link entity thus: <link property=”rdfs:subPropertyOf” href=”http://schema.org/workFeatured” />.

Those used to defining other RDF vocabularies may question the use of  http://schema.org/domainIncludes and http://schema.org/rangeIncludes to define these relationships. This is a pragmatic approach to producing a flexible data model for the web.  For a more in-depth explanation I refer you to the Schema.org Data Model documentation.

Not an exhaustive tutorial in editing the defining RDFa but hopefully enough to get you going!


Making Examplesexamples

One of the most powerful features of the Schema.org documentation is the Examples section on most of the term pages.  These provide mark up examples for most of the terms in the vocabulary, that can be used and built upon by those adding Schema.org data to their web pages.  These examples represent how the html of a page or page section may be marked up.  To set context, the examples are provided in several serialisations – basic html, html plus Microdata, html plus RDFa, and JSON-LD.  As the objective is to aid the understanding of how Schema.org may be used, it is usual to provide simple basic html formatting in the examples.

Examples in File
As described earlier, the source for examples are held in files with a –examples.txt suffix, stored in the data directory or in individual extension directories.

One or more examples per file are defined in a very simplistic format.

An example begins in the file with a line that starts with TYPES:, such as this:

This example has a unique identifier prefixed with a # character, there should be only one of these per example.  These identifiers are intended for future feedback mechanisms and as such are not particularly controlled.  I recommend you crate your own when creating your examples.  Next comes a comma separated list of term names.  Adding a term to this list will result in the example appearing on the page for that term.  This is true for both Types and Properties.

Next comes four sections each preceded by a line containing a single label in the following order: PRE-MARKUP:, MICRODATA:, RDFA:, JSON:.  Each section ends when the next label line, or the end of the file is reached.  The contents of each section of the example is then inserted into the appropriate tabbed area on the term page.  The process that does this is not a sophisticated one, there are no error or syntax checking involved – if you want to insert the text of the Gettysburg Address as your RDFa example, it will let you do it.

I am not going to provide tutorials for html, Microdata, RDFa, or JSON-LD here there are a few of those about.  I will however recommend a tool I use to convert between these formats when creating examples.  RDF Translatorrdftranslator is a simple online tool that will validate and translate between RDFa, Microdata, RDF/XML, N3, N-Triples, and JSON-LD.  A suggestion, to make your examples as informative possible – when converting between formats, especially when converting to JSON-LD, most conversion tools reorder he statements. It is worth investing some time in ensuring that the mark up order in your example is consistent for all serialisations.

Hopefully this post will clear away some of mystery of how Schema.org is structured and managed. If you have proposals in mind to enhance and extend the vocabulary or examples, have a go, see if thy make sense in a version on your own system, suggest them to the community on Github.

In my next post I will look more at extensions, Hosted and External, and how you work with those, including some hints on choosing where to propose changes – in the core vocabulary, in a hosted or an external extension.

Schema.org 2.0

About a month ago Version 2.0 of the Schema.org vocabulary hit the streets. But does this warrant the version number clicking over from 1.xx to 2.0?

schema-org1 About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.

This update includes loads of tweaks, additions and fixes that can be found in the release information.  The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about.  With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.

But does this warrant the version number clicking over from 1.xx to 2.0?

These new types and properties are only the tip of the 2.0 iceberg.  There is a heck of a lot of other stuff going on in this release that apart from these additions.  Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.

Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced.  So part of the 2.0 effort has included some rationalisation.  For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details.  If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded.  However you are encouraged to moved towards the updated terminology as convenient.  The question of what is in which version brings me to an enhancement to the supporting documentation.  Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0.  So if you want to refer to a term at a particular version you now can.

CreativeWork_usage How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication.  Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘.  Unsurprisingly for a newly introduced property, there is virtually no usage of it yet.  If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘.  Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.

In the release information you will find the following cryptic reference: ‘Fix to #429: Implementation of new extension system.’

This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary.  The motivation for this new approach to extending is explained thus:

Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.

As yet there are no extensions published.  However, there are some on the way.

As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org.  The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics.  I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone.  Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.

W3C Community Group for Schema.org
A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place.  The creation of the Schema.org Community Group has now provided an open forum for this.

So is 2.0 a bit of a milestone?  Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.

The role of Role in Schema.org

This post is about an unusual, but very useful, aspect of the Schema.org vocabulary — the Role type.

schema-org1 Schema.org is basically a simple vocabulary for describing stuff, on the web.  Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs.  They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding.  I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.

The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for.  Ideally provide a URI (URL in Schema.org) for a property that references another thing, but if you don’t have one a simple string will do.

There are a few strangenesses, that hit you when you first delve into using the vocabulary.  For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place.  This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.

At first look at the documentation, Role looks like a very simple type with a handful of properties.  On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary.  That is because it is capable of fitting almost anywhere.  Anywhere there is a relationship between one type and another, that is.  It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes.  Some might term this as a form of annotation.

So what need is this satisfying you may ask.  It must be a significant need to cause the creation of a special case in the vocabulary.  Let me walk through a case, that is used in a Schema.org Blog post, to explain a need scenario and how Role satisfies that need.

Starting With American Football

Say you are describing members of an American Football Team.  Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:

Then describe a player using a Person type, providing name, gender, etc.:

Now lets relate them together by adding an athlete relationship to the Person description:


Let’s take a look of the data structure we have created using Turtle – not a html markup syntax but an excellent way to visualise the data structures isolated from the html:

So we now have Chucker Roberts described as an athlete on the Touchline Gods team.  The obvious question then is how do we describe the position he plays in the team.  We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as Schema.org.

This is where Role comes in handy.  Regardless of the range defined for any property in Schema.org, it is acceptable to provide a Role as a value.  The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person).  In simple terms we have have just inserted a Role type between the original two descriptions.


This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person).  For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.



and in Turtle:

Beyond American Football

So far I have just been stepping through the example provided in the Schema.org blog post on this.  Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.

There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator,  contributor, publisher, character, etc.) in CreativeWork and its subtypes.  By using Role, in the same way as in the sports example above,  we have the flexibility to describe what is needed.

Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor.  However by using Role can qualify the contribution role that he played to be that of the writer of preface.


In Turtle:

and RDFa:

You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings.  I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.

Do I always use Role?
In the above example I relate a book to two people, the author and the writer of preface.  I could have linked to the author via another role with the roleName being ‘Author’ or <http://id.loc.gov/vocabulary/relators/aut>.  Although possible, it is not a recommended approach.  Wherever possible use the properties defined for a type.  This is what data consumers such as search engines are going to be initially looking for.

One last example

To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:

This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.

Whenever possible use URLs for roleName
In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”.  However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL.  If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data  the URI always means the same thing.  Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.

Final advice
This very flexible mechanism has many potential uses when describing your resources in Schema.org. There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.

Good luck in your role in describing your resources and the relationships between them using Schema.org

Google Sunsets Freebase – Good News For Wikidata?

Google announced yesterday that it is the end of the line for Freebase – will this be good for Wikidata?

fb Google announced yesterday that it is the end of the line for Freebase, and they have “decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project”.wd

As well as retiring access for data creation and reading, they are also retiring API access – not good news for those who have built services on top of them.  The timetable they shared for the move is as follows:

Before the end of March 2015
– We’ll launch a Wikidata import review tool
– We’ll announce a transition plan for the Freebase Search API & Suggest Widget to a Knowledge Graph-based solution

March 31, 2015
– Freebase as a service will become read-only
– The website will no longer accept edits
– We’ll retire the MQL write API

June 30, 2015
– We’ll retire the Freebase website and APIs[3]
– The last Freebase data dump will remain available, but developers should check out the Wikidata dump

The crystal ball gazers could probably have predicted a move such as this when Google employed, the then lead of Wikidata, Denny Vrandečić a couple of years back. However they could have predicted a load of other outcomes too. 😉

In the long term this should be good news for Wikidata, but in the short term they may have a severe case of indigestion as they attempt to consume data that will, in some estimations, treble the size of Wikidata adding about 40 million Freebase facts into its current 12 million.  It won’t be a simple copy job.

Loading Freebase into Wikidata as-is wouldn’t meet the Wikidata community’s guidelines for citation and sourcing of facts — while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault, so these individual facts can then be properly loaded to Wikidata.

There are obvious murmurings on the community groups about things such as how strict the differing policies for confirming facts are, and how useful the APIs are. There are bound to be some hiccups on this path – more of an arranged marriage than one of love at first sight between the parties.

I have spent many a presentation telling the world how Google have based their Knowledge Graph on the data from Freebase, which they got when acquiring Metaweb in 2010.

So what does this mean for the Knowledge Graph?  I believe it is a symptom of the Knowledge Graph coming of age as a core feature of the Google infrastructure.  They have used Freebase to seed the Knowledge Graph, but now that seed has grow into a young tree fed by the twin sources of Google search logs, and the rich nutrients delivered by Schema.org structured data embedded in millions of pages on the web. Following the analogy, the seed of Freebase, as a standalone project/brand, just doesn’t fit anymore with the core tree of knowledge that Google is creating and building.  No coincidence that  they’ll “announce a transition plan for the Freebase Search API & Suggest Widget to a Knowledge Graph-based solution”.

As for Wikidata, if the marriage of data is successful, it will establish it as the source for open structured data on the web and for facts within Wikipedia.

As the live source for information that will often be broader than the Wikipedia it sprang from, I suspect Wikidata’s rise will spur the eventual demise of that other source of structured data from Wikipedia – DBpedia.   How in the long term will it be able to compete, as a transformation of occasional dumps of Wikipedia, with a live evolving broader source?   Such a demise would be a slow process however – DBpedia has been the de facto link source for such a long time, its URIs are everywhere!

However you see the eventual outcomes for Frebase, Wikidata, and DBpedia, this is big news for structured data on the web.

Baby Steps Towards A Library Graph

image It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data.  The Web of Data that is starting to take shape and drive benefits for everyone.  Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.

Building blocks such as Schema.org; Linked Data in WorldCat.org; moves to enhance Schema.org capabilities for bibliographic resource description; recognition that Linked Data has a beneficial place in library data and initiatives to turn that into a reality; the release of Work entity data mined from, and linked to, the huge WorldCat.org data set.

OK, you may say, we’ve heard all that before, so what is new now?

As always it is a couple of seemingly unconnected events that throw things into focus.

Event 1:  An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go.  An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future.  Get our data out there, everywhere, so it can find its way to those users, wherever they are.  Making it accessible to all.  David references three main ways to provide this access:

  1. APIs – to allow systems to directly access our library system data and functionality
  2. Linked Datacan help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
  3. The Library Graph –  an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities

(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)

It is the 3rd of those, however, that triggered recognition for me.  Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’  – if graphs can have corners that is.  Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data.  However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.

Event 2: A posting by ZBW Labs Other editions of this work: An experiment with OCLC’s LOD work identifiers detailing experiments in using the OCLC WorldCat Works Data.

ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.

The post is interesting from a couple of points of view.  Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data.  Command-line calls that were no doubt embedded in simple scripts.

Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions.  A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages.  What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.

The ZBW Labs experiment is interesting in its own way – simple approach enlightening results.  What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.

The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data.  Using it for things we never envisaged, delivering unexpected value to our community.  I can’t wait to hear about other initiatives like this that we can all learn from.

So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph.  Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.

As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing.  When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data.  An important authoritative corner, of a corner, of the Giant Global Graph.

As I said at the start these are baby steps towards a vision that is forming out of the mist.  I hope you and others can see it too.

(Toddler image: Harumi Ueda)

A Step for Schema.org – A Leap for Bib Data on the Web

Several significant bibliographic related proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org

schema-org1 Regular readers of this blog may well know I am an enthusiast for Schema.org – the generic vocabulary for describing things on the web as structured data, backed by the major search engines Google, Bing, Yahoo! & Yandex.  When I first got my head around it back in 2011 I soon realised it’s potential for making bibliographic resources, especially those within libraries, a heck of a lot more discoverable.  To be frank library resources did not, and still don’t, exactly leap in to view when searching the web – a bit of a problem when most people start searching for things with Google et al – and do not look elsewhere.

Schema.org as a generic vocabulary to describe most stuff, easily embedded in your web pages, has been a great success.  IMG_0655As was reported by Google’s R.V. Guha, at the recent Semantic Technology and Business Conference in San Jose, a sample of 12B pages showed approximately 21% containing Schema.org markup.  Right from the beginning, however, I had concerns about its applicability to the bibliographic world – great start with the Book type, but there were gaps the coverage for such things as journal issues & volumes, multi-volume works, citations, and the relationship between a work and its editions.  Discovering others shared my combination of enthusiasm and concerns, I formed a W3C Community Group – Schema Bib Extend – to propose some bibliographic focused extensions to Schema.org. Which brings me to the events behind this post…

The SchemaBibEx group have had several proposals accepted over the last couple of years, such as making the [commercial] Offer more appropriate for describing loanable materials, and broadening of the citation property. Several other significant proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org.  For many in our group these latest proposals were a long time coming after their initial proposal.  Although frustrating, the delays were symptomatic of a very healthy process.

Our proposals to add hasPart, isPartOf, exampleOfWork, and workExample to the CreativeWork Type will be available to many, as CreativeWork is the superclass to many types in many areas. Our proposals for issueNumber on PublicationIssue and volumeNumber on PerodicalVolume are very similar to others in the vocabulary, such as seasonNumber and episodeNumber in TV & Radio.  Under Dan Brickley’s careful organisation, tweaks and adjustments were made across a few areas resulting in a consistent style across parts of the vocabulary underpinned by CreativeWork.

Although the number of new types and properties are small, their addition to Schema opens up potential for much better description of periodicals and creative work relationships. To introduce the background to this, SchemaBibEx member Dan Scott and I were invited to jointly post on the Schema.org Blog.

So, another step forward for Schema.org.   I believe that is more than just a step however, for those wishing to make the bibliographic resources more visible on the Web.  There as been some criticism that Schema.org has been too simplistic to be able represent some of the relationships and subtleties from our world.  Criticism that was not unfounded.  Now with these enhancements, much of these criticisms are answered. There is more to do, but the major objective of the group that proposed them has been achieved – to lay the broad foundation for the description of bibliographic, and creative work, resources in sufficient detail for them to be understood by the search engines to become part of their knowledge graphs. Of course that is not the final end we are seeking.  The reason we share data is so that folks are guided to our resources – by sharing, using the well understood vocabulary, Schema.org.

worldcat Examples of a conceptual creative work being related to its editions, using exampleOfWork and workExample, have been available for some time.  In anticipation of their appearance in Schema, they were introduced into the OCLC WorldCat release of 194 million Work descriptions (for example: http://worldcat.org/entity/work/id/1363251773) with the inverse relationship being asserted in an updated version of the basic WorldCat linked data that has been available since 2012.

Visualising Schema.org

One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.

The scenario goes something like this…..

Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”

At about this time you usually display a slide showing html code with embedded RDFa.  It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora example_RDFaof angle brackets and quotes swimming before their eyes is fairly remote.

Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages –  “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].

You ask them to take on trust that Google is actually understanding, indexing and using that structured data.  At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.

I enjoy a challenge, but I also like to find a better way sometimes.   w3

WorldCat_Logo_V_Color When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page.   This made my job far easier!

But it has a couple of downsides.  Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data.  Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.

turtle-32x32 Let the Green Turtle show the way!
Whilst looking for a better answer I discovered Green Turtle – a JavaScript library for working with RDFa and most usefully packaged in an extention for the Chrome browser.  Load this into your copy of Chrome and it will sit quietly in the background checking for RDFa (and microdata if you turn on the option) in the pages you are viewing.  When it finds one,  a green turtle iconturtle-32x32appears in the address bar.  GTtriplesClicking on that turtle opens up a new tab to show you a list of the data, in the form of triples, that it identified within the page.

That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept.  But that is not all.  This excellent little extension has a couple of extra tricks up its sleeve.

GTgraph It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data.  Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted.  As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.

Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not.  I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.

google Custom Searches that know about Schema!
Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types.  Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.

A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.