Hidden Gems in the new Schema.org 3.1 Release

I spend a significant amount of time working on the supporting software, vocabulary contents, and application of Schema.org. So it is with great pleasure, and a certain amount of relief, I share the release of Schema.org 3.1 and share some hidden gems you find in there.

I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of Schema.org.  So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.

That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:

  • Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
  • Finance Terms – Following an initiative and work by Financial Industry Business Ontology (FIBO) project (which I have the pleasure to be part of), in support of the W3C Financial Industry Business Ontology Community Group, several terms to improve the capability for describing things such as banks, bank accounts, financial products such as loans, and monetary amounts.
  • Spatial and Temporal and DatasetsCreativeWork now includes spatialCoverage and temporalCoverage which I know my cultural heritage colleagues and clients will find very useful.  Like many enhancements in the Schema.org community, this work came out of a parallel interest, in which  Dataset has received some attention.
  • Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
  • Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed schema.org terms to be documented, tested and revised.  The anticipation being that this area will be updated with proposals relatively frequently, in between formal Schema.org releases.
  • How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.

For fuller details on what is in 3.1 and other releases, checkout the Releases document.

Hidden Gems

Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:

  • Snappy Performance – The Schema.org site, a Python app hosted on the Google App Engine, is shall we say a very popular site.  Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
  • Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy.  This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files.  From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary.  (Tip: The page displays the link to the file that will always return the latest version.)
  • Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance.  I know from working with colleagues and clients, that it is sometimes difficult to get your head around Schema.org’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value.  It is good to now have somewhere to point people when they question such things.
  • Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary.  The rdfs:comment section of term definitions are now passed through a Markdown processor.  This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc.  So for example a link can be input as [The Link](http://example.com/mypage) and italic text would be input as *italic*.  The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link.   This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files. Winking smile

I could go on, but won’t  – If you are new to Schema.org, or very familiar, I suggest you take a look.

Visualising Schema.org

One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.

The scenario goes something like this…..

Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”

At about this time you usually display a slide showing html code with embedded RDFa.  It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora example_RDFaof angle brackets and quotes swimming before their eyes is fairly remote.

Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages –  “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].

You ask them to take on trust that Google is actually understanding, indexing and using that structured data.  At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.

I enjoy a challenge, but I also like to find a better way sometimes.   w3

WorldCat_Logo_V_Color When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page.   This made my job far easier!

But it has a couple of downsides.  Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data.  Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.

turtle-32x32 Let the Green Turtle show the way!
Whilst looking for a better answer I discovered Green Turtle – a JavaScript library for working with RDFa and most usefully packaged in an extention for the Chrome browser.  Load this into your copy of Chrome and it will sit quietly in the background checking for RDFa (and microdata if you turn on the option) in the pages you are viewing.  When it finds one,  a green turtle iconturtle-32x32appears in the address bar.  GTtriplesClicking on that turtle opens up a new tab to show you a list of the data, in the form of triples, that it identified within the page.

That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept.  But that is not all.  This excellent little extension has a couple of extra tricks up its sleeve.

GTgraph It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data.  Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted.  As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.

Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not.  I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.

google Custom Searches that know about Schema!
Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types.  Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.

A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.

Google SEO RDFa and Semantic Search

GoogleBlueBalls Today’s Wall Street Journal gives us an insight in to the makeover underway in the Google search department.

Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.

They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.

The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold.  This is now continuing with harvesting of html embedded, schema.org encoded, structured data that is starting to spread across the web.

The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets.  A nice trade-off from Google – you embed the information we want/need for a better search and we will give you  better results.

The premise of what Google are are up to is that it will deliver better search.  Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results.  The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us.  This is the step-change.  We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.

Another example quoted in the article:

…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.

Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success.  I would say that they should be helping the search engine supplier(s) do this right and be part of the process.  There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.

As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously.   I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result.  I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities.  However I’ve been around far too long to get hung up about labels.

Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the schema.org.  Google recognises both, both are ugly for humans to hand code, and web masters should not have to care.  Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.

I welcome this.  Yet it is only a symptom of something much bigger and game-changing as I postulated last month A Data 7th Wave is Approaching.