Having covered the working environment; in this post I now intend to describe some of the important files that make up Schema.org and how you can work with them to create or update, examples and term definitions within your local forked version in preparation for proposing them in a Pull Request.
The File Structure
If you inspect the repository you will see a simple directory structure. At the top level you will find a few files sporting a .py suffix. These contain the python application code to run the site you see at http://schema.org. They load the configuration files, build an in-memory version of the vocabulary that are used to build the html pages containing the definitions of the terms, schema listings, examples displays, etc. They are joined by a file named app.yaml, which contains the configuration used by the Google App Engine to run that code.
At this level there are some directories containing supporting files: docs & templates contain static content for some pages; tests & scripts are used in the building and testing of the site; data contains the files that define the vocabulary, its extensions, and the examples used to demonstrate its use.
The Data Files The data directory itself contains various files and directories. schema.rdfa is the most important file, it contains the core definitions for the majority of the vocabulary. Although, most of the time, you will see schema.rdfa as the only file with a .rdfa suffix in the data directory, the application will look for and load any .rdfa files it finds here. This is a very useful feature when working on a local version – you can keep your enhancements together only merging them into the main schema.rdfa file when ready to propose them.
Also in the data directory you will find an examples.txt file and several others ending with –examples.txt. These contain the examples used on the term pages, the application loads all of them.
Amongst the directories in data, there are a couple of important ones. releases contains snapshots of versions of the vocabulary from version 2.0 onwards. The directory named ext contains the files that define the vocabulary extensions and examples that relate to them. Currently you will find auto and bib directories within ext, corresponding to the extensions currently supported. The format within these directories follows the basic pattern of the data directory – one or more .rdfa files containing the term definitions and –examples.txt files containing relevant examples.
Getting to grips with the RDFa
Enough preparation let’s get stuck into some vocabulary!
Take your favourite text/code editing application and open up schema.rdfa. You will notice two things – it is large [well over 12,500 lines!], and it is in the format of a html file. This second attribute makes it easy for non-technical viewing – you can open it with a browser.
Once you get past a bit of CSS formatting information and a brief introduction text, you arrive [about 35 lines down] at the first couple of definitions – for Thing and CreativeWork.
The Anatomy of a Type Definition Standard RDFa (RDF in attributes) html formatting is used to define each term. A vocabulary Type is defined as a RDFa marked up <div> element with its attributes contained in marked up <span> elements.
<span property="rdfs:comment>The most generic type of item.</span>
The attributes of the <div> element indicate that this is the definition of a Type (typeof=”rdfs:Class”) and its canonical identifier (resource=”http://schema.org/Thing”). The <span> elements filling in the details – it has a label (rdfs:label) of ‘Thing‘ and a descriptive comment (rdfs:comment) of ‘The most generic type of item‘. There is one formatting addition to the <span> containing the label. The class=”h” is there to make the labels stand out when viewing in a browser – it has no direct relevance to the structure of the vocabulary.
Inspecting the CreativeWork definition reveals a few other attributes of a Type defined in <span> elements. The rdfs:subClassOf property, with the associated href on the <a> element, indicates that http://schema.org/CreativeWork is a sub-type of http://schema.org/Thing.
Finally there is the dc:source property and its associated href value. This has no structural impact on the vocabulary, its purpose is to acknowledge and reference the source of the inspiration for the term. It is this reference that results in the display of a paragraph under the Acknowledgements section of a term page.
Defining Properties The properties that can be used with a Type are defined in a very similar way to the Types themselves.
The attributes of the <div> element indicate that this is the definition of a Property (typeof=”rdf:Property”) and its canonical identifier (resource=”http://schema.org/name”). As with Types the <span> elements fill in the details.
Properties have two specific <span> elements to define the domain and range of a property. If these concepts are new to you, the concepts are basically simple. The Type(s) defined as being in the domain of a property are are those for which the property is a valid attribute. The Type(s) defined as being in the range of a property, are those that expected values for that property. So inspecting the above name example we can see that name is a valid property of the Thing Type with an expected value type of Text. Also specific to property definitions is rdfs:subPropertyOf which defies that one property is a sub-property another. For html/RDFa format reasons this is defined using a link entity thus: <link property=”rdfs:subPropertyOf” href=”http://schema.org/workFeatured” />.
Those used to defining other RDF vocabularies may question the use of http://schema.org/domainIncludes and http://schema.org/rangeIncludes to define these relationships. This is a pragmatic approach to producing a flexible data model for the web. For a more in-depth explanation I refer you to the Schema.org Data Model documentation.
Not an exhaustive tutorial in editing the defining RDFa but hopefully enough to get you going!
One of the most powerful features of the Schema.org documentation is the Examples section on most of the term pages. These provide mark up examples for most of the terms in the vocabulary, that can be used and built upon by those adding Schema.org data to their web pages. These examples represent how the html of a page or page section may be marked up. To set context, the examples are provided in several serialisations – basic html, html plus Microdata, html plus RDFa, and JSON-LD. As the objective is to aid the understanding of how Schema.org may be used, it is usual to provide simple basic html formatting in the examples.
Examples in File
As described earlier, the source for examples are held in files with a –examples.txt suffix, stored in the data directory or in individual extension directories.
One or more examples per file are defined in a very simplistic format.
An example begins in the file with a line that starts with TYPES:, such as this:
This example has a unique identifier prefixed with a # character, there should be only one of these per example. These identifiers are intended for future feedback mechanisms and as such are not particularly controlled. I recommend you crate your own when creating your examples. Next comes a comma separated list of term names. Adding a term to this list will result in the example appearing on the page for that term. This is true for both Types and Properties.
Next comes four sections each preceded by a line containing a single label in the following order: PRE-MARKUP:, MICRODATA:, RDFA:, JSON:. Each section ends when the next label line, or the end of the file is reached. The contents of each section of the example is then inserted into the appropriate tabbed area on the term page. The process that does this is not a sophisticated one, there are no error or syntax checking involved – if you want to insert the text of the Gettysburg Address as your RDFa example, it will let you do it.
I am not going to provide tutorials for html, Microdata, RDFa, or JSON-LD here there are a few of those about. I will however recommend a tool I use to convert between these formats when creating examples. RDF Translator is a simple online tool that will validate and translate between RDFa, Microdata, RDF/XML, N3, N-Triples, and JSON-LD. A suggestion, to make your examples as informative possible – when converting between formats, especially when converting to JSON-LD, most conversion tools reorder he statements. It is worth investing some time in ensuring that the mark up order in your example is consistent for all serialisations.
Hopefully this post will clear away some of mystery of how Schema.org is structured and managed. If you have proposals in mind to enhance and extend the vocabulary or examples, have a go, see if thy make sense in a version on your own system, suggest them to the community on Github.
In my next post I will look more at extensions, Hosted and External, and how you work with those, including some hints on choosing where to propose changes – in the core vocabulary, in a hosted or an external extension.
I am often asked by people with ideas for extending or enhancing Schema.org how they go about it. These requests inevitably fall into two categories – either ‘How do I decide upon and organise my new types & properties and relate them to other vocabularies and ontology‘ or ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?‘
I touch on both of theses areas in a free webinar I recorded for DCMI/ASIS&T a couple of months ago. It is in the second in a two part series Schema.org in Two Parts: From Use to Extension. The first part covers the history of Schema.org and the development of extensions. That part is based up on my experiences applying and encouraging the use of Schema.org with bibliographic resources, including the set up and work of the Schema Bib Extend W3C Community Group – bibliographically focused but of interest to anyone looking to extend Schema.org.
To add to those webinars, the focus of this post is in answering the ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?‘ question. In later posts I will move onto how the vocabulary its examples and extensions are defined and how to decide where and how to extend.
What skills do you need
Not many. If you want to add to the vocabulary and/or examples you will naturally need some basic understanding of the vocabulary and the way you navigate around the Schema.org site, viewing examples etc. Beyond that you need to be able to run a few command line instructions on your computer and interact with GitHub. If you are creating examples, you will need to understand how Microdata, RDFa, and JSON-LD mark up are added to html.
I am presuming that you want to do more than tweak a typo, which could be done directly in the GitHub interface, so in this post I step through the practice of working locally, sharing with others, and proposing via a Github Pull Request your efforts..
How do I start
You need to set up the environment on your PC, this needs a local installation of Git so that you can interact with the Schema.org source and a local copy of the Google App Engine SDK to run your local copy of the Schema.org site. The following couple of links should help you get these going.
This is a two-step process. Firstly you need your own parallel fork of the Schema.org repository. If you have not yet, create a user account at Github.com. They are free, unless you want to keep your work private.
Create yourself a working area on your PC and via a command line/terminal window place yourself in that directory to run the following git command, with MyAccount being replaced with your Github account name:
This will download and unwrap a copy of the code into a schemaorg subdirectory of your working directory.
Running a Local Version
In the directory where you downloaded the code, run the following command:
This should result in the output at the command line that looks something like this:
The important line being the one telling you module “default” running at: http://localhost:8080 If you drop that web address into your favourite browser you should end up looking at a familiar screen.
Success! You should now be looking at a version that operates exactly like the liver version, but is totally contained on your local PC. Note the message on the home page reminding you which version you are viewing.
Running a Shared Public Version It is common practice to want to share proposed changes with others before applying them to the Schema.org repository in Github. Fortunately there is an easy free way of running a Google App Engine in the cloud. To do this you will need a Google account which most of us have. When logged in to your Google account visit this page: https://console.cloud.google.com
From the ‘Select a project‘ menu Create a project.. Give your project a name – choose a name that is globally unique. There is a convention that we use names that start with ‘sdo-‘ as an indication that it is a project running a Schema.org instance.
To ready your local code to be able to be uploaded into the public instance you need to make a minor change in a file named app.yaml in the schemaorg directory. Use your favourite text editor to change the line near the top of the file that begins application to have a value that is the same as the project name you have just crated. Note that lines beginning with a ‘#’ character are commented out and have no effect on operation. For this post I have created an App Engine project named sdo-blogpost.
To upload the code run the following command:
appcfg.py update schemaorg/
You should get output that indicates the upload process has happened successfully. Dependant on your login state, you may find a browser window appearing to ask you to login to Google. Make sure at this point you login as the user that created the project.
To view your new shared instance go to the following address http://sdo-blogpost.appspot.com – modified to take account of your project name http://<project name>.appspot.com.
Working on the Files
I will go into the internal syntax of the controlling files in a later post. However, if you would like a preview, take a look in the data directory you will find a large file named schema.rdfa. This contains the specification for core of the Schema.org vocabulary – for simple tweaks and changes you may find things self-explanatory. Also in that directory you will find several files that end in ‘-examples.txt‘. As you might guess, these contain the examples that appear in the Schema.org pages.
Evolving and Sharing How much you use your personal Github schemaorg repositoy fork to collaborate with like minded colleagues, or just use it as a scratch working area for yourself, is up to you. However you choose to organise yourself, you will find the following git commands, that should be run when located in the schemaorg subdirectory, useful:
git status – How your local copy is instep with your repository
git add <filename> – adds file to the ones being tracked against your repository
git commit <filename> – commits (uploads) local changed or added file to your repository
git commit –a – commits (uploads) all changed or added files to your repository
It is recommended to commit as you go.
The mechanism for requesting a change of any type to Schema.org is to raise a Github Pull Request. Each new release of Schema.org is assembled by the organising team reviewing and hopefully accepting each Pull Request. You can see the current list of requests awaiting acceptance in Github. To stop the comments associated with individual requests getting out of hand, and to make it easier to track progress, the preferred way of working is to raise a Pull Request as a final step in completing work on an Issue.
Raising an Issue first enables discussion to take place around proposals as they take shape. It is not uncommon for a final request to differ greatly from an original idea after interaction in the comment stream.
So I suggest that you raise an Issue in the Schema.org repository for what you are attempting to solve. Try to give it a good explanatory Title, and explain what you intend in the comment. This is where the code in your repository and the appspot.com working version can be very helpful in explaining and exploring the issue.
When ready to request, take yourself to your repository’s home page to create a New Pull request. Providing you do not create a new branch in the code, any new commits you make to your repository will become part of that Pull Request. A very handy feature in the real world where inevitably you want to make minor changes just after you say that you are done!
Look out for the next post in this series – Working Within the Vocabulary – in which I’ll cover working in the different file types that make up Schema.org and its extensions.
I find myself in New York for the day on my way back from the excellent Smart Data 2015 Conference in San Jose. It’s a long story about red-eye flights and significant weekend savings which I won’t bore you with, but it did result in some great chill-out time in Central Park to reflect on the week.
In its long auspicious history the SemTech, Semantic Tech & Business, and now Smart Data Conference has always attracted a good cross section of the best and brightest in Semantic Web, Linked Data, Web, and associated worlds. This year was no different for me in my new role as an independent working with OCLC and at Google.
I was there on behalf of OCLC to review significant developments with Schema.org in general – now with 640 Types (Classes) & 988 properties – used on over 10 Million web sites. Plus the pioneering efforts OCLC are engaged with, publishing Schema.org data in volume from WorldCat.org and via APIs in their products. Check out my slides:
By mining the 300+ million records in WorldCat to identify, describe, and publish approx. 200 million Work entity descriptions, and [soon to be shared] 90+ million Person entity descriptions, this pioneering continues.
These are not only significant steps forward for the bibliographic sector, but a great example of a pattern to be followed by most sectors:
Identify the entities in your data
Describe them well using Schema.org
Publish embedded in html
Work with, do not try to replace, the domain specific vocabularies – Bibframe in the library world
Work with the community to extend an enhance Schema.org to enable better representation of your resources
If Schema.org is still not broad enough for you, build an extension to it that solves your problems whilst still maintaining the significant benefits of sharing using Schema.org – in the library world’s case this was BiblioGraph.net
Extending Schema.org Through OCLC and now Google I have been working with and around Schema.org since 2012. The presentation at Smart Data arrived at an opportune time to introduce and share some major developments with the vocabulary and the communities that surround it.
On a personal note the launch of these extensions, bib.schema.org in particular, is the culmination of a bit of a journey that started a couple of years ago with forming of the Schema Bib Extend W3C Community Group (SchemaBibEx) which had great success in proposing additions and changes to the core vocabulary.
A journey that then took in the formation of the BiblioGraph.net extension vocabulary which demonstrated both how to build a domain focused vocabulary on top of Schema.org as well as how the open source software, that powers the Schema.org site, could be forked for such an effort. These two laying the ground work for defining how hosted and external extensions will operate, and for SchemaBibex to be one of the first groups to propose a hosted extension.
Finally this last month working at Google with Dan Brickley on Schema.org, has been a bit of a blur as I brushed up my Python skills to turn the potential in version 2.0 in to the reality of fully integrated and operational extensions in version 2.1. And to get it all done in time to talk about at Smart Data was the icing on the cake.
Of course things are not stoping there. On the not too distant horizon are:
The final acceptance of bib.schema.org & auto.schema.org – currently they are in final review.
SchemaBibEx can now follow up this initial version of bib.schema.org with items from its backlog.
New extension proposals are already in the works such as: health.schema.org, archives.schema.org, fibo.schema.org.
More work on the software to improve the navigation and helpfulness of the site for those looking to understand and adopt Schema.org and/or the extensions.
The checking of the capability for the software to host external extensions without too much effort.
And of course the continuing list of proposals and fixes for the core vocabulary and the site itself.
I believe we are on the cusp of a significant step forward for Schema.org as it becomes ubiquitous across the web; more organisations, encouraged by extensions, prepare to publish their data; and the SEO community recognise proof of it actually working – but more of that in the next post.
The Culture Grid closed to ‘new accessions’ (ie. new collections of metadata) on the 30th April
The existing index and API will continue to operate in order to ensure legacy support
Museums, galleries, libraries and archives wishing to contribute material to Europeana can still do so via the ‘dark aggregator’, which the Collections Trust will continue to fund
Interested parties are invited to investigate using the Europeana Connection Kit to automate the batch-submission of records into Europeana
The reasons he gave for the ending of this aggregation service are enlightening for all engaged with or thinking about data aggregation in the library, museum, and archives sectors.
Throughout its history, the Culture Grid has been tough going. Looking back over the past 7 years, I think there are 3 primary and connected reasons for this:
The value proposition for aggregation doesn’t stack up in terms that appeal to museums, libraries and archives. The investment of time and effort required to participate in platforms like the Culture Grid isn’t matched by an equal return on that investment in terms of profile, audience, visits or political benefit. Why would you spend 4 days tidying up your collections information so that you can give it to someone else to put on their website? Where’s the kudos, increased visitor numbers or financial return?
Museum data (and to a lesser extent library and archive data) is non-standard, largely unstructured and dependent on complex relations. In the 7 years of running the Culture Grid, we have yet to find a single museum whose data conforms to its own published standard, with the result that every single data source has required a minimum of 3-5 days and frequently much longer to prepare for aggregation. This has been particularly salutary in that it comes after 17 years of the SPECTRUM standard providing, in theory at least, a rich common data standard for museums;
Metadata is incidental. After many years of pump-priming applications which seek to make use of museum metadata it is increasingly clear that metadata is the salt and pepper on the table, not the main meal. It serves a variety of use cases, but none of them is ‘proper’ as a cultural experience in its own right. The most ‘real’ value proposition for metadata is in powering additional services like related search & context-rich browsing.
The first of these two issues represent a fundamental challenge for anyone aiming to promote aggregation. Countering them requires a huge upfront investment in user support and promotion, quality control, training and standards development.
The 3rd is the killer though – countering these investment challenges would be possible if doing so were to lead directly to rich end-user experiences. But they don’t. Instead, you have to spend a huge amount of time, effort and money to deliver something which the vast majority of users essentially regard as background texture.
As an old friend of mine would depressingly say – Makes you feel like packing up your tent and going home!
Interestingly earlier in the post Nick give us an insight into the purpose of Culture Grid:
.… we created the Culture Grid with the aim of opening up digital collections for discovery and use ….
That basic purpose is still very valid for both physical and digital collections of all types. The what [helping people find, discover, view and use cultural resources] is as valid as it has ever been. It is the how [aggregating metadata and building shared discovery interfaces and landing pages for it] that has been too difficult to justify continuing in Culture Grid’s case.
In my recent presentations to library audiences I have been asking a simple question “Why do we catalogue?” Sometimes immediately, sometimes after some embarrassed shuffling of feet, I inevitably get the answer “So we can find stuff!“. In libraries, archives, and museums helping people finding the stuff we have is core to what we do – all the other things we do are a little pointless if people can’t find, or even be aware of, what we have.
If you are hoping your resources will be found they have to be referenced where people are looking. Where are they looking?
It is exceedingly likely they are not looking in your aggregated discovery interface, or your local library, archive or museum interface either. Take a look at this chart detailing the discovery starting point for college students and others. Starting in a search engine is up in the high eighty percents, with things like library web sites and other targeted sources only just making it over the 1% hurdle to get on the chart. We have known about this for some time – the chart comes from an OCLC Report ‘College Students’ Perceptions of Libraries and Information Resources‘ published in 2005. I would love to see a similar report from recent times, it would have to include elements such as Siri, Cortana, and other discovery tools built-in to our mobile devices which of course are powered by the search engines. Makes me wonder how few cultural heritage specific sources would actually make that 1% cut today.
Our potential users are in the search engines in one way or another, however it is the vast majority case that our [cultural heritage] resources are not there for them to discover.
Culture Grid, I would suggest, is probably not the only organisation, with an ‘aggregate for discovery’ reason for their existence, that may be struggling to stay relevant, or even in existence.
You may well ask about OCLC, with it’s iconic WorldCat.org discovery interface. It is a bit simplistic say that it’s 320 million plus bibliographic records are in WorldCat only for people to search and discover through the worldcat.org user interface. Those records also underpin many of the services, such as cooperative cataloguing, record supply, inter library loan, and general library back office tasks, etc. that OCLC members and partners benefit from. Also for many years WorldCat has been at the heart of syndication partnerships supplying data to prominent organisations, including Google, that help them reference resources within WorldCat.org which in turn, via find in a library capability, lead to clicks onwards to individual libraries. [Declaration: OCLC is the company name on my current salary check] Nevertheless, even though WorldCat has a broad spectrum of objectives, it is not totally immune from the influences that are troubling the likes of Culture Graph. In fact they are one of the web trends that have been driving the Linked Data and Schema.org efforts from the WorldCat team, but more of that later.
How do we get our resources visible in the search engines then? By telling the search engines what we [individual organisations] have. We do that by sharing a relevant view of our metadata about our resources, not necessarily all of it, in a form that the search engines can easily consume. Basically this means sharing data embeded in your web pages, marked up using the Schema.org vocabulary. To see how this works, we need look no further than the rest of the web – commerce, news, entertainment etc. There are already millions of organisations, measured by domains, that share structured data in their web pages using the Schema.org vocabulary with the search engines. This data is being used to direct users with more confidence directly to a site, and is contributing to the global web of data.
There used to be a time that people complained in the commercial world of always ending up being directed to shopping [aggregation] sites instead of directly to where they could buy the TV or washing machine they were looking for. Today you are far more likely to be given some options in the search engine that link you directly to the retailer. I believe is symptomatic of the disintermediation of the aggregators by individual syndication of metadata from those retailers.
Can these lessons be carried through to the cultural heritage sector – of course they can. This is where there might be a bit of light at the end of the tunnel for those behind the aggregations such as Culture Grid. Not for the continuation as an aggregation/discovery site, but as a facilitator for the individual contributors. This stuff, when you first get into it, is not simple and many organisations do not have the time and resources to understand how to share Schema.org data about their resources with the web. The technology itself is comparatively simple, in web terms, it is the transition and implementation that many may need help with.
Schema.org is not the perfect solution to describing resources, it is not designed to be. It is there to describe them sufficiently to be found on the web. Nevertheless it is also being evolved by community groups to enhance it capabilities. Through my work with the Schema Bib Extend W3C Community Group, enhancements to Schema.org to enable better description of bibliographic resources, have been successfully proposed and adopted. This work is continuing towards a bibliographic extension – bib.schema.org. There is obvious potential for other communities to help evolve and extend Schema to better represent their particular resources – archives for example. I would be happy to talk with others who want insights into how they may do this for their benefit.
Schema.org is not a replacement for our rich common data standards such as MARC for libraries, and SPECTRUM for museums as Nick describes. Those serve purposes beyond sharing information with the wider world, and should be continued to be used for those purposes whilst relevant. However we can not expect the rest of the world to get its head around our internal vocabularies and formats in order to point people at our resources. It needs to be a compromise. We can continue to use what is relevant in our own sectors whilst sharing Schema.org data so that our resources can be discovered and then explored further.
So to return to the question I posed – Is There Still a Case for Cultural Heritage Data Aggregation? – If the aggregation is purely for the purpose of supporting discovery, I think the answer is a simple no. If it has broader purpose, such as for WorldCat, it is not as clear cut.
I do believe nevertheless that many of the people behind the aggregations are in the ideal place to help facilitate the eventual goal of making cultural heritage resources easily discoverable. With some creative thinking, adoption of ‘web’ techniques, technologies and approaches to provide facilitation services, reviewing what their real goals are [which may not include running a search interface]. I believe we are moving into an era where shared authoritative sources of easily consumable data could make our resources more visible than we previously could have hoped.
Are there any black clouds on this hopeful horizon? Yes there is one. In the shape of traditional cultural heritage technology conservatism. The tendency to assume that our vocabulary or ontology is the only way to describe our resources, coupled with a reticence to be seen to engage with the commercial discovery world, could still hold back the potential.
As an individual library, archive, or museum scratching your head about how to get your resources visible in Google and not having the in-house ability to react; try talking within the communities around and behind the aggregation services you already know. They all should be learning and a problem shared is more easily solved. None of this is rocket science, but trying something new is often better as a group.
About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.
This update includes loads of tweaks, additions and fixes that can be found in the release information. The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about. With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.
But does this warrant the version number clicking over from 1.xx to 2.0?
These new types and properties are only the tip of the 2.0 iceberg. There is a heck of a lot of other stuff going on in this release that apart from these additions. Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.
Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced. So part of the 2.0 effort has included some rationalisation. For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details. If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded. However you are encouraged to moved towards the updated terminology as convenient. The question of what is in which version brings me to an enhancement to the supporting documentation. Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0. So if you want to refer to a term at a particular version you now can.
How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication. Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘. Unsurprisingly for a newly introduced property, there is virtually no usage of it yet. If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘. Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.
This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary. The motivation for this new approach to extending is explained thus:
Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.
As yet there are no extensions published. However, there are some on the way.
As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org. The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics. I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone. Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.
W3C Community Group for Schema.org A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place. The creation of the Schema.org Community Group has now provided an open forum for this.
So is 2.0 a bit of a milestone? Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.
Schema.org is basically a simple vocabulary for describing stuff, on the web. Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs. They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding. I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.
The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for. Ideally provide a URI (URL in Schema.org) for a property that references another thing, but if you don’t have one a simple string will do.
There are a few strangenesses, that hit you when you first delve into using the vocabulary. For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place. This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.
At first look at the documentation, Role looks like a very simple type with a handful of properties. On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary. That is because it is capable of fitting almost anywhere. Anywhere there is a relationship between one type and another, that is. It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes. Some might term this as a form of annotation.
So what need is this satisfying you may ask. It must be a significant need to cause the creation of a special case in the vocabulary. Let me walk through a case, that is used in a Schema.org Blog post, to explain a need scenario and how Role satisfies that need.
Starting With American Football
Say you are describing members of an American Football Team. Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:
So we now have Chucker Roberts described as an athlete on the Touchline Gods team. The obvious question then is how do we describe the position he plays in the team. We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as Schema.org.
This is where Role comes in handy. Regardless of the range defined for any property in Schema.org, it is acceptable to provide a Role as a value. The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person). In simple terms we have have just inserted a Role type between the original two descriptions.
This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person). For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.
So far I have just been stepping through the example provided in the Schema.org blog post on this. Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.
There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator, contributor, publisher, character, etc.) in CreativeWork and its subtypes. By using Role, in the same way as in the sports example above, we have the flexibility to describe what is needed.
Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor. However by using Role can qualify the contribution role that he played to be that of the writer of preface.
<span property="roleName"src="http://id.loc.gov/vocabulary/relators/wpr">Writer of preface</span>
<span property="contributor"src="http://http://viaf.org/viaf/283191359">Floor de Goede</span>
You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings. I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.
Do I always use Role? In the above example I relate a book to two people, the author and the writer of preface. I could have linked to the author via another role with the roleName being ‘Author’ or <http://id.loc.gov/vocabulary/relators/aut>. Although possible, it is not a recommended approach. Wherever possible use the properties defined for a type. This is what data consumers such as search engines are going to be initially looking for.
One last example
To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:
@prefix schema:<http://schema.org/> .
This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.
Whenever possible use URLs for roleName In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”. However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL. If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data the URI always means the same thing. Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.
Final advice This very flexible mechanism has many potential uses when describing your resources in Schema.org. There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.
Good luck in your role in describing your resources and the relationships between them using Schema.org
Google announced yesterday that it is the end of the line for Freebase, and they have “decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project”.
As well as retiring access for data creation and reading, they are also retiring API access – not good news for those who have built services on top of them. The timetable they shared for the move is as follows:
Before the end of March 2015
– We’ll launch a Wikidata import review tool
– We’ll announce a transition plan for the Freebase Search API & Suggest Widget to a Knowledge Graph-based solution
March 31, 2015
– Freebase as a service will become read-only
– The website will no longer accept edits
– We’ll retire the MQL write API
June 30, 2015
– We’ll retire the Freebase website and APIs
– The last Freebase data dump will remain available, but developers should check out the Wikidata dump
The crystal ball gazers could probably have predicted a move such as this when Google employed, the then lead of Wikidata, Denny Vrandečić a couple of years back. However they could have predicted a load of other outcomes too. 😉
In the long term this should be good news for Wikidata, but in the short term they may have a severe case of indigestion as they attempt to consume data that will, in some estimations, treble the size of Wikidata adding about 40 million Freebase facts into its current 12 million. It won’t be a simple copy job.
Loading Freebase into Wikidata as-is wouldn’t meet the Wikidata community’s guidelines for citation and sourcing of facts — while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault, so these individual facts can then be properly loaded to Wikidata.
There are obvious murmurings on the community groups about things such as how strict the differing policies for confirming facts are, and how useful the APIs are. There are bound to be some hiccups on this path – more of an arranged marriage than one of love at first sight between the parties.
I have spent many a presentation telling the world how Google have based their Knowledge Graph on the data from Freebase, which they got when acquiring Metaweb in 2010.
So what does this mean for the Knowledge Graph? I believe it is a symptom of the Knowledge Graph coming of age as a core feature of the Google infrastructure. They have used Freebase to seed the Knowledge Graph, but now that seed has grow into a young tree fed by the twin sources of Google search logs, and the rich nutrients delivered by Schema.org structured data embedded in millions of pages on the web. Following the analogy, the seed of Freebase, as a standalone project/brand, just doesn’t fit anymore with the core tree of knowledge that Google is creating and building. No coincidence that they’ll “announce a transition plan for the Freebase Search API & Suggest Widget to a Knowledge Graph-based solution”.
As for Wikidata, if the marriage of data is successful, it will establish it as the source for open structured data on the web and for facts within Wikipedia.
As the live source for information that will often be broader than the Wikipedia it sprang from, I suspect Wikidata’s rise will spur the eventual demise of that other source of structured data from Wikipedia – DBpedia. How in the long term will it be able to compete, as a transformation of occasional dumps of Wikipedia, with a live evolving broader source? Such a demise would be a slow process however – DBpedia has been the de facto link source for such a long time, its URIs are everywhere!
However you see the eventual outcomes for Frebase, Wikidata, and DBpedia, this is big news for structured data on the web.
It is one thing to have a vision, regular readers of this blog will know I have them all the time, its yet another to see it starting to form through the mist into a reality. Several times in the recent past I have spoken of the some of the building blocks for bibliographic data to play a prominent part in the Web of Data. The Web of Data that is starting to take shape and drive benefits for everyone. Benefits that for many are hiding in plain site on the results pages of search engines. In those informational panels with links to people’s parents, universities, and movies, or maps showing the location of mountains, and retail outlets; incongruously named Knowledge Graphs.
OK, you may say, we’ve heard all that before, so what is new now?
As always it is a couple of seemingly unconnected events that throw things into focus.
Event 1: An article by David Weinberger in the DigitalShift section of Library Journal entitled Let The Future Go. An excellent article telling libraries that they should not be so parochially focused in their own domain whilst looking to how they are going serve their users’ needs in the future. Get our data out there, everywhere, so it can find its way to those users, wherever they are. Making it accessible to all. David references three main ways to provide this access:
APIs – to allow systems to directly access our library system data and functionality
Linked Data – can help us open up the future of libraries. By making clouds of linked data available, people can pull together data from across domains
The Library Graph – an ambitious project libraries could choose to undertake as a group that would jump-start the web presence of what libraries know: a library graph. A graph, such as Facebook’s Social Graph and Google’s Knowledge Graph, associates entities (“nodes”) with other entities
(I am fortunate to be a part of an organisation, OCLC, making significant progress on making all three of these a reality – the first one is already baked into the core of OCLC products and services)
It is the 3rd of those, however, that triggered recognition for me. Personally, I believe that we should not be focusing on a specific ‘Library Graph’ but more on the ‘Library Corner of a Giant Global Graph’ – if graphs can have corners that is. Libraries have rich specialised resources and have specific needs and processes that may need special attention to enable opening up of our data. However, when opened up in context of a graph, it should be part of the same graph that we all navigate in search of information whoever and wherever we are.
ZBW contributes to WorldCat, and has 1.2 million oclc numbers attached to it’s bibliographic records. So it seemed interesting, how many of these editions link to works and furthermore to other editions of the very same work.
The post is interesting from a couple of points of view. Firstly the simple steps they took to get at the data, really well demonstrated by the command-line calls used to access the data – get OCLCNum data from WorldCat.or in JSON format – extract the schema:exampleOfWork link to the Work – get the Work data from WorldCat, also in JSON – parse out the links to other editions of the work and compare with their own data. Command-line calls that were no doubt embedded in simple scripts.
Secondly, was the implicit way that the corpus of WorldCat Work entity descriptions, and their canonical identifying URIs, is used as an authoritative hub for Works and their editions. A concept that is not new in the library world, we have been doing this sort of things with names and person identities via other authoritative hubs, such as VIAF, for ages. What is new here is that it is a hub for Works and their relationships, and the bidirectional nature of those relationships – work to edition, edition to work – in the beginnings of a library graph linked to other hubs for subjects, people, etc.
The ZBW Labs experiment is interesting in its own way – simple approach enlightening results. What is more interesting for me, is it demonstrates a baby step towards the way the Library corner of that Global Web of Data will not only naturally form (as we expose and share data in this way – linked entity descriptions), but naturally fit in to future library workflows with all sorts of consequential benefits.
The experiment is exactly the type of initiative that we hoped to stimulate by releasing the Works data. Using it for things we never envisaged, delivering unexpected value to our community. I can’t wait to hear about other initiatives like this that we can all learn from.
So who is going to be doing this kind of thing – describing entities and sharing them to establish these hubs (nodes) that will form the graph. Some are already there, in the traditional authority file hubs: The Library of Congress LC Linked Data Service for authorities and vocabularies (id.loc.gov), VIAF, ISNI, FAST, Getty vocabularies, etc.
As previously mentioned Work is only the first of several entity descriptions that are being developed in OCLC for exposure and sharing. When others, such as Person, Place, etc., emerge we will have a foundation of part of a library graph – a graph that can and will be used, and added to, across the library domain and then on into the rest of the Global Web of Data. An important authoritative corner, of a corner, of the Giant Global Graph.
As I said at the start these are baby steps towards a vision that is forming out of the mist. I hope you and others can see it too.