Part of my efforts working with Google in support of the Schema.org structured web data vocabulary, its extensions, usage and implementation, is to introduce new functionality and facilities on to the Schema.org site.
I have recently concluded a piece of work to improve accessibility to the underlying definition of vocabulary terms in various data formats, which is now available for testing and comment. It is available in a prerelease of the next Schema.org version hosted at webschemas.org – the Schema.org test site. It is the kind of functionality could benefit from as many eyes as possible to check and comment.
Introduced in the current release (3.1) was the ability to download vocabulary definition files for the core vocabulary and its extensions in triples, quads, json-ld and turtle formats. Building upon that, the new version, provisionally numbered 3.2, introduces download files in RDF/XML, plus CSV format for vocabulary Types, Properties and Enumeration values.
There have been requests to make other output formats available for individual terms in addition to the embedded RDFa already available. In response to these requests, we have introduced per term outputs in csv, triples, json-ld, rdf/xml and turtle format.
These can be accessed in two ways. Firstly by adding the appropriate suffix (.csv, .nt, .jsonld, .rdf, .ttl) to the term URI:
They are also available using content-negotiation, supplying the appropriate value for Accept in the httpd request header (text/csv, text/plain, application/ld+json, application/rdf+xml, application/x-turtle).
So if you are interested in such things, take a look and let us know what you think about the contents, formats, and accuracy of these outputs.
I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of Schema.org. So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.
That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:
Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed schema.org terms to be documented, tested and revised. The anticipation being that this area will be updated with proposals relatively frequently, in between formal Schema.org releases.
How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.
For fuller details on what is in 3.1 and other releases, checkout the Releases document.
Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:
Snappy Performance – The Schema.org site, a Python app hosted on the Google App Engine, is shall we say a very popular site. Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy. This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files. From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary. (Tip: The page displays the link to the file that will always return the latest version.)
Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance. I know from working with colleagues and clients, that it is sometimes difficult to get your head around Schema.org’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value. It is good to now have somewhere to point people when they question such things.
Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary. The rdfs:comment section of term definitions are now passed through a Markdown processor. This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc. So for example a link can be input as [The Link](http://example.com/mypage) and italic text would be input as *italic*. The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link. This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files.
I could go on, but won’t – If you are new to Schema.org, or very familiar, I suggest you take a look.
Marketing Hype! I hear you thinking – well at least I didn’t use the tired old ‘Next Generation’ label.
Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.
This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org. But let me first set the context by explaining my thoughts on this New Web.
Having once been considered an expert on Web 2.0 (I hasten to add by others, not myself) I know how dangerous it can be to attach labels to things. It tends to spawn screen full’s of passionate opinions on the relevance of the name, date of the revolution, and over detailed analysis of isolated parts of what is a general movement. I know I am on dangerous ground here!
To my mind something is new when it feels different. The Internet felt different when the Web (aka HTTP + HTML + browsers) arrived. The Web felt different (Web 2.0?) when it became more immersive (write as well as read) and visually we stopped trying to emulate in a graphical style what we saw on character terminals. Oh, and yes we started to round our corners.
There have been many times over the last few years when it felt new – when it suddenly arrived in our pockets (the mobile web) – when the inner thoughts, and eating habits, of more friends that you ever remember meeting became of apparent headline importance (the social web) – when [the contents of] the web broke out of the boundaries of the browser and appeared embedded in every app, TV show, and voice activated device.
The feeling different phase I think we are going through at the moment, like previous times, is building on what went before. It is exemplified by information [data] breaking out of the boundaries of our web sites and appearing where it is useful for the user.
We are seeing the tip of this iceberg in the search engine Knowledge Panels, answer boxes, and rich snippets, The effect of this being that often your potential user can get what they need without having to find and visit your site – answering questions such as what is the customer service phone number for an organisation; is the local branch open at the moment;give me driving directions to it; what is available and on offer. Increasingly these interactions can occur without the user even being aware they are using the web – “Siri! Where is my nearest library?“ A great way to build relationships with your customers. However a new and interesting challenge for those trying to measure the impact of your web site.
So, what is fundamental to this New Web?
There are several things – HTTP, the light-weight protocol designed to transfer text, links and latterly data, across an internet previously used to specific protocols for specific purposes – HTML, that open, standard, easily copied light-weight extensible generic format for describing web pages that all browsers can understand – Microdata, RDFa, JSON, JSON-LD – open standards for easily embedding data into HTML – RDF, an open data format for describing things of any sort, in the form of triples, using shared vocabularies. Building upon those is Schema.org – an open, [de facto] standard, generic vocabulary for describing things in most areas of interest.
Why is one vocabulary fundamental when there are so many others to choose from? Check out the 500+ referenced on the Linked Open Vocabularies (LOV) site. Schema.org however differs from most of the others in a few key areas:
Size and scope – its current 642 Types and 992 Properties is significantly larger and covers far more domains of interest than most others. This means that if you are looking to describe a something, you are highly likely to to find enough to at least start. Despite its size, it is yet far from capable of describing everything on, or off, the planet.
Evolution – it is under continuous evolutionary development and extension, driven and guided by an open community under the wing of the W3C and accessible in a GitHub repository.
Flexibility – from the beginning Schema.org was designed to be used in a choice of your favourite serialisation – Microdata, RDFa, JSON-LD, with the flexibility of allowing values to default to text if you have not got a URI available.
Consumers – The major search engines Google, Bing, Yahoo!, and Yandex, not only back the open initiative behind Schema.org but actively search out Schema.org markup to add to their Knowledge Graphs when crawling your sites.
Guidance – If you search out guidance on supplying structured data to those major search engines, you are soon supplied with recommendations and examples for using Schema.org, such as this from Google. They even supply testing tools for you to validate your markup.
With this support and adoption, the Schema.org initiative has become self-fulfilling. If your objective is to share or market structured data about your site, organisation, resources, and or products with the wider world; it would be difficult to come up with a good reason not to use Schema.org.
Is it a fully ontologically correct semantic web vocabulary? Although you can see many semantic web and linked data principles within it, no it is not. That is not its objective. It is a pragmatic compromise between such things, and the general needs of webmasters with ambitions to have their resources become an authoritative part of the global knowledge graphs, that are emerging as key to the future of the development of search engines and the web they inhabit.
Note that I question if Schema.org is a fundamental component, of what I am feeling is a New Web. It is not the fundamental component, but one of many that over time will become just the way we do things.
Art maybe an over ambitious word for the process that I am going to try and describe. However it is not about rules, required patterns, syntaxes, and file formats – the science; it is about general guidelines, emerging styles & practices, and what feels right. So art it is.
OK. You have read the previous posts in this series. You have said to yourself I only wish that I could describe [insert you favourite issue here] in Schema.org. You are now inspired to do something about it, or get together with a community of colleagues to address the usefulness of Schema.org for your area of interest. Then comes the inevitable question…
Where do I focus my efforts – the core vocabulary or a Hosted Extension or an External Extension?
Firstly a bit of background to help answer that question.
The core of the Schema.org vocabulary has evolved since its launch by Google, Bing, and Yahoo! (soon joined by Yandex), in June 2011. By the end of 2015 its term definitions had reached 642 types and 992 properties. They cover many many sectors commercial, and not, including sport, media, retail, libraries, local businesses, heath, audio, video, TV, movies, reviews, ratings, products, services, offers and actions. Its generic nature has facilitated is spread of adoption across well over 10 million sites. For more background I recommend the December 2015 article Schema.org: Evolution of Structured Data on the Web – Big data makes common schemas even more necessary. By Guha, Brickley and Macbeth.
That generic nature however does introduce issues for those in specific sectors wishing to focus in more detail on the entities and relationships specific to their domain whist still being part of, or closely related to, Schema.org. In the spring of 2015 an Extension Mechanism, consisting of Hosted and External extensions, was introduced to address this.
Reviewed/Hosted Extensions are domain focused extensions hosted on the Schema.org site. They will have been reviewed and discussed by the broad Schema.org community as to style, compatibility with the core vocabulary, and potential adoption. An extension is allocated its own part of the schema.org namespace – auto.schema.org & bib.schema.org being the first two examples.
External Extensions are created and hosted separate from Schema.org in their own namespace. Although related to and building upon [extending] the Schema.org vocabulary these extensions are not part of the vocabulary. I am editor for an early example of such an external extension BiblioGraph.net that predates the launch of the extension mechanism. Much more recently GS1 (The Global Language of Business) have published their External Extension – the GS1 Web Vocabulary at http://gs1.org/voc/.
An example of how gs1.org extends Schema.org can be seen from inspecting the class gs1:WearableProduct which is a subclass of gs1:Product which in turn is defined as an exact match to schema:Product. Looking at an example property of gs1:Product,gs1:brand we can see that it is defined as a subproperty of schema:brand. This demonstrates how Schema.org is foundational to GS1.org.
Choosing Where to Extend
This initially depends on what and how much you are wanting to extend.
If all you are thinking of is adding the odd property to an already existent type, or to add another type to the domain and/or range of a property, or improve the description of a type or property; you probably do not need to create an extension. Raise an issue, and after some thought and discussion, go for it – create the relevant code and associated Pull Request for the Schema.org Gihub repositiory.
More substantial extensions require a bit of thought.
When proposing extension to the Schema.org vocabulary the above-described structure provides the extender/developer with three options. Extend the core; propose a hosted extension; or develop an external extension. Potentially a proposal could result in a combination of all three.
For example a proposal could be for a new Type (class) to be added to the core, with few or no additional properties other than those inherited from its super type. In addition more domain focused properties, or subtypes, for that new type could be proposed as part of a hosted extension, and yet more very domain specific ones only being part of an external extension.
Although not an exact science, there are some basic principles behind such choices. These principles are based upon the broad context and use of Schema.org across the web, the consuming audience for the data that would be marked up; the domain specific knowledge of those that would do the marking up and reviewing the proposal; and the domain specific need for the proposed terms.
Guiding Questions A decision as to if a proposed term should be in the core, hosted extension or external extension can be aided by the answers to some basic questions:
Public or not public? Will the data that would be marked up using the term be normally shared on the web? Would you expect to find that information on a publicly accessible web page today?If the answer is not public, there is no point in proposing the term for the core or a hosted extension. It would be defined in an external extension.
General or Specific? Is the level of information to be marked up, or the thing being described, of interest or relevant to non-domain specific consumers?If the answer is general, the term could be a candidate for a core term. For example Train could be considered as a potential new subtype of Vehicle to describe that mode of transport that is relevant for general travel discovery needs. Whereas SteamTrain and its associated specific properties about driving wheel configuration etc. would be more appropriate to a railway extension.
Popularity? How many sites on the web would potentially be expected to make use of these term(s) How many webmasters would find them useful?If the answer is lots, you probably have a candidate for the core. If it is only a few hundred, especially if they would be all in a particular focus of interest, it would be more likely a candidate for a hosted extension. If it is a small number, it might be more appropriate in an external extension.
Detailed or Technical? Is the information, or the detailed nature of proposed properties, too technical for general consumption?If yes, the term should be proposed for a hosted or external extension. In the train example above, the fact that a steam train is being referenced could be contained in the text based description property of a Train type. Whereas the type of steam engine configuration could be a defined value for a property in an external extension.
When defining and then proposing enhancements to the core of Schema, or for hosted extensions, there is a temptation to take an area of concern, analyse it in detail and then produce a fully complete proposal. Experience has demonstrated that it is beneficial to gain feedback on the use and adoption of terms before building upon them to extend and add more detailed capability.
Based on that experience the way of extending Schema.org should be by steps that build upon each other in stages. For example introducing a new subtype with few if any new specific properties. Initial implementers can use textual description properties to qualify its values in this initial form. In a later releases more specific properties can be proposed, their need being justified by the take-up, visibility, and use of the subtype on sites across the web.
Several screen-full’s and a few weeks ago, this started out as a simple post in an attempt to cover off some of the questions I am often asked about how Schema.org is structured, and how it can be made more appropriate for this project or that domain. Hopefully you find the distillation of my experience and my personal approach, across these three resulting posts on Evolving Schema.org in Practice, enlightening and helpful. Especially if you are considering proposing a change, enhancement or extension to Schema.org.
Often that similarity is hidden behind sector specific views, understanding, and issues in dealing with the open wide web of structured data where they are just another interested group, looking to benefit from shared recognition of common schemas by the major search engine organisations. But that is all part of the joy and challenge I relish when entering a new domain and meeting new interested and motivated people.
Of course enhancing, extending and evolving the Schema.org vocabulary is only one part of the story. Actually applying it for benefit to aid the discovery of your organisation, your resources and the web sites that reference them is the main goal for most.
I get the feeling that there maybe another blog post series I should be considering!
Having covered the working environment; in this post I now intend to describe some of the important files that make up Schema.org and how you can work with them to create or update, examples and term definitions within your local forked version in preparation for proposing them in a Pull Request.
The File Structure
If you inspect the repository you will see a simple directory structure. At the top level you will find a few files sporting a .py suffix. These contain the python application code to run the site you see at http://schema.org. They load the configuration files, build an in-memory version of the vocabulary that are used to build the html pages containing the definitions of the terms, schema listings, examples displays, etc. They are joined by a file named app.yaml, which contains the configuration used by the Google App Engine to run that code.
At this level there are some directories containing supporting files: docs & templates contain static content for some pages; tests & scripts are used in the building and testing of the site; data contains the files that define the vocabulary, its extensions, and the examples used to demonstrate its use.
The Data Files The data directory itself contains various files and directories. schema.rdfa is the most important file, it contains the core definitions for the majority of the vocabulary. Although, most of the time, you will see schema.rdfa as the only file with a .rdfa suffix in the data directory, the application will look for and load any .rdfa files it finds here. This is a very useful feature when working on a local version – you can keep your enhancements together only merging them into the main schema.rdfa file when ready to propose them.
Also in the data directory you will find an examples.txt file and several others ending with –examples.txt. These contain the examples used on the term pages, the application loads all of them.
Amongst the directories in data, there are a couple of important ones. releases contains snapshots of versions of the vocabulary from version 2.0 onwards. The directory named ext contains the files that define the vocabulary extensions and examples that relate to them. Currently you will find auto and bib directories within ext, corresponding to the extensions currently supported. The format within these directories follows the basic pattern of the data directory – one or more .rdfa files containing the term definitions and –examples.txt files containing relevant examples.
Getting to grips with the RDFa
Enough preparation let’s get stuck into some vocabulary!
Take your favourite text/code editing application and open up schema.rdfa. You will notice two things – it is large [well over 12,500 lines!], and it is in the format of a html file. This second attribute makes it easy for non-technical viewing – you can open it with a browser.
Once you get past a bit of CSS formatting information and a brief introduction text, you arrive [about 35 lines down] at the first couple of definitions – for Thing and CreativeWork.
The Anatomy of a Type Definition Standard RDFa (RDF in attributes) html formatting is used to define each term. A vocabulary Type is defined as a RDFa marked up <div> element with its attributes contained in marked up <span> elements.
<span property="rdfs:comment>The most generic type of item.</span>
The attributes of the <div> element indicate that this is the definition of a Type (typeof=”rdfs:Class”) and its canonical identifier (resource=”http://schema.org/Thing”). The <span> elements filling in the details – it has a label (rdfs:label) of ‘Thing‘ and a descriptive comment (rdfs:comment) of ‘The most generic type of item‘. There is one formatting addition to the <span> containing the label. The class=”h” is there to make the labels stand out when viewing in a browser – it has no direct relevance to the structure of the vocabulary.
Inspecting the CreativeWork definition reveals a few other attributes of a Type defined in <span> elements. The rdfs:subClassOf property, with the associated href on the <a> element, indicates that http://schema.org/CreativeWork is a sub-type of http://schema.org/Thing.
Finally there is the dc:source property and its associated href value. This has no structural impact on the vocabulary, its purpose is to acknowledge and reference the source of the inspiration for the term. It is this reference that results in the display of a paragraph under the Acknowledgements section of a term page.
Defining Properties The properties that can be used with a Type are defined in a very similar way to the Types themselves.
The attributes of the <div> element indicate that this is the definition of a Property (typeof=”rdf:Property”) and its canonical identifier (resource=”http://schema.org/name”). As with Types the <span> elements fill in the details.
Properties have two specific <span> elements to define the domain and range of a property. If these concepts are new to you, the concepts are basically simple. The Type(s) defined as being in the domain of a property are are those for which the property is a valid attribute. The Type(s) defined as being in the range of a property, are those that expected values for that property. So inspecting the above name example we can see that name is a valid property of the Thing Type with an expected value type of Text. Also specific to property definitions is rdfs:subPropertyOf which defies that one property is a sub-property another. For html/RDFa format reasons this is defined using a link entity thus: <link property=”rdfs:subPropertyOf” href=”http://schema.org/workFeatured” />.
Those used to defining other RDF vocabularies may question the use of http://schema.org/domainIncludes and http://schema.org/rangeIncludes to define these relationships. This is a pragmatic approach to producing a flexible data model for the web. For a more in-depth explanation I refer you to the Schema.org Data Model documentation.
Not an exhaustive tutorial in editing the defining RDFa but hopefully enough to get you going!
One of the most powerful features of the Schema.org documentation is the Examples section on most of the term pages. These provide mark up examples for most of the terms in the vocabulary, that can be used and built upon by those adding Schema.org data to their web pages. These examples represent how the html of a page or page section may be marked up. To set context, the examples are provided in several serialisations – basic html, html plus Microdata, html plus RDFa, and JSON-LD. As the objective is to aid the understanding of how Schema.org may be used, it is usual to provide simple basic html formatting in the examples.
Examples in File
As described earlier, the source for examples are held in files with a –examples.txt suffix, stored in the data directory or in individual extension directories.
One or more examples per file are defined in a very simplistic format.
An example begins in the file with a line that starts with TYPES:, such as this:
This example has a unique identifier prefixed with a # character, there should be only one of these per example. These identifiers are intended for future feedback mechanisms and as such are not particularly controlled. I recommend you crate your own when creating your examples. Next comes a comma separated list of term names. Adding a term to this list will result in the example appearing on the page for that term. This is true for both Types and Properties.
Next comes four sections each preceded by a line containing a single label in the following order: PRE-MARKUP:, MICRODATA:, RDFA:, JSON:. Each section ends when the next label line, or the end of the file is reached. The contents of each section of the example is then inserted into the appropriate tabbed area on the term page. The process that does this is not a sophisticated one, there are no error or syntax checking involved – if you want to insert the text of the Gettysburg Address as your RDFa example, it will let you do it.
I am not going to provide tutorials for html, Microdata, RDFa, or JSON-LD here there are a few of those about. I will however recommend a tool I use to convert between these formats when creating examples. RDF Translator is a simple online tool that will validate and translate between RDFa, Microdata, RDF/XML, N3, N-Triples, and JSON-LD. A suggestion, to make your examples as informative possible – when converting between formats, especially when converting to JSON-LD, most conversion tools reorder he statements. It is worth investing some time in ensuring that the mark up order in your example is consistent for all serialisations.
Hopefully this post will clear away some of mystery of how Schema.org is structured and managed. If you have proposals in mind to enhance and extend the vocabulary or examples, have a go, see if thy make sense in a version on your own system, suggest them to the community on Github.
In my next post I will look more at extensions, Hosted and External, and how you work with those, including some hints on choosing where to propose changes – in the core vocabulary, in a hosted or an external extension.
I am often asked by people with ideas for extending or enhancing Schema.org how they go about it. These requests inevitably fall into two categories – either ‘How do I decide upon and organise my new types & properties and relate them to other vocabularies and ontology‘ or ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?‘
I touch on both of theses areas in a free webinar I recorded for DCMI/ASIS&T a couple of months ago. It is in the second in a two part series Schema.org in Two Parts: From Use to Extension. The first part covers the history of Schema.org and the development of extensions. That part is based up on my experiences applying and encouraging the use of Schema.org with bibliographic resources, including the set up and work of the Schema Bib Extend W3C Community Group – bibliographically focused but of interest to anyone looking to extend Schema.org.
To add to those webinars, the focus of this post is in answering the ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?‘ question. In later posts I will move onto how the vocabulary its examples and extensions are defined and how to decide where and how to extend.
What skills do you need
Not many. If you want to add to the vocabulary and/or examples you will naturally need some basic understanding of the vocabulary and the way you navigate around the Schema.org site, viewing examples etc. Beyond that you need to be able to run a few command line instructions on your computer and interact with GitHub. If you are creating examples, you will need to understand how Microdata, RDFa, and JSON-LD mark up are added to html.
I am presuming that you want to do more than tweak a typo, which could be done directly in the GitHub interface, so in this post I step through the practice of working locally, sharing with others, and proposing via a Github Pull Request your efforts..
How do I start
You need to set up the environment on your PC, this needs a local installation of Git so that you can interact with the Schema.org source and a local copy of the Google App Engine SDK to run your local copy of the Schema.org site. The following couple of links should help you get these going.
This is a two-step process. Firstly you need your own parallel fork of the Schema.org repository. If you have not yet, create a user account at Github.com. They are free, unless you want to keep your work private.
Create yourself a working area on your PC and via a command line/terminal window place yourself in that directory to run the following git command, with MyAccount being replaced with your Github account name:
This will download and unwrap a copy of the code into a schemaorg subdirectory of your working directory.
Running a Local Version
In the directory where you downloaded the code, run the following command:
This should result in the output at the command line that looks something like this:
The important line being the one telling you module “default” running at: http://localhost:8080 If you drop that web address into your favourite browser you should end up looking at a familiar screen.
Success! You should now be looking at a version that operates exactly like the liver version, but is totally contained on your local PC. Note the message on the home page reminding you which version you are viewing.
Running a Shared Public Version It is common practice to want to share proposed changes with others before applying them to the Schema.org repository in Github. Fortunately there is an easy free way of running a Google App Engine in the cloud. To do this you will need a Google account which most of us have. When logged in to your Google account visit this page: https://console.cloud.google.com
From the ‘Select a project‘ menu Create a project.. Give your project a name – choose a name that is globally unique. There is a convention that we use names that start with ‘sdo-‘ as an indication that it is a project running a Schema.org instance.
To ready your local code to be able to be uploaded into the public instance you need to make a minor change in a file named app.yaml in the schemaorg directory. Use your favourite text editor to change the line near the top of the file that begins application to have a value that is the same as the project name you have just crated. Note that lines beginning with a ‘#’ character are commented out and have no effect on operation. For this post I have created an App Engine project named sdo-blogpost.
To upload the code run the following command:
appcfg.py update schemaorg/
You should get output that indicates the upload process has happened successfully. Dependant on your login state, you may find a browser window appearing to ask you to login to Google. Make sure at this point you login as the user that created the project.
To view your new shared instance go to the following address http://sdo-blogpost.appspot.com – modified to take account of your project name http://<project name>.appspot.com.
Working on the Files
I will go into the internal syntax of the controlling files in a later post. However, if you would like a preview, take a look in the data directory you will find a large file named schema.rdfa. This contains the specification for core of the Schema.org vocabulary – for simple tweaks and changes you may find things self-explanatory. Also in that directory you will find several files that end in ‘-examples.txt‘. As you might guess, these contain the examples that appear in the Schema.org pages.
Evolving and Sharing How much you use your personal Github schemaorg repositoy fork to collaborate with like minded colleagues, or just use it as a scratch working area for yourself, is up to you. However you choose to organise yourself, you will find the following git commands, that should be run when located in the schemaorg subdirectory, useful:
git status – How your local copy is instep with your repository
git add <filename> – adds file to the ones being tracked against your repository
git commit <filename> – commits (uploads) local changed or added file to your repository
git commit –a – commits (uploads) all changed or added files to your repository
It is recommended to commit as you go.
The mechanism for requesting a change of any type to Schema.org is to raise a Github Pull Request. Each new release of Schema.org is assembled by the organising team reviewing and hopefully accepting each Pull Request. You can see the current list of requests awaiting acceptance in Github. To stop the comments associated with individual requests getting out of hand, and to make it easier to track progress, the preferred way of working is to raise a Pull Request as a final step in completing work on an Issue.
Raising an Issue first enables discussion to take place around proposals as they take shape. It is not uncommon for a final request to differ greatly from an original idea after interaction in the comment stream.
So I suggest that you raise an Issue in the Schema.org repository for what you are attempting to solve. Try to give it a good explanatory Title, and explain what you intend in the comment. This is where the code in your repository and the appspot.com working version can be very helpful in explaining and exploring the issue.
When ready to request, take yourself to your repository’s home page to create a New Pull request. Providing you do not create a new branch in the code, any new commits you make to your repository will become part of that Pull Request. A very handy feature in the real world where inevitably you want to make minor changes just after you say that you are done!
Look out for the next post in this series – Working Within the Vocabulary – in which I’ll cover working in the different file types that make up Schema.org and its extensions.
I find myself in New York for the day on my way back from the excellent Smart Data 2015 Conference in San Jose. It’s a long story about red-eye flights and significant weekend savings which I won’t bore you with, but it did result in some great chill-out time in Central Park to reflect on the week.
In its long auspicious history the SemTech, Semantic Tech & Business, and now Smart Data Conference has always attracted a good cross section of the best and brightest in Semantic Web, Linked Data, Web, and associated worlds. This year was no different for me in my new role as an independent working with OCLC and at Google.
I was there on behalf of OCLC to review significant developments with Schema.org in general – now with 640 Types (Classes) & 988 properties – used on over 10 Million web sites. Plus the pioneering efforts OCLC are engaged with, publishing Schema.org data in volume from WorldCat.org and via APIs in their products. Check out my slides:
By mining the 300+ million records in WorldCat to identify, describe, and publish approx. 200 million Work entity descriptions, and [soon to be shared] 90+ million Person entity descriptions, this pioneering continues.
These are not only significant steps forward for the bibliographic sector, but a great example of a pattern to be followed by most sectors:
Identify the entities in your data
Describe them well using Schema.org
Publish embedded in html
Work with, do not try to replace, the domain specific vocabularies – Bibframe in the library world
Work with the community to extend an enhance Schema.org to enable better representation of your resources
If Schema.org is still not broad enough for you, build an extension to it that solves your problems whilst still maintaining the significant benefits of sharing using Schema.org – in the library world’s case this was BiblioGraph.net
Extending Schema.org Through OCLC and now Google I have been working with and around Schema.org since 2012. The presentation at Smart Data arrived at an opportune time to introduce and share some major developments with the vocabulary and the communities that surround it.
On a personal note the launch of these extensions, bib.schema.org in particular, is the culmination of a bit of a journey that started a couple of years ago with forming of the Schema Bib Extend W3C Community Group (SchemaBibEx) which had great success in proposing additions and changes to the core vocabulary.
A journey that then took in the formation of the BiblioGraph.net extension vocabulary which demonstrated both how to build a domain focused vocabulary on top of Schema.org as well as how the open source software, that powers the Schema.org site, could be forked for such an effort. These two laying the ground work for defining how hosted and external extensions will operate, and for SchemaBibex to be one of the first groups to propose a hosted extension.
Finally this last month working at Google with Dan Brickley on Schema.org, has been a bit of a blur as I brushed up my Python skills to turn the potential in version 2.0 in to the reality of fully integrated and operational extensions in version 2.1. And to get it all done in time to talk about at Smart Data was the icing on the cake.
Of course things are not stoping there. On the not too distant horizon are:
The final acceptance of bib.schema.org & auto.schema.org – currently they are in final review.
SchemaBibEx can now follow up this initial version of bib.schema.org with items from its backlog.
New extension proposals are already in the works such as: health.schema.org, archives.schema.org, fibo.schema.org.
More work on the software to improve the navigation and helpfulness of the site for those looking to understand and adopt Schema.org and/or the extensions.
The checking of the capability for the software to host external extensions without too much effort.
And of course the continuing list of proposals and fixes for the core vocabulary and the site itself.
I believe we are on the cusp of a significant step forward for Schema.org as it becomes ubiquitous across the web; more organisations, encouraged by extensions, prepare to publish their data; and the SEO community recognise proof of it actually working – but more of that in the next post.
The Culture Grid closed to ‘new accessions’ (ie. new collections of metadata) on the 30th April
The existing index and API will continue to operate in order to ensure legacy support
Museums, galleries, libraries and archives wishing to contribute material to Europeana can still do so via the ‘dark aggregator’, which the Collections Trust will continue to fund
Interested parties are invited to investigate using the Europeana Connection Kit to automate the batch-submission of records into Europeana
The reasons he gave for the ending of this aggregation service are enlightening for all engaged with or thinking about data aggregation in the library, museum, and archives sectors.
Throughout its history, the Culture Grid has been tough going. Looking back over the past 7 years, I think there are 3 primary and connected reasons for this:
The value proposition for aggregation doesn’t stack up in terms that appeal to museums, libraries and archives. The investment of time and effort required to participate in platforms like the Culture Grid isn’t matched by an equal return on that investment in terms of profile, audience, visits or political benefit. Why would you spend 4 days tidying up your collections information so that you can give it to someone else to put on their website? Where’s the kudos, increased visitor numbers or financial return?
Museum data (and to a lesser extent library and archive data) is non-standard, largely unstructured and dependent on complex relations. In the 7 years of running the Culture Grid, we have yet to find a single museum whose data conforms to its own published standard, with the result that every single data source has required a minimum of 3-5 days and frequently much longer to prepare for aggregation. This has been particularly salutary in that it comes after 17 years of the SPECTRUM standard providing, in theory at least, a rich common data standard for museums;
Metadata is incidental. After many years of pump-priming applications which seek to make use of museum metadata it is increasingly clear that metadata is the salt and pepper on the table, not the main meal. It serves a variety of use cases, but none of them is ‘proper’ as a cultural experience in its own right. The most ‘real’ value proposition for metadata is in powering additional services like related search & context-rich browsing.
The first of these two issues represent a fundamental challenge for anyone aiming to promote aggregation. Countering them requires a huge upfront investment in user support and promotion, quality control, training and standards development.
The 3rd is the killer though – countering these investment challenges would be possible if doing so were to lead directly to rich end-user experiences. But they don’t. Instead, you have to spend a huge amount of time, effort and money to deliver something which the vast majority of users essentially regard as background texture.
As an old friend of mine would depressingly say – Makes you feel like packing up your tent and going home!
Interestingly earlier in the post Nick give us an insight into the purpose of Culture Grid:
.… we created the Culture Grid with the aim of opening up digital collections for discovery and use ….
That basic purpose is still very valid for both physical and digital collections of all types. The what [helping people find, discover, view and use cultural resources] is as valid as it has ever been. It is the how [aggregating metadata and building shared discovery interfaces and landing pages for it] that has been too difficult to justify continuing in Culture Grid’s case.
In my recent presentations to library audiences I have been asking a simple question “Why do we catalogue?” Sometimes immediately, sometimes after some embarrassed shuffling of feet, I inevitably get the answer “So we can find stuff!“. In libraries, archives, and museums helping people finding the stuff we have is core to what we do – all the other things we do are a little pointless if people can’t find, or even be aware of, what we have.
If you are hoping your resources will be found they have to be referenced where people are looking. Where are they looking?
It is exceedingly likely they are not looking in your aggregated discovery interface, or your local library, archive or museum interface either. Take a look at this chart detailing the discovery starting point for college students and others. Starting in a search engine is up in the high eighty percents, with things like library web sites and other targeted sources only just making it over the 1% hurdle to get on the chart. We have known about this for some time – the chart comes from an OCLC Report ‘College Students’ Perceptions of Libraries and Information Resources‘ published in 2005. I would love to see a similar report from recent times, it would have to include elements such as Siri, Cortana, and other discovery tools built-in to our mobile devices which of course are powered by the search engines. Makes me wonder how few cultural heritage specific sources would actually make that 1% cut today.
Our potential users are in the search engines in one way or another, however it is the vast majority case that our [cultural heritage] resources are not there for them to discover.
Culture Grid, I would suggest, is probably not the only organisation, with an ‘aggregate for discovery’ reason for their existence, that may be struggling to stay relevant, or even in existence.
You may well ask about OCLC, with it’s iconic WorldCat.org discovery interface. It is a bit simplistic say that it’s 320 million plus bibliographic records are in WorldCat only for people to search and discover through the worldcat.org user interface. Those records also underpin many of the services, such as cooperative cataloguing, record supply, inter library loan, and general library back office tasks, etc. that OCLC members and partners benefit from. Also for many years WorldCat has been at the heart of syndication partnerships supplying data to prominent organisations, including Google, that help them reference resources within WorldCat.org which in turn, via find in a library capability, lead to clicks onwards to individual libraries. [Declaration: OCLC is the company name on my current salary check] Nevertheless, even though WorldCat has a broad spectrum of objectives, it is not totally immune from the influences that are troubling the likes of Culture Graph. In fact they are one of the web trends that have been driving the Linked Data and Schema.org efforts from the WorldCat team, but more of that later.
How do we get our resources visible in the search engines then? By telling the search engines what we [individual organisations] have. We do that by sharing a relevant view of our metadata about our resources, not necessarily all of it, in a form that the search engines can easily consume. Basically this means sharing data embeded in your web pages, marked up using the Schema.org vocabulary. To see how this works, we need look no further than the rest of the web – commerce, news, entertainment etc. There are already millions of organisations, measured by domains, that share structured data in their web pages using the Schema.org vocabulary with the search engines. This data is being used to direct users with more confidence directly to a site, and is contributing to the global web of data.
There used to be a time that people complained in the commercial world of always ending up being directed to shopping [aggregation] sites instead of directly to where they could buy the TV or washing machine they were looking for. Today you are far more likely to be given some options in the search engine that link you directly to the retailer. I believe is symptomatic of the disintermediation of the aggregators by individual syndication of metadata from those retailers.
Can these lessons be carried through to the cultural heritage sector – of course they can. This is where there might be a bit of light at the end of the tunnel for those behind the aggregations such as Culture Grid. Not for the continuation as an aggregation/discovery site, but as a facilitator for the individual contributors. This stuff, when you first get into it, is not simple and many organisations do not have the time and resources to understand how to share Schema.org data about their resources with the web. The technology itself is comparatively simple, in web terms, it is the transition and implementation that many may need help with.
Schema.org is not the perfect solution to describing resources, it is not designed to be. It is there to describe them sufficiently to be found on the web. Nevertheless it is also being evolved by community groups to enhance it capabilities. Through my work with the Schema Bib Extend W3C Community Group, enhancements to Schema.org to enable better description of bibliographic resources, have been successfully proposed and adopted. This work is continuing towards a bibliographic extension – bib.schema.org. There is obvious potential for other communities to help evolve and extend Schema to better represent their particular resources – archives for example. I would be happy to talk with others who want insights into how they may do this for their benefit.
Schema.org is not a replacement for our rich common data standards such as MARC for libraries, and SPECTRUM for museums as Nick describes. Those serve purposes beyond sharing information with the wider world, and should be continued to be used for those purposes whilst relevant. However we can not expect the rest of the world to get its head around our internal vocabularies and formats in order to point people at our resources. It needs to be a compromise. We can continue to use what is relevant in our own sectors whilst sharing Schema.org data so that our resources can be discovered and then explored further.
So to return to the question I posed – Is There Still a Case for Cultural Heritage Data Aggregation? – If the aggregation is purely for the purpose of supporting discovery, I think the answer is a simple no. If it has broader purpose, such as for WorldCat, it is not as clear cut.
I do believe nevertheless that many of the people behind the aggregations are in the ideal place to help facilitate the eventual goal of making cultural heritage resources easily discoverable. With some creative thinking, adoption of ‘web’ techniques, technologies and approaches to provide facilitation services, reviewing what their real goals are [which may not include running a search interface]. I believe we are moving into an era where shared authoritative sources of easily consumable data could make our resources more visible than we previously could have hoped.
Are there any black clouds on this hopeful horizon? Yes there is one. In the shape of traditional cultural heritage technology conservatism. The tendency to assume that our vocabulary or ontology is the only way to describe our resources, coupled with a reticence to be seen to engage with the commercial discovery world, could still hold back the potential.
As an individual library, archive, or museum scratching your head about how to get your resources visible in Google and not having the in-house ability to react; try talking within the communities around and behind the aggregation services you already know. They all should be learning and a problem shared is more easily solved. None of this is rocket science, but trying something new is often better as a group.