Evolving Schema.org in Practice Pt1: The Bits and Pieces

I am often asked by people with ideas for extending or enhancing Schema.org how they go about it.  These requests inevitably fall into two categories – either ‘How do I decide upon and organise my new types & properties and relate them to other vocabularies and ontology‘ or ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?

I touch on both of theses areas in a free webinar I recorded for DCMI/ASIS&T a couple of months ago.  It is in the second in a two part series Schema.org in Two Parts: From Use to Extension .  The first part covers the history of Schema.org and the development of extensions.  That part is based up on my experiences applying and encouraging the use of Schema.org with bibliographic resources, including the set up and work of the Schema Bib Extend W3C Community Group – bibliographically focused but of interest to anyone looking to extend Schema.org.

To add to those webinars, the focus of this post is in answering the ‘now I have my proposals, how do I test, share, and submit them to the Schema.org community?‘ question.  In later posts I will move onto how the vocabulary its examples and extensions are defined and how to decide where and how to extend.

What skills do you need

Not many.  If you want to add to the vocabulary and/or examples you will naturally need some basic understanding of the vocabulary and the way you navigate around the Schema.org site, viewing examples etc.  Beyond that you need to be able to run a few command line instructions on your computer and interact with GitHub.  If you are creating examples, you will need to understand how Microdata, RDFa, and JSON-LD mark up are added to html.

I am presuming that you want to do more than tweak a typo, which could be done directly in the GitHub interface, so in this post I step through the practice of working locally, sharing with others, and proposing via a Github Pull Request your efforts..

How do I start

Environment
GitHub You need to set up the environment on your PC, this needs a local installation of Git so that you can interact with the Schema.org source and a local copy of the Google App Engine SDK to run your local copy of the Schema.org site.  The following couple of links should help you get these going.

Getting the Source

This is a two-step process.  Firstly you need your own parallel fork of the Schema.org repository.  If you have not yet, create a user account at Github.com.  They are free, unless you want to keep your work private.

Fork2 Logged into Github, go to the Schema.org repository page – https://github.com/schemaorg/schemaorg, and select Fork this will create a schemaorg repository copy under your account.

Create yourself a working area on your PC and via a command line/terminal window place yourself in that directory to run the following git command, with MyAccount being replaced with your Github account name:

This will download and unwrap a copy of the code into a schemaorg subdirectory of your working directory.

Running a Local Version
In the directory where you downloaded the code, run the following command:

This should result in the output at the command line that looks something like this:

deimos_dev_appserver

The important line being the one telling you module “default” running at: http://localhost:8080 If you drop that web address into your favourite browser you should end up looking at a familiar screen.

local-schemaorg Success! You should now be looking at a version that operates exactly like the liver version, but is totally contained on your local PC.  Note the message on the home page reminding you which version you are viewing.

Running a Shared Public Version
It is common practice to want to share proposed changes with others before applying them to the Schema.org repository in Github.  Fortunately there is an easy free way of running a Google App Engine in the cloud.  To do this you will need a Google account which most of us have.  When logged in to your Google account visit this page: https://console.cloud.google.com

From the ‘Select a project‘ menu Create a project..  Give your project a name – choose a name that is globally unique.  There is a convention that we use names that start with ‘sdo-‘ as an indication that it is a project running a Schema.org instance.

app_yaml To ready your local code to be able to be uploaded into the public instance you need to make a minor change in a file named app.yaml in the schemaorg directory.  Use your favourite text editor to change the line near the top of the file that begins application to have a value that is the same as the project name you have just crated.  Note that lines beginning with a ‘#’ character are commented out and have no effect on operation.  For this post I have created an App Engine project named sdo-blogpost.

To upload the code run the following command:

sdo upload You should get output that indicates the upload process has happened successfully. Dependant on your login state, you may find a browser window appearing to ask you to login to Google. Make sure at this point you login as the user that created the project.

appspot To view your new shared instance go to the following address http://sdo-blogpost.appspot.com – modified to take account of your project name http://<project name>.appspot.com.

Working on the Files
I will go into the internal syntax of the controlling files in a later post.  However, if you would like a preview, take a look in the data directory you will find a large file named schema.rdfa.  This contains the specification for core of the Schema.org vocabulary – for simple tweaks and changes you may find things self-explanatory.  Also in that directory you will find several files that end in ‘-examples.txt‘.  As you might guess, these contain the examples that appear in the Schema.org pages.

Evolving and Sharing
How much you use your personal Github schemaorg repositoy fork to collaborate with like minded colleagues, or just use it as a scratch working area for yourself, is up to you.  However you choose to organise yourself, you will find the following git commands, that should be run when located in the schemaorg subdirectory, useful:

  • git status – How your local copy is instep with your repository
  • git add <filename> – adds file to the ones being tracked against your repository
  • git commit <filename> – commits (uploads) local changed or added file to your repository
  • git commit –a – commits (uploads) all changed or added files to your repository

It is recommended to commit as you go.

Requesting Changes
The mechanism for requesting a change of any type to Schema.org is to raise a Github Pull Request.  Each new release of Schema.org is assembled by the organising team reviewing and hopefully accepting each Pull Request. You can see the current list of requests awaiting acceptance in Github.  To stop the comments associated with individual requests getting out of hand, and to make it easier to track progress, the preferred way of working is to raise a Pull Request as a final step in completing work on an Issue.

Raising an Issue first enables discussion to take place around proposals as they take shape.  It is not uncommon for a final request to differ greatly from an original idea after interaction in the comment stream.

So I suggest that you raise an Issue in the Schema.org repository for what you are attempting to solve.  Try to give it a good explanatory Title, and explain what you intend in the comment.   This is where the code in your repository and the appspot.com working version can be very helpful in explaining and exploring the issue.

Pull rquest When ready to request, take yourself to your repository’s home page to create a New Pull request.  Providing you do not create a new branch in the code, any new commits you make to your repository will become part of that Pull Request.  A very handy feature in the real world where inevitably you want to make minor changes just after you say that you are done!

Look out for the next post in this series – Working Within the Vocabulary – in which I’ll cover working in the different file types that make up Schema.org and its extensions.

The role of Role in Schema.org






This post is about an unusual, but very useful, aspect of the Schema.org vocabulary — the Role type.






schema-org1 Schema.org is basically a simple vocabulary for describing stuff, on the web.  Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs.  They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding.  I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.

The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for.  Ideally provide a URI (URL in Schema.org) for a property that references another thing, but if you don’t have one a simple string will do.

There are a few strangenesses, that hit you when you first delve into using the vocabulary.  For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place.  This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.

At first look at the documentation, Role looks like a very simple type with a handful of properties.  On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary.  That is because it is capable of fitting almost anywhere.  Anywhere there is a relationship between one type and another, that is.  It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes.  Some might term this as a form of annotation.

So what need is this satisfying you may ask.  It must be a significant need to cause the creation of a special case in the vocabulary.  Let me walk through a case, that is used in a Schema.org Blog post, to explain a need scenario and how Role satisfies that need.

Starting With American Football

Say you are describing members of an American Football Team.  Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:

Then describe a player using a Person type, providing name, gender, etc.:

Now lets relate them together by adding an athlete relationship to the Person description:

1

Let’s take a look of the data structure we have created using Turtle – not a html markup syntax but an excellent way to visualise the data structures isolated from the html:

So we now have Chucker Roberts described as an athlete on the Touchline Gods team.  The obvious question then is how do we describe the position he plays in the team.  We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as Schema.org.

This is where Role comes in handy.  Regardless of the range defined for any property in Schema.org, it is acceptable to provide a Role as a value.  The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person).  In simple terms we have have just inserted a Role type between the original two descriptions.

2

This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person).  For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.

3

RDFa:

and in Turtle:

Beyond American Football

So far I have just been stepping through the example provided in the Schema.org blog post on this.  Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.

There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator,  contributor, publisher, character, etc.) in CreativeWork and its subtypes.  By using Role, in the same way as in the sports example above,  we have the flexibility to describe what is needed.

Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor.  However by using Role can qualify the contribution role that he played to be that of the writer of preface.

4

In Turtle:

and RDFa:

You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings.  I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.

Do I always use Role?
In the above example I relate a book to two people, the author and the writer of preface.  I could have linked to the author via another role with the roleName being ‘Author’ or <http://id.loc.gov/vocabulary/relators/aut>.  Although possible, it is not a recommended approach.  Wherever possible use the properties defined for a type.  This is what data consumers such as search engines are going to be initially looking for.

One last example

To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:

This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.

Whenever possible use URLs for roleName
In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”.  However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL.  If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data  the URI always means the same thing.  Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.

Final advice
This very flexible mechanism has many potential uses when describing your resources in Schema.org. There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.

Good luck in your role in describing your resources and the relationships between them using Schema.org

From Records to a Web of Library Data – Pt1 Entification






The phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.






As is often the way, you start a post without realising that it is part of a series of posts – as with this one.  This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Entification

russian dolls Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…

What is it, and why should I care, you may be asking.

I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources.  Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.

That phrase ‘getting library data into a linked data form’ hides multitude of issues.  There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc.  However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.

Marc is a record based format.  For each book catalogued, a record created.  The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system.  In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc.  We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc.  A vast amount of information about, and related to, that book in a single record.  A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world.   In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.

Why do we catalogue?  A question I often ask with an obvious answer – so that people can find our stuff.  Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records .  Indexes acting as search access points to a library’s collection.

A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications.   So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection.  To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.

So where does entification come in?  Well, much of the information about authors subjects, publishers, and the like is locked up in those records.  A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text.    I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.

Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain.  A major step forward from the somewhat singular view that we have been working with over previous decades.  With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.

We are in a state of change in the library world which entification of our data will help us get to grips with.  As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens.  Challenging times – bring on the entification!

Russian doll image by smcgee on Flickr

The Correct End Of Your Telescope – Viewing Schema.org Adoption

schema-org1telescope I have been banging on about Schema.org for a while.  For those that have been lurking under a structured data rock for the last year, it is an initiative of cooperation between Google, Bing, Yahoo!, and Yandex to establish a vocabulary for embedding structured data in web pages to describe ‘things’ on the web.  Apart from the simple significance of having those four names in the same sentence as the word cooperation, this initiative is starting to have some impact.  As I reported back in June, the search engines are already seeing some 7%-10% of pages they crawl containing Schema.org markup.  Like it or not, it is clear that Schema.org is rapidly becoming a de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.

Snapshot It is no coincidence then, at OCLC we chose Schema.org as the way to expose linked data in WorldCat.  If you haven’t seen it, just search for any item at worldcat.org, scroll to the bottom of the page and open up the Linked Data tab and there you will see the [not very pretty, but hay it’s really designed for systems not humans] Schema.org marked up linked data for the item, with links out to other data sources such as VIAF, LCSH, FAST, and Dewey.

As with everything new it was not perfect from the start.  We discovered some limitations in the vocabulary as my colleagues attempted to describe WorldCat resources. Leading to the creation of a Library vocabulary (as a potential extension to Schema.org) to help encode some of the stuff that Schema couldn’t.  Fortunately, those at Schema.org are open to extension proposals and, with the help of the W3C, run a Group [WebSchemas]to propose and discuss them.  Proposals that have already been accepted include those from news and ecommerce groups.

Things have moved on and, I have launched another W3C community Group – Schema Bib Extend to attempt to build a consensus, across a wide group of those concerned about things bibliographic, around proposing extensions to the Schema.org vocabulary.  Addressing it’s capability for describing these types of resources – books, journals, articles, theses, etc., etc. in all forms and formats. 

My personal hope being that the resulting proposals, if and when adopted by Schema.org, will enable libraries, publishers, interest groups, universities, retailers, OCLC, and others to not only publish data about their resources in a way that the search engines can understand, but also have a light weight way to interconnect them to each other and authoritative identifiers for place, name, subject, etc., that will help us begin to form a distributed web of bibliographic data.   A bit of a grand ambition for a fairly simple vocabulary you may think, but things worth having are worth reaching for.  

So focusing back on the short term for the moment. Extending Schema.org to better describe bib resources could have significant benefits anyway. What is in library catalogues, and other bibliographic sources, is mostly hidden to search engines – OPAC pages are almost impossible scrape intuitively, data formats used are only understood by the library and publisher worlds, and even if they ascertain the work a library is describing, there is little way to identify that it is, or is not, the same as one in another library.  It is no accident that Google Book Search came into being utilising special data ingest processes and search techniques to help. Unfortunately there is a significant part of the population unaware of it’s existence and few who use it as part of their general search activities.  By marking up your resources in their terms, your data should appear in the main search indexes and you may even get a better results listing (courtesy of Google Rich Snippets).

OK, that’s the pitch for Schema.org (and getting together to extend it a little in the bibliographic direction) over.  Now on to the point of this post – the mindset we should adopt when approaching the generic, high level, course grained, broad but shallow, simplistic [choose your own phrase] Schema.org vocabulary to describe rich and [already] richly described resources we find in libraries.  Although all my examples will be library/bibliographic ones, I believe that much of what I describe here will be of use and relevance to those in other industries with rich and established ways to describe their data and resources.

Initially let me get a few simple things out of the way.  Firstly, the Schema.org vocabulary is not designed to, and will never, replace any rich industry specific vocabularies or ontologies.  It’s prime benefits are that it is light-weight (understandable by non-experts) and cross-sectoral (data from many domains can be merged and mixed) and, oh yes becoming broadly adopted.  Secondly nobody is advocating that anyone starts to use it instead of their currently used standards – either mix it with your domain specific standards and/or use it as ‘publicly understandable’ publishing format for web pages and the like.  Finally, although initially conceived as a web page markup (Microdata) format, the schema.org vocabulary is equally applicable as Linked Data vocabulary that can be used in the creation of RDF data.  The increasing use and reference to RDFa in Schema.being a reflection of this.  This is also exemplified by the use of Schema.org in the RDF N-Triples dump file OCLC has published of a sub-set of WorldCat data.

So moving on. You have your resources already being described, following established practice, in domain specific format(s) and you want to attempt to describe them using the Schema.org vocabulary.  In the library/publishing community we have more such standards than you can shake a stick at – MARC (of several mostly incompatible flavours), MODS, METS, ONIX, ISBD, RDA, to name just some. Each have their enthusiasts, and proponents, many being a great starting point for a process that might go something like this:

Working my way through all the elements of the [insert your favourite here] standard let me find an equivalent in Schema that I can map my data to.

This can become a bit of an involved operation.  Take something as simple as the author of a book for instance.  Bibliographic standards have concepts such as main author, corporate, creator, contributor, etc.  Schema>Book only has the simple property ‘author’.  How can I reflect the rich nuances and detail in my [library] format, in this simplistic Schema.org vocabulary?  Simple answer – you can’t, so don’t try.  The question you have to ask yourself at this point is: By adding all this detail will I confuse potential consumers of this data, or will the Googles of this world just want to know the people and organisations connected with [linked to] this book in a creative (text) way.  Taking this approach of looking at the problem from the data/domain expert’s end of the telescope means that you have to go through a similar process for each and every element in your data forma/vocabulary/standard.  An approach that will most probably lead to a long list of things missing from and recommendations for Schema.org that they (the group, not the vocabulary) would be unlikely to accept.

Let me propose an alternative approach by turning the telescope around and viewing the data, that you care about and want to publish, from the non-expert consumer’s point of view.  Using my book example again it might go like this:

Schema has a Book class (great!) let me step through it’s properties and identify where in [insert your favourite standard here] I could get that from.

So for example, the ‘author’ property of Schema’s Book class comes from it being a sub-class of the generic CreativeWork class where it is defined as being a Person or Organization – The author of this content.  You can now look into your own vocabulary or standard to find the elements which would contain author-ish data to map to Schema. 

Hang on a moment though!  The Book>author property is defined as being a instance of (or link to) Person or Organization classes.  This means that when we start to publish our data in this form, it is not a matter of just extracting the text string of the author’s name from our data; we need to provide a link to a description of that author (preferably also in Schema.org format).  WorldCat data does this by providing author links to VIAF – a pattern repeated with other properties such as ‘about’ (with links to Dewey and LCSH).

Taking this approach limits you to only thinking about the things Schema [currently] concerns itself with – a much simpler process. 

If that was all there was to it, there would be no need for the Schema Bib Extend Group. As we did at OCLC with WorldCat, some gaps were identified in the result, making it unsatisfactory in some areas in providing a description for even a non-expert.  Obvious candidates [for a Book] include a holding statement, and how to describe the type of book (ebook, talking book, etc.) and the format it is in (paper/hard back, large print, CD, Cassette, MP3, etc.)  However, approaching it from this direction encourages you to firstly look across other areas of the Schema.org vocabulary and other extension proposals for solutions.  GoodRelations, soon to be merged into Schema, offers some promising potential answers for holdings (describing them as Offers to hire/lease). A proposal from the Radio/TV community includes a PublicationEvent.

Finally it is only the gaps, or anomalies, apparent at a Schema.org level that should turn into proposals for extension.  How they would map to elements of standards from our own domain would be down to us [as with what is already in Schema.org] to establish and share consensus driven good practice and lots, and lots, of examples.

We, especially in the library community, have invested much time and effort over many decades in describing [cataloguing] our resources so that people can discover and benefit from them.  Long gone are the days when the way to find things was to visit the library and flick through draws full of catalogue cards.   Libraries were quick to take advantage of the web, putting up their WebOPAC’s so that you could ‘search from home’.  However, study after study has shown that people are now not visiting the library online either. The de facto [and often only] start point is now a search engine – increasingly as represented by a generic search prompt on your phone or tablet device.

This evolution in searching practice would be fine [from a library point of view] if library resources were identified and described to the search engines such that they can easily consume and understand it – so far it hasn’t been.  Schema.org is a way to do that, and to be realistic at the moment is the only show in town that fits that particular bill.  We realised decades, if not centuries ago, that for people to find our things we need to describe them, but the best descriptions in the world are about as much use as a chocolate teapot if they are not in places where those people are looking. 

If you want to know more about bibliographic extension proposals to Schema.org, or help in creating them, join us at Schema Bib Extend.

And remember – when you are thinking about relating your favourite standard to Schema.org, check which end of the telescope you are using before you start.