A Fundamental Component of a New Web?






Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.

This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org.

cogsbrowser Marketing Hype! I hear you thinking – well at least I didn’t use the tired old ‘Next Generation’ label.

Let me explain what is this fundamental component of what I am seeing potentially as a New Web, and what I mean by New Web.

This fundamental component I am talking about you might be surprised to learn is a vocabulary – Schema.org.  But let me first set the context by explaining my thoughts on this New Web.

Having once been considered an expert on Web 2.0 (I hasten to add by others, not myself) I know how dangerous it can be to attach labels to things.  It tends to spawn screen full’s of passionate opinions on the relevance of the name, date of the revolution, and over detailed analysis of isolated parts of what is a general movement.   I know I am on dangerous ground here!

To my mind something is new when it feels different.  The Internet felt different when the Web (aka HTTP + HTML + browsers) arrived.  The Web felt different (Web 2.0?) when it became more immersive (write as well as read) and visually we stopped trying to emulate in a graphical style what we saw on character terminals. Oh, and yes we started to round our corners. 

There have been many times over the last few years when it felt new – when it suddenly arrived in our pockets (the mobile web) – when the inner thoughts, and eating habits, of more friends that you ever remember meeting became of apparent headline importance (the social web) – when [the contents of] the web broke out of the boundaries of the browser and appeared embedded in every app, TV show, and voice activated device.

The feeling different phase I think we are going through at the moment, like previous times, is building on what went before.  It is exemplified by information [data] breaking out of the boundaries of our web sites and appearing where it is useful for the user. 

library_of_congress We are seeing the tip of this iceberg in the search engine Knowledge Panels, answer boxes, and rich snippets,ebay-rich-snippet   The effect of this being that often your potential user can get what they need without having to find and visit your site – answering questions such as what is the customer service phone number for an organisation; is the local branch open at the moment; give me driving directions to it; what is available and on offer.  Increasingly these interactions can occur without the user even being aware they are using the web – “Siri! Where is my nearest library?
A great way to build relationships with your customers. However a new and interesting challenge for those trying to measure the impact of your web site.

So, what is fundamental to this New Web

There are several things – HTTP, the light-weight protocol designed to transfer text, links and latterly data, across an internet previously used to specific protocols for specific purposes – HTML, that open, standard, easily copied light-weight extensible generic format for describing web pages that all browsers can understand – Microdata, RDFa, JSON, JSON-LD – open standards for easily embedding data into HTML – RDF, an open data format for describing things of any sort, in the form of triples, using shared vocabularies.  Building upon those is Schema.org – an open, [de facto] standard, generic vocabulary for describing things in most areas of interest. 

icon-LOVWhy is one vocabulary fundamental when there are so many others to choose from? Check out the 500+ referenced on the  Linked Open Vocabularies (LOV) site.  Schema.org however differs from most of the others in a few key areas:Schema.org square 

  • Size and scope – its current 642 Types and 992 Properties is significantly larger and covers far more domains of interest than most others.  This means that if you are looking to describe a something, you are highly likely to to find enough to at least start.  Despite its size, it is yet far from capable of describing everything on, or off, the planet.
  • Adoption – it is estimated to be in use on over 12 million sites.  A sample of 10 billion pages showed over 30% containing Schema.org markup.  Checkout this article for more detail: Schema.org: Evolution of Structured Data on the Web – Big data makes common schemas even more necessary. By Guha, Brickley and Macbeth.
  • Evolution – it is under continuous evolutionary development and extension, driven and guided by an open community under the wing of the W3C and accessible in a GitHub repository.
  • Flexibility – from the beginning Schema.org was designed to be used in a choice of your favourite serialisation – Microdata, RDFa, JSON-LD, with the flexibility of allowing values to default to text if you have not got a URI available.
  • Consumers – The major search engines Google, Bing, Yahoo!, and Yandex, not only back the open initiative behind Schema.org but actively search out Schema.org markup to add to their Knowledge Graphs when crawling your sites.
  • Guidance – If you search out guidance on supplying structured data to those major search engines, you are soon supplied with recommendations and examples for using Schema.org, such as this from Google.  They even supply testing tools for you to validate your markup.

With this support and adoption, the Schema.org initiative has become self-fulfilling.  If your objective is to share or market structured data about your site, organisation, resources, and or products with the wider world; it would be difficult to come up with a good reason not to use Schema.org.

Is it a fully ontologically correct semantic web vocabulary? Although you can see many semantic web and linked data principles within it, no it is not.  That is not its objective. It is a pragmatic compromise between such things, and the general needs of webmasters with ambitions to have their resources become an authoritative part of the global knowledge graphs, that are emerging as key to the future of the development of search engines and the web they inhabit.

Note that I question if Schema.org is a fundamental component, of what I am feeling is a New Web. It is not the fundamental component, but  one of many that over time will become just the way we do things

Schema.org 2.0






About a month ago Version 2.0 of the Schema.org vocabulary hit the streets. But does this warrant the version number clicking over from 1.xx to 2.0?






schema-org1 About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.

This update includes loads of tweaks, additions and fixes that can be found in the release information.  The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about.  With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.

But does this warrant the version number clicking over from 1.xx to 2.0?

These new types and properties are only the tip of the 2.0 iceberg.  There is a heck of a lot of other stuff going on in this release that apart from these additions.  Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.

Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced.  So part of the 2.0 effort has included some rationalisation.  For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details.  If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded.  However you are encouraged to moved towards the updated terminology as convenient.  The question of what is in which version brings me to an enhancement to the supporting documentation.  Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0.  So if you want to refer to a term at a particular version you now can.

CreativeWork_usage How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication.  Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘.  Unsurprisingly for a newly introduced property, there is virtually no usage of it yet.  If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘.  Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.

Extensions
In the release information you will find the following cryptic reference: ‘Fix to #429: Implementation of new extension system.’

This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary.  The motivation for this new approach to extending is explained thus:

Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.

As yet there are no extensions published.  However, there are some on the way.

As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org.  The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics.  I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone.  Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.

W3C Community Group for Schema.org
A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place.  The creation of the Schema.org Community Group has now provided an open forum for this.

So is 2.0 a bit of a milestone?  Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.

A Step for Schema.org – A Leap for Bib Data on the Web






Several significant bibliographic related proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org






schema-org1 Regular readers of this blog may well know I am an enthusiast for Schema.org – the generic vocabulary for describing things on the web as structured data, backed by the major search engines Google, Bing, Yahoo! & Yandex.  When I first got my head around it back in 2011 I soon realised it’s potential for making bibliographic resources, especially those within libraries, a heck of a lot more discoverable.  To be frank library resources did not, and still don’t, exactly leap in to view when searching the web – a bit of a problem when most people start searching for things with Google et al – and do not look elsewhere.

Schema.org as a generic vocabulary to describe most stuff, easily embedded in your web pages, has been a great success.  IMG_0655As was reported by Google’s R.V. Guha, at the recent Semantic Technology and Business Conference in San Jose, a sample of 12B pages showed approximately 21% containing Schema.org markup.  Right from the beginning, however, I had concerns about its applicability to the bibliographic world – great start with the Book type, but there were gaps the coverage for such things as journal issues & volumes, multi-volume works, citations, and the relationship between a work and its editions.  Discovering others shared my combination of enthusiasm and concerns, I formed a W3C Community Group – Schema Bib Extend – to propose some bibliographic focused extensions to Schema.org. Which brings me to the events behind this post…

The SchemaBibEx group have had several proposals accepted over the last couple of years, such as making the [commercial] Offer more appropriate for describing loanable materials, and broadening of the citation property. Several other significant proposals were brought together in a package which I take great pleasure in reporting was included in the latest v1.9 release of Schema.org.  For many in our group these latest proposals were a long time coming after their initial proposal.  Although frustrating, the delays were symptomatic of a very healthy process.

Our proposals to add hasPart, isPartOf, exampleOfWork, and workExample to the CreativeWork Type will be available to many, as CreativeWork is the superclass to many types in many areas. Our proposals for issueNumber on PublicationIssue and volumeNumber on PerodicalVolume are very similar to others in the vocabulary, such as seasonNumber and episodeNumber in TV & Radio.  Under Dan Brickley’s careful organisation, tweaks and adjustments were made across a few areas resulting in a consistent style across parts of the vocabulary underpinned by CreativeWork.

Although the number of new types and properties are small, their addition to Schema opens up potential for much better description of periodicals and creative work relationships. To introduce the background to this, SchemaBibEx member Dan Scott and I were invited to jointly post on the Schema.org Blog.

So, another step forward for Schema.org.   I believe that is more than just a step however, for those wishing to make the bibliographic resources more visible on the Web.  There as been some criticism that Schema.org has been too simplistic to be able represent some of the relationships and subtleties from our world.  Criticism that was not unfounded.  Now with these enhancements, much of these criticisms are answered. There is more to do, but the major objective of the group that proposed them has been achieved – to lay the broad foundation for the description of bibliographic, and creative work, resources in sufficient detail for them to be understood by the search engines to become part of their knowledge graphs. Of course that is not the final end we are seeking.  The reason we share data is so that folks are guided to our resources – by sharing, using the well understood vocabulary, Schema.org.

worldcat Examples of a conceptual creative work being related to its editions, using exampleOfWork and workExample, have been available for some time.  In anticipation of their appearance in Schema, they were introduced into the OCLC WorldCat release of 194 million Work descriptions (for example: http://worldcat.org/entity/work/id/1363251773) with the inverse relationship being asserted in an updated version of the basic WorldCat linked data that has been available since 2012.

WorldCat Works – 197 Million Nuggets of Linked Data

worldcat They’re released!

A couple of months back I spoke about the preview release of Works data from WorldCat.org.  Today OCLC published a press release announcing the official release of 197 million descriptions of bibliographic Works.

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, OCLC numbered, editions already shared from WorldCat.org.

Story_of_my_experiments_with_truth___WorldCat_Entities__and_Windows_XP_Professional_2 They look a little different to the kind of metadata we are used to in the library world.  Check out this example <http://worldcat.org/entity/work/id/1151002411> and you will see that, apart from name and description strings, it is mostly links.  It is linked data after all.

These links (URIs) lead, where available, to authoritative sources for people, subjects, etc.  When not available, placeholder URIs have been created to capture information not yet available or identified in such authoritative hubs.  As you would expect from a linked data hub the works are available in common RDF serializations – Turtle, RDF/XML, N-Triples, JSON-LD – using the Schema.org vocabulary – under an open data license.

The obvious question is “how do I get a work id for the items in my catalogue?”.  The simplest way is to use the already released linked data from WorldCat.org. If you have an OCLC Number (eg. 817185721) you can create the URI for that particular manifestation by prefixing it with ‘http://worldcat.org/oclc/’ thus: http://worldcat.org/oclc/817185721

Gandhi___an_autobiography___the_story_of_my_experiments_with_truth__Book__2011___WorldCat_org_ In the linked data that is returned, either on screen in the Linked Data section, or in the RDF in your desired serialization, you will find the following triple which provides the URI of the work for this manifestation:

<http://worldcat.org/oclc/817185721> exampleOfWork <http://worldcat.org/entity/work/id/1151002411>

To quote Neil Wilson, Head of Metadata Services at the British Library:

With this release of WorldCat Works, OCLC is creating a significant, practical contribution to the wider community discussion on how to migrate from traditional institutional library catalogues to popular web resources and services using linked library data.  This release provides the information community with a valuable opportunity to assess how the benefits of a works-based approach could impact a new generation of library services.

This is a major first step in a journey to provide linked data views of the entities within WorldCat.  Looking forward to other WorldCat entities such as people, places, and events.  Apart from major release of linked data, this capability is the result of applying [Big] Data mining and analysis techniques that have been the focus of research and development for several years.  These efforts are demonstrating that there is much more to library linked data than the mechanical, record at a time, conversion of Marc records into an RDF representation.

You may find it helpful, in understanding the potential exposed by the release of Works, to review some of the questions and answers that were raised after the preview release.

Personally I am really looking forward to hearing about the uses that are made of this data.

Visualising Schema.org

One of the most challenging challenges in my evangelism of the benefits of using Schema.org for sharing data about resources via the web is that it is difficult to ‘show’ what is going on.

The scenario goes something like this…..

Using the Schema.org vocabulary, you embed data about your resources in the HTML that makes up the page using either microdata or RDFa….”

At about this time you usually display a slide showing html code with embedded RDFa.  It may look pretty but the chances of more than a few of the audience being able to pick out the schema:Book or sameAs or rdf:type elements out of the plethora example_RDFaof angle brackets and quotes swimming before their eyes is fairly remote.

Having asked them to take a leap of faith that the gobbledegook you have just presented them with, is not only simple to produce but also invisible to users viewing their pages –  “but not to Google, which harvest that meaningful structured data from within your pages” – you ask them to take another leap [of faith].

You ask them to take on trust that Google is actually understanding, indexing and using that structured data.  At this point you start searching for suitable screen shots of Google Knowledge Graph to sit behind you whilst you hypothesise about the latest incarnation of their all-powerful search algorithm, and how they imply that they use the Schema.org data to drive so-called Semantic Search.

I enjoy a challenge, but I also like to find a better way sometimes.   w3

WorldCat_Logo_V_Color When OCLC first released Linked Data in WorldCat they very helpfully addressed the first of these issues by adding a visual display of the Linked Data to the bottom of each page.   This made my job far easier!

But it has a couple of downsides.  Firstly it is not the prettiest of displays and is only really of use to those interested in ‘seeing’ Linked Data.  Secondly, I believe it creates an impression to some that, if you want Google to grab structured data about resources, you need to display a chunk of gobbledegook on your pages.

turtle-32x32 Let the Green Turtle show the way!
Whilst looking for a better answer I discovered Green Turtle – a JavaScript library for working with RDFa and most usefully packaged in an extention for the Chrome browser.  Load this into your copy of Chrome and it will sit quietly in the background checking for RDFa (and microdata if you turn on the option) in the pages you are viewing.  When it finds one,  a green turtle iconturtle-32x32appears in the address bar.  GTtriplesClicking on that turtle opens up a new tab to show you a list of the data, in the form of triples, that it identified within the page.

That simple way to easily show someone the data embedded in a page, is a great aid to understanding for those new to the concept.  But that is not all.  This excellent little extension has a couple of extra tricks up its sleeve.

GTgraph It includes a visualisation of the [Linked Data] graph of relationships – the structure of the data.  Clicking on any of the nodes of the display, causes the value of the subject, predicate, or object it represents to be displayed below the image and the relevant row(s) in the list of triples to be highlighted.  As well as all this, there is a ‘Show Turtle’ button, which does just as you would expect opening up a window in which it has translated the triples into Turtle – Turtle being (after a bit of practise) the more human friendly way of viewing or creating RDF.

Green Turtle is a useful little tool which I would recommend to visualise microdata and RDFa, be it using the Schema.org vocabulary or not.  I am already using it on WorldCat in preference to scrolling to the bottom of the page to click the Linked Data tab.

google Custom Searches that know about Schema!
Google have recently enhanced the functionality of their Custom Search Engine (CSE) to enable searching by Schema.org Types.  Try out this example CSE which only returns results from WorldCat.org which have been described in their structured data as being of type schema:Book.

A simple yet powerful demonstration that not only are Google harvesting the Schema.org Linked Data from WorldCat, but they are also understanding it and are visibly using it to drive functionality.

OCLC Preview 194 Million Open Bibliographic Work Descriptions






demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front: 1. The release of 194 Million Linked Data Bibliographic Work descriptions. 2. The WorldCat Linked Data Explorer interface






WorldCat_Logo_V_Color I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons.  A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Open Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

ZenWorldCat Works

A Work is a high-level description of a resource, containing information such as author, name, descriptions, subjects etc., common to all editions of the work.  The description format is based upon some of the properties defined by the CreativeWork type from the Schema.org vocabulary.  In the case of a WorldCat Work description, it also contains [Linked Data] links to individual, oclc numbered, editions already shared in WorldCat.   Let’s take a look at one – try this: http://worldcat.org/entity/work/id/12477503

You will see, displayed in the new WorldCat Linked Data Explorer, a html view of the data describing ‘Zen and the art of motorcycle maintenance’. Click on the ‘Open All’ button to view everything.  Anyone used to viewing bibliographic data will see that this is a very different view of things. It is mostly URIs, the only visible strings being the name or description elements.  This is not designed as an end-user interface, it is designed as a data exploration tool.  viewsThis is highlighted by the links at the top to alternative RDF serialisations of the data – Turtle, N-Triple, JSON-LD, RDF/XML.

The vocabulary used to describe the data is based upon Schema.org, and enhancements to it recommended and proposed by the Schema Bib Extend W3C Community Group, which I have the pleasure to chair.

Why is this a preview? Can I usefully use the data now? Are a couple of obvious questions for you to ask at this time.

This is the first production release of WorldCat infrastructure delivering linked data.  The first step in what will be an evolutionary, and revolutionary journey, to provide interconnected linked data views of the rich entities (works, people, organisations, concepts, places, events) captured in the vast shared collection of bibliographic records that makes up WorldCat.  Mining those, 311+ million, records is not a simple task, even to just identify works. It takes time, and a significant amount of [Big Data] computing resources.  One of the key steps in this process is to identify where they exist connections between works and authoritative data hubs, such as VIAF, FAST, LCSH, etc.  In this preview release, it is some of those connections that are not yet in place.

What you see in their place at the moment is a link to, what can be described as, a local authority.  These are exemplified by what the data geeks call a hash-URI as its identifier. http://experiment.worldcat.org/entity/work/data/12477503#Person/pirsig_robert for example is such an identifier, constructed from the work URI and the person name.  Over the next few weeks, where the information is available, you would expect to see this link replaced by a connection to VIAF, such as this: http://viaf.org/viaf/78757182.

So, can I use the data? – Yes, the data is live, and most importantly the work URIs are persistent. It is also available under an open data license (ODC-BY).

How do I get a work id for my resources? – Today, there is one way.  If you use the OCLC xISBN, xOCLCNum web services you will find as part of the data returned a work id (eg. owi=”owi12477503”). By striping off the ‘owi’ you can easily create the relevant work URI: http://worldcat.org/entity/work/id/12477503

In a very few weeks, once the next update to the WorldCat linked data has been processed, you will find that links to works will be embedded in the already published linked data.  For example you will find the following in the data for OCLC number 53474380:

What is next on the agenda? As described, within a few weeks, we expect to enhance the linking within the descriptions and provide links from the oclc numbered manifestations.  From then on, both WorldCat and others will start to use WorldCat Work URIs, and their descriptions, as a core stable foundations to build out a web of relationships between entities in the library domain.  It is that web of data that will stimulate the sharing of data and innovation in the design of applications and interfaces consuming the data over coming months and years.

As I said on the program today, we are looking for feedback on these releases.

We as a community are embarking on a new journey with shared, linked data at its heart. Its success will be based upon how that data is exposed, used, and the intrinsic quality of that data.  Experience shows that a new view of data often exposes previously unseen issues, it is just that sort of feedback we are looking for.  So any feedback on any aspect of this will be more than welcome.

I am excitedly looking forward to being able to comment further as this journey progresses.

Update:  I have posted answers to some interesting questions raised by this release.

Getty Release AAT Vocabulary as Linked Open Data

Linked Open date logo P3 The Getty Research Institute has announced the release of the Art & Architecture Thesaurus (AAT)® as Linked Open Data. The data set is available for download at vocab.getty.edu under an Open Data Commons Attribution License (ODC BY 1.0).

The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques.  I’m sure this will become an indispensible authoritative hub of terms in the Web of Data to assist those describing their resources and placing them in context in that Web.

This is the fist step in an 18 month process to release four vocabularies – the others being The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®.

A great step from Getty.  I look forward to the others appearing over the months and seeing how rapidly their use is made across the web.

OCLC Declare OCLC Control Numbers Public Domain

ocn Little things mean a lot.  Little things that are misunderstood often mean a lot more.

Take the OCLC Control Number, often known as the OCN, for instance.

Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times.  The individual number can be found represented in the record it is associated with.  Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.

Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits.  Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries.  It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN.  The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records.  Processes that I believe, in a few cases, are still in place.

So in the current and future climate of open sharing of data, where for instance WorldCat Linked Data, is published under an open data license, such assumptions and practices are an anomaly.

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:

The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.

As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.

Forming Consensus on Schema.org for Libraries and More

w3c_home Back in September I formed a W3C Group – Schema Bib Extend.  To quote an old friend of mine “Why did you go and do that then?” 

schema-org1 Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web.  I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web.  This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.

At the same time, I believe that summer 2012 was a bit of a watershed for Linked Data in the library world.  Over the preceding few years we have had various national libraries publishing linked data (British Library, Bibliothèque nationale de France, Deutsche National Bibliothek, National Library of Sweden, to name just a few).  We have had linked data published versions of authority files such as LCSH, RAMEAU, National Diet Library, plus OCLC hosted services such as VIAF, FAST, and Dewey.  These plus many other initiatives have lead me to conclude that we are moving to the next stage – for instance the British Library and Deutsche Nationalbibliothek are starting to cross-link their data, and the Library of Congress BIBFRAME initiative is starting to expose some of its [very linked data] thinking.

 WorldCat_Logo_V_ColorOf course the other major initiative was that publication of Linked Data, using Schema.org, from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held works.

 Harry Potter and the Deathly Hallows (Book, 2007) [WorldCat.org]-1The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.

The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list.  OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.

So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources.  I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others.  Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.

Four months, and four conference calls (supported by OCLCthank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary.  We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.  

linkeddata_blue Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges.  Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.

My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise.  Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.

I would characterise the stage we are at, as moving from the talking about it to doing something about it stage.  I am looking forward to the next few months with enthusiasm. 

If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far).  If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.