If you would like to meet up at an event contact him.
Published in Data Publishing, Libraries, Licensing, Linked Data, OCLC, Open Data, WorldCat
Tagged: OCLC, OCN, Open Data, WorldCat
Take the OCLC Control Number, often known as the OCN, for instance.
Every time an OCLC bibliographic record is created in WorldCat it is given a unique number from a sequential set – a process that has already taken place over a billion times. The individual number can be found represented in the record it is associated with. Over time these numbers have become a useful part of the processing of not only OCLC and its member libraries but, as a unique identifier proliferated across the library domain, by partners, publishers and many others.
Like anything that has been around for many years, assumptions and even myths have grown around the purpose and status of this little string of digits. Many stem from a period when there was concern, being voiced by several including me at the time, about the potentially over restrictive reuse policy for records created by OCLC and its member libraries. It became assumed by some, that the way to tell if a bibliographic record was an OCLC record was to see if it contained an OCN. The effect was that some people and organisations invested effort in creating processes to remove OCNs from their records. Processes that I believe, in a few cases, are still in place.
I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:
Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.
The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.
When discussing this with a few people, one question often came up – Why just declare OCNs as public domain, why not license them as such? The following answer from the OCLC website, I believe explains why:
The OCN is an individual bibliographic element, and OCLC doesn’t make any copyright claims either way on specific data elements. The OCN can be used by other institutions in ways that, at an aggregate level, may have varying copyright assertions. Making a positive, specific claim that the OCN is in the public domain might interfere with the copyrights of others in those situations.
As I said, this is a little thing, but if it clears up some misunderstandings and consequential anomalies, it will contribute the usefulness of OCNs and ease the path towards a more open and shared data environment.
Published in Consuming Data, Development, Libraries, Linked Data, OCLC, schema.org, Web, WorldCat
Tagged: Linked Data, RDF, RDFa, WorldCat
Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.
For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.
Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was embedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:
- All resource pages include Linked Data
- Human visible under a Linked Data tab at the bottom of the page
- Embedded as RDFa within the page html
- Described using the Schema.org vocabulary
- Released under an ODC-BY open data license
That is all still valid – so what’s new from now?
That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.
In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.
Let me take you through some simple steps to make this visible for those that are interested.
Starting with a resource in WorldCat: http://www.worldcat.org/oclc/41266045. Clicking that link will take you to the page for Harry Potter and the prisoner of Azkaban. As we did not indicate otherwise, the content-negotiation defaulted to returning the html web page.
To specify that we want RDF/XML we would specify http://www.worldcat.org/oclc/41266045.rdf (dependant on your browser this may not display anything, but allow you to download the result to view in your favourite editor)
This allows you to manually specify the serialisation format you require. You can also do it from within a program by specifying, to the http protocol, the format that you would accept from accessing the URI. This means that you do not have to write code to add the relevant suffix to each URI that you access. You can replicate the effect by using curl, a command line http client tool:
curl -L -H “Accept: application/rdf+xml” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: application/ld+json” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: text/turtle” http://www.worldcat.org/oclc/41266045
curl -L -H “Accept: text/plain” http://www.worldcat.org/oclc/41266045
So, how can I use it? However you like.
If you embed links to WorldCat resources in your linked data, the standard tools used to navigate around your data should now be able to automatically follow those links into and around WorldCat data. If you have the URI for a WorldCat resource, which you can create by prefixing an oclc number with ‘http://www.worldcat.org/oclc/’, you can use it in a program, browser plug-in, smartphone/facebook app to pull data back, in a format that you prefer, to work with or display.
Go have a play, I would love to hear how people use this.
Published in Consuming Data, Data Publishing, Development, Linked Data
Tagged: APIs, Linked Data, Ordnace Survey
Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.
There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.
Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.
A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.
The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.
Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier. (eg. For my local village: http://data.ordnancesurvey.co.uk/id/7000000000002929). What you get back is some nicely formatted html for your web browser, and with content negotiation you can get the same thing as RDF/XML, JSON or turtle. As it is Linked Data, what you get back also includes links to to other data, enabling you to navigate your way around their data from entity to entity.
An excellent demonstration of the basic power and benefit of Linked Data. So why is this often missed? Maybe it is because there is nothing to learn, no API documentation required, you can see and use it by just entering a URI into your web browser – too simple to be interesting perhaps.
To get at the data in more interesting and complex ways you need the API set thoughtfully provided by those that understand the data and some of the most common uses for it, Ordnance Survey.
The API set, now in beta, in my opinion is a most excellent example of how to build, document, and provide access to Linked Data assets in this way.
Firstly the APIs are applied as a standard to four available data sets – three individual, and one combining all three data sets. Nice that you can work with an individually focussed set or get data from all in a consolidated graph.
There are four APIs:
- Lookup – a simple way to extract an RDF description of a single resource, using its URI.
- Search – for running keyword searches over a dataset.
- Sparql – a fully-compliant SPARQL 1.1 endpoint.
- Reconciliation – a simple web service that supports linking of datasets to the Ordnance Survey Linked Data.
Each API is available to play with on a web page complete with examples and pop-up help hints. It is very easy and quick to get your head around the capabilities of the individual APIs, the use of parameters, and returned formats without having to read documentation or cut a single line of code.
For a quick intro there is even a page with them all on for you to try. When you do get around to cutting code, the documentation for each API is also well presented in simple and understandable form. They even include details of the available output formats and expected http response codes.
Finally a few general comments.
Firstly the look, feel, and performance of the site reflects that this is a robust serious professional service and fills you with confidence about building your application on its APIs. Developers of services and APIs, even for internal use, often underestimate the value of presenting and documenting their offering in a professional way. How often have you come across API documentation that makes the first web page look modern and wonder about investing the time in even looking at it. Also a site with a snappy response ups your confidence that your application will perform well when using their service.
Secondly the range of APIs, all cleanly and individually satisfying specific general needs. So for instance you can usefully use Search and Lookup without having any understanding of RDF or SPARQL – the power of SPARQL being there only if you understand and need it.
The additional features – CORS Support and Response Caching – (detailed on the API documentation pages) also demonstrate that this service has been built with the issues of the data consumer in mind. Providing the tools for consumers to take advantage of web caching in their application will greatly enhance response and performance. The CORS Support enables the creation of in browser applications that draw data from many sites – one of the oft promoted benefits of linked data, but sometimes a little tricky to implement ‘in browser’.
I can see this site and its associated APIs greatly enhancing the reputation of Ordnance Survey; underpinning the development of many apps and applications; and becoming an ideal source for many people to go ‘to try out’, when writing their first API consuming application code.
Well done to the team behind its production.
Published in Development, Libraries, Linked Data, OCLC, Semantic Tech & Business, Semantic Web
Tagged: Libraries, Linked Data, OCLC, SemTechBiz
Help spotlight library innovation and send a library linked data practitioner to the SemTechBiz conference in San Francisco, June 2-5
Update from organisers:
We are pleased to announce that Kevin Ford, from the Network Development and MARC Standards Office at the Library of Congress, was selected for the Semantic Web.com Spotlight on Innovation for his work with the Bibliographic Framework Initiative (BIBFRAME) and his continuing work on the Library of Congress’s Linked Data Service (loc.id). In addition to being an active contributor, Kevin is responsible for the BIBFRAME website; has devised tools to view MARC records and the resulting BIBFRAME resources side-by-side; authored the first transformation code for MARC data to BIBFRAME resources; and is project manager for The Library of Congress’ Linked Data Service. Kevin also writes and presents frequently to promote BIBFRAME, ID.LOC.GOV, and educate fellow librarians on the possibilities of linked data.
Without exception, each nominee represented great work and demonstrated the power of Linked Data in library systems, making it a difficult task for the committee, and sparking some interesting discussions about future such spotlight programs.
Congratulations, Kevin, and thanks to all the other great library linked data projects nominated!
OCLC and LITA are working to promote library participation at the upcoming Semantic Technology & Business Conference (SemTechBiz). Libraries are doing important work with Linked Data. SemanticWeb.com wants to spotlight innovation in libraries, and send one library presenter to the SemTechBiz conference expenses paid.
SemTechBiz brings together today’s industry thought leaders and practitioners to explore the challenges and opportunities jointly impacting both business leaders and technologists. Conference sessions include technical talks and case studies that highlight semantic technology applications in action. The program includes tutorials and over 130 sessions and demonstrations as well as a hackathon, start-up competition, exhibit floor, and networking opportunities. Amongst the great selection of speakers you will find yours truly!
If you know of someone who has done great work demonstrating the benefit of linked data for libraries, nominate them for this June 2-5 conference in San Francisco. This “library spotlight” opportunity will provide one sponsored presenter with a spot on the conference program, paid travel & lodging costs to get to the conference, plus a full conference pass.
Nominations for the Spotlight are being accepted through May 10th. Any significant practical work should have been accomplished prior to March 31st 2013 — project can be ongoing. Self-nominations will be accepted
Even if you do not nominate anyone, the Semantic Technology and Business Conference is well worth experiencing. As supporters of the SemanticWeb.com Library Spotlight OCLC and LITA members will get a 50% discount on a conference pass – use discount code “OCLC” or “LITA” when registering. (Non members can still get a 20% discount for this great conference by quoting code “FCLC”)
For more details checkout the OCLC Innovation Series page.
Thank you for all the nominations we received for the first Semantic Web.com Spotlight on Innovation in Libraries.
Published in Data Publishing, Google, Libraries, Linked Data, schema.org, Web
Tagged: Data, Entities, Libraries, Linked Data, RDF
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Beacons of Availability
As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.
As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.
Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.
I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!
Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about. This is not a problem isolated to library interfaces. In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references. This vocabulary is Schema.org – I have previously posted about its success and significance.
With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand. By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person. With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results. I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.
For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites. We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them. If not an identical process, it is clear that similar rules will apply for structured data linking. If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them. If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.
Page ranking does not depend on all pages having to link to all others. Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.
WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource. However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web. Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.
We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks. We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources. To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.
So who is going to do all this then – will every library need to employ a linked data expert? I certainly hope not.
One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs. Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers. Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.
By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources. One beacon is not enough. A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.
This won’t happen over night, but we are certainly in for some interesting times ahead.
Beacons picture from wallpapersfor.me
Published in Consuming Data, Data Publishing, Development, Libraries, Linked Data, OCLC, Web
Tagged: Data, Entities, Libraries, Linked Data, RDF, Records
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
Image by the trial on Flickr
Published in Development, Libraries, OCLC, Uncategorized
Tagged: Data, Entities, Libraries, RDF, Records
As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!Russian doll image by smcgee on Flickr
Published in Data Publishing, Libraries, Linked Data, OCLC, Open Data, schema.org
Tagged: Libraries, Linked Data, OCLC, Open Data, schema.org
Back in September I formed a W3C Group – Schema Bib Extend. To quote an old friend of mine “Why did you go and do that then?”
Well, as I have mentioned before Schema.org has become a bit of a success story for structured data on the web. I would have no hesitation in recommending it as a starting point for anyone, in any sector, wanting to share structured data on the web. This is what OCLC did in the initial exercise to publish the 270+ million resources in WorldCat.org as Linked Data.
At the same time, I believe that summer 2012 was a bit of a watershed for Linked Data in the library world. Over the preceding few years we have had various national libraries publishing linked data (British Library, Bibliothèque nationale de France, Deutsche National Bibliothek, National Library of Sweden, to name just a few). We have had linked data published versions of authority files such as LCSH, RAMEAU, National Diet Library, plus OCLC hosted services such as VIAF, FAST, and Dewey. These plus many other initiatives have lead me to conclude that we are moving to the next stage – for instance the British Library and Deutsche Nationalbibliothek are starting to cross-link their data, and the Library of Congress BIBFRAME initiative is starting to expose some of its [very linked data] thinking.
Of course the other major initiative was that publication of Linked Data, using Schema.org, from within OCLC’s WorldCat.org, both as RDFa embedded in WorldCat detail pages, and in a download file containing the 1.2 million most highly held works.
The need to extend the Schema.org vocabulary became clear when using it to mark up the bibliographic resources in WorldCat. The Book type defined in Schema.org, along with other types derived from CreativeWork, contain many of the properties you need to describe bibliographic resources, but is lacking in some of the more detailed ones, such as holdings count and carrier type, we wanted to represent. It was also clear that it would need more extension if we wanted to go further to define the relationships between such things as works, expressions, manifestations, and items – to talk FRBR for a moment.
The organisations behind Schema.org (Google, Bing, Yahoo, Yandex) invite proposals for extension of the vocabulary via the W3C public-vocabs mailing list. OCLC could have taken that route directly, but at best I suggest it would have only partially served the needs of the broad spread of organisations and people who could benefit from enriched description of bibliographic resources on the web.
So that is why I formed a W3C Community Group to build a consensus on extending the Schema.org vocabulary for these types of resources. I wanted to not only represent the needs, opinions, and experience of OCLC, but also the wider library sector of libraries, librarians, system suppliers and others. Any generally applicable vocabulary [most importantly recognised by the major search engines] would also provide benefit for the wider bibliographic publishing, retailing, and other interested sectors.
Four months, and four conference calls (supported by OCLC – thank you), later we are a group of 55 members with a fairly active mailing list. We are making progress towards shaping up some recommendations having invested much time in discussing our objectives and the issues of describing detailed bibliographic information (often to be currently found in Marc, Onix, or other industry specific standards) in a generic web-wide vocabulary. We are not trying to build a replacement for Marc, or turn Schema.org into a standard that you could operate a library community with.
Applying Schema.org markup to your bibliographic data is aimed at announcing it’s presence, and the resources it describes, to the web and linking them into the web of data. I would expect to see it being applied as complementary markup to other RDF based standards such as BIBFRAME as it emerges. Although Schema.org started with Microdata and, latterly [and increasingly] RDFa, the vocabulary is equally applicable serialised in any of the RDF formats (N-Triples, Tertle, RDF/XML, JSON) for processing and data exchange purposes.
My hope over the next few months is that we will agree and propose some extensions to schema.org (that will get accepted) especially in the areas of work/manifestation relationships, representations of identifiers other than isbn, defining content/carrier, journal articles, and a few others that may arise. Something that has become clear in our conversations is that we also have a role as a group in providing examples of how [extended] Schema.org markup should be applied to bibliographic data.
I would characterise the stage we are at, as moving from the talking about it to doing something about it stage. I am looking forward to the next few months with enthusiasm.
If you want to join in, you will find us over at http://www.w3.org/community/schemabibex/ (where you will amongst other things on the Wiki find recordings and chat transcripts from the meetings so far). If you or your group want to know more about Schema.org and it’s relevance to libraries and the broader bibliographic world, drop me a line or, if I can fit it in with my travels to conferences such as ALA, could be persuaded to stand up and talk about it.