From Records to a Web of Library Data – Pt2 Hubs of Authority

As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series.  That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Hubs of Authority

hub Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years.  The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations.  Two from personal experience come to mind,  BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC.  Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records.  A large number of national libraries have such lists of agreed formats for author and organisational names.  The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc.  Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..

These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.

It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names.  Almost despite their initial purpose, authorities are gaining a wider purpose.  As a source of names for, and growing descriptions of, the entities that the library world is aware of.  Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data.  As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot  responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other.  A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web.  Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community.  This data, starting to add value to the wider web, comes from those Marc records I spoke about last time.  Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued.  In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.

So where next?  One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about.  Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources.  In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject.  It is a fundamental part of the data.

With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data.  They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources.  A framework that has uses and benefits beyond the boundaries of bibliographic data.  By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.

Image by the trial on Flickr

From Records to a Web of Library Data – Pt1 Entification






The phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.






As is often the way, you start a post without realising that it is part of a series of posts – as with this one.  This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek.  Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.

Entification

russian dolls Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…

What is it, and why should I care, you may be asking.

I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources.  Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.

That phrase ‘getting library data into a linked data form’ hides multitude of issues.  There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc.  However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.

Marc is a record based format.  For each book catalogued, a record created.  The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system.  In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc.  We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc.  A vast amount of information about, and related to, that book in a single record.  A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world.   In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.

Why do we catalogue?  A question I often ask with an obvious answer – so that people can find our stuff.  Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records .  Indexes acting as search access points to a library’s collection.

A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications.   So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection.  To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.

So where does entification come in?  Well, much of the information about authors subjects, publishers, and the like is locked up in those records.  A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text.    I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.

Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain.  A major step forward from the somewhat singular view that we have been working with over previous decades.  With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.

We are in a state of change in the library world which entification of our data will help us get to grips with.  As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens.  Challenging times – bring on the entification!

Russian doll image by smcgee on Flickr