As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, the following one – Hubs of Authority and this, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Beacons of Availability
As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.
As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.
Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.
I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!
Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.
Before a search engine can lead a user to a suitable resource, it needs to identify that the resource exists, in any form, and hold a description for display in search results that will be sufficiently inform a user as such. Library search interfaces are inherently poor sources of such information, with web crawlers having to infer, from often difficult to differentiate text, what the page might be about. This is not a problem isolated to library interfaces. In response, the major search engines have cooperated to introduce a generic vocabulary for embedded structured information in to web pages so that they can be informed in detail what the page references. This vocabulary is Schema.org – I have previously posted about its success and significance.
With a few enhancements in the way it can describe bibliographic resources (currently being discussed by the Schema Bib Extend W3C Community Group) Schema.org is an ideal way for libraries to publish information about our resources and associated entities in a format the search engines can consume and understand. By using URIs for authorities in that data to identify, the author in question for instance using his/her VIAF identifier, gives them the ability to identify resources from many libraries associated by the same person. With this greatly enriched, more structured, linked to authoritative hubs, view of library resources, the likes of Google over time will stand a far better chance of presenting potential library users with useful informative results. I am pleased to say that OCLC have been at the forefront of demonstrating this approach by publishing Schema.org modelled linked data in the default WorldCat.org interface.
For this approach to be most effective, many of the major libraries, consortia, etc. will need to publish metadata as linked data, in a form that the search engines can consume whilst (following linked data principles) linking to each other when they identify that they are describing the same resource. Many instances of [in data terms] the same thing being published on the web will naturally raise its visibility in results listings.
An individual site (even a WorldCat) has difficultly in being identified above the noise of retail and other sites. We are aware of the Page Rank algorithms used by the search engines to identify and boost the reputation of individual sites and pages by the numbers of links between them. If not an identical process, it is clear that similar rules will apply for structured data linking. If twenty sites publish their own linked data about the same thing, the search engines will take note of each of them. If each of those sites assert that their resource is the same resource as a few of their partner sites (building a web of connection between instances of the same thing), I expect that the engines will take exponentially more notice.
Page ranking does not depend on all pages having to link to all others. Like many things on the web, hubs of authority and aggregation will naturally emerge with major libraries, local, national, and global consortia doing most of the inter-linking, providing interdependent hubs of reputation for others to connect with.
Having identified a resource that may satisfy a potential library user’s need, the next even more difficult problem is to direct that user to somewhere that they can gain access to it – loan, download, view via an appropriate licence, or purchase, etc.
WorldCat.org, and other hubs, with linked data enhanced to provide holdings information, may well provide a target to link via which a user may access to, in addition to just getting a description of, a resource. However, those few sites, no matter how big or well recognised they are, are just a few sites shouting in the wilderness of the ever increasing web. Any librarian in any individual library can quite rightly ask how to help Google, and the others, to point users at the most appropriate copy in his/her library.
We have all experienced the scenario of searching for a car rental company, to receive a link to one within walking distance as first result – or finding the on-campus branch at the top of a list of results.in response to a search for banks. We know the search engines are good at location, either geographical or interest, based searching so why can they not do it for library resources. To achieve this a library needs to become an integral part of a Web of Library Data, publishing structured linked data about the resources they have available for the search engines to find; in that data linking their resources to the reputable hubs of bibliographic that will emerge, so the engines know it is another reference to the same thing; go beyond basic bibliographic description to encompass structured data used by the commercial world to identify availability.
So who is going to do all this then – will every library need to employ a linked data expert? I certainly hope not.
One would expect the leaders in this field, national libraries, OCLC, consortia etc to continue to lead the way, in the process establishing the core of this library web of data – the hubs. Building on that framework the rest of the web can be established with the help of the products, and services of service providers and system suppliers. Those concerned about these things should already be starting to think about how they can be helped not only to publish linked data in a form that the search engines can consume, but also how their resources can become linked via those hubs to the wider web.
By lighting a linked data beacon on top of their web presence, a library will announce to the world the availability of their resources. One beacon is not enough. A web of beacons (the web of library data) will alert the search engines to the mass of those resources in all libraries, then they can lead users via that web to the appropriately located individual resource in particular.
This won’t happen over night, but we are certainly in for some interesting times ahead.
As is often the way, you start a post without realising that it is part of a series of posts – as with the first in this series. That one – Entification, and the next in the series – Beacons of Availability, together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Hubs of Authority
Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).
One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations..
These authority files play a major role in the efficient cataloguing of material today, either by being part of the workflow in a cataloguing interface, or often just using the wonders of Windows ^C & ^V keystroke sequences to transfer agreed format text strings from authority sites into Marc record fields.
It is telling that the default [librarian] description of these things is a file – an echo back to the days when they were just that, a file containing a list of names. Almost despite their initial purpose, authorities are gaining a wider purpose. As a source of names for, and growing descriptions of, the entities that the library world is aware of. Many authority file hosting organisations have followed the natural path, in this emerging world of Linked Data, to provide persistent URIs for each concept plus publishing their information as RDF.
These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.
Libraries and librarians have a great brand image, something that attaches itself to the data and services they publish on the web. Respected and trusted are a couple of words that naturally associate with bibliographic authority data emanating from the library community. This data, starting to add value to the wider web, comes from those Marc records I spoke about last time. Yet it does not, as yet, lead those navigating the web of data to those resources so carefully catalogued. In this case, instead of cataloguing so people can find stuff, we could be considered to be enriching the web with hubs of authority derived from, but not connected to, the resources that brought them into being.
So where next? One obvious move, that is already starting to take place, is to use the identifiers (URIs) for these authoritative names to assert within our data, facts such as who a work is by and what it is about. Check out data from the British National Bibliography or the linked data hidden in the tab at the bottom of a WorldCat display – you will see VIAF, LCSH and other URIs asserting connection with known resources. In this way, processes no longer need to infer from the characters on a page that they are connected with a person or a subject. It is a fundamental part of the data.
With that large amount of rich [linked] data, and the association of the library brand, it is hardly surprising that these datasets are moving beyond mere nodes on the web of data. They are evolving in to Hubs of Authority, building a framework on which libraries and the rest of the web, can hang descriptions of, and signposts to, our resources. A framework that has uses and benefits beyond the boundaries of bibliographic data. By not keeping those hubs ‘library only’, we enable the wider web to build pathways to the library curated resources people need to support their research, learning, discovery and entertainment.
As is often the way, you start a post without realising that it is part of a series of posts – as with this one. This, and the following two posts in the series – Hubs of Authority, and Beacons of Availability – together map out a journey that I believe the library community is undertaking as it evolves from a record based system of cataloguing items towards embracing distributed open linked data principles to connect users with the resources they seek. Although grounded in much of the theory and practice I promote and engage with, in my role as Technology Evangelist with OCLC and Chairing the Schema Bib Extend W3C Community Group, the views and predictions are mine and should not be extrapolated to predict either future OCLC product/services or recommendations from the W3C Group.
Entification – a bit of an ugly word, but in my day to day existence one I am hearing more and more. What an exciting life I lead…
What is it, and why should I care, you may be asking.
I spend much of my time convincing people of the benefits of Linked Data to the library domain, both as a way to publish and share our rich resources with the wider world, and also as a potential stimulator of significant efficiencies in the creation and management of information about those resources. Taking those benefits as being accepted, for the purposes of this post, brings me into discussion with those concerned with the process of getting library data into a linked data form.
That phrase ‘getting library data into a linked data form’ hides multitude of issues. There are some obvious steps such as holding and/or outputting the data in RDF, providing resources with permanent URIs, etc. However, deriving useful library linked data from a source, such as a Marc record, requires far more than giving it a URI and encoding what you know, unchanged, as RDF triples.
Marc is a record based format. For each book catalogued, a record created. The mantra driven in to future cataloguers at library school has been, and I believe often still is, catalogue the item in your hand. Everything discoverable about that item in their hand is transferred on to that [now virtual] catalogue card stored in their library system. In that record we get obvious bookish information such as title, size, format, number of pages, isbn, etc. We also get information about the author (name, birth/death dates etc.), publisher (location, name etc.), classification scheme identifiers, subjects, genres, notes, holding information, etc., etc., etc. A vast amount of information about, and related to, that book in a single record. A significant achievement – assembling all this information for the vast majority of books in the vast majority of the libraries of the world. In this world of electronic resources a pattern that is being repeated for articles, journals, eBooks, audiobooks, etc.
Why do we catalogue? A question I often ask with an obvious answer – so that people can find our stuff. Replicating the polished draws of catalogue cards of old, ordered by author name or subject, indexes are applied to the strings stored in those records . Indexes acting as search access points to a library’s collection.
A spin-off of capturing information in record attributes, about library books/articles/etc., is that we are also building up information about authors, publishers subjects and classifications. So for instance a subject index will contain a list of all the names of the subjects addressed by an individual library collection. To apply some consistency between libraries, authorities – authoritative sets of names, subject headings etc., have emerged so that spellings and name formats could be shared in a controlled way between libraries and cataloguers.
So where does entification come in? Well, much of the information about authors subjects, publishers, and the like is locked up in those records. A record could be taken as describing an entity, the book. However the other entities in the library universe are described as only attributes of the book/article/text. I can attest to the vast computing power and intellectual effort that goes into efforts at OCLC to mine these attributes from records to derive descriptions of the entities they represent – the people, places, organisations, subjects, etc. that the resources are by, about, or related to in some way.
Once the entities are identified, and a model is produced & populated from the records, we can start to work with a true multi-dimensional view of our domain. A major step forward from the somewhat singular view that we have been working with over previous decades. With such a model it should be possible to identify and work with new relationships, such as publishers and their authors, subjects and collections, works and their available formats.
We are in a state of change in the library world which entification of our data will help us get to grips with. As you can imagine as these new approaches crystallise, they are leading to all sorts of discussions around what are the major entities we need to concern ourselves with; how do we model them; how do we populate that model from source [record] data; how do we do it without compromising the rich resources we are working with; and how do we continue to provide and improve the services relied upon at the moment, whilst change happens. Challenging times – bring on the entification!