Recently I gave a presentation to the 2023 LD4 Conference on Linked Data entitled From Ambition to Go Live: The National Library Board of Singapore’s journey to an operational Linked Data Management & Discovery System which describes one of the projects I have been busy with for the last couple of years.
The whole presentation is available to view on YouTube.
In response to requests for more detail, this post focuses in on one aspect of that project – the import pipeline that on a daily basis automatically ingests MARC data, exported from NLB Singapore’s ILS system, into the Linked Data Management and Discovery System (LDMS).
The LDMS, supporting infrastructure, and user & management interfaces have been created in a two-year project for NLB by a consortium of vendors: metaphacts GmbH, KewMann, and myself as Data Liberate.
The MARC ingestion pipeline is one of four pipelines that keep the Knowledge Graph, underpinning the LDMS, synchronised with additions, updates, and deletions from the many source systems that NLB curate and host. This enables the system to provide a near-real-time reconciled and consolidated linked data view of the entities described across their disparate source systems.
Below is a simplified diagram showing the technical architecture of the LDMS implemented in a Singapore Government instance of Amazon Web Services (AWS). Export files from source systems, are dropped in a shared location where they are picked up by import control scripts, passed to appropriate import pipeline processes, the output from which are then uploaded into a GraphDB Knowledge Graph. The Data Management and Discovery interfaces running on metaphactory platform then interact with that Knowledge Graph.
Having set the scene, let us zoom in on the ILS pipeline and its requirements and implementation.
Input
Individual Marc-XML files, one per MARC record, exported from NLB’s ILS system, are placed in the shared location. They are picked up by timed python scripts and passed for processing.
MARC to BIBFRAME
The first step in the pipeline is to produce BIBFRAME-RDF equivalents to the individual MARC records. For this, use was made of the open-source scripts shared by the Library of Congress (LoC) – marc2bibframe2.
marc2bibframe2 was chosen for a few reasons. Firstly, it closely follows the MARC 21 to BIBFRAME 2.0 Conversion Specifications published by LoC, which are recognised across the bibliographic world. Secondly, they are designed to read MARC-XML files in the form output from the ILS source. Also they are in the form of XSLT scripts which could be easily wrapped and controlled in the chosen python based environment.
That wrapping consists of a python script that takes care of multiple file handling, and caching and reuse of the xslt transform module so that one is not created for every individual file transform. The python script was constructed so that it could be run on a local file system, for development and debug, and also deployed, for production operations, as AWS Lambda virtual processes using AWS S3 virtual file structures.
The output from the process is individual RDF-XML files containing a representation of the entities (Work, Instance, Item, Organization, Person, Subject, etc.) implicitly described in the source MARC data. A mini RDF knowledge graph for each record processed.
BIBFRAME and Schema.org
The next step in the pipeline is to add Schema.org entity types and properties to the BIBFRAME. This is to satisfy the dual requirements in the LDMS for all entities to be described, as a minimum, with Schema.org, and having Schema.org to share on the web.
This step was realised by making use of another open-source project – Bibframe2Schema.org*. Somewhat counter intuitively the core of this project is a SPARQL script, making you think you might need to fully implement an RDF triplestore to use it. Fortunately that is not the case. Taking inspiration from the project’s schemaise.py; a python script was produced to make use of the rdflib python module to simply create a memory triplestore from the contents of each RDF-XML file, then run the SPARQL query against it.
The SPARQL query uses INSERT commands to add Schema.org triples to those loaded into the store created from the BIBFRAME RDF files output form the pipeline’s first step. The final BIBFRAME/Schema.org cocktail of triples are output as RDF files for subsequent loading into the Knowledge Graph.
Using python scripting components shared with the marc2bibframe wrapping scripts, this step has been created to either be tested locally, or run as AWS Lambda processes. Picking up the RDF-XML input files in S3 where they were placed by the first step.
It is never that simple
For the vast majority of needs these two pipeline steps work together well, producing often thousands of RDF files for loading into the Knowledge Graph every day.
However things are never that simple in significant production oriented projects. There are invariably site specific tweaks required to get the results you need from the open-source modules you use. In our case such tweaks included: changing hash URIs to externally referenceable slash URIs; replacing blank nodes with URIs; and correcting a couple of anomalies with the assignment of subject URIs in the LoC script.
Fortunately the schemaise.py script already had the capability to call pre and post processing python modules, which allowed the introduction of some bespoke logic without diverging from the methods of operation of the open-source modules.
And into the graph…
The output from all the pipelines, including ILS, are RDF data files ready for loading into the Knowledge Graph using the standard APIs made available by GraphDB. As entity descriptions are loaded into the Knowledge Graph, reconciliation and consolidation processes are triggered which result in a single real world entity view for users of the system. My next post will provide an insight into how that is achieved.
*Richard Wallis is Chair of the Bibframe2Schema.rg W3C Community Group.
See Richard presenting in person on this and other aspects of this project at the following upcoming conferences:
- SWIB 23 Semantic Web in Libraries Conference September 12th 2023 , Berlin Germany
- BIBFRAME Workshop in Europe September 19th 2023, Brussels Belgium