Putting WorldCat Data Into A Triple Store

I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.

I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 million most highly held resources in WorldCat.org – to make the cut, a resource had to be held by more than 250 libraries.

So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.

Step 1
Choose a triplestore. I followed my own advise and chose 4Store. The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work. Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.

Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!

Step 2
Operating system. Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems. I had three options. I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.

As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application. The Mac in question being a fairly modern MacBook Air. The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.

Step 3
Download and install. I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/. As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder. (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)

Step 4
Setup and load. Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store. Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:

$ 4s-backend-setup WorldCatMillion

Next start the database:

$ 4s-backend WorldCatMillion

Then I need to load the triples from the WorldCat Most Highly Held data set. This step takes a while – over an hour on my system.

$ 4s-import WorldCatMillion –format ntriples /Users/walllisr/Downloads/WorldCatMostHighlyHeld-2012-05-15.nt

This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.

Now to start the http server to access it:

$ 4s-httpd -p 8000 WorldCatMillion

A quick test to see if it all worked:

$ 4s-query WorldCatMillion ‘SELECT * WHERE { ?s ?p ?o } LIMIT 10’

This should output some XML encoded triples

Step 5
Access via a web browser. I chose Firefox, as it seems to handle unformatted XML better than most. 4Store comes with a very simple SPARQL interface: http://localhost:8000/test/ This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query. The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.

Step 6
Some simple SPARQL queries. Try these and see what you get:

Describe a resource:

DESCRIBE <http://www.worldcat.org/oclc/46843162>

Select all the genre used:

SELECT DISTINCT ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Select 100 resources, with a genre triple, outputting the resource URI and it’s genre. (By adjusting the OFFSET value, you can page through all the results):

SELECT ?s, ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0

Ok there is a start, now I need to play a bit to brush up on my SPARQL!

Structured Data / Schema.org Site Audit Service Launched

Putting WorldCat Data Into A Triple Store

Leave a Reply Cancel reply