
Semantic Tech & Business Conference
San Francisco 2-5 June, 2013
Register
I can not really get away with making a statement like “Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them” and then not following it up.
I made it in my previous post Get Yourself a Linked Data Piece of WorldCat to Play With in which I was highlighting the release of a download file containing RDF descriptions of the 1.2 million most highly held resources in WorldCat.org – to make the cut, a resource had to be held by more than 250 libraries.
So here for those that are interested is a step by step description of what I did to follow my own encouragement to load up the triples and start playing.
Step 1
Choose a triplestore. I followed my own advise and chose 4Store. The main reasons for this choice were that it is open source yet comes from an environment where it was the base platform for a successful commercial business, so it should work. Also in my years rattling around the semantic web world, 4Store has always been one of those tools that seemed to be on everyone’s recommendation list.
Looking at some of the blurb – 4store is optimised to run on shared–nothing clusters of up to 32 nodes, linked with gigabit Ethernet – at times holding and running queries over databases of 15GT, supporting a Web application used by thousands of people – you may think it might be a bit of overkill for a tool to play with at home, but hay if it works does that matter!
Step 2
Operating system. Unsurprisingly for a server product, 4Store was developed to run on Unix-like systems. I had three options. I could resurrect that old Linux loaded pc in the corner, fire up an Amazon Web Service image with 4Store built in (such as the one built for the Billion Triple Challenge), or I could use the application download for my Mac.
As I was only needing it for personal playing, I went for the path of least resistance and went for the Mac application. The Mac in question being a fairly modern MacBook Air. The following instructions are therefore Mac oriented, but should not be too difficult to replicate on your OS of choice.
Step 3
Download and install. I downloaded the 15Mb, latest version of the application from the download server: http://4store.org/download/macosx/. As with most Mac applications, it was just a matter of opening up the downloaded 4store-1.1.5.dmg file and dragging the 4Store icon into my applications folder. (Time saving tip, whilst you are doing the next step you can be downloading the 1Gb WorldCat data file in the background, from here)
Step 4
Setup and load. Clicking on the 4Store application opens up a terminal window to give you command line access to controlling your triple store. Following the simple but effective documentation, I needed to create a dataset, which I called WorldCatMillion:
$ 4s-backend-setup WorldCatMillion
Next start the database:
$ 4s-backend WorldCatMillion
Then I need to load the triples from the WorldCat Most Highly Held data set. This step takes a while – over an hour on my system.
$ 4s-import WorldCatMillion –format ntriples /Users/walllisr/Downloads/WorldCatMostHighlyHeld-2012-05-15.nt
This single command line, which may have wrapped on to more than one line in your browser, looks a bit complicated but all it is doing is telling the import process to import the file, which I had downloaded and unziped (automatically on the Mac – you may have to use gunzip on another system), which is formatted as ntriples, into my WorldCatMillion dataset.
Now to start the http server to access it:
$ 4s-httpd -p 8000 WorldCatMillion
A quick test to see if it all worked:
$ 4s-query WorldCatMillion ‘SELECT * WHERE { ?s ?p ?o } LIMIT 10′
This should output some XML encoded triples
Step 5
Access via a web browser. I chose Firefox, as it seems to handle unformatted XML better than most. 4Store comes with a very simple SPARQL interface: http://localhost:8000/test/ This comes already populated with a sample query, just press execute and you should get the data back that you got with the command line 4s-query. The server sends it back in an XML format, which your browser may save to disk for you to view – tweaking the browser settings to automatically open these files will make life easier.
Step 6
Some simple SPARQL queries. Try these and see what you get:
Describe a resource:
DESCRIBE <http://www.worldcat.org/oclc/46843162>
Select all the genre used:
SELECT DISTINCT ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0
Select 100 resources, with a genre triple, outputting the resource URI and it’s genre. (By adjusting the OFFSET value, you can page through all the results):
SELECT ?s, ?o WHERE {
?s <http://schema.org/genre> ?o .
} LIMIT 100 OFFSET 0
Ok there is a start, now I need to play a bit to brush up on my SPARQL!
[...] Putting WorldCat Data Into A Triple Store by Richard Wallis. [...]
[...] followed up his recent announcement that WorldCat data can now be downloaded as RDF triples with an explanation of how to put that data into a triple store. He begins: “Step 1: Choose a triplestore. I followed my own advise and chose 4Store. [...]
A note for Ubuntu users: I run into a problem when I try importing the dataset (Ubuntu 12.04). The error message I always get contained ipv6 addresses. After some googleing I found that I have to disable ipv6 in /etc/sysctl.conf, and reboot the machine. The process is described in this blog entry: http://dannyayers.com/2011/03/26/4store-on-Ubuntu. It worked for me.
Thanks for the instructions Richard. At least with 4store 1.1.4 (and possible others) it’s important to load the data before starting the 4s-httpd service or else you will see “waiting for lock” errors. Apparently there can only be one client open to 4store at a time, and since 4s-httpd allows you to write data via SPARQL you cannot load data while 4s-httpd is running. More info is available here: https://groups.google.com/group/4store-support/tree/browse_frm/month/2010-11?hide_quotes=no&pli=1
Thanks Ed for the heads up on this.
I have edited the post, starting 4s-httpd after the data load.
[...] Putting Worldcat Data Into a Triple Store (DataLiberate) – A step-by-step guide to using a popular data store for your own purposes… [...]
[...] Putting Worldcat Data Into a Triple Store (DataLiberate) – A step-by-step guide to using a popular data store for your own purposes… [...]
A little warning. MacOSX README file says: Unfortunately this App will only work on Intel Macs running Snow Leopard (10.6).
[...] Interestingly, it also publishes Schema.org RDFa markup for all its bibliographic resources, and a large chunk of these resources (i.e. the 1.2M bibliographic resources held by at least 250 libraries) are also available as a RDF dataset that one can easily load into a triplestore such as 4Store. [...]
Thank you! This is a great project. I’m working with colleagues on building out a test RDF library catalog that will link out to this data but also we’re hoping to bootstrap the example work here.
We have a couple of questions about process: a) does OCLC have 1 stylesheet they use for transforms out of MARCXML to RDF or is it a multi threaded process? Can we take a look at that stylesheet if it exists? b) We noticed several “nodeIDs,” what do the For example, “http://schema.org/” rdf:nodeID=”b192f4100000000e5″/> What does “b192f4100000000e5″ mean?
The current process is strictly experimental. From beginning to end, it uses internal data structures and mostly runs as Map/Reduce jobs on a Hadoop cluster. The logic for many (but certainly not all) of the mappings COULD be expressed as XSL Transforms on MARCXML records, but the richer internal formats have been more convenient for doing experimentation. As these mappings firm up, though, it’s certainly true that they should be documented in terms of MARCXML when possible. We’ll work on that as we engage with the community on using/extending Schema.org.
The rdf:nodeID indicates a “blank node”. You’ll see these used in situations where headings aren’t authority controlled. As our matching algorithms improve, these should gradually get replaced with rdf:about and rdf:resource with resolvable http URIs instead.
cool, thanks!
[...] the Richard Wallis advice (see reference) 4Store was considered a good tool useful for our needs. There are a number of reasons we like this [...]
[...] first task, obviously, is choosing the triplestore application. Following Richard Wallis’ advice (see reference) 4Store was considered a good tool, useful for our needs. There are a number of reasons why we like [...]