Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.
They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.
The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold. This is now continuing with harvesting of html embedded, schema.org encoded, structured data that is starting to spread across the web.
The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets. A nice trade-off from Google – you embed the information we want/need for a better search and we will give you better results.
The premise of what Google are are up to is that it will deliver better search. Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results. The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us. This is the step-change. We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.
Another example quoted in the article:
…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.
Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success. I would say that they should be helping the search engine supplier(s) do this right and be part of the process. There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.
As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously. I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result. I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities. However I’ve been around far too long to get hung up about labels.
Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the schema.org. Google recognises both, both are ugly for humans to hand code, and web masters should not have to care. Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.
Why is it those sceptical about a new technology resort, within a very few sentences, to the show me the Killer App line. As if the appearance of, a gold-star approved by the bloggerati, example of something useful implemented in said technology is going to change their mind.
Remember the Web – what was the Web’s Killer App?
Step back a little further – what was the Killer App for HTML?
Answers in the comments section for both of the above.
I relate this to a blog post by Senior Program Officer for OCLC Research, Roy Tennant. I’m sure Roy won’t mind me picking on him as a handy current example of a much wider trend. In his post, which he asserts that Microdata, Not RDF, Will Power the Semantic Web (and elicits an interesting comment stream), he says:
Twelve years ago I basically called the Resource Description Framework (RDF) “dead on arrival”. That was perhaps too harsh of an assessment, but I had my reasons and since then I haven’t had a lot of motivation to regret those words. Clearly there is a great deal more data available in RDF-encoded form now than there was then.
But I’m still waiting for the killer app. Or really any app at all. Show me something that solves a problem or fulfills a need I have that requires RDF to function. Go ahead. I’ll wait.
Oh, you’ve got nothing? Well then, keep reading.
I could go off on a rant about something that was supposedly dead a decade ago still having trouble laying down; or comparing Microdata and RDF as vehicles for describing and relating things is like comparing text files to XML; or even that the value is not in the [Microdata or RDF] encoding, it is in the linking of things, concepts, and relationships – but that is for another time.
So to the search for Kill Apps.
Was VisiCalc a Killer App for the PC – yes it probably was. Was Windows a Killer App – more difficult to answer. Is/was Windows an App or something more general, in technology wave, terms that would begat it’s own Killer Apps. What about Hypertext? Ignoring all the pre-computer era work on it’s principles, I would contend that it’s first Killer App was the Windows Help System – WinHelp. Although, with a bit of assistance from the html concept of the href, it was somewhat eclipsed by the Web.
The further up the breakfast pancake like stack of technologies, standards, infrastructures, and ecosystems we evolve away from the bit of silicon, at the core of what we still belittle with the simple name of computer, the more vague and pointless is our search for a Killer Use, and therefore App.
For a short period in history you could have considered the killer application of the internal combustion engine to be the automobile, but it wasn’t long before those two became intimately linked into a single entity with more applications than you could shake a stick at – all and none of which could be considered to be the killer.
Back to my domain , data. As I have postulated previously, I believe we are nearing a point where data, it’s use, our access to it, and the attention and action large players are giving to it, is going to fundamentally change the way we do things. Change that will come about, not from a radical change in what we do, but from a gradual adoption of several techniques and technologies atop of what we already have. As I have also said before, Linked Data (and its data format RDF) will be a success when we stop talking about it as it it becomes just a tool in the bag. A tool that is used when and where appropriate and mundane enough not to warrant a Linked Data Powered sticker on the side.
Take these Apps being built on the Kasabi, [Linked Data Powered] Platform. Will the users of these apps be aware of the Linked Data (and RDF) under the hood? No. Will they benefit from it? Yes, the aggregation and linking capabilities should deliver a better experience. Are any of this Killers? I doubt it, but nevertheless they are no less worth while.
So stop searching for that Killer App, you may be fruitlessly hunting for a long time. When someone invents a pocket-sized fusion reactor or the teleport , they might be back in vogue.
Prospector picture from ToOliver2 on Flickr Pancakes picture from eyeliam on Flickr
What relevance does Linked Data have for a City’s food supply you may ask. As Chris put it:
We live in a world where the agri-food supply chain, from producer all the way through to final consumer, is extremely inefficient in the flow of knowledge. It is very good at delivering food to your table but we don’t know where it comes from, what it’s history is. That has great implication in various scenarios, for example when there are food emergencies, e colli, things of that sort.
My vision is with the application of Semantic Web and Linked Data technologies along the food supply chain, it will make it easier for all actors along there to know more about where their food comes from and where their food goes. This will also create opportunities for new business models where local food will be more easily integrated in to the overall consumption patters of local communities.
This is a vision that can be applied to many of our traditional supply chains. Industries have become very efficient at producing, building, and delivering often very complex things to customers and consumers, but only the thing travels along the chain, it is not accompanied by information about the thing, other than what you may find on a label or delivery note. These supply chains are highly tuned processes that the logisticians will tell you have had most every drop of efficiency squeezed out of them already. Information about all aspects of and steps within a chain could possibly allow parts of the chain to react, and possibly apply some local agility, feedback, and previously hidden efficiencies.
Another example of a traditional chain that exhibits an, on the surface, poor information supply chain is sustainable wood supply. As covered by the BBC You&Yours radio program today (about 43 minutes in), coincidentally within minutes of me watching Dr Brewster.
The Forest Stewardship Council has had a problem where one of their producers had part of their license revoked but apparently still applied the FSC label on the wood they were shipping. Some of this wood travelled through the supply chain and was unwittingly displayed on UK retailers shelves as certified sustainable wood. Listening to the FSC representative it was clear that if an integrated information supply network had been available, the chances of this happening would have been decreased, or at least it being identified sooner.
All very well, but why Linked Data?
One of the characteristics of supply chains it that they tend to deal with many differing organisations engaged in many differing processes – cultivation, packing, assembly, manufacture, shipping, distribution, retailing, etc. Traditionally the computerisation of information flow between differing organisations and their differing systems and procedures has been a difficult nut to crack. Getting all the players to agree and conform is an almost impossible task. One of the many benefits of Linked Data is the ability to extract data from disparate sources describing different things and aggregate them together. Yes you need some loose coordination between the parties around, identification of concepts etc., but you do not need to enforce a regimented vanilla system everywhere.
The automotive industry have already hooked in on this to address the problem of disseminating the mass of information around models of cars and their options. There was a great panel on the 2nd day of the Semantic Tech and Business Conference in Berlin last month:
My takeaway from the panel: Here is an industry that pragmatically picked up a technology that can not only solve it’s problems but also can enable it to take innovative steps, not only for individual company competitive advantage but also to move the industry forward in it’s dealings with its supply/value chain and customers. However, they are also looking more broadly and openly to for instance make data publicly available which will enhance the used car market.
So back to food. The local food part of Dr Brewster’s new business model vision stems from the fact it should easier for a local producer to broadcast availability of their produce to the world. Similarly, it should be easier for a retailer to tune in to that information in an agile way and not only become aware of the produce but also be linked to information about the supplier.
Food and Linked Data is also something the team at Kasabi have been focussing in on recently. Because of the Linked Data infrastructure underpinning the Kasabi Data Marketplace, they have been able to produce an aggregate Food dataset initially from BBC and Foodista.
As the dataset is updated, the Data Team will broaden the sources of food data, and increase the data quality for those passionate about food. They’ll be adding resources and improving the links between them to include things like: chefs, diets, seasonality information, and more.
Food aims to answer questions such as:
I fancy cooking something with “X”, but I don’t like “Y” what shall I cook?
I am pregnant and vegan, what should I prepare for dinner?
Ambitiously, it could also provide data to be used to aid the invention of new recipes based on the co-occurrence of ingredients.
Answering questions like how can I create something new from what I have is one of those difficult to measure yet nevertheless very apparent benefits of using Linked Data techniques and technologies.
It is very easy to imagine the [Linked Data] enhanced food supply chain of Chris’ vision integrated/aggregated with an evolved Kasabi Food dataset answering questions such as “what can I make for dinner which is wheat-free, contains ingredients grown locally that are available from the local major-chain-supermarket which has disabled parking bays?”.
A bit utopian I know, but what differs today from the similar visions that accompanied Tim Berners-Lees original Semantic Web descriptions is that folks like those in the automotive industry and at Kasabi are demonstrating bits of it already.
Bee on plate image from Kasabi.
Declaration I am a shareholder of Kasabi parent company Talis.
Even more recently I note that OCLC, at their EMEA Regional Council Meeting in Birmingham this week, see Linked Data as an important topic on the library agenda.
The consequence of this rise interest in library Linked Data is that the community is now exploring and debating how to migrate library records from formats such as Marc into this new RDF. In my opinion there is a great danger here of getting bogged down in the detail of how to represent every scintilla of information from a library record in every linked data view that might represent the thing that record describes. This is hardly unsurprising as most engaged in the debate come from an experience where if something was not preserved on a physical or virtual record card, it would be lost forever. By concentrating on record/format transformation I believe that they are using a Linked Data telescope to view their problem, but are not necessarily looking through the correct end of that telescope.
Let me explain what I mean by this. There is a massive duplication of information in library catalogues. For example, every library record describing a copy of a book about a certain boy wizard will contain one or more variations of the string of characters “Rowling, K. J”. To us humans it is fairly easy for us to infer that all of them represent the same person, as described by each cataloguer. For a computer, they are just strings of characters.
OCLC host the Virtual International Authority File (VIAF) project which draws together these strings of characters and produces a global identifier for each author. Associated with that author they collect the local language representations of their name.
One simple step down the Linked Data road would be to replace those strings of characters in those records with the relevant VIAF permalink, or URI – http://viaf.org/viaf/116796842/. One result of this would be that your system could follow that link and return an authoritative naming of that person, with the added benefit of it being available in several languages. A secondary, and more powerful, result is that any process scanning such records can identify exactly which [VIAF identified] person is the creator, regardless of the local language or formatting practices.
Why stop at the point of only identifying creators with globally unique identifiers. Why not use an identifier to represent the combined concept of a text, authored by a person, published by an organisation, in the form of a book – each of those elements having their own unique identifiers. If you enabled such a system on the Linked Data web, what would a local library catalogue need to contain? – Probably only a local identifier of some sort with links to local information such as supplier, price, date of purchase, license conditions, physical location, etc. plus a link to the global description provided by a respected source such as Open Library, Library of Congress, British Library, OCLC etc. A very different view of what might constitute a record in a local library.
So far I have looked at this from the library point of view. What about the view from the rest of the world?
I contend that most wishing to reference books, journal articles, curated and provided by libraries would happiest if they could refer to a global identifier that represents the concept of a particular work. Such consumers would only need a small sub-set of the data assembled by a library for basic display and indexing purposes – title, author. The next question may be, where is there a locally available copy of this book or article that I can access. In the model I describe, where these global identifiers are linked to local information such as loan status, the lookup would be a simple process compared with a current contrived search against inferred strings of characters.
Currently Google and other search engines have great difficulty in managing the massive amount of library catalogue pages that will mach a search for a book title. As referred to previously, Google are assembling a graph of related things. In this context the thing is the concept of the book or article, not the thousands of library catalogue pages describing the same thing.
Pulling these thoughts together, and looking down the Linked Data telescope from the non-library end, I envisage a layered approach to accessing library data.
A simple global identifier, or interlinked identifiers from several respected sources, that represents the concept of a particular thing (book, article, etc.)
A simple set of high-level description information for each thing – links to author, title, etc., associated with the identifier. This level of information would be sufficient for many uses on the web and could contain only publicly available information.
For those wishing more in depth bibliographic information, those unique identifiers, either directly or via SameAs links, could link you to more of the rich resources catalogued by libraries around the world, which may or may not be behind slightly less open licensing or commercial constraints.
Finally library holding/access information would be available, separate from the constraints of the bibliographic information, but indexed by those global identifiers.
To get us to such a state will require a couple of changes in the way libraries do things.
Firstly the rich data collated in current library records should be used to populate a Lined Data data model of the things those records describe – not just reproducing the records we have in another format. An approach I expanded upon in a previous post Create Data Not Records.
Secondly, as such a change would be a massive undertaking, libraries will need to work together to do this. The centralised library data holders have a great opportunity to drive this forward. A few years ago, the distributed hosted-on-site landscape of library management systems would have prevented such a change happening. However with library system software-as-a-service becoming an increasingly viable option for many, it is not the libraries that would have to change, just the suppliers of the systems the use.
Most Semantic Web and Linked Data enthusiasts will tell you that Linked Data is not rocket science, and it is not. They will tell you that RDF is one of the simplest data forms for describing things, and they are right. They will tell you that adopting Linked Data makes merging disparate datasets much easier to do, and it does. They will say that publishing persistent globally addressable URIs (identifiers) for your things and concepts will make it easier for others to reference and share them, it will. They will tell you that it will enable you to add value to your data by linking to and drawing in data from the Linked Open Data Cloud, and they are right on that too. Linked Data technology, they will say, is easy to get hold of either by downloading open source or from the cloud, yup just go ahead and use it. They will make you aware of an ever increasing number of tools to extract your current data and transform it into RDF, no problem there then.
So would I recommend a self-taught do-it-yourself approach to adopting Linked Data? For an enthusiastic individual, maybe. For a company or organisation wanting to get to know and then identify the potential benefits, no I would not. Does this mean I recommend outsourcing all things Linked Data to a third party – definitely not.
Let me explain this apparent contradiction. I believe that anyone having, or could benefit from consuming, significant amounts of data, can realise benefits by adopting Linked Data techniques and technologies. These benefits could be in the form of efficiencies, data enrichment, new insights, SEO benefits, or even business models. Gaining the full effects of these benefits will only come from not only adopting the technologies but also adopting the different way of thinking, often called open-world thinking, that comes from understanding the Linked Data approach in your context. That change of thinking, and the agility it also brings, will only embed in your organisation if you do-it-yourself. However, I do council care in the way you approach gaining this understanding.
A young child wishing to keep up with her friends by migrating from tricycle to bicycle may have a go herself, but may well give up after the third grazed knee. The helpful, if out of breath, dad jogging along behind providing a stabilising hand, helpful guidance, encouragement, and warnings to stay on the side of the road, will result in a far less painful and rewarding experience.
I am aware of computer/business professionals who are not aware of what Linked Data is, or the benefits it could provide. There are others who have looked at it, do not see how it could be better, but do see potential grazed knees if they go down that path. And there yet others who have had a go, but without a steadying hand to guide them, and end up still not getting it.
You want to understand how Linked Data could benefit your organisation? Get some help to relate the benefits to your issues, challenges and opportunities. Don’t go off to a third party and get them to implement something for you. Bring in a steadying hand, encouragement, and guidance to stay on track. Don’t go off and purchase expensive hardware and software to help you explore the benefits of Linked Data. There are plenty of open source stores, or even better just sign up to a cloud based service such as Kasabi. Get your head around what you have, how you are going to publish and link it, and what the usage might be. Then you can size and specify the technology and/or service you need to support it.
So back to my original question – Is Linked Data DIY a good idea? Yes it is. It is the only way to reap the ‘different way of thinking’ benefits that accompany understanding the application of Linked data in your organisation. However, I would not recommend a do-it-yourself introduction to this. Get yourself a steadying hand.
Is that last statement a thinly veiled pitch for my services – of course it is, but that should not dilute my advice to get some help when you start, even if it is not from me.
Picture of girl learning to ride from zsoltika on Flickr.
Source of cartoon unknown.
Europeana recently launched an excellent short animation explaining what Linked Open Data is and why it’s a good thing, both for users and for data providers. They did this in support of the release of a large amount of Linked Open Data describing cultural heritage assets held in Libraries, Museums, Galleries and other institutions across Europe.
Europeana, as an aggregator and proxy for data supplied by other institutions is in a difficult position. They not only want to publish this information for the benefit of Europe and the wider world, they also need to maintain the provenance and relationships between the submissions of data from their partner organisations. I believe that the EDM is the result of the second of these two priorities taking precedence. Their proxy role being reflected in the structure of the data. The effect being that a potential consumer of their data, who is not versed in Europeana and their challenges, will need to understand their model before being able to identify that the Cartographer : Ryther, Augustus created the Cittie of London 31.
Fortunately as their technical overview indicates, this is a pilot and the team at Europeana are open to suggestion, particularly on the issue of providing information at the item level in the data model:
Depending on the feedback received during this pilot, we may change this and duplicate all the descriptive metadata at the level of the item URI. Such an option is costly in terms of data verbosity, but it would enable easier access to metadata, for data consumers less concerned about provenance.
In the interests of this data becoming useful, valuable, and easily consumable for those outside of the Europeana partner grouping, I encourage you to lobby them to take a hit on the duplication of some data.
Europeana have launched a video. An excellent short (03:42) video animation explaining what Linked Open Data is and why it’s a good thing, both for users and for data providers.
It’s purpose is to support their publication of a Linked Open Data representation of their cultural heritage descriptive assets and aggregation of descriptions from their contributing organisations. I will cover the ramifications and benefits of this elsewhere.
For now, checkout the video and drop it in to your favourites ready to send to anyone who asks the what is Linked Data question.
Some in the surfing community will tell you that every seventh wave is a big one. I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself. Because of that, it is a little constraining to talk about this next one only effecting the world of the Web. This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.
Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches. Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us. The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.
I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. Enough of the grandiose metaphors and down to business.
Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s. Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.
Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?
It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant. Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives. The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.
I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.
Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.
Data Marketplaces are emerging to provide platforms to share and possibly monetise your data. An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis. Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept. This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts. An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.
The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy. Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.
The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.
I wrote about Schema.org a while back. An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html. The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets. When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much. However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported. By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.
The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.
So what caused me to open up my Macbook and start writing this. It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified. They are already harvesting, and using, structured information from web pages that has been encoded using RDF. The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.
Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa. This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand. It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level. Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies. These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.
So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice. By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically. Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them. A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all. Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.
The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links. So far this evangelism has had little success. However, this shift may give them what they want via an unexpected route.
Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow. A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company. Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.
These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out. Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching. Perhaps I should look at the beach again in five years time to see if I was right.
Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.
When moving to a new place we bring loads of baggage and stuff from our old house that we feel will be necessary in our new abode. One poke of the head in to your attic will clearly demonstrate how much of the necessary stuff was not so necessary after all. Yet you will also find things lurking around the house that may have survived several moves. The moral here is that, although some things are core to what we are and do, it is difficult to predict what we will need in a new situation.
This is also how it is when we move to doing things in a different way – describing our assets in RDF for instance.
I have watched many flounder when they first try to get their head around describing the things they already know in this new Linked Data format, RDF. Just like moving house, we initially grasp for the familiar and that might not always be helpful.
So. You understand your data, you are comfortable with XML, you recognise some familiar vocabularies that someone has published in RDF-XML, this recreation in RDF sounds simple-ish, you take a look at some other data in RDF-XML to see what it should look like, and…… Your brain freezes as you try to work out how you nest one vocabulary within another and how your schema should look.
This is is where stepping back from the XML is a good idea. XML is only one encoding/transmission format for RDF, it is a way of encoding RDF so that it can be transmitted from one machine process to another. XML is ugly. XML is hierarchical and therefore introduces lots of compromises when imparting the graph nature of an RDF model. XML brings with it a [hierarchical] way of thinking about data that constrains your decisions when creating a mode for your resources.
I suggest you not only step back from XML, you initially step back from the computer as well.
Get out your white/blackboard or flip-chat & pen and start drawing some ellipses, rectangles, and arrows. You know your domain, go ahead draw a picture of it. Draw an ellipse for each significant thing in your domain – each type of object, concept, or event, etc. Draw a rectangle for each literal (string, date, number). Beware of strings that are really ‘things’ – things that have other attributes including the string as a name attribute. Draw arrows to show the relationships between your things and things and their attributes. Label the arrows to define those relationships – don’t initially worry about vocabularies yet, use simple labels such as name, manufacturer, creator, publication event, discount, owner, etc. Create identifiers for your things in the form of URIs, so that they will be unique and accessible when you eventually publish them. What you end up with should look a bit like this slide from my recent presentation at the Semantic Tech and Business Conference in Berlin – the full deck is on SlideShare. Note how how I have also represented resources the model in a machine readable form of RDF – yes you can now turn to the computer again.
Once back at the computer you can now start referring to generally used vocabularies such as foaf, dcterms, etc. to identify generally recognised labels for relationships – an obvious candidate being foaf:name in this example. Move on to more domain specific vocabularies as you go. When you find a relationship not catered for elsewhere you may have to consider publishing your own to supplement the rest of the world.
Once you have your model, it is often a simple bit of scripting to take data from your current form (CSV, database, XML record) and produce some simple files of triples, in n-triples format. Then use a useful tool like Raptor to transform it in to good old ugly XML for transfer. Better still, take your n-triples files and load them in to a storage/publishing platform like Kasabi. This was the route the British Library took when the published the British National Bibliography as RDF.
Data is going to be come more core to our world than we could ever have imagined a few short years ago. Although we have be producing it for decades, data has either been treated as something in the core of a project not to expose to prying eyes, or often as a toxic waste product of business processes. Some of the traditional professions that have emerged, to look after and work with these data, reflect this relationship between us and and our digital assets. In the data warehouse, they archive, preserve, catalogue, and attempt to make sense of vast arrays of data. The data miners, precariously dig through mountains of data as it shifts and settles around them, propping up their expensive burrows with assumptions and inferred relationships, hoping a change in the strata does not cause a logical cave-in and they have to start again.
As I have postulated previously, I believe we are on the edge of a new revolution where data becomes a new raw material that drives the emergence of new industries, analogous to the emergence of manufacturing as a consequence of the industrial revolution. As this new era rolls out, the collection of data wrangling enthusiasts that have done a great job in getting us thus far will not be sufficient to sustain a new industry of extracting, transforming, linking, augmenting, analysing and publishing data.
So this initiative from the OKF & P2PU is very welcome:
The explosive growth in data, especially open data, in recent years has meant that the demand for data skills — for data “wranglers” or “scientists” — has been growing rapidly. Moreover, these skills aren’t just important for banks, supermarkets or the next silicon valley start-up, they are also going to be cruicial in research, in journalism, and in civil society organizations (CSOs).
However, there is currently a significant shortfall of data “wranglers” to satisfy this growing demand, especially in civil society organisations — McKinsey expects a skills shortage in data expertise to reach 50-60% by 2018 in the US alone.
It is welcome, not just because they are doing it but also, because of who they are and the direction they are taking:
The School of Data will adopt the successful peer-to-peer learning model established by P2PU and Mozilla in their ‘School of Webcraft’ partnership. Learners will progress by taking part in ‘learning challenges’ – series of structured, achievable tasks, designed to promote collaborative and project-based learning.
As learners gain skills, their achievements will be rewarded through assessments which lead to badges. Community support and on-demand mentoring will also be available for those who need it.
They are practically approaching real world issues and tasks from the direction of the benefit to society of opening up data. Taking this route will engage with those that have the desire, need and enthusiasm to become either part or full time data wranglers. Hopefully these will establish an ethos that will percolate into commercial organisations, taking an open world view with it. I am not suggesting that commerce should be persuaded to freely and ,openly share all their data but they should learn the techniques of the open data community as the best way to share data under whatever commercial and licensing conditions are appropriate.