Over the next few months, Google’s search engine will begin spitting out more than a list of blue Web links. It will also present more facts and direct answers to queries at the top of the search-results page.
They are going about this by developing the search engine [that] will better match search queries with a database containing hundreds of millions of “entities”—people, places and things—which the company has quietly amassed in the past two years.
The ‘amassing’ got a kick start in 2010 with the Metaweb acquisition that brought Freebase and it’s 12 Million entities into the Google fold. This is now continuing with harvesting of html embedded, schema.org encoded, structured data that is starting to spread across the web.
The encouragement for webmasters and SEO folks to go to the trouble of inserting this information in to their html is the prospect of a better result display for their page – Rich Snippets. A nice trade-off from Google – you embed the information we want/need for a better search and we will give you better results.
The premise of what Google are are up to is that it will deliver better search. Yes this should be true, however I would suggest that the major benefit to us mortal Googlers will be better results. The search engine should appear to have greater intuition as to what we are looking for, but what we also should get is more information about the things that it finds for us. This is the step-change. We will be getting, in addition to web page links, information about things – the location, altitude, average temperature or salt content of a lake. Whereas today you would only get links to the lake’s visitors centre or a Wikipedia page.
Another example quoted in the article:
…people who search for a particular novelist like Ernest Hemingway could, under the new system, find a list of the author’s books they could browse through and information pages about other related authors or books, according to people familiar with the company’s plans. Presumably Google could suggest books to buy, too.
Many in the library community may note this with scepticism, and as being a too simplistic approach to something that they have been striving towards for for many years with only limited success. I would say that they should be helping the search engine supplier(s) do this right and be part of the process. There is great danger that, for better or worse, whatever Google does will make the library search interface irrelevant.
As an advocate for linked data, it is great to see the benefits of defining entities and describing the relationships between them being taken seriously. I’m not sure I buy into the term ‘Semantic Search’ as a name for what will result. I tend more towards ‘Semantic Discovery’ which is more descriptive of where the semantics kick in – in the relationship between a searched for thing and it’s attributes and other entities. However I’ve been around far too long to get hung up about labels.
Whilst we are on the topic of labels, I am in danger of stepping in to the almost religious debate about the relative merits of microdata and RDFa as the encoding method for embedding the schema.org. Google recognises both, both are ugly for humans to hand code, and web masters should not have to care. Once the CMS suppliers get up to speed in supplying the modules to automatically embed this stuff, as per this Drupal module, they won’t have to care.
Why is it those sceptical about a new technology resort, within a very few sentences, to the show me the Killer App line. As if the appearance of, a gold-star approved by the bloggerati, example of something useful implemented in said technology is going to change their mind.
Remember the Web – what was the Web’s Killer App?
Step back a little further – what was the Killer App for HTML?
Answers in the comments section for both of the above.
I relate this to a blog post by Senior Program Officer for OCLC Research, Roy Tennant. I’m sure Roy won’t mind me picking on him as a handy current example of a much wider trend. In his post, which he asserts that Microdata, Not RDF, Will Power the Semantic Web (and elicits an interesting comment stream), he says:
Twelve years ago I basically called the Resource Description Framework (RDF) “dead on arrival”. That was perhaps too harsh of an assessment, but I had my reasons and since then I haven’t had a lot of motivation to regret those words. Clearly there is a great deal more data available in RDF-encoded form now than there was then.
But I’m still waiting for the killer app. Or really any app at all. Show me something that solves a problem or fulfills a need I have that requires RDF to function. Go ahead. I’ll wait.
Oh, you’ve got nothing? Well then, keep reading.
I could go off on a rant about something that was supposedly dead a decade ago still having trouble laying down; or comparing Microdata and RDF as vehicles for describing and relating things is like comparing text files to XML; or even that the value is not in the [Microdata or RDF] encoding, it is in the linking of things, concepts, and relationships – but that is for another time.
So to the search for Kill Apps.
Was VisiCalc a Killer App for the PC – yes it probably was. Was Windows a Killer App – more difficult to answer. Is/was Windows an App or something more general, in technology wave, terms that would begat it’s own Killer Apps. What about Hypertext? Ignoring all the pre-computer era work on it’s principles, I would contend that it’s first Killer App was the Windows Help System – WinHelp. Although, with a bit of assistance from the html concept of the href, it was somewhat eclipsed by the Web.
The further up the breakfast pancake like stack of technologies, standards, infrastructures, and ecosystems we evolve away from the bit of silicon, at the core of what we still belittle with the simple name of computer, the more vague and pointless is our search for a Killer Use, and therefore App.
For a short period in history you could have considered the killer application of the internal combustion engine to be the automobile, but it wasn’t long before those two became intimately linked into a single entity with more applications than you could shake a stick at – all and none of which could be considered to be the killer.
Back to my domain , data. As I have postulated previously, I believe we are nearing a point where data, it’s use, our access to it, and the attention and action large players are giving to it, is going to fundamentally change the way we do things. Change that will come about, not from a radical change in what we do, but from a gradual adoption of several techniques and technologies atop of what we already have. As I have also said before, Linked Data (and its data format RDF) will be a success when we stop talking about it as it it becomes just a tool in the bag. A tool that is used when and where appropriate and mundane enough not to warrant a Linked Data Powered sticker on the side.
Take these Apps being built on the Kasabi, [Linked Data Powered] Platform. Will the users of these apps be aware of the Linked Data (and RDF) under the hood? No. Will they benefit from it? Yes, the aggregation and linking capabilities should deliver a better experience. Are any of this Killers? I doubt it, but nevertheless they are no less worth while.
So stop searching for that Killer App, you may be fruitlessly hunting for a long time. When someone invents a pocket-sized fusion reactor or the teleport , they might be back in vogue.
Prospector picture from ToOliver2 on Flickr Pancakes picture from eyeliam on Flickr
Some in the surfing community will tell you that every seventh wave is a big one. I am getting the feeling, in the world of Web, that a number seven is up next and this one is all about data. The last seventh wave was the Web itself. Because of that, it is a little constraining to talk about this next one only effecting the world of the Web. This one has the potential to shift some significant rocks around on all our beaches and change the way we all interact and think about the world around us.
Sticking with the seashore metaphor for a short while longer; waves from the technology ocean have the potential to wash into the bays and coves of interest on the coast of human endeavour and rearrange the pebbles on our beaches. Some do not reach every cove, and/or only have minor impact, however some really big waves reach in everywhere to churn up the sand and rocks, significantly changing the way we do things and ultimately think about the word around us. The post Web technology waves have brought smaller yet important influences such as ecommerce, social networking, and streaming.
I believe Data, or more precisely changes in how we create, consume, and interact with data, has the potential to deliver a seventh wave impact. Enough of the grandiose metaphors and down to business.
Data has been around for centuries, from clay tablets to little cataloguing tags on the end of scrolls in ancient libraries, and on into computerised databases that we have been accumulating since the 1960’s. Up until very recently these [digital] data have been closed – constrained by the systems that used them, only exposed to the wider world via user interfaces and possibly a task/product specific API. With the advent of many data associated advances, variously labelled Big Data, Social Networking, Open Data, Cloud Services, Linked Data, Microformats, Microdata, Semantic Web, Enterprise Data, it is now venturing beyond those closed systems into the wider world.
Well this is nothing new, you might say, these trends have been around for a while – why does this constitute the seventh wave of which you foretell?
It is precisely because these trends have been around for a while, and are starting to mature and influence each other, that they are building to form something really significant. Take Open Data for instance where governments have been at the forefront – I have reported before about the almost daily announcements of open government data initiatives. The announcement from the Dutch City of Enschede this week not only talks about their data but also about the open sourcing of the platform they use to manage and publish it, so that others can share in the way they do it.
I might find some of the activities in the Cloud Computing short-sighted and depressing, yet already the concept of housing your data somewhere other than in a local datacenter is becoming accepted in most industries.
Enterprise use of Linked Data by leading organisations such as the BBC who are underpinning their online Olympics coverage with it are showing that it is more that a research tool, or the province only of the open data enthusiasts.
Data Marketplaces are emerging to provide platforms to share and possibly monetise your data. An example that takes this one step further is Kasabi.com from the leading Semantic Web technology company, Talis. Kasabi introduces the data mixing, merging, and standardised querying of Linked Data into to the data publishing concept. This potentially provides a platform for refining and mixing raw data in to new data alloys and products more valuable and useful than their component parts. An approach that should stimulate innovation both in the enterprise and in the data enthusiast community.
The Big Data community is demonstrating that there are solutions, to handling the vast volumes of data we are producing, that require us to move out of the silos of relational databases towards a mixed economy. Programs need to move – not the data, NoSQL databases, Hadoop, map/reduce, these are are all things that are starting to move out of the labs and the hacker communities into the mainstream.
The Social Networking industry which produces tons of data is a rich field for things like sentiment analysis, trend spotting, targeted advertising, and even short term predictions – innovation in this field has been rapid but I would suggest a little hampered by delivering closed individual solutions that as yet do not interact with the wider world which could place them in context.
I wrote about Schema.org a while back. An initiative from the search engine big three to encourage the SEO industry to embed simple structured data in their html. The carrot they are offering for this effort is enhanced display in results listings – Google calls these Rich Snippets. When first announce, the schema.org folks concentrated on Microdata as the embedding format – something that wouldn’t frighten the SEO community horses too much. However they did [over a background of loud complaining from the Semantic Web / Linked Data enthusiasts that RDFa was the only way] also indicate that RDFa would be eventually supported. By engaging with SEO folks on terms that they understand, this move from from Schema.org had the potential to get far more structured data published on the Web than any TED Talk from Sir Tim Berners-Lee, preaching from people like me, or guidelines from governments could ever do.
The above short list of pebble stirring waves is both impressive in it’s breadth and encouraging in it’s potential, yet none of them are the stuff of a seventh wave.
So what caused me to open up my Macbook and start writing this. It was a post from Manu Sporny, indicating that Google were not waiting for RDFa 1.1 Lite (the RDF version that schema.org will support) to be ratified. They are already harvesting, and using, structured information from web pages that has been encoded using RDF. The use of this structured data has resulted in enhanced display on the Google pages with items such as event date & location information,and recipe preparation timings.
Manu references sites that seem to be running Drupal, the open source CMS software, and specifically a Drupal plug-in for rendering Schema.org data encoded as RDFa. This approach answers some of the critics of embedding Schema.org data into a site’s html, especially as RDF, who say it is ugly and difficult to understand. It is not there for humans to parse or understand and, with modules such as the Drupal one, humans will not need to get there hands dirty down at code level. Currently Schema.org supports a small but important number of ‘things’ in it’s recognised vocabularies. These, currently supplemented by GoodRelations and Recipes, will hopefully be joined by others to broaden the scope of descriptive opportunities.
So roll the clock forward, not too far, to a landscape where a large number of sites (incentivised by the prospect of listings as enriched as their competitors results) are embedding structured data in their pages as normal practice. By then most if not all web site delivery tools should be able to embed the Schema.org RDF data automatically. Google and the other web crawling organisations will rapidly build up a global graph of the things on the web, their types, relationships and the pages that describe them. A nifty example of providing a very specific easily understood benefit in return for a change in the way web sites are delivered, that results in a global shift in the amount of structured data accessible for the benefit of all. Google Fellow and SVP Amit Singhal recently gave insight into this Knowledge Graph idea.
The Semantic Web / Linked Data proponents have been trying to convince everyone else of the great good that will follow once we have a web interlinked at the data level with meaning attached to those links. So far this evangelism has had little success. However, this shift may give them what they want via an unexpected route.
Once such a web emerges, and most importantly is understood by the commercial world, innovations that will influence the way we interact will naturally follow. A Google TV, with access to such rich resource, should have no problem delivering an enhanced viewing experience by following structured links embedded in a programme page to information about the cast, the book of the film, the statistics that underpin the topic, or other programmes from the same production company. Our iPhone version next-but-one, could be a personal node in a global data network, providing access to relevant information about our location, activities, social network, and tasks.
These slightly futuristic predictions will only become possible on top of a structured network of data, which I believe is what could very well immerge if you follow through on the signs that Manu is pointing out. Reinforced by, and combining with, the other developments I reference earlier in this post, I believe we may well have a seventh wave approaching. Perhaps I should look at the beach again in five years time to see if I was right.
Wave photo from Nathan Gibbs in Flickr
Declarations – I am a Kasabi Partner and shareholder in Kasabi parent company Talis.
Paul is a prolific podcaster, but had yet to venture in to the world of the video conversation. This conversation was therefore a bit of an experiment. Take a look below and see what you think. For those that prefer audio only, Paul has helpfully included an mp3 for you to listen to. At the end of this post you will also find a link to a short survey which has posted to ascertain how successful this format.
Here are a few comments about the process from Paul and a link to the survey:
It’s perhaps unfair to draw too many conclusions from this first attempt, but a few things are immediately apparent. The whole process takes an awful lot longer. The files are larger, so processing and uploading times increase 2-3 fold. Uploading a separate audio file also takes a bit of time. Simply dumping the Skype recording into iMovie worked just fine… but I’ve (so far) not managed to find any way to balance the audio levels. Garagebandlets me do this with my audio-only podcasts, but iMovie doesn’t seem to, so Richard’s side of the conversation comes across as quite a bit louder than mine.
The BBC have been at the forefront of the real application of Linked Data techniques and technologies for some time. It has been great to see them evolve from early experiments by BBC Backstage working with Talis to publish music and programmes data as RDF – to see what would happen.
Their Wildlife Finder that drives the stunning BBC Nature site has been at the centre of many of my presentations promoting Linked Data over the last couple of years. It not only looks great, but it also demonstrates wonderfully the follow-your-nose navigation around a site that naturally occurs if you let the underlying data model show you the way.
The BBC team have been evolving their approach to delivering agile, effective, websites in an efficient way by building on Linked Data foundations sector by sector – wildlife, news, music, World Cup 2010, and now in readiness for London 2012 – the whole sport experience. Since the launch a few days ago, the main comment seems to be that it is ‘very yellow’, which it is. Not much reference to the innovative approach under the hood – as it should be. If you can see the technology, you have got it wrong.
In an interesting post on the launch Ben Gallop shares some history about the site and background on the new version. With a site which gets around 15 million unique visitors a week they have a huge online audience to serve. Cait O’Riodan in a more technical post talks about the efficiency gains of taking the semantic web technologies approach:
Doing more with less One of the reasons why we are able to cover such a wide range of sports is that we have invested in technology which allows our journalists to spend more time creating great content and less time managing that content.
In the past when a journalist wrote a story they would have to place that story on every relevant section of the website.
A story about Arsenal playing Manchester United, for example, would have to be placed manually on the home page, the Football page, the premier league page, the Arsenal page and the Manchester United page – a very time consuming and labour intensive process.
Now the journalists tell the system what the story is about and that story is automatically placed on all the relevant parts of the site.
We are using semantic web technologies to do this, an exciting evolution of a project begun with the Vancouver Winter Games and extended with the BBC’s 2010 World Cup website. It will really come into its own during the Olympics this summer.
It is that automatic placement, and linking, of stories that leads to the natural follow-your-nose navigation around the site. If previous incarnations of the BBC using this approach are anything to go by, there will also be SEO benefits as well – as I have discussed previously.
The data model used under the hood of the Sports site is based upon the Sport Ontology openly published by them. Check out the vocabulary diagram to see how they have mapped out and modelled the elements of a sporting event, competition, and associated broadcast elements. A great piece of work from the BBC teams.
In addition to the visual, navigation and efficiency benefits this launch highlights, it also settles the concerns that Linked Data / Semantic Web technologies can not perform. This site is supporting 15 million unique visitors a week and will probably be supporting a heck of a lot more during the Olympics. That is real web scale!
Well I did for a start! I chose this auspicious day to move the Data Liberate web site from one hosting provider to another. The reasons why are a whole other messy story, but I did need some help on the WordPress side of things and [quite rightly in my opinion] they had ‘gone dark’ in support of the SOPA protests. Frustration, but in a good cause.
Looking at the press coverage from my side of the Atlantic, such as from BBC News, it seems that some in Congress have also started to take notice. The most fuss in general seemed to be around Wikipedia going dark, demonstrating what the world would be like without the free and easy access to information we have become used to. All in all I believe the campaign has been surprisingly effective on the visible web.
However, what prompted this post was trying to ascertain how effective it was on the Data Web, which almost by definition is the invisible web. Ahead of the dark day, a move started on the Semantic Web and Linked Open Data mailing lists to replicate what Wikipedia was doing by going dark on Dbpedia – the Linked Data version of Wikipedia structured information. The discussion was based around the fact that SOPA would not discriminate between human readable web pages and machine-to-machine data transfer and linking, therefore we [concerned about the free web] should be concerned. Of that there was little argument.
The main issue was that systems, consuming data that suddenly goes away, would just fail. This was countered by the assertion that, regardless of the machines in the data pipeline, there will always be a human at the end. Responsible systems providers, should be aware of the issue and report the error/reason to their consuming humans.
Some suggested that instead of delivering the expected data, systems [operated by those that are] protesting, should provide data explaining the issue. How many application developers have taken this circumstance in to account in their design I wonder. If you, as a human, are accessing a SPARQL endpoint, are presented with a ‘dark’ page, you can understand and come back to query tomorrow. If you are a system getting different types of, or no, data back, you will see an error.
The question I have is, who using systems that use Linked Data [that went dark] noticed that there was either a problem, or preferably an effect of the protest?
I suspect the answer is very few, but I would like to hear the experiences of others on this.