When I reported the announcement of Wikidata by Denny Vrandecic at the Semantic Tech & Business Conference in Berlin in February, I was impressed with the ambition to bring together all the facts from all the different language versions of Wikipedia in a central Wikidata instance with a single page per entity. These single pages will draw together all references to the entities and engage with a sustainable community to manage this machine-readable resource. This data would then be used to populate the info-boxes of all versions of Wikipedia in addition to being an open resource of structured data for all.
In his post Mark raises concerns that this approach could result in the loss of the diversity of opinion currently found in the diverse Wikipedias:
It is important that different communities are able to create and reproduce different truths and worldviews. And while certain truths are universal (Tokyo is described as a capital city in every language version that includes an article about Japan), others are more messy and unclear (e.g. should the population of Israel include occupied and contested territories?).
He also highlights issues about the unevenness or bias of contributors to Wikipedia:
We know that Wikipedia is a highly uneven platform. We know that not only is there not a lot of content created from the developing world, but there also isn’t a lot of content created about the developing world. And we also, even within the developed world, a majority of edits are still made by a small core of (largely young, white, male, and well-educated) people. For instance, there are more edits that originate in Hong Kong than all of Africa combined; and there are many times more edits to the English-language article about child birth by men than women.
A simplistic view of what Wikidata is attempting to do could be a majority-rules filter on what is correct data, where low volume opinions are drowned out by that majority. If Wikidata is successful in it’s aims, it will not only become the single source for info-box data in all versions of Wilkipedia, but it will take over the mantle currently held by Dbpedia as the de faco link-to place for identifiers and associated data on the Web of Data and the wider Web.
I share some of his concerns, but also draw comfort from some of the things Denny said in Berlin – “WikiData will not define the truth, it will collect the references to the data…. WikiData created articles on a topic will point to the relevant Wikipedia articles in all languages.” They obviously intend to capture facts described in different languages, the question is will they also preserve the local differences in assertion. In a world where we still can not totally agree on the height of our tallest mountain, we must be able to take account of and report differences of opinion.
Phil picked out a section of Dan’s presentation for comment:
In the RDF community, in the Semantic Web community, we’re kind of polite, possibly too polite, and we always try to re-use each other’s stuff. So each schema maybe has 20 or 30 terms, and… schema.org has been criticised as maybe a bit rude, because it does a lot more it’s got 300 classes, 300 properties but that makes things radically simpler for people deploying it. And that’s frankly what we care about right now, getting the stuff out there. But we also care about having attachment points to other things…
Then reflecting on current practice in Linked Data he went on to postulate:
… best practice for the RDF community… …i.e. look at existing vocabularies, particularly ones that are already widely used and stable, and re-use as much as you can. Dublin Core, FOAF – you know the ones to use.
Except schema.org doesn’t.
schema.org has its own term for name, family name and given name which I chose not to use at least partly out of long term loyalty to Dan. But should that affect me? Or you? Is it time to put emotional attachments aside and move on from some of the old vocabularies and at least consider putting more effort into creating a single big vocabulary that covers most things with specialised vocabularies to handle the long tail?
As the question in the title of his post implies, should we move on and start adopting, where applicable, terms from the large and extending Schema.org vocabulary when modelling and publishing our data. Or should we stick with the current collection of terms from suitable smaller vocabularies.
One of the common issues when people first get to grips with creating Linked Data is what terms from which vocabularies do I use for my data, and where do I find out. I have watched the frown skip across several people’s faces when you first tell them that foaf:nameis a good attribute to use for a person’s name in a data set that has nothing to do with friends or friends of friends. It is very similar to the one they give you when you suggest that it may also be good for something that isn’t even a person.
As Schema.org grows and, enticed by the obvious SEO benefits in the form of Rich Snippets, becomes rapidly adopted by a community far greater than the Semantic Web and Linked Data communities, why would you not default to using terms in their vocabulary? Another former colleague, David Wood Tweeted No in answer to Phil’s question – I think this in retrospect may seem a King Canute style proclamation. If my predictions are correct, it won’t be too long before we are up to our ears in structured data on the web, most of it marked up using terms to be found at schema.org.
You may think that I am advocating the death, and replacement by Schema.org, of all the vocabularies well known, and obscure, in use today – far from it. When modelling your [Linked] data, start by using terms that have been used before, then build on terms more specific to your domain and finally you may have to create your own vocabulary/ontology. What I am saying is that as Schema.org becomes established, it’s growing collection of 300+ terms will become the obvious start point in that process.
OK a couple of interesting posts, but where is the similar message and connection? I see it as democracy of opinion. Not the democracy of the modern western political system, where we have a stand up shouting match every few years followed by a fairly stable period where the rules are enforced by one view. More the traditional, possibly romanticised, view of democracy where the majority leads the way but without disregarding the opinions of the few. Was it the French Enlightenment philosopher Voltaire who said: ”I may hate your views, but I am willing to lay down my life for your right to express them” – a bit extreme when discussing data and ontologies, but the spirit is right.
Once the majority of general data on the web becomes marked up as schema.org – it would be short sighted to ignore the gravitational force it will exert in the web of data if you want your data to be linked to and found. However, it will be incumbent on those behind Schema.org to maintain their ambition to deliver easy linking to more specialised vocabularies via their extension points. This way the ‘how’ of data publishing should become simpler, more widespread, and extensible. On the ‘what’ side of the the [structured] data publishing equation, the Wikidata team has an equal responsible to not only publish the majority definition of facts, but also clearly reflect the views of minorities – not a simple balancing act as often those with the more extreme views have the loudest voices.
The BBC have been at the forefront of the real application of Linked Data techniques and technologies for some time. It has been great to see them evolve from early experiments by BBC Backstage working with Talis to publish music and programmes data as RDF – to see what would happen.
Their Wildlife Finder that drives the stunning BBC Nature site has been at the centre of many of my presentations promoting Linked Data over the last couple of years. It not only looks great, but it also demonstrates wonderfully the follow-your-nose navigation around a site that naturally occurs if you let the underlying data model show you the way.
The BBC team have been evolving their approach to delivering agile, effective, websites in an efficient way by building on Linked Data foundations sector by sector – wildlife, news, music, World Cup 2010, and now in readiness for London 2012 – the whole sport experience. Since the launch a few days ago, the main comment seems to be that it is ‘very yellow’, which it is. Not much reference to the innovative approach under the hood – as it should be. If you can see the technology, you have got it wrong.
In an interesting post on the launch Ben Gallop shares some history about the site and background on the new version. With a site which gets around 15 million unique visitors a week they have a huge online audience to serve. Cait O’Riodan in a more technical post talks about the efficiency gains of taking the semantic web technologies approach:
Doing more with less One of the reasons why we are able to cover such a wide range of sports is that we have invested in technology which allows our journalists to spend more time creating great content and less time managing that content.
In the past when a journalist wrote a story they would have to place that story on every relevant section of the website.
A story about Arsenal playing Manchester United, for example, would have to be placed manually on the home page, the Football page, the premier league page, the Arsenal page and the Manchester United page – a very time consuming and labour intensive process.
Now the journalists tell the system what the story is about and that story is automatically placed on all the relevant parts of the site.
We are using semantic web technologies to do this, an exciting evolution of a project begun with the Vancouver Winter Games and extended with the BBC’s 2010 World Cup website. It will really come into its own during the Olympics this summer.
It is that automatic placement, and linking, of stories that leads to the natural follow-your-nose navigation around the site. If previous incarnations of the BBC using this approach are anything to go by, there will also be SEO benefits as well – as I have discussed previously.
The data model used under the hood of the Sports site is based upon the Sport Ontology openly published by them. Check out the vocabulary diagram to see how they have mapped out and modelled the elements of a sporting event, competition, and associated broadcast elements. A great piece of work from the BBC teams.
In addition to the visual, navigation and efficiency benefits this launch highlights, it also settles the concerns that Linked Data / Semantic Web technologies can not perform. This site is supporting 15 million unique visitors a week and will probably be supporting a heck of a lot more during the Olympics. That is real web scale!
Like many of my posts, this one comes from the threads of several disparate conversations coming together in my mind, in an almost astrological conjunction.
One thread stems from my recent Should SEO Focus in on Linked Data? post, in which I was concluding that the group, loosely described as the SEO community, could usefully focus in on the benefits of Linked Data in their quest to improve the business of the sites and organisations they support. Following the post I received an email looking for clarification of something I said.
I am interested in understanding better the allusion you make in this paragraph:
One of the major benefits of using RDFa is that it can encode the links to other sources, that is the heart of Linked Data principles and thus describe the relationships between things. It is early days with these technologies & initiatives. The search engine providers are still exploring the best way to exploit structured information embedded in and/or linked to from a page. The question is do you just take RDFa as a new way of embedding information in to a page for the search engines to pick up, or do you delve further in to the technology and see it as public visibility of an even more beneficial infrastructure for your data.
If the immediate use-case for RDFa (microdata, etc.) is search engine optimization, what is the “even more beneficial infrastructure”? If the holy grail is search engine visibility, rank, relevance and rich-results, what is the “even more”?
In reply I offered:
What I was trying to infer is that if you build your web presence on top of a Linked Data described dataset / way of thinking / platform, you get several potential benefits:
Flexible easier to maintain page structure
Value added data from external sources….
… therefore improved [user] value with less onerous cataloguing processes
Agile/flexible systems – easy to add/mix in new data
Lower cost of enhancement (eg. BBC added dinosaurs to the established Wildlife Finder with minimal effort)
In-built APIs [with very little extra effort] to allow others to access / build apps upon / use your data in innovative ways
As per the BBC a certain level of default SEO goodness
Easy to map, and therefore link, your categorisations to ones the engines do/may use (eg. Google are using MusicBrainz to help folks navigate around – if, say as the BBC do, you link your music categories to those of MusicBrainz you can share in that effect.
So what I am saying is that you can ‘just’ take RDFa as a dialect to send your stuff to the Google (in which case microdata/microformats could be equally as good), but then you will miss out on the potential benefits I describe.
From my point of view there are two holy grails (if that isn’t breaking the analogy 😉
Get visibility and hence folks to hit your online resources.
Provide the best experience/usefulness/value to them when they do.
Linked Data techniques and technologies, have great value for the data owners in the second of those, with the almost spin-off benefit of helping you with the first one.
The next thread was not a particular item but a general vibe, from several bits and pieces I read – that RDFa was confusing and difficult. This theme I detect was coming from those only looking at it from a ‘how do I encode my metadata for Google to grab it for it’s snippets’ point of view (and there is nothing wrong in that) or those trying to justify a ‘schema.org is the only show in town’ position. Coming at it from the first of those two points of view, I have some sympathy – those new to RDFa must feel like I do (with my basic understanding of html) when I peruse the contents of many a css file looking for clues as to the designer’s intention.
However I would make two comments. Firstly, a site surfacing lots of data and hence wanting to encode RDFa amongst the human-readable stuff, will almost certainly be using tools to format the data as it is extracted from an underlying data source – it is those tools that should be evolved to produce the RDFa as a by-product. Secondly, it is the wider benefits of Linked Data, which I’m trying to promote in my posts, that justify people investing in time to focus on it. The fact that you may use RDFa to surface that data embedded in html, so that search engines can pick it up, is implementation detail – important detail, but missing the point if that is all you focus upon.
Thread number three, is the overhype of the Semantic Web. Someone who I won’t name, but I’m sure won’t mind me quoting, suggested the following as the introduction to a bit of marketing: The Semantic Web is here and creating new opportunities to revamp and build your business.
The Semantic Web is not here yet, and won’t be for some while. However what is here, and is creating opportunities, is Linked Data and the pragmatic application of techniques, technologies and standards that are enabling the evolution towards an eventual Semantic Web.
This hyped approach is a consequence of the stance of some in the Semantic Web community who with fervour have been promoting it’s coming, in it’s AI entirety, for several years and fail to understand why all of us, [enthusiasts, researchers, governments, commerce and industry] are not implementing all of it’s facets now. If you have the inclination, you can see some of the arguments playing out now in this thread on a SemWeb email list where Juan Sequeda asks for support for his SXSW panel topic suggestion.
A simple request, that I support, but the thread it created shows that the ‘eating the whole elephant’ of the Semantic Web will be too much to introduce it successfully to the broad Web, SEO, SERP, community and the ‘one mouthful at a time’ approach may have better chance of success. Also any talk of a ‘killer app’ is futile – we are talking about infrastructure here. What is the killer app feature of the Web? You could say linked, globally distributed, consistently accessed documents; an infrastructure that facilitated the development of several killer businesses and business models. We will see the same when we look back on a web enriched by linked, globally distributed, consistently accessed data.
So what is my astrological conjunction telling me? There is definitely fertile ground to be explored between the Semantic Web and the Web in the area of the pragmatic application of Linked Data techniques and technologies. People in both camps need to open their minds to the motivations and vision of the other. There is potential to be realised, but we are definitely not in silver bullet territory.
As I said in my previous post, I would love to explore this further with folks from the world of SEO & SERP. If you want to talk through what I have described, I encourage you to drop me an email or comment on this post.
This post was also published on the Talis Consulting Blog