Yesterday’s announcement of Google’s new Knowledge Graph feature has already been well covered in a number of different reports. See for example ReadWriteWeb and TechCrunch. If you’ve not yet seen it then the product announcement gives some additional useful background and includes a video demonstration. I thought I’d share some early thoughts of my own on the news.
Firstly the reporting on what Google are actually doing is a little confused. For example ReadWriteWeb say that Google “looks at the words of your query and identifies the things in it.” But, from reading this bit of background from Google, its clear that they’re actually doing something more interesting and potentially more powerful.
There has been lots of previous work on creating semantic search engines that include natural language parsing to “understand” more about a user’s query and its contents. But what Google are doing is looking at the search results to identify the things that are frequently referenced, and then surface useful summaries of those things from their Knowledge Graph. As a user you can refine the results by identifying which thing you were actually interested in.
This seems like a more powerful approach as its driven from the content and data that’s on the web, not just the snippet of text we’re using in our searches. Its a much more emergent approach and will enable them to quickly surface information from their Knowledge Graph about whole new classes of things that people are talking about on the web.
And Google have already given us the means to help improve the mapping between the content we publish and the Knowledge Graph: Schema.org.
Schema.org
Schema.org has already been driving improvements in the amount of structured data on the web. Initially the goal seemed to be to drive some smaller refinements to search engine behaviour, e.g. by allowing better rich snippets on search results. But now we can see that there is a bigger vision.
The core schema proposed by Google and others is busily being extended by a community of people who are tailoring it to the needs of their particular domains. There’s also a recent move to provide more linking from schema.org markup to authoritative sources. With increased adoption of embedded metadata, and continual refinements to both the types of things that are described and the available detail about them, we can see that Schema.org is going to help drive improvements to their Knowledge Graph.
Firstly, its going to let them feed and grow the Knowledge Graph more directly from their web crawls. They’re already extracting lots of data from pages, but having more structure at source will be much more reliable. Secondly, the Schema.org markup is going help them more readily identify the entities referenced or described in a page and associate that content with entries in their Knowledge Graph.
That will also mean that we can predict a more fundamental change in how sites will be prioritised and ranked in search results: how authoritative is their content about a particular person, place, or thing? Good content will remain a must, but clear identification of the “things, not strings” that it references will be vital.
Build Your Own Knowledge Graph
The early reporting on the origins of the Knowledge Graph expresses a lot of interest in the sources of the data. Freebase clearly provides a strong backbone for the effort, but there is a collection of other sources being drawn upon. I wonder how many of the commercial sources that Google have used initially might end up getting supplanted by data drawn directly from web crawls in the future as schema.org markup becomes more ubiquitous?
While Google can clearly still operate at a scale that most organisations can only dream of, the Knowledge Graph is something that is actually within the reach of most organisations already. The approach is a new and exciting addition to their search engine, but the technology and capability isn’t a radical leap forward. You could build a Knowledge Graph of your own.
Google haven’t created a whole new dataset, they’ve collated existing sources. The data already available in the Linked Data cloud — which includes Freebase — is there for anyone to reuse. It is perfectly feasible for an organisation to create its own “knowledge graph” to serve a particular product or domain by selecting from the available sources.
Indeed this is what the BBC has been doing for some time now. As they’ve described in numerous talks and interviews: by drawing on data from the web, and using it as their content management system, they’ve been able to create graphs of data to power their own innovative applications.
These product graphs weave together open data sources with the BBC’s own unique content. The fundamental technologies and the scale of operations may differ, but both Google and the BBC are deriving real value from focusing on “things, not strings”.
Graphs aren’t just cool, they’re a necessary component of innovative data-driven products.
Shared Vision
It would be easy to get distracted by comparing what Google have done with the details of the Semantic Web and Linked Data vision, but this would be a mistake. Personally I see this as a massive validation of the overall approach: Google have built a large graph database populated with a rich domain model describing things in the world. To quote from Amit Singhal’s announcement: “It’s the intelligence between these different entities that’s the key”.
The Knowledge Graph, and the product its built on, is clearly going to improve as more structured data is added to the web; data that will also be available for anyone else to crawl, mine and process. The basis for network effects are already in place. It’s irrelevant whether the Knowledge Graph itself is published as Open Data or via an API, although that may happen in the future. The underlying technologies themselves are an implementation detail.
What’s important is that the approach, particularly the reliance on the web as a data source and exploiting the value of relationships between things, is Semantic Web through and through.
If you’re interested in exploring these ideas further, especially how your organisation could contribute to or build its own “knowledge graph” then get in touch.