Last week TechCrunch UK wrote about a search startup that utilizes AI/Semantic Web techniques named True Knowledge. The post entitled VCs price True Knowledge at £20m pre-money. Is this the UK’s Powerset?  stated

The chatter I’m hearing is that True Knowledge is being talked about in hushed tones, as if it might be the Powerset of the UK. To put that in context, Google has tried to buy the Silicon Valley search startup several times, and they have only launched a showcase product, not even a real one. However, although True Knowledge and Powerset are similar, they are different in significant ways, more of which later.
Currently in private beta, True Knowledge says their product is capable of intelligently answering - in plain English - questions posed on any topic. Ask it if Ben Affleck is married and it will come back with "Yes" rather than lots of web pages which may or may not have the answer (don’t ask me!).
Here’s why the difference matters. True Knowledge can infer answers that the system hasn’t seen. Inferences are created by combining different bits of data together. So for instance, without knowing the answer it can work out how tall the Eiffel Tower is by inferring that it is shorter that the Empire State Building but higher than St Pauls Cathedral.
AI software developer and entrepreneur William Tunstall-Pedoe is the founder of True Knowledge. He previously developed a technology that can solve a commercially published crossword clues but also explain how the clues work in plain English. See the connection?

The scenarios described in the TechCrunch write up should sound familiar to anyone who has spent any time around fans of the Semantic Web. Creating intelligent agents that can interrogate structured data on the Web and infer new knowledge has turned out to  be easier said than done because for the most part content on the Web isn't organized according to the structure of the data. This is primarily due to the fact that HTML is a presentational language. Of course, even if information on the Web was structured data (i.e. idiomatic XML formats) we still need to build machinary to translate between all of these XML formats.

Finally, in the few areas on the Web where structured data in XML formats is commonplace such as Atom/RSS feeds for blog content, not a lot has been done with this data to fulfill the promise of the Semantic Web.

So if the Semantic Web is such an infeasible utopia, why are more and more search startups using that as the angle from which they will attack Google's dominance of Web search? The answer can be found in Bill Slawski's post from a year ago entitled Finding Customers Through Anti-Commercial Queries where he wrote

Most Queries are Noncommercial

The first step might be to recognize that most queries conducted by people at search engines aren't aimed at buying something. A paper from the WWW 2007 held this spring in Banff, Alberta, Canada, Determining the User Intent of Web Search Engine Queries, provided a breakdown of the types of queries that they were able to classify.

Their research uncovered the following numbers: "80% of Web queries are informational in nature, with about 10% each being navigational and transactional." The research points to the vast majority of searches being conducted for information gathering purposes. One of the indications of "information" queries that they looked for were searches which include terms such as: “ways to,” “how to,” “what is.”

Although the bulk of the revenue search engines make is from people performing commercial queries such as searching for "incredible hulk merchandise", "car insurance quotes" or "ipod prices", this is actually a tiny proportion of the kinds of queries people want answered by search engines. The majority of searches are about the five Ws (and one H) namely "who", "what", "where", "when", "why" and "how". Such queries don't really need a list of Web pages as results, they simply require an answer. The search engine that can figure out how to always answer user queries directly on the page without making the user click on half a dozen pages to figure out the answer will definitely have moved the needle when it comes to the Web search user experience.

This explains why scenarios that one usually associates with AI and Semantic Web evangelists are now being touted by the new generation of "Google-killers". The question is whether knowledge inference techniques will prove to be more effective than traditional search engine techniques when it comes to providing the best search results especially since a lot of the traditional search engines are learning new tricks.

Now Playing: Bob Marley - Waiting In Vain


Thursday, June 26, 2008 3:29:34 PM (GMT Daylight Time, UTC+01:00)
I've seen this movie before. For background, see On Search: Intelligence.
Thursday, June 26, 2008 5:58:48 PM (GMT Daylight Time, UTC+01:00)
if the innovations of these new companies are strictly in the non-revenue-producing queries, then how do these companies expect to make money?
Thursday, June 26, 2008 6:14:33 PM (GMT Daylight Time, UTC+01:00)
For the most part users don't pick one search engine for commercial queries and another for informational queries (although eBay/Amazon could be thought of as commercial search engines for specific niches). So if users find some engine to be better than Google for their informational queries, they will use it for their commercial queries as well.
Friday, June 27, 2008 5:41:53 AM (GMT Daylight Time, UTC+01:00)
Isn't the goal of wikipedia to give those simple answers of "who", "what", "where", "when", "why" and "how" - If it's something that can be answered in a simple fashion by a system that's crawling the web, I would expect it to hit wikipedia by the time there's sufficient consensus to allow a computer to answer it.
Sunday, June 29, 2008 12:56:59 AM (GMT Daylight Time, UTC+01:00)
Hey Dare, you were lead down the garden path on this study! The breakdown navigational, transactional, and informational does not map cleanly to monetizable from the search engine perspective or commercial from the user POV (e.g. probability to bust out the credit card).

Transactions in this sense are downloads, etc. Queries for brands, say "home depot", are buried in the navigational bucket and they monetize well (though not as well anywhere other than Google, due to teleporting site search).

The best stat I'm aware of published on % of queries with commercial intent is the AdCenter Labs (Honghua (Kathy) Dai, Zaiqing Nie, Lee Wang, Lingzhi Zhao, Ji-Rong Wen, and Ying Li) -- see
Monday, June 30, 2008 10:18:20 AM (GMT Daylight Time, UTC+01:00)
I think the idea of using semantic techniques to improve search has always been a good one, take, for example... they try to improve health search by extracting concept relationship maps and using those maps to answer the question of what the best treatment for any disease is.

So if you try to solve a specific problem or have an actual goal for your semantic analysis it might work, but if you just try to do grammar parsing you end up with not much added value for a lot of work.

The "sematic web"... well that's like IPv6... still waiting to be needed.
Comments are closed.