Mike Lynch has a high profile in the UK as one of our few technology success stories, he is CEO of a company called Autonomy which provides knowledge management solutions for enterprises, government organisations like the NSA and search engine Blinkx.
The secret sauce of Autonomy is based on statistical analysis: a technique called Bayes Theory.
Put in layman’s terms Bayes Theory essentially means that the probability of something happening can be estimated from how often it has happened in the past.
This technique has its merits, Lynch is head of a successful company and companies like Microsoft’s research arm and Google put a lot of store in it as approximation technique.
However with the continued proliferation of data, sorting it into information requires a greater degree of semantic understanding. Let’s do a thought experiment; if you have an approximation that returns documents based on an input.
For a total repository of different sizes, where 0.01 per cent are deemed relevant by a Bayesian agent:
- 100,000 – that’s a list of 10 documents
- 1,000,000 – that’s 100 documents
- 20,000,000,000 (according to Stephen Taylor (the European head of Yahoo!’s new Audience group speaking at the Blogging4Business conference last Wednesday, its about the size of the company’s search index) – that’s 2,000,000 documents
Within an enterprise, Bayes theory is perfectly acceptable, but in a more open environment like the internet, an additional semantic filter is required to find the content that you would want amongst the 2,000,000 documents returned because that 2,000,000 is an approximation – so there may be only a few documents you may want in there – or none at all.
That semantic filter is proving difficult to provide which is the reason why Google has stayed out in front on search for so long, any progress that is being made is baby steps. Crystal Semantics a small company based in Holyhead, used the brute force of linguists to plot all the phrase clusters (called lexemes) in the English language – though it may not work for technical or company-specific language without a great deal of additional work, whilst they have met with most of the major players in search, none of them have adopted their technology for search results.
Instead companies like Yahoo! have fallen back on human computing in the form of tagging, in an interview with Tom Foremski of Silicon Valley Watcher ‘Autonomy CEO says tags don’t work’ Lynch points out the very human failings of laziness and inconsistency as a major pitfall in the usefulness of tagging.
Lynch does have a point, humans fail, however they are also a great source of massive parallel processing and particularly good at the kind of ambiguous problems that technology isn’t.
The problems that Lynch can be seen to be addressed in various products:
- A common nomenclature: my MyWeb accout suggests existing tags to me when I bookmark a new item as I start typing as does Flickr, at the Blogging4Business conference attendees agreed on a common tag of B4B2007
- Incentivising people: this is less about the hard technology that Lynch gets and more the community management and social engineering pioneered by the likes of Caterina Fake and Heather Champ at Flickr. In building a community carefully its constituents see the collective benefits of tags and are willing to contribute. Granted most people who use these products are consumers of information, but in that respect it probably mirrors the kind of systems that Lynch is more experienced with. A second way of doing this is making the human interpretation fun, a classic example of this is Luis von Ahn of Carnegie Mellon University who created the ESP game. Google has been using this technique to further enhance its own image search via the Google Image Labeler. Now I realise that this probably won’t be possible with items in a finance department or a library of whitepapers, but it does demonstrate the power of social engineering
- Not all information will be tagged: knowledge management system users are constantly looking to sort out information that is useful and useless, this provides a motivation to tag useful content and let the less relevant content submerge over time in a sea of tags
On a related note: information capture is a far bigger issue, Jeff Weiner used to present a thought experiment where he came up with a ridiculously low percentage of human knowledge was not online, I can’t remember it exactly, but the gist of it is below.
Lets exclude all the countless recordings, books and video that haven’t been digitised and never will be. Now lets make some assumptions:
On average the 6.5 billion people in the world each know a 1,000 distinct things, be it a family cake recipe that was passed down from their parent or the best particular domain knowledge like the tricks of the trade that my Dad has developed in his 40 years as a mechanical fitter, who is the best quality wholesale butcher in Mansfield or the ins and out of the housing market in St Helens.
In reality, that 1,000 pieces of distinct knowledge is probably a conservative estimate even allowing for the amount of children in the world’s population. That is roughly 325 times as much knowledge as indexed by Yahoo! Search (allowing for spam and discounting repetition in the index this number would rocket higher).
Tags aren’t part of a rigid taxonomy, but a folksonomy so you can have several tags some of which may hit the mark, its the reason why I have found that am more likely to find a picture that suits a blog post on flickr than I am on Google Image Search. Or why I would also look for items on the Blogging4Business conference with similar tags like B4B and Blogging4Business, as well as B4B2007.
In an ironic twist on the Silicon Valley Watcher story, one of the reasons why Mr Lynch was doing media outreach was the launch of Autonomy’s Virage ACID product, Virage was a video search start-up acquired by Autonomy.