As many have already noted, last week at Web 2.0 Peter Norvig, Google director of search quality, demonstrated word clustering, “named entities,” and machine translation technology to the audience. The translation software was impressive, but somehow lacked zing – “good enough” translation doesn’t seem like much of a revelation anymore. That in itself is an extraordinary achievement – Norvig showed translations from Arabic and Chinese – both significantly distinct languages compared to English. Google already has translation features built into its engine (from a third party), but this hand-rolled stuff was far more powerful, it seemed to me.
In any case, the demos that really got the audience going (and me, to be honest) was the named entities and the clustering technology. Seeing anything behind the veil of Google’s real research and development is of course a revelation, but seeing something that was so clearly ready for prime time felt rather close to a declaration of where Google is heading, in particular given the recent moves in the personalization and clustering space from Amazon, Ask, Vivisimo, and Yahoo.
“Named entity extraction” is a relatively new project called which Norvig said Google had been working on for about six months. As Norvig explained the concept – essentially identifying semantically important concepts and the meaning wrapped around them – I couldn’t help but think of WebFountain and my wish (near the end of the post) that Google would add a bit of IBM’s semantic peanut butter into its PageRank chocolate.
Norvig also showed an entertaining (and live) demo of clustering, which he claimed was the “largest bayesian database of clusters” extant. Hmmm.
From the eWeek story covering the news:
For example, Norvig said, researchers are looking for ways to break down sentences by looking for a phrase like “such as” and grabbing the names that follow it. The goal is to not only pull out the name but also its clusters, so that a name such as “Java” can be associated both with the computer language and with language in general, Norvig said.
“We want to be able to search and find these [entities] and the relationships between them, rather than you typing in the words specifically,” Norvig said.
This has potentially interesting implications in next-generation ranking methodologies, for one, but combined with clustering, it signals that Google is serious about taking what one might call the UI plunge.
What do I mean by that? Well, of all the major engines, only Google has strictly maintained what might be called the C prompt interface to search: put in yer command, get out yer list of results (Google Local is a departure, but it’s still in beta). Yahoo, Ask, A9 and others have begun to twiddle in pretty significant ways with evolved interfaces which – by employing your search history, your personal data, clustering, and other tricks – deliver more filtered and intentional results (though it is still arguable if they are more relevant). I sense it’s only a matter of time before Google takes this approach as well, and Norvig’s demo certainly points that way. After all, it’s not that often Google decides to give us a glimpse behind the curtain, and coupled with Google Board member John Doerr’s semi-announcement the day before (he told the audience that Google would become “the Google that knows you”) I think the UI plunge might come sooner than we all expect.
If you want to know more about how Google is thinking about clustering, here’s a paper written by a Google team, courtesy of a link from Don Park.
Update: Lazy linking on my part, the clustering paper is about hardwaree (though it is really interesting…)
6 thoughts on “Google’s Web 2 Demo and the UI Plunge”
IMHO, I think that the ‘The Google Cluster Architecture’ PDF document you link has nothing to do with the technology Peter Norvig explained.
The document explains how the X,000 Google servers are clustered in order to run quickly and help with users’ queries.
And what Peter explained on ‘Web 2.0’ was how Google plans to cluster search results by learning the meaning of the web pages.
You’re right! Sorry about that. I should not rushlink, as I did to that page. My bad.
Your mistake is a good sample of the use of clustering (one word, several meanings). Perhaps you used Google to search “Google+cluster”, and the first result is the document you linked. But it wasn’t the “cluster” you looked for.
It’s good to see some technical hints from Google, although not much to go on.
Machine Translation – destined to be damned with faint praise. Chinese isn’t hard to translate; most of the relative difficulty of the language lies in learning the characters. Japanese and Korean have a much more unique grammatical structure. No idea on Arabic.
Named Entity Extraction (as opposed to unnamed entities? nouns?) – This sounds like Google Sets. IIRC they were scanning text for list-type noun phrases (eg North, South, East and West), and building up associations from them.
Clustering – sounds like they’ve rediscovered data mining. It’s not clear what algorithm they’re using, but I can think of a few that might work with a large inverted index.
There’s a new paper from Google in OSDI 2004
MapReduce: : Simplified Data Processing on Large Clusters
I think that Googl’s new context translation is a great thing.