This paper “presents a high-level discussion of some problems in information retrieval that are unique to web search engines,” according to its abstract in the ACM library. (A reminder as to what this whole “Search Papers” thing is about: read this.) “The goal is to raise awareness and stimulate research in these areas,” it continues. How might such a lofty incitement be backed up? Well, it’s written by two senior employees of Google, Monika R. Henzinger and Craig Silverstein (I’ve met with Craig, he was employee #1 after Larry and Sergey, and a nice guy to boot), as well as Rajeev Motwani, a professor at Stanford (Craig was his graduate student).
The paper is dated September, 2002, so it does not rank as a missive from the early, more geeky phase of Google’s life, but rather a more corporate product – the two Google authors knew they bore the weight of “being Google” when they wrote this paper, and it’s worth keeping that in mind when reading through it.
This is particularly clear in the paper’s scope and focus. It lays out six challenges for search engines – and they read like a laundry list of Google’s headaches. The paper then goes on to offer suggested paths for more research on the topics, which I could imagine might read either as genuine or a tiny bit patronizing, depending on who you are. (The paper does not tackle a range of other issues it says are already the subject of abundant research – natural language queries, image/audio search, improving text-based retrieval, language issues, or interface/clustering, for example.)
(more in the extended entry, click link below)
First among the stated problems is spam – folks who try to game search engine listings for their own commercial gain (this is clearly Google’s biggest problem, dominating a lot of their time). Second and third are content quality and quality evaluation – how to determine the relative value of content on a web page, and how to determine if your algorithms w/r/t same are working. Fourth is something they call “web conventions” – how to create useful search engines given the fact that the web follows loose conventions rather than strict rules. Fifth is the problem of duplicate hosts – two hosts that serve the same content (eliminating these would unclog search results, Google has sometimes been criticized by its competitors for having too many duplicate pages). And sixth is the wonderfully termed “vaguely structured data” – XML is mentioned, but dismissed – the authors instead suggested there is value in understanding conventions of HTML presentation (the way a page looks) and somehow using that to make searches better.
So as not to bore the lot of you, I won’t go into the detail on each. Suffice to say, this paper was interesting and a worthy read if you are a student of the company and/or the field. I have only now begun to read the more recent public papers from Google scientists, so I can’t compare them as a corpus. A few notes: It’s not clear who this paper was really written for, as there are notes that seem for less technical readers (ie one note explains what a crawler is – are there really folks in the research community who are not web savvy?). The paper toots Google’s horn a few times (it says PageRank is not vulnerable to some types of spam), admits where Google has weaknesses, gives props to Jon Kleinberg’s HITS algorithm (upon which some say PageRank is based) and even seems to float some trial balloons to the research community (on how to detect spamming tactics, for example, in section 2.4). I did take issue with some of the editorial assumptions in the “Content Quality” section, but I won’t go into all that here. Drop me a line if you want to discuss. And…if you are a researcher in this field, or know one, I’d be interested in what the academic community thinks of this paper, and any others I post on as well. Thanks!