As I posted earlier, Yahoo’s claim of indexing more than 20 billion items ruffled more than a few feathers across the web, and nowhere more distinctly than at Google. I spent an hour or so on the phone with a group of Google folks, and they shared a lot of information about how they measure index size, how they deal with issues of duplicate URLs and documents, and why they are baffled by Yahoo’s claim.
I am still reporting this story, so a longer post is forthcoming, but an update at the end of the day is worth penning.
First of all, I agreed to review some of the Google information on background, agreeing not to disclose it save with permission. (I agreed to this only if I could tell you all that I did in fact agree to it). I am still digesting what Google had to say, and the information they sent me, but it did leave a distinct set of questions percolating in my mind, questions that I plan to speak to Yahoo about (Yahoo has agreed to talk as well, we just haven’t had time yet).
In any case, the lead really is this: I asked Google to go on the record with their concerns about Yahoo’s index and whether they believed the news was in fact accurate, and Google agreed. The quote, which I can only attribute at this point to a “Google spokesperson,” is as follows:
“Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn’t support the 19.2 (billion page) claim and we’re confused by that.”
Now, the size of an index is only one part of the equation of what makes a good search engine – relevance, speed, UI, and other factors are also critical, but when it comes to comprehensiveness (size), Google has been king pretty much since day one, save a couple of short lapses with FAST in 2002 and another in 03, as I recall, with Yahoo (briefly). The company has always trumpeted its size on its home page, and Yahoo’s announcement had to come as a slap in the face. Down to the presumptive specificity of the pronouncement on their home page since 2000 – “searching 8,168,684,336 web pages” – Google set the tone for all future “size matter” battles.
I plan a longer post on this, as I said, but there are some tantalizing examples (I will add some in the next post) that one might expect would yield significantly different results between Yahoo and Google, given Yahoo’s massive new size, but don’t. The math, in essence, seems not to be adding up. At least, that is what the Google scientists are saying. But then again, I am not a mathematician, and there are always at least two sides to the story. So stay tuned and we’ll see how this one plays out…
(I must say, this calls for a benchmark/standard for measurement that might makes all of this moot…)