Google Announces New Index Size, Shifts Focus from Counting

Under embargo last week, I spoke to Marissa Mayer about Google search. I do this often, as part of the normal news cycle, but this time was different. After clearing her throat with some 7th birthday news, she dropped the other shoe – Google is now claiming that its…

Goog No Count

Under embargo last week, I spoke to Marissa Mayer about Google search. I do this often, as part of the normal news cycle, but this time was different. After clearing her throat with some 7th birthday news, she dropped the other shoe – Google is now claiming that its index is three times bigger than its competition. “Wow!” I said. “How can you tell?” “Tests we’ve done,” Mayer responded. “But…those are the same tests we’ve been debating since August, right? The same tests Yahoo claims are inconclusive and not to be trusted!”

Yup, that’s right. The index wars are over, at least in terms of raw counting. Google has taken its ball and gone home. The company has decided to take the McDonalds like number off its website – “8 billion pages served…”, and instead simply claim to be more comprehensive. “Google is the most comprehensive search engine by far,” Mayer told me. Can she prove that? Not easily. But there you have it.

Problem is, while Google is clearly sincere in making this claim – I don’t doubt they believe it – the company refuses to call out any numbers or walk anyone through how they can prove it (other than a battery of disputed tests that honestly, no single person could reliably execute anyway).

In fact, this announcement, tied to Google’s 7th birthday, is a major exercise in changing the rules of the game. Google has been increasing its index of late, Marissa said, and many out there have noticed it, including many commentors on this and other sites. The company was getting ready to back this claim, that’s for sure. It’s clear that this is a response to Yahoo’s earlier announcement on index size. To pretend otherwise is naive. Second, by refusing to count anymore, Google is forcing the debate back to relevance, where, honestly, it really belongs.

I asked Marissa that since Yahoo claims 20+ billion documents, and Google claims to be three times larger, might not folks simply presume that Google has 60 billion documents in its index? The answer goes to the heart of the index debate in the first place: Google does not count the way Yahoo seems to, so the comparison is apples to oranges. Google is counting one way, Yahoo another. So the numbers don’t add up.

I then asked Marissa if Google would be open to having a third party, agreed to by both sides, settle this in some reliable fashion. She said sure, but as she answered, I realized this will never happen. Both sides think they are right, and both sides will never divulge how they go about counting in the first place. So where are we left? Pretty much where we’ve been, only now, it’s all about who you believe. So who’s more comprehensive? Depends who you ask…..

Yahoo sent me a response late tonight. Here it is, in its entirety:

“We congratulate Google on removing the index size number from its homepage and recognizing that it is a meaningless number. As we’ve said in the past, what matters is that consumers find what they are looking for and we invite Google users to compare their results to Yahoo! Search at http://search.yahoo.com.”

Er, sorry Yahoo. I don’t buy that one. Why on earth, then, did you announce that 20 billion number in the first place?

Well, at least this is the end of it. I’m not sure either company came off well in this particular dust up, but it seems to have been fought to some kind of a draw, at least for now.

Update: Eric Schmidt spoke with Markoff for this Times piece, in which he announces that Google will encourage folks to “guess” the size of Google’s index. And the closest person will win something. Maybe. Sheesh.

18 thoughts on “Google Announces New Index Size, Shifts Focus from Counting”

  1. So Google played the size game to it’s advantage for seven straight years, then now decides to stop playing… why you might ask? Why would you take a feature that you boasted about for years and suddenly remove?

    To me that answer is obvious: Google can no longer trump the competition. For Google to say that our counting method is different so therefore we’re no longer going to tell the world what it is a) admitting defeat, and b) like saying, “Well, you now that you have a bigger house, let’s stop measuring house size because we use the super-geeky-cool Klingon method and you use the metric system.”

    Or better yet, kind of like a kid’s favorite toy being taken away and them saying, “Well, I did not really like it anyway.” How childish.

    Comprehensiveness matters and how many documents you index demonstrates such. If it does not matter now, why did it matter a year ago?

    Google is starting to admit defeat on the quality of it’s search engine. How interesting…

    Even better, for this announcement Google is back to talking to… you guessed it… CNet of all the folks.

    Isn’t CNet the ones that used Google to find out information on Google’s CEO Eric Schmidt and Eric got so mad at CNet for writing about it that Google said it was going to stop talking to CNET for a year?

    I guess now that Google needs folks to tell their size-does-not-matter story Google is back to speaking to them. How childish is that?

    I do not know which is worse, being arrogant while riding on the high horse of not being evil or being arrogant and then retreating your high ground.

    Google is starting to really suck as a cultural icon… I really wish Apple did search.

  2. Google is right that raw page-count, well, doesn’t count. And I don’t think, as Thomas says, they’re just saying this because they realized they can’t beat Yahoo at the quantity game. I believe it’s meaningless to count pure pages: on my own server, I could create 1 million pages full of garbage content (or redundant content) and wait for a searchbot to come around. Does that mean I added 1 million pages of value to users? No, all I did was to write a script which dynamically generates pages (and you won’t even be able to tell by looking at the URL, because it doesn’t have any GET parameters — the magic of htaccessified, friendly URLs). Now when Google would say “10 billion pages” and Yahoo would say “20 billion pages”, what is the meaning of that if it’s 5 billion of those automatically generated spam pages? Right, there’s no use for those numbers, because you can artificially boost them.

    The question remains though, if Google throws out copies of redundant content, how do they know which one is original and which one is the copycat?

  3. This sounds like the often meaningless Intel vs. PowerPC argument about “how fast is my chip set” argument cast anew. The size of the index is dependent on how you build the index and doesn’t reflect an intrinsic value—it’s how you use the index to deliver value to the user that matters.

    Google may be “right” that index size doesn’t count, but they said their index is bigger while, as they increasingly have lately, pulled the kimono closer when they should be concerned about improving transparency.

  4. I think Andrew Orlovski already explained “3 times bigger” phenomenon (surprisingly coming up with the same estimate)

    http://www.theregister.co.uk/2005/08/16/google_yahoo_junk/

    “For the spam friendly gibberish words “carbolization clambers” Google returned 7 pages, all from a dictionary, and Yahoo! returned none. For the words ” alkaloid’s observance”, Google returns 30 pages and Yahoo none. In other words, the methodology is geared not to measure who has the most useful documents, but who has the most spam. To be more precise, in these examples, Google returns a number of copies of a dictionary file. It’s a different frequency of noise.”

  5. It might be more interesting to see how many pages are delivered to users in any form. 8 billion pages sitting in a server room isn’t impressive, but if they manage to deliver all 8 billion to people who want to read them, well then there’s something to talk about.

  6. Google and Yahoo think numbers equals better results. I rather hear results are relavant. Or, they have come up with a new system to stop spamming sites in their tracks.

  7. maybe if we all had stopped talking about the size wars in the first place and only paid tribute to the pertinent PR releases these companies were making / or would have made (e.g. we have improved relevancy 10 fold and he’s the white paper, test methodology, please sanity check it for us) then we would all be better off /.

    For me, I wish “everyone” would focus their energies on reducing the non efective parts of the engines algo’s ~ that of combatting spam etc. The sheer wate of resourceful computing power will always be a thorn in the backside of relevancy advancement.

  8. Google had 8 billion pages, now has 3x anyone else – which is presumably 60 billion pages; there really wasn’t much that I couldn’t find before that would be possible to find now (the deep web still exists) so essentially Google added 52 billion useless documents. Not to mention, chances are that if Google didn’t crawl these documents before they’re probably not SEO ideal and they won’t be highly positioned for any common query, therefore even though the documents are there they still won’t get viewed (who goes passed the 3rd serp?)
    Even if Google lost the indexing war – who cares? Search as we know it is becoming passe(why else would Google be making all of these brand extensions). It’s time for new technologies to refine the process and organize the world’s information in even more useful ways.

  9. For me, I wish “everyone” would focus their energies on reducing the non efective parts of the engines algo’s ~ that of combatting spam etc. The sheer wate of resourceful computing power will always be a thorn in the backside of relevancy advancement.

  10. I attend that Kendall Willets is right, google wants to have much pages but begins to forget about qualities, gives me oneself that their idea from page rank is too weak for yes large quantities of pages .

  11. or me, I wish “everyone” would focus their energies on reducing the non efective parts of the engines algo’s ~ that of combatting spam etc. The sheer wate of resourceful computing power will always be a thorn in the backside of relevancy advancement.

  12. If to talk about google index anв its following analysis i’ve come across a great tool called SERPalyzer. I wouldn’t mind having John Battelle review it. It’s not an ad in any possible way. I just got excited when I discovered it – the small tool lets you analyze goole serach results without leaving the google page. Pretty cool.

  13. It would be valuable if you would sometimes chose between your search results and indicate which web site is most useful while at the same time indicating whether that web site fell in the top ten results or not.

  14. ometimes chose between your search results and indicate which web site is most useful while at the same time indicating whether that web site fell in the top ten results or not.

Leave a Reply

Your email address will not be published. Required fields are marked *