More On Yahoo, Google, Index, Size

I had a long chat today with folks from Yahoo about the ongoing "size matters" tempest, and it was once again enlightening. I'm planning a longer post on all this, but the upshot of our conversation was that Yahoo stands by its number, that it agrees with many that…

I had a long chat today with folks from Yahoo about the ongoing “size matters” tempest, and it was once again enlightening. I’m planning a longer post on all this, but the upshot of our conversation was that Yahoo stands by its number, that it agrees with many that size alone does not matter, that any claims that any one company can accurately estimate another’s index are simply not defensible, and that, in the end, the proof will be in the results.

Yahoo also acknowledged that it was certainly aware of the PR angle when it made its announcement, and that given Google’s home page claim regarding index size, it was hardly a new tactic to tout that number.

I think there’s more to this story than meets the eye, in terms of a major, multi-billion dollar tussle for the hearts, minds, and pocket books of millions of web users. Sure, the math is hard, and the science even harder, but at the end of the day, I think size matters, a lot. Maybe not so much to the ultimate results one gets – that may well be a case of “it’s not the size of the wand, it’s how you wave it” – but in terms of bragging rights and marketing mojo. Perhaps the ultimate end game of this all will be a deeper cultural awareness of what constitutes good search, but then again, no one ever got rich overestimating the public’s taste for nuance.

BTW, several sources contacted me to remind me of a fact we all know to be true – that Google’s claimed size on its home page – of roughly 8 billion documents – is pretty out of date. Since they put that up, nearly a year ago (scroll to bottom), I’m pretty sure the discoverable web has grown by, oh, at least a few billion pages, and I’m also pretty sure Google knows about those pages. Recall that Google increased its index by roughly a factor of two back then, as a response, one would presume, to Microsoft’s claim to have trumped Google’s number, which had been reported at about 4 billion. I mean, heck, this new post will create a page, and I bet Google (and Yahoo and everyone else) will have found it within a week, if not a day. Blogs alone are adding millions of pages every week.

Would I be surprised if Google announced shortly that its index was magically up to, oh, 22 billion or so? No, I would not. I think if and when that day comes, the timbre of this debate will change. Clearly, such a change would not have occurred overnight.

Heck, if engines are going to do it anyway, I’d love to see the static numbers on home pages (Yahoo sometimes touts its Image Search index size on that service’s home page, by the way), replaced with a counter that is updated constantly. Kind of like that national debt billboard, but for the overall size of the web as discovered by each engine. Why not, at least it’d be more accurate….

17 thoughts on “More On Yahoo, Google, Index, Size”

  1. Actually, I did a quick analysis of the number of referring links in Google’s, Yahoo’s and MSN’s indexes and found that for a sample of 100 sites Yahoo had, on average, 50 referring links to Google’s 1 link. More suprisingly, MSN had 6.5 referring links to Google’s 1 for an average URL in the sample.

    Read my post at http://dumbsearch.blogspot.com . I also included the spreadsheet that contains the ratio analysis.

    Admittedly, this is not a scientific study, although the numbers point to strong evidence that Yahoo’s index is over one order of magnitude greater than Google’s index given that the index size is proportional to number of links in the index. This is a very safe assumption as far as I am concerned.

  2. John, no question in my mind that bigger databases are better. At minimum, bigger databases increase the likelihood of success with Long Tail inquiries. (For short head inquiries, bigger doesn’t necessarily mean better; it’s all about the relevancy algorithm). However, I continue to wonder just how many indexed pages are spam pages. It wouldn’t surprise me if this number was in the billions. If index sizes are growing due to spam pages, then the bigger databases really have no value for either searchers or the search engines. Eric.

  3. What about quantity vs. quality. In the recent Yahoo update of July 20 the Yahoo Blog noted that we should see more of our pages in the index. Does this mean that we will need to weed through a lot of spam and garbage to find the results we really need? What discrimination was used in creating the mega-index? As a consumer I’ve already become frustrated with results and am finding I am not even attempting search through Yahoo anymore.

  4. John, MSN claimed back in April to have an Index size “north of 5 billion” documents; see – seomoz.org/articles/msn-search-interview.php

    Also, see this document from Jan. of 2005, where Antonio Gulli, a major IR scientist from Italy, has estimated the total size of the WWW – http://www.cs.uiowa.edu/~asignori/web-size/. I would ask him directly about Yahoo!’s claim.

    Thanks for keeping on top of this.

  5. John, MSN claimed back in April to have an Index size “north of 5 billion” documents; see – seomoz.org/articles/msn-search-interview.php

    Also, see this document from Jan. of 2005, where Antonio Gulli, a major IR scientist from Italy, has estimated the total size of the WWW – http://www.cs.uiowa.edu/~asignori/web-size/. I would ask him directly about Yahoo!’s claim.

    Thanks for keeping on top of this.

  6. All Yahoo seems to have increased is the number of estimated results they give. Their estimates appear to have grown by 2-3X. The number of results themselves? I still do better at Google.

    Has Yahoo even launched their larger index? I just don’t see it. I could understand if they said it hadn’t been launched yet, but, if they claim it is live, I am just not seeing the effect.

  7. I’m also looking into this and have gotten the same briefings from Google and Yahoo!. I did some tests which I detail on my blog at blogs.forrester.com/charleneli. What I’m really puzzled by are 1) if the index is really that much bigger, why are the results better, and 2) how come when I walk the floor at SES, marketers and agencies report that they haven’t seen much change on Yahoo? I’m doing more research into it, but in one area, I disagree with John — I don’t think Google will come out with a larger index number quickly because it would require that they change significantly the way they “count” what’s in the index. If they have a beef with the way Yahoo counts, they can’t easily turn around and count differently unless they want to be called hypocrits.

  8. John, I also did a quick study and found some interesting results. Specifically, while Yahoo estimates more results on their first results page, they usually only deliver 30% of that estimator. Also, with respect to unique and even total results, Google seems to give me 50-70% more hits. You can look at my queries and results in my entry at: http://blog.akashjain.org/2005/08/12/is-yahoos-index-really-bigger-methinks-not-really-googles-index-seems-50-larger/
    – definitely interested in anyone else’s thoughts.

  9. I don’t know how many have sat through Peter Norvig’s “size matters” presentation on machine learning algorithms, but his central thesis is that the worst algorithm outperforms the best ones, given a large enough sample size.

    He also said that about 30% of their crawl is duplicates, so I can see where there’s room for debate about net index sizes.

  10. Several researchers (myself included) at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Chamapign ran a fairly extensive study about this issue (about 10,000 queries) and found that Google returns more results than Yahoo in almost every single case. We found that Google returned well over 150% more total results and gave more results in about 97% of our queries.

    Our full study and test code is available online at: http://vburton.ncsa.uiuc.edu/indexsize.html

  11. “any claims that any one company can accurately estimate another’s index are simply not defensible”

    Implying they can say whatever they want because nobody can prove them wrong anyway.

    Whatever the size of their index may be (i can’t believe i just wrote that), this story just reeks of pr (as they admit as well). Ultimately however, people don’t care for “index size”, they just care for better search results.

  12. Thanks for the study, Matt. One problem I noticed is that spam is biased towards random word combinations, so queries like “ensiform teleprompter” return a lot of spam and porn sites. Apparently the web spammers have adopted the same tactic used by email spammers, of loading a page with random terms to dilute the significance of the spam terms.

    Have you made any attempt to separate spam from the results?

  13. I know this is an old post, but I’ve seen all kinds of increases in my search engine saturation for my sites, first from Yahoo right before their big index claim, and now Google looks like its ready to fire back. Number of pages available for my sites has tripled in the Google index in less then a few weeks. I’d bet their going to do something astronomical to their number soon. In some cases it appears that they are crawlign more and in other cases it looks like they might have loosened their de-duplication rules. Either they ordered even more hardware, or maybe every developers workstation is crawling in its idle time now!

    Anyone else hearing bits in this space?

  14. Can’t believe you said it either but I’m still laughing. Iwas referring to Font size and I can bearly see it. Don’t laugh I’m serious. Can you give me an answer. Thanks L

Leave a Reply

Your email address will not be published. Required fields are marked *