By: James MacAonghus

James MacAonghus — Tue, 16 Aug 2005 21:39:03 +0000

It’s good to have your analytical posts back again, thank you 🙂

By: Seth Finkelstein

Seth Finkelstein — Tue, 16 Aug 2005 18:32:25 +0000

I’ve written a post pointing out

I’ve dug into some of the study’s data, and written an initial
quick blog post to point out two bad flaws. The methodology used does
indeed have a selective bias, towards both:
1) search-engine spam pages, and 2) large word lists.

Briefly, by using searches for random words from a large
wordlist, that created a tendency to select *large* *wordlists*, and
also gibberish spam pages which happened to have those words (probably
derived from the same large wordlists). Moreover, this effect applies
(to some extent) to *every* *search* *sample*. In fact, many of the
searches could be repeatedly selecting the *same wordlist file*,
or similar. Since either Google had more large wordlists indexed, or
Yahoo eliminated many of them as useless data, this results in an
extremely misleading conclusion about the relative size of their databases.

In effect, the outcome is that a relatively small number of
dubious documents are being repeatedly sampled, rather than any sort
of comprehensive examination.

Comments on: Danny: Screw Size

By: James MacAonghus

By: Seth Finkelstein