A fine piece of Jesus Not Again writing from Danny. I’m deep in this as well, as those of you who’ve read my previous posts know. And more is coming, but I promise, I will be brief as can be. I’m waiting to talk with a couple more folks. Danny notes he and Gary will also be posting more later in the week. I agree with Danny that relevance is key, but think it’s nearly impossible to set a standard for relevance – it’s too subjective. I disagree that size is not important. Once we can figure out how to audit and count size, it’s important, as important as UI, speed, or algorithms. It’s also important in a business sense – it’s a number that folks pay attention to and that marketers know works, and that the mainstream press will parrot. Even if you disagree with the tactics, and I do, it’s still important….
Danny: Screw Size
A fine piece of Jesus Not Again writing from Danny. I'm deep in this as well, as those of you who've read my previous posts know. And more is coming, but I promise, I will be brief as can be. I'm waiting to talk with a couple more folks….
I’ve written a post pointing out
Flaws in NCSA Yahoo/Google study
I’ve dug into some of the study’s data, and written an initial
quick blog post to point out two bad flaws. The methodology used does
indeed have a selective bias, towards both:
1) search-engine spam pages, and 2) large word lists.
Briefly, by using searches for random words from a large
wordlist, that created a tendency to select *large* *wordlists*, and
also gibberish spam pages which happened to have those words (probably
derived from the same large wordlists). Moreover, this effect applies
(to some extent) to *every* *search* *sample*. In fact, many of the
searches could be repeatedly selecting the *same wordlist file*,
or similar. Since either Google had more large wordlists indexed, or
Yahoo eliminated many of them as useless data, this results in an
extremely misleading conclusion about the relative size of their databases.
In effect, the outcome is that a relatively small number of
dubious documents are being repeatedly sampled, rather than any sort
of comprehensive examination.
It’s good to have your analytical posts back again, thank you 🙂