Duping search engines, even the big-G

A Moldovian blackhat successfully indexed and gained rank (since dropped due to the maelstrom of publicity) for over 5 billion junk pages (example) in just three weeks—duping Google, along with Yahoo and MSN. The junk pages are also covered in AdSense ads, leading Email Battles to speculate that they…

Battelle adds that “5 billion pages is the entire size of the Google index just a year or so ago. The last claim, before they stopped MAKING claims, was 8 billion…think about that.”

While junk results are frequently a problem in Yahoo and MSN, the news here is that Google indexed more of the low quality sites faster. While the attention is warranted, to be fair, a concluding judgement should note that this is also a function of Google generally indexing more pages, faster, as Ana’s Lair writes. See the original, weekend post from Monetize, which kindly provides a how-to guide for future blackhat reference.

(via Melanie)

21 thoughts on “Duping search engines, even the big-G”

MikeM says:

June 20, 2006 at 7:24 pm

How could Google AdSense people not notice all the activity being generated from this particular geographic area. Even if it was not coming from one exact spot someone should have caught it and flagged it as pure junk.
1 to 5 billion pages even over a few weeks time could potentially alter the earnings of Google in Q2. Maybe they need to address this and talk to Wall Street. This is pretty big news.

Reply
SEARCH ENGINES Web says:

June 20, 2006 at 7:29 pm

This presents an opportunity for a vibrant debate.

One extremely important point to note is that it matters less how many pages were indexed – than if ANY of them CAME UP ON THE SERPS…

In other words – storage is so cheap – that the pages will just lie there as useless – unused – bytes – as so-o many Web pages do.

However, if they came up on any SERPs – Therein lies a potential ethical problem.

Especially if the pages were in the TOP 30 SERPs where most people search, because, in theory, they would be taking the place of “valid” web pages.

However, if any of those controversial pages WERE in FACT including any valuable valid information, then the motives of their creators and their tactics would in theory have be LESS HARMFUL.

This is an enourmous conflict that search quality Engineers have to resolve. Do they take the easy way out…. and just NIX EVERYTHING??????

That may be what Google had to do, because it would have been too resource-demanding to analyze all of the billions of pages.

And of course, as sophisticated as Google Algos are they were NOT able to auto-detect something wrong.

BTW:
Look at the amount of pages indexed in Google using This Technique….

digg.com/technology/Google:_25_TRILLION_Pages_Indexed_

Reply
Webmetricsguru says:

June 20, 2006 at 8:51 pm

Hi John, I covered this over the weekend with a post http://www.webmetricsguru.com/2006/06/get_billions_of_pages_indexed.html

Google quickly got rid of the spam domains …. they only existed for a couple of weeks – and when I posted over the weekend it was already being deleated.

Reply
JG says:

June 20, 2006 at 10:18 pm

SE Web wrote: One extremely important point to note is that it matters less how many pages were indexed – than if ANY of them CAME UP ON THE SERPS…

Um, no. That is a bit of a simplistic view. Whether or not any of them came up in the SERPs, they ALL still have the potential to affect (push up or down) other pages in the index. Think about how all those new links can screw with PageRank, for example.

I find it incredibly naive (no offense intended to you, personally, SE Web) that so many people think that just because a page isn’t listed in the top 10 or 50 results it is just dead bits on a disk. ALL the pages in an index exert subtle influence on each other, whether through PageRank, or even through the value of a simple relevance feature such as IDF (inverse document frequency).

For example: Previously, a term which was a very good keyword for your website might have had a very high IDF, which means your page comes up ranked highly when someone querys using that term. Now, the addition of 5 billion web pages might have added millions of instances of that term to Google’s big index, which would have the effect of lowering the IDF for that term ON YOUR PAGE. That term will no longer be very useful for helping people find your page.

I boggle that more people don’t see this.

Reply
Todd Henley says:

June 20, 2006 at 10:22 pm

I was absolutely amazed when I first heard about this. This seems to be getting press because it is such a huge amount of pages, but this may open some eyeballs to all the countless smaller spam sites running adsense and what-not?

Reply
Adam Lasnik says:

June 20, 2006 at 10:44 pm

I’ve long been a lurker / occasional commenter for quite some time here, and I figured I might as well offer a few clarifications on the “5 billion” issue :-).

I work with Matt Cutts and other engineers in the Search Quality Team at Google. And yes, we noticed that lots of subdomains got indexed last week — and sometimes listed in search results — that shouldn’t have been. Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off. For example, the one site that supposedly had 5.5 billion pages in the index actually had under 1/100,000th of that.

So how did this happen? We pushed some corrupted data with our index. Once we diagnosed the problem, we started rolling the data back and pushed something better… and we’ve been putting in place checks so that this kind of thing doesn’t happen again.

Reply
CPCcurmudgeon says:

June 20, 2006 at 11:47 pm

There has been lots of discussion of spam sites corrupting Google’s
indexes on WebmasterWorld for the past several months.

Reply
numlock says:

June 21, 2006 at 12:39 am

Adam,

Interesting stuff..

Is the subdomain issue a new thing or has it always been possible to trick the index that way?

Or rather, what does google do now to prevent subdomains from polluting the index?

Reply
Eric Giguere says:

June 21, 2006 at 2:48 am

Google’s always treated a subdomain as a separate site from its parent domain. (Which makes perfect sense.) Given that Google indexes the root page of any new site it encounters very quickly (but not the other pages), it was just a matter of time before someone tried to exploit this by converting all the pages of a site as separate subdomains. Really, I’m surprised this hasn’t happened before, though it sounds like it just slipped past some of Google’s defenses due to the changeover to Big Daddy.

Reply
Kate says:

June 21, 2006 at 6:45 am

Matt Cutts is going to come back from vacation to all sorts of stuff. Thanks for the comment Adam.

Reply
Keith Cash says:

June 21, 2006 at 8:35 am

I am surprised that the site is still up and running after this. There should be safe guards for this kind of problem in the future. Google lets this happen, but a new legit website is put in a sand box for a year.

Could this be seen as a criminal act??? It Should.

To get this billion plus pages, the keebler elves must have put in some overtime.

Reply
JG says:

June 21, 2006 at 9:17 am

Adam, you write: Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off.

So what gives with that? Do you remember Robert Scoble’s “brrreeeport” test a few months ago? He made up a word that didn’t previous exist in Google’s index. Then he asked his blog readers to put the word in an entry on their blog. He wanted to see (1) how quickly it got indexed, and (2) how extensively.

Well, what he found is that, after only a few days, when you searched Google using “brrreeeport” as your query, not only did his blog show up first, but Google’s results said “showing results 1-10 of about 10,000”. A day or two later it was up to 100,000. A week or two after that it was up to 180,000.

I was curious, though, about whether this number was true. So I manually clicked page after page of results.. 1-10, 11-20, 21-30, etc. I finally got to 671-680, and at result #683 in the middle of the page, there were no more results. The blurb up top still read “results 671-680 of about 180,000”.

Google cuts off at 1,000, correct? So showing me 683 results means there really are 683 results. Otherwise Google would have shown me result #684, too.

So why did it say there were “about 180,000” web pages with that word?!

How can you be so off in your estimates? It is one thing to say “results 671-680 of about 700”, or even “of about 1000”. But of “about 180,000”?

Why do you even need to estimate? Don’t you use inverted lists? Can’t you store that number as a simple “long” value at the beginning of the list?

Mind you, this all happened 3+ months ago. So if it has something to do with a broken index, it has been broken for quite some time now. And if I use this example as a rough statistic, I would say that your estimates are 263 times too large. I.e., you said there were 180,000 pages, and there were only 673 pages. That is a factor of 263 times larger than actually existed.

So instead of there being 5.5 billion spam pages from this person in your index, there are probably more like 5.5bil/263 = “about” 20.9 million.

That is still huge.

And again, why do you even estimate in the first place? How hard is it to just look up the length of an inverted list, for a single term query?

Reply
Otis Gospodnetic says:

June 21, 2006 at 3:20 pm

This is bread and butter for Google, at least short term. Why would they want to get rid of their revenue? They got rid of it only when the public noticed and started writing about it. Think about it.

Reply
Anonymous says:

June 21, 2006 at 4:04 pm

Sounds exactly like what these asshats are probably doing…

http://www.morethantraffic.com/

Reply
Ellis says:

June 22, 2006 at 5:24 am

Sounds like another big business trying to keep their stock at inflated price. It will come back on them.

Reply
Skip E says:

June 22, 2006 at 5:29 am

When do the internet users revolt on/about the spam sites corrupting Google’s indexes

It is hard enough trying to use the net for business

Reply
M C says:

June 22, 2006 at 5:35 am

A Moldovian blackhat successfully indexed and gained rank (since dropped due to the maelstrom of publicity) for over 5 billion junk pages (example) in just three weeks—duping Google, along with Yahoo and MSN.

I am surprised that the site is still up and running after this.

I agree with Keith, WHY IS THE WEBSITE STILL UP???
This has to stop and sever punishment needs to happen.
be-heading not that is out.
boil in water, no to hot.
shutdown website, no lets not do that to painful

Reply
MikeM says:

June 22, 2006 at 6:56 am

Skip E. Internet users will not revolt on Google. It’s kind of like Starbuck’s. Everyone drinks those heart clogging blended creamy drinks so they can’t be bad right? Starbuck’s could care less if your colesterol hits the roof. Everyone Google’s so it must be good. People are easy to fool. I would guess the average searcher who stumbles on one of the BadSense sites believes it is a valid website and clicks the links thinking it will serve their needs.
I think Google follows the P.T. Barnum principle flawlessly. If it weren’t for these pesky search bloggers….

Reply
John Bokma says:

June 22, 2006 at 3:10 pm

My site (http://johnbokma.com/ ) has about 1,000 pages, and Google reports ca. 10,000 pages when the spam attack was happening. I wonder if the huge amount of pages caused the site: operator to report results that were quite off.

It seems to have been fixed now (site: reports now 997).

Reply
anon says:

June 23, 2006 at 5:28 am

That’ll be why nobody be half a brain lets their adwords show on adsense.

Reply
Mike Levin says:

June 23, 2006 at 7:53 am

If you’re looking for a way to legitimately target the long tail of search without these automated black hat techniques, check out HitTail. It’s the first site designed to help you zero in on the best writing topics for genuine non-automated targeting. It could be even more effective in the long-run, because everyone can do it, and it won’t get you banned.

Reply

Share this:

21 thoughts on “Duping search engines, even the big-G”

Leave a Reply to numlock Cancel reply