<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Duping search engines, even the big-G</title>
	<atom:link href="http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php/feed" rel="self" type="application/rss+xml" />
	<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=duping_search_engines_even_the_big-g</link>
	<description>Thoughts on the intersection of search, media, technology, and more.</description>
	<lastBuildDate>Tue, 21 May 2013 16:33:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.1</generator>
	<item>
		<title>By: Mike Levin</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14799</link>
		<dc:creator>Mike Levin</dc:creator>
		<pubDate>Fri, 23 Jun 2006 14:53:57 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14799</guid>
		<description>&lt;p&gt;If you&#039;re looking for a way to legitimately target the long tail of search without these automated black hat techniques, check out &lt;a href=&quot;http://www.hittail.com&quot; rel=&quot;nofollow&quot;&gt;HitTail&lt;/a&gt;. It&#039;s the first site designed to help you zero in on the best writing topics for genuine non-automated targeting. It could be even more effective in the long-run, because everyone can do it, and it won&#039;t get you banned.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>If you&#8217;re looking for a way to legitimately target the long tail of search without these automated black hat techniques, check out <a href="http://www.hittail.com" rel="nofollow">HitTail</a>. It&#8217;s the first site designed to help you zero in on the best writing topics for genuine non-automated targeting. It could be even more effective in the long-run, because everyone can do it, and it won&#8217;t get you banned.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: anon</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14798</link>
		<dc:creator>anon</dc:creator>
		<pubDate>Fri, 23 Jun 2006 12:28:53 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14798</guid>
		<description>&lt;p&gt;That&#039;ll be why nobody be half a brain lets their adwords show on adsense. &lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>That&#8217;ll be why nobody be half a brain lets their adwords show on adsense. </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Bokma</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14797</link>
		<dc:creator>John Bokma</dc:creator>
		<pubDate>Thu, 22 Jun 2006 22:10:26 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14797</guid>
		<description>&lt;p&gt;My site (http://johnbokma.com/ ) has about 1,000 pages, and Google reports ca. 10,000 pages when the spam attack was happening. I wonder if the huge amount of pages caused the site: operator to report results that were quite off.&lt;/p&gt;

&lt;p&gt;It seems to have been fixed now (site: reports now 997).&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>My site (<a href="http://johnbokma.com/" rel="nofollow">http://johnbokma.com/</a> ) has about 1,000 pages, and Google reports ca. 10,000 pages when the spam attack was happening. I wonder if the huge amount of pages caused the site: operator to report results that were quite off.</p>
<p>It seems to have been fixed now (site: reports now 997).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: MikeM</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14796</link>
		<dc:creator>MikeM</dc:creator>
		<pubDate>Thu, 22 Jun 2006 13:56:46 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14796</guid>
		<description>&lt;p&gt;Skip E.  Internet users will not revolt on Google.  It&#039;s kind of like Starbuck&#039;s.  Everyone drinks those heart clogging blended creamy drinks so they can&#039;t be bad right?  Starbuck&#039;s could care less if your colesterol hits the roof.  Everyone Google&#039;s so it must be good.  People are easy to fool. I would guess the average searcher who stumbles on one of the BadSense sites believes it is a valid website and clicks the links thinking it will serve their needs.&lt;br /&gt;
I think Google follows the P.T. Barnum principle flawlessly.  If it weren&#039;t for these pesky search bloggers....&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Skip E.  Internet users will not revolt on Google.  It&#8217;s kind of like Starbuck&#8217;s.  Everyone drinks those heart clogging blended creamy drinks so they can&#8217;t be bad right?  Starbuck&#8217;s could care less if your colesterol hits the roof.  Everyone Google&#8217;s so it must be good.  People are easy to fool. I would guess the average searcher who stumbles on one of the BadSense sites believes it is a valid website and clicks the links thinking it will serve their needs.<br />
I think Google follows the P.T. Barnum principle flawlessly.  If it weren&#8217;t for these pesky search bloggers&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: M C</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14795</link>
		<dc:creator>M C</dc:creator>
		<pubDate>Thu, 22 Jun 2006 12:35:16 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14795</guid>
		<description>&lt;p&gt;A Moldovian blackhat successfully indexed and gained rank (since dropped due to the maelstrom of publicity) for over 5 billion junk pages (example) in just three weeks---duping Google, along with Yahoo and MSN. &lt;/p&gt;

&lt;p&gt;I am surprised that the site is still up and running after this. &lt;/p&gt;

&lt;p&gt;I agree with Keith, WHY IS THE WEBSITE STILL UP???&lt;br /&gt;
This has to stop and sever punishment needs to happen.&lt;br /&gt;
be-heading not that is out.&lt;br /&gt;
boil in water, no to hot.&lt;br /&gt;
shutdown website, no lets not do that to painful&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>A Moldovian blackhat successfully indexed and gained rank (since dropped due to the maelstrom of publicity) for over 5 billion junk pages (example) in just three weeks&#8212;duping Google, along with Yahoo and MSN. </p>
<p>I am surprised that the site is still up and running after this. </p>
<p>I agree with Keith, WHY IS THE WEBSITE STILL UP???<br />
This has to stop and sever punishment needs to happen.<br />
be-heading not that is out.<br />
boil in water, no to hot.<br />
shutdown website, no lets not do that to painful</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Skip E</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14794</link>
		<dc:creator>Skip E</dc:creator>
		<pubDate>Thu, 22 Jun 2006 12:29:56 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14794</guid>
		<description>&lt;p&gt;When do the internet users revolt on/about the spam sites corrupting Google&#039;s indexes&lt;/p&gt;

&lt;p&gt;It is hard enough trying to use the net for business&lt;br /&gt;
&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>When do the internet users revolt on/about the spam sites corrupting Google&#8217;s indexes</p>
<p>It is hard enough trying to use the net for business</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ellis</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14793</link>
		<dc:creator>Ellis</dc:creator>
		<pubDate>Thu, 22 Jun 2006 12:24:12 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14793</guid>
		<description>&lt;p&gt;Sounds like another big business trying to keep their stock at inflated price. It will come back on them.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Sounds like another big business trying to keep their stock at inflated price. It will come back on them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14792</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Wed, 21 Jun 2006 23:04:23 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14792</guid>
		<description>&lt;p&gt;Sounds exactly like what these asshats are probably doing...&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.morethantraffic.com/&quot; rel=&quot;nofollow&quot;&gt;http://www.morethantraffic.com/&lt;/a&gt;&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Sounds exactly like what these asshats are probably doing&#8230;</p>
<p><a href="http://www.morethantraffic.com/" rel="nofollow">http://www.morethantraffic.com/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Otis Gospodnetic</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14791</link>
		<dc:creator>Otis Gospodnetic</dc:creator>
		<pubDate>Wed, 21 Jun 2006 22:20:48 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14791</guid>
		<description>&lt;p&gt;This is bread and butter for Google, at least short term.  Why would they want to get rid of their revenue?  They got rid of it only when the public noticed and started writing about it.  Think about it.&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>This is bread and butter for Google, at least short term.  Why would they want to get rid of their revenue?  They got rid of it only when the public noticed and started writing about it.  Think about it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: JG</title>
		<link>http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14790</link>
		<dc:creator>JG</dc:creator>
		<pubDate>Wed, 21 Jun 2006 16:17:04 +0000</pubDate>
		<guid isPermaLink="false">http://battellemedia.com/archives/2006/06/duping_search_engines_even_the_big-g.php#comment-14790</guid>
		<description>&lt;p&gt;Adam, you write: &lt;i&gt;Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off.&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;So what gives with that?  Do you remember Robert Scoble&#039;s &quot;brrreeeport&quot; test a few months ago?  He made up a word that didn&#039;t previous exist in Google&#039;s index.  Then he asked his blog readers to put the word in an entry on their blog.  He wanted to see (1) how quickly it got indexed, and (2) how extensively.  &lt;/p&gt;

&lt;p&gt;Well, what he found is that, after only a few days, when you searched Google using &quot;brrreeeport&quot; as your query, not only did his blog show up first, but Google&#039;s results said &quot;showing results 1-10 of about 10,000&quot;.  A day or two later it was up to 100,000.  A week or two after that it was up to 180,000.  &lt;/p&gt;

&lt;p&gt;I was curious, though, about whether this number was true.  So I manually clicked page after page of results.. 1-10, 11-20, 21-30, etc.  I finally got to 671-680, and at result #683 in the middle of the page, there were no more results.  The blurb up top still read &quot;results 671-680 of about 180,000&quot;.&lt;/p&gt;

&lt;p&gt;Google cuts off at 1,000, correct?  So showing me 683 results means there really are 683 results.  Otherwise Google would have shown me result #684, too.&lt;/p&gt;

&lt;p&gt;So why did it say there were &quot;about 180,000&quot; web pages with that word?!&lt;/p&gt;

&lt;p&gt;How can you be so off in your estimates?  It is one thing to say &quot;results 671-680 of about 700&quot;, or even &quot;of about 1000&quot;.  But of &quot;about 180,000&quot;?  &lt;/p&gt;

&lt;p&gt;Why do you even need to estimate?  Don&#039;t you use inverted lists?  Can&#039;t you store that number as a simple &quot;long&quot; value at the beginning of the list?  &lt;/p&gt;

&lt;p&gt;Mind you, this all happened 3+ months ago.  So if it has something to do with a broken index, it has been broken for quite some time now.  And if I use this example as a rough statistic, I would say that your estimates are 263 times too large.  I.e., you said there were 180,000 pages, and there were only 673 pages.  That is a factor of 263 times larger than actually existed.  &lt;/p&gt;

&lt;p&gt;So instead of there being 5.5 billion spam pages from this person in your index, there are probably more like 5.5bil/263 = &quot;about&quot; 20.9 million.&lt;/p&gt;

&lt;p&gt;That is still huge.  &lt;/p&gt;

&lt;p&gt;And again, why do you even estimate in the first place?  How hard is it to just look up the length of an inverted list, for a single term query?&lt;/p&gt;</description>
		<content:encoded><![CDATA[<p>Adam, you write: <i>Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off.</i></p>
<p>So what gives with that?  Do you remember Robert Scoble&#8217;s &#8220;brrreeeport&#8221; test a few months ago?  He made up a word that didn&#8217;t previous exist in Google&#8217;s index.  Then he asked his blog readers to put the word in an entry on their blog.  He wanted to see (1) how quickly it got indexed, and (2) how extensively.  </p>
<p>Well, what he found is that, after only a few days, when you searched Google using &#8220;brrreeeport&#8221; as your query, not only did his blog show up first, but Google&#8217;s results said &#8220;showing results 1-10 of about 10,000&#8243;.  A day or two later it was up to 100,000.  A week or two after that it was up to 180,000.  </p>
<p>I was curious, though, about whether this number was true.  So I manually clicked page after page of results.. 1-10, 11-20, 21-30, etc.  I finally got to 671-680, and at result #683 in the middle of the page, there were no more results.  The blurb up top still read &#8220;results 671-680 of about 180,000&#8243;.</p>
<p>Google cuts off at 1,000, correct?  So showing me 683 results means there really are 683 results.  Otherwise Google would have shown me result #684, too.</p>
<p>So why did it say there were &#8220;about 180,000&#8243; web pages with that word?!</p>
<p>How can you be so off in your estimates?  It is one thing to say &#8220;results 671-680 of about 700&#8243;, or even &#8220;of about 1000&#8243;.  But of &#8220;about 180,000&#8243;?  </p>
<p>Why do you even need to estimate?  Don&#8217;t you use inverted lists?  Can&#8217;t you store that number as a simple &#8220;long&#8221; value at the beginning of the list?  </p>
<p>Mind you, this all happened 3+ months ago.  So if it has something to do with a broken index, it has been broken for quite some time now.  And if I use this example as a rough statistic, I would say that your estimates are 263 times too large.  I.e., you said there were 180,000 pages, and there were only 673 pages.  That is a factor of 263 times larger than actually existed.  </p>
<p>So instead of there being 5.5 billion spam pages from this person in your index, there are probably more like 5.5bil/263 = &#8220;about&#8221; 20.9 million.</p>
<p>That is still huge.  </p>
<p>And again, why do you even estimate in the first place?  How hard is it to just look up the length of an inverted list, for a single term query?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
