free html hit counter Scraping Google To See What Happens - John Battelle's Search Blog

Scraping Google To See What Happens

By - January 11, 2005

Komatsu D575A Scraper Lg

Daniel Brandt, Google’s most relentless thorn, has released code which scrapes Google, sans ads. Techdirt covers it here. The Register (also a Google thorn) covers it here. Highlights:

Brandt fully expects Google to throw legal and technical resources at him, but says he welcomes the challenge if only to clarify copyright issues.

Google took people’s free stuff and made a $50 billion business from it, he argues.

“The commercialization of the web became possible only because tens of thousands of noncommercial sites made the web interesting in the first place,” he writes. “All search engines should make a stable, bare-bones, ad-free, easy-to-scrape version of their results available for those who want to set up nonprofit repeaters. Even if it cuts into their ad profits slightly, there’s no easier way to give back some of what they stole from us.”

OK, there are a lot of issues here, and I really must write the book. Really…must…write…aww hell. I’ll say this, in any case: Google hasn’t stolen anything from anyone. Has the company profited from innovation in assembly and the architecture of participation? Hell yes. But that’s OK, after all, those who innovate in assembling data, and those who take the patterns from the aggregate and make sense of them for the individual, well, they deserve the rewards of the marketplace.

But the question of public data as a copyrightable fact is an interesting one. It’s been around the legislative maypole (as noted here) and I don’t have time to get fully smart on it, but it is an interesting dilemma.

Think of the implications for the public domain material in the Google Print/Library project, for instance….

Related Posts Plugin for WordPress, Blogger...

8 thoughts on “Scraping Google To See What Happens

  1. brian says:

    But, but…well Google and other search engines by the strictest definition of copyright law (an affirmative right) could be construed to have cached web pages and images of others without the explict permission of the owners.

  2. Adam says:

    Mr. Brandt is an idiot, and a rude idiot at that.

    1) Don’t want your sites on the Web spidered by Google? Put up a simple robots.txt file. Voila.

    2) Don’t like Google? Don’t use it. Mr. Brandt’s effort to use Google’s bandwidth, R&D, and processing power while stripping their ads is akin to the selfishness of BugMeNot’s childishly gleeful facilitation of helping people access registration-required news sites while violating the terms of those sites. Same thing: Don’t want to register at the New York Times? Don’t read their articles.

    3) There are many things I don’t like about Google. Their blog is typically fluffy and uninformative, their hiring practices are tiresome and inefficient, and they’ve done a surprisingly poor job of integrating their various services. With that said, though, Google’s done a LOT of stuff right… and most of it quite unselfishly and without being evil. If only the same could be said of Andrew O at The Register (another annoying twit) and Daniel Brandt.

  3. Miles Barr says:

    In the UK in addition to regular copyright law we also have the Database Act:

    This allows (among other things) someone to compile of a database of public domain works and have IP rights on that database. This would make what Scroogle does illegal in the UK. Is there something similar in the US?

  4. pb says:

    Looks like we’ll need to go to Microsoft to scrape search results:

    Google used to offer this simple, RESTful API method but ditched it for its current, cumbersome SOAP-based APIs.

  5. Steve Crcker says:

    Perhaps in a legalistic sense Google has not “stolen” anything. But I understand Daniel to be saying, essentially, that Google has profited hugely from the free. volunteer and idealistic efforts of others. And has given nothing back to those whose efforts have made Google possible. Now this last point is certainly debatable. One could argue that by providing search capability, access is facilitated to many small sites which would otherwise go unnoticed. But wait, it’s not that simple. The growing comercialization of the search industry, led by Google, has created a situation where noise increasingly drowns out information in search results – at least if you are a user searching for information for its own sake – and not for some buying opportunity. Commercialization of the search function virtually insures that paid results will tend to dominate and crowd out material put on the Web as a volunteer “labor of love” – the very kind of material which caused many of us to gravitate toward the Internet in the first place.

    Just a thought,

    P.S. No relation to the Steve Crocker who helped invent the Internet.

  6. Vitaliy says:

    I like Google and have the toolbar, but please scraping Google. I can’t find spyware.

  7. MetaSearch says:

    How do metasearch engines survive? Don’t they just scrape results?

  8. Judy Salter says:

    Google exists because it scraped everything and everyone without caring if they allow it or not.

    Now they are large and don’t want to be scraped, that just sucks imho.
    But anyway, it is possible to scrape Google. is an open source project that is able to scrape millions of hits without issues.Might be worth to add to the blog.