free html hit counter I'll Scrape Your Back, But Don't You Scrape Mine... | John Battelle's Search Blog

I'll Scrape Your Back, But Don't You Scrape Mine…

By - April 02, 2004

backscratchGoogle is blocking a small web developer from scraping Google News, which itself is a scrape of a bunch of other sites. The developer admits his scrape breaks Google’s TOS. Rich, how do you feel about this over at Topix? Will you let him scrape you?

Developer’s lament here

(Thanks, Beal.)


Related Posts Plugin for WordPress, Blogger...

11 thoughts on “I'll Scrape Your Back, But Don't You Scrape Mine…

  1. Topix provides RSS feed for each and every page they generate, so looks like they encourage it and make it easier as well.

  2. Funny you should mention Topix. I wrote about Topix.net’s hypocrisy a few weeks back.

  3. Rich Skrenta says:

    I’ll email Julian and see if he’s interested in switching his converter to Topix.net… We have lots of folks using our RSS. Greg Linden at Findory.com asked to use our feeds for his personalized search engine and we were happy to help.

    Hmmm. Since Topix.net, Yahoo News and Google News each crawl unique sources that the others don’t have, a really useful service would pull all three and merge the results. :-)

    Re our TOS adrian … ours is pretty standard for any internet site. No different than Google’s, Yahoo’s, etc. Search engines like Google and Yahoo, as well as Topix.net provide value by aggregating content from many diverse sources using sophisticated technology. They help users get to information which otherwise would be difficult or impossible to find. These services are difficult and expensive to build, and are supported via advertising. It’s a standard internet search model. I’m not sure what your problem with that is..

  4. Mike Masnick says:

    Rich and the folks at Topix have been great about allowing people to use their feeds. We have one on Techdirt, and we’re quite happy with it.

    As for the complaints about the ToS, I think that’s missing the point. Topix’s ToS are basically telling people not to take all of Topix’s work. When they spider other sites, they’re only taking headlines/summaries (clearly within fair use) and posting direct links back to the content. That gives added benefit to those sites they scrape.

    What they don’t want is someone to just take Topix’s work wholesale and reuse it for commercial purposes. That goes beyond fair use.

    There’s a big difference between pulling a simple headline and copying and entire page – and that’s why I don’t think Topix’s ToS are hypocritical at all – though, perhaps the language should be clarified. What they do is pull a little bit (within fair use) and use it to give more traffic to that site. What they don’t want is someone taking all of their work in a way that gives them no benefit at all.

    How is that hypocritical?

  5. It’s hypocritical because their entire business is built on spidering others, but then they disallow people from spidering them. It may make sense from a business point of view but there’s definitely an element of “do what we say, not what we do”.

  6. Rich and Mike:

    Without the content of thousands of news sources, Topix would have no reason for being. Topix gets its content by scraping those sites.

    You cannot deny that.

    So it’s particularly cheeky for Topix to disallow scraping of its own site. I’m failing to see how that is *not* hypocritical.

  7. Tim says:

    I used a version of Julian’s script to provide a SARS news feed from Google on my non-profit site SARS Watch Org last year, after getting verbal assurances from someone at Google that they wouldn’t care about a small non-profit site. I also created a SARS news feed using Moreover.com’s service. The instructions for doing so are buried deep in Moreover’s website, but they expressly allow it, for non-profits. But I got the best results using NewsisFree to get RSS feeds from newspapers and specialized medical and science journals I selected, doing a search of SARS on those, then displaying the results.

    I recently put a Topix feed on my local Berkeley news page of my personal site, and have been quite pleased with the results.

    It makes sense that if you are going to be making money from someone else’s content, you should have to pay, but it seems like there are a lot of good alternatives out there for non-profit sites. It does seem like Google should allow the use of headlines generated from their site for non-profits. But as long as they aren’t selling ads on their news pages, I don’t see hypocrisy.

    P.S. A way to get headlines from Google on your site in a way that might not violate the TOS would be to subscribe to Google News alerts, feed the mail to an email-to-RSS service like Mailbucket, and display the resulting RSS file on your site.

    Maybe – people tell me I have a sneaky mind :-)

  8. Rich Skrenta says:

    Adrian:

    You’re alleging that the news aggregation (and by extension, the search engine) business model is simply “spidering” content.

    That’s not true.

    Yes, it takes content to build a search engine, or a news aggregator. But, as we all discovered with search engines that didn’t work very well in the 90’s, providing a quality navigational experience is something users want, and is accepted to be intellectual property in its own right. Our business model supports the development of our service through advertising.

    A couple of things to point out:

    1) Our use of published content is fair use. We always link directly to the story, and our summary is a single sentence from the article. (In fact, if the sentence is too long, we truncate it.) We’ve had requests to link to the print-only versions of stories like Drudge does; we won’t do that. We want publishers to be happy with the traffic we send them.

    2) We have an opt-out policy for our crawl. If a publisher doesn’t want to be included in our crawl, we take them out. So far we’ve had one opt-out request. We’ve had hundreds of requests to be included — from sources including newspapers, TV stations, and magazines.

    Building businesses that drive traffic to content owners is a win/win for users, content owners and aggregators. And as we wouldn’t reprint an entire article from a content source, we don’t allow people to take our aggregated, value-added content, and resell it, without talking to us first.

    It’s not analogous, nor hypocritical for us to make money of off the value we have created.

  9. Hi again, Rich,

    Thanks for taking the time to respond. I agree 100 percent with everything you’ve written. I think Topix.net is brilliant technically and useful content-wise. (I make a point of visiting the site every so often — particularly the page of Beatles news, which I find very useful.)

    You do indeed add value to the content, and you do indeed have a right to make money off of that added value. I have not disputed that.

    What I *am* pointing out, and what you didn’t address in your comment, is your Terms of Service, which is the part I find hypocritical. Particularly the part that says users may not “use any robot, spider, other automated device, or manual process to monitor or copy any content from the Service.” Your service is fundamentally based on scraping other sites, yet you disallow scraping from your own.

    I hope I’ve made myself more clear.

  10. Rich Skrenta says:

    Ah, I see what you mean now… Mike Masnick from TechDirt has suggested that we look at switching to a Creative Commons license, which might be good idea for us.

    Good point, we’ll look into this.

  11. Cool. I’m glad my point finally came across. Thanks for listening and responding.