free html hit counter Google Shares Some Data - John Battelle's Search Blog

Google Shares Some Data

By - August 06, 2006

No, not the kind that might help you predict earnings, but the kind that might help researchers around the world play with massive sets of word phrases and figure out all kinds of new applications based on the core concept of n-grams (don’t ask me, read this). Massive on the order of trillions, that is. On Friday Google’s research blog announced it would be releasing such a trove, blog post:

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

It’s good to see Google giving something back to the research community, in particular given the thread about this very topic on Searchblog earlier. But I’m going to guess that this will only whet the appetite of folks in pure R&D who’d love to see even more information shared – more complex patterns across data, for example – the very same information, unfortunately for them, that is the basis for competitive differentiation, and is not likely to be shared anytime soon.

Update: Yow. AOL release actual search data from half a million users, according to this post. Wow….

Related Posts Plugin for WordPress, Blogger...

3 thoughts on “Google Shares Some Data

  1. JG says:

    This is a highly laudable move. Whatever my personal biases against the efficacy of Google’s “20% time only” research approach, the fact that Google is sharing data like this is fantastic. The black hole that is Google Research just became one shade lighter.

    Of additional note is the fact that AOL also released a large dataset to the research public. Not only is AOL giving you the raw data, but they are also providing classification labels for 20,000 hand-labeled queries. That is huge. Hand-labeling is a labor-intensive task. It is wonderful that they have made this data available, as it has a large effect on the types of research one can do with the data.

    What I also found quite interesting about the AOL data is the fact that it includes 20 million web queries from over 500,000 users over the period of three months.

    Wasn’t that exactly the sort of data that was raising such a big privacy furor a few months ago? Remember, with Yahoo and MSN giving in to the U.S. govt, and Google holding tight (only to give in to China a week later.. but that’s another story).

    Well, now anyone can go in, and get this sort of data (pre-anonymized, of course, but isn’t that also what the government was asking for?) from AOL. Without a government warrant. And use it to do research on interesting patterns and trends and such.

    John, you talk about whetting R&D appetites. Google and AOL have just offered us two savory morsels.

  2. JG says:

    Wait a minute here.. a thought just clicked. These ~2M queries from ~500k users that AOL is releasing.. the reports say that the dates on those logs are March 2006 – May 2006. So.. who was providing the search results for AOL in that time? Was it not Google? Did AOL essentially just release a subset of the Google data, the data Google said it would not release, earlier?

    Am I reading that correctly? Someone help me out, here. That is very interesting, if so.

  3. or says:

    Just a slight correction. I don’t think Google’s 20% percent time is their research. As far as I understand they have a dedicated research team/department. The 20% time is for engineers to develop their own products. But Google does have a research team that works along side engineers ( ). It is that research team that released the data, not a 20% engineer’s work.