Site icon John Battelle's Search Blog

Google Shares Some Data

No, not the kind that might help you predict earnings, but the kind that might help researchers around the world play with massive sets of word phrases and figure out all kinds of new applications based on the core concept of n-grams (don’t ask me, read this). Massive on the order of trillions, that is. On Friday Google’s research blog announced it would be releasing such a trove, blog post:

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

It’s good to see Google giving something back to the research community, in particular given the thread about this very topic on Searchblog earlier. But I’m going to guess that this will only whet the appetite of folks in pure R&D who’d love to see even more information shared – more complex patterns across data, for example – the very same information, unfortunately for them, that is the basis for competitive differentiation, and is not likely to be shared anytime soon.

Update: Yow. AOL release actual search data from half a million users, according to this post. Wow….

Exit mobile version