Yahoo is on the ropes, but this project has been showing promise for a long time. Check out the news today:
Yahoo! recently launched what we believe is the worlds largest Apache Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.
Some Webmap size data:
* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes
3 thoughts on “Meanwhile, Hadoop News”
What does this mean to the web searcher? I have seen several blogs write about this topic, but none has explained why this is important. Could you?
To a web searcher, it doesn’t mean that much other than the Yahoo index is a little bigger and fresher. However, the important piece is that Hadoop is stable enough to be used at scale in production. If you have a program that doesn’t fit on a single computer, you can use Hadoop to scale it out to thousands of computers. For free.