I must have been under a rock, because I missed the news that Doug Cutting (of Lucene and Nutch fame) is now at Yahoo, and working on supporting Hadoop, which is “a software platform lets one easily write and run applications that process vast amounts of data.”
Tim covers this well, writing:
…why is Yahoo!’s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor. It’s very much the same reason why IBM got behind Eclipse, as a way of getting competitive advantage against Sun in the Java market. (If you thought they were doing it out of the goodness of their hearts rather than clear-sighted business logic, think again.) If Yahoo! is realizing that open source is an important part of their competitive strategy, you can be sure that other big Web 2.0 companies will follow.
4 thoughts on “Hadoop”
Maybe you’ve been under a rock for quite a long time 🙂
You may also be interested in projects that are building on top of Hadoop. See my blog post “Hadoop gaining momentum” to see references on machine learning projects using MapReduce, SQL-like relational semantics on top of Hadoop, etc.
This isn’t an Open Source story, it’s an infrastructure story. Yahoo’s IT infrastructure is made up of lots of different smaller systems. But Google is increasingly moving everything over to having a single GFS file system which uses Map Reduce to run jobs, and BigTable running on top which can store pretty much any kind of data you can think of.
Robin Harris at Storage Mojo believes that if Yahoo moves over to a Google-like infrastructure, they could save 30-40% of their IT costs, and cut as many as 4,000 jobs.
As you know, Hadoop, Hbase and MapReduce are the Apache clones of GFS, BigTable and MapReduce respectively.
I’ve heard that Yahoo is already running a 1,000 node Hadoop cluster. So it makes sense that they believe that the Apache suite might be able to do for them, what Google’s infrastructure has done for Google which is to give them a single heterogeneous storage system which runs on hundreds of thousands of cheap commodity server which you can then quickly and easily build products and services on top of across the entire company.
Now, if they were really smart, they would then take this to the next level to create a next generation infrastructure which would allow them to leapfrog Google instead of simply catching up to them.
That’s what my company is working on and Yahoo would be stupid not to be doing as well.
Btw, speaking of Nutch, the first two massive WebHarvest.gov (terabytes of permanently archived government data) can be keyword searched using Nutch. WebHarvest is a project from the Internet Archive and the National Archives.
We also learned last week that Lucene will be used at Wikia.