John Battelle's Search Blog Failure Becomes a Manageable Event

That’s the best line from a story posted today on ZDNet UK. It’s spoken by Urs Hölzle, Google Fellow, who is currently on a tour of Europe recruiting engineers. ZD “snuck into” one of his talks to potential recruits and has an extensive overview of what he said. The piece includes metrics on Google’s infrastructure, but to my eye they seem understated (ie it mentions a 4 billion document index, when Google now claims 8 billion, and 30 clusters of up to 2000 computers, when I’ve got sources saying it’s more than twice that). In any case, it’s very interesting reading.

Highlights:

It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we’re not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists….

Google replicates servers, sets of servers and entire data centres, added Hölzle, and has not had a complete system failure since February 2000. Back then it had a single data centre, and the main switch failed, shutting the search engine down for an hour. Today the company mirrors everything across multiple independent data centres, and the fault tolerance works across sites, “so if we lose a data centre we can continue elsewhere — and it happens more often than you would think. Stuff happens and you have to deal with it.”

A new data centre can be up and running in under three days. “Our data centre now is like an iMac,” said Schulz.” You have two cables, power and data. All you need is a truck to bring the servers in and the whole burning in, operating system install and configuration is automated.”…

If the index size doubles, then the embarrassingly parallel nature of the problem means that Google could double the number of machines and get the same response time so it can grow linearly with traffic. “In reality (from a business point of view) we would like to grow less than linear to keep costs down,” said Hölzle, “but luckily the hardware keeps getting cheaper.”

So every year as the Web gets bigger and requires more hardware to index, search and return Web pages, hardware gets cheaper so it “more or less evens out” to use Hölzle’s words. …

Google wrote its own spell checker, and maintains that nobody know as many spelling errors as it does. The amount of computing power available at the company means it can afford to begin teaching the system which words are related – for instance “Imperial”, “College” and “London”. It’s a job that many CPU years, and which would not have been possible without these thousands of machines. “When you have tons of data and tons of computation you can make things work that don’t work on smaller systems,” said Hölzle. One goal of the company now is to develop a better conceptual understanding of text, to get from the text string to a concept…..

Even three years ago, he said, the Web had much more of a grass roots feeling to it. “We have thought of having a button saying ‘give me less commercial results’,” but the company has shied away from implementing this yet.

Thanks for the tip, KK.

One thought on “Failure Becomes a Manageable Event”

Most developers understand the overwhelming joy of their first application that actually does something useful. Likewise, developers that have had large success (measured by the number of happy users on of their application) understand the incredible joy & responsibility of having to build a cluster of machines to support the application’s user base.

However, when you look at a rack of servers sitting in a cold hosting facility in the middle of a huge datacenter, there is no joy. Unless that data center is Google. Google inspires the techie within all of us with it’s awesome power. I think that’s the power of the brand Sergey and Larry have built. I feel as though I have a personal relationship with Google… not the company but with Google, the application(s). I so respect the way they’ve applied their knowledge for good and they way they’ve been able to keep the company as personal as a mid-size company with a multi-billion dollar market cap can be.

When I think about Microsoft, I don’t have that same connection or feeling of intimacy. After all, Windows was snagged out from under Apple and there’s no joy in that. Not that the employees of Microsoft today had anything to do with that but it’s a fact that will forever remain.

I for one am thankful for the power of Google and their ability to pump out incredibly powerful apps as if they were a 3 man development team. That kind of branding is going to be around for a long time.

Share this:

One thought on “Failure Becomes a Manageable Event”

Leave a Reply Cancel reply