Failure Becomes a Manageable Event

John Battelle

19 years ago

That’s the best line from a story posted today on ZDNet UK. It’s spoken by Urs Hölzle, Google Fellow, who is currently on a tour of Europe recruiting engineers. ZD “snuck into” one of his talks to potential recruits and has an extensive overview of what he said. The piece includes metrics on Google’s infrastructure, but to my eye they seem understated (ie it mentions a 4 billion document index, when Google now claims 8 billion, and 30 clusters of up to 2000 computers, when I’ve got sources saying it’s more than twice that). In any case, it’s very interesting reading.

Highlights:

It is one of the largest computing projects on the planet, arguably employing more computers than any other single, fully managed system (we’re not counting distributed computing projects here), some 200 computer science PhDs, and 600 other computer scientists….

Google replicates servers, sets of servers and entire data centres, added Hölzle, and has not had a complete system failure since February 2000. Back then it had a single data centre, and the main switch failed, shutting the search engine down for an hour. Today the company mirrors everything across multiple independent data centres, and the fault tolerance works across sites, “so if we lose a data centre we can continue elsewhere — and it happens more often than you would think. Stuff happens and you have to deal with it.”

A new data centre can be up and running in under three days. “Our data centre now is like an iMac,” said Schulz.” You have two cables, power and data. All you need is a truck to bring the servers in and the whole burning in, operating system install and configuration is automated.”…

If the index size doubles, then the embarrassingly parallel nature of the problem means that Google could double the number of machines and get the same response time so it can grow linearly with traffic. “In reality (from a business point of view) we would like to grow less than linear to keep costs down,” said Hölzle, “but luckily the hardware keeps getting cheaper.”

So every year as the Web gets bigger and requires more hardware to index, search and return Web pages, hardware gets cheaper so it “more or less evens out” to use Hölzle’s words. …

Google wrote its own spell checker, and maintains that nobody know as many spelling errors as it does. The amount of computing power available at the company means it can afford to begin teaching the system which words are related – for instance “Imperial”, “College” and “London”. It’s a job that many CPU years, and which would not have been possible without these thousands of machines. “When you have tons of data and tons of computation you can make things work that don’t work on smaller systems,” said Hölzle. One goal of the company now is to develop a better conceptual understanding of text, to get from the text string to a concept…..

Even three years ago, he said, the Web had much more of a grass roots feeling to it. “We have thought of having a button saying ‘give me less commercial results’,” but the company has shied away from implementing this yet.

Thanks for the tip, KK.

Share this: