Author Archive for Joachim Aertebjerg

How many 9s of reliability do you need?

While the scale up vs. scale out discussions continue at large datacenters around the globe, the question of reliability often creeps in. Most involved with server infrastructure agree that reliability is important. The question becomes “how much reliability is adequate?” Servers such as HP Integrity NonStop with triple-redundancy and a special operating system compete with mainframes on being the most reliable server solution available. 99.99999% uptime used to be the benchmark for these type of solutions. At the other end of the scale one can find hardware vendors offering servers build with desktop components and open source operating systems running the IT backbone in small and medium-sized businesses; obviously with a lot less reliability built-in. Between these extremes is where the various clusters of smaller nodes and SMP systems reside and, in my opinion, where a lot of innovation is happening.

Recently, the Itanium architecture has led the market in terms of reliability. Machine Check Architecture (MCA), reduced impact from soft errors (cosmic radiation), core-level lock step (instructions in parallel execution) and extensive error correction (ECC) on cache, bus & buffers are just some of the reliability terms that Intel and Itanium system vendors will highlight as differentiators.

But the x86 segment is not sitting still. Moore’s law allows x86 processors to carry additional circuitry dedicated to reliability and hardening of the entire server. Other innovation efforts allow virtual SMP solutions with fall-over if one of the server nodes fails. Judging the overall improvements in reliability and measuring the number of 9s is difficult, but some IT managers think this is the future.

And again, attractive solutions also exist in the middle. Take for example HP’s recent business intelligence solution which is based on cluster of 2-way Itanium servers. And for those that prefer UNIX environment, HP’s virtualization software (VSE) can provide the mission-critical reliability that only large scale up solutions did in the past.

Bottom line: Innovation around server infrastructure happens at multiple levels, with higher-end architectures such as Itanium continuing to move the goal post for reliability features, and it is increasingly difficult to measure how much actual system uptime (and how many 9s of reliability) the solution provides.

P.S. In my day job I manage Intel’s server business (including Itanium & Xeon product lines) in Europe, Middle-East & Africa.