Posts Tagged ‘Reliability’

Illuminata Video Series Part II: Amidst & Among

The second video in a series of four featuring Jonathan Eunice, co-founder and principal IT advisor for Illuminata, and Joan Jacobs, Alliance president and executive director, can be viewed below. In this episode, Jonathan discusses the reality of diversified datacenters including the management of applications, costs, and risk. See the corresponding slide deck here.

NonStop in India

indiaA recent article from an Indian publication, BusinessLine, described how for over 30 years, HP NonStop has powered some of the India’s most critical installations. The article interviews Santanu Ghose, Country Head, Business Critical Systems and NonStop Servers, HP India. Ghose explains how the key fundamentals of the HP NonStop have remained the same and how they have evolved on Itanium for improved performance and energy efficiency. Ghose says:

“The HP Integrity NonStop family now starts with “seven 9s” (99.99999 per cent availability). Some key changes, over time, include the standardisation of NonStop based on Intel Itanium architecture; leveraging the power of HP Blade architecture to deliver lower TCO and energy efficiency and collaborating with ISVs (independent software vendors) to create an ecosystem of solutions based on NonStop.”

Read the entire article from BusinessLine here.

$1.79 Trillion Running on Itanium

In a recent article from India Infoline, Anjan Choudhury, Chief Technology Officer of Bombay Stock Exchange, talks about the current economic climate as a good opportunity to upgrade mission-critical IT infrastructure. He says:

“In our business, we need to have a reliable, scalable and fault tolerant system. We use HP Integrity Nonstop servers with Intel® Itanium® processors as our core back-end system. Itanium-based systems have very successfully catered to our business requirements during the last bull-run. We have further upgraded the systems even in the present scenario as we need to be ready for the next wave.”

About: The Bombay Stock Exchange is the world’s number one exchange in terms of the number of listed companies and the world’s 5th in transaction numbers. The market capitalization as on December 31, 2007 stood at USD 1.79 trillion. An investor can choose from more than 4,700 listed companies.

Read the entire article here.

What does Mission Critical mean?

I have often been asked to define what I consider to be “mission-critical” since it can mean different things to different people. My Intel colleague in Europe, Joachim Aertebjerg, posed a similar question in a recent post. Is it a system with five 9s reliability or seven 9s? For some users, five 9s is good enough while for others seven 9s is still not good enough. (Seven 9s, by the way, is approximately just over 3 seconds downtime per year).
 
What I usually tell customers is to consider this - a manufacturing company ships out 10 PCs, 30 printers and 1 server from its manufacturing facilities every minute. So what is the cost of every minute of downtime? What about 10 minutes of downtime? I’ll let you do the math. Some may argue that such small interruptions are fine as they can work overtime to catch up; but what if it is the end of quarter where quarter close revenues will be impacted? Or if the shipment is destined for a very important customer where future orders depended on this shipment reaching its destination on time?
 
A more obvious example would be systems used in a stock exchange or other financial services institutions. The cost of downtime here could run into billions of dollars not to mention other implications like loss of reputation, etc. I’ll have more to say about this soon.
 
For customers that need this level of mission-critical, Itanium-based systems can deliver.

How many 9s of reliability do you need?

While the scale up vs. scale out discussions continue at large datacenters around the globe, the question of reliability often creeps in. Most involved with server infrastructure agree that reliability is important. The question becomes “how much reliability is adequate?” Servers such as HP Integrity NonStop with triple-redundancy and a special operating system compete with mainframes on being the most reliable server solution available. 99.99999% uptime used to be the benchmark for these type of solutions. At the other end of the scale one can find hardware vendors offering servers build with desktop components and open source operating systems running the IT backbone in small and medium-sized businesses; obviously with a lot less reliability built-in. Between these extremes is where the various clusters of smaller nodes and SMP systems reside and, in my opinion, where a lot of innovation is happening.

Recently, the Itanium architecture has led the market in terms of reliability. Machine Check Architecture (MCA), reduced impact from soft errors (cosmic radiation), core-level lock step (instructions in parallel execution) and extensive error correction (ECC) on cache, bus & buffers are just some of the reliability terms that Intel and Itanium system vendors will highlight as differentiators.

But the x86 segment is not sitting still. Moore’s law allows x86 processors to carry additional circuitry dedicated to reliability and hardening of the entire server. Other innovation efforts allow virtual SMP solutions with fall-over if one of the server nodes fails. Judging the overall improvements in reliability and measuring the number of 9s is difficult, but some IT managers think this is the future.

And again, attractive solutions also exist in the middle. Take for example HP’s recent business intelligence solution which is based on cluster of 2-way Itanium servers. And for those that prefer UNIX environment, HP’s virtualization software (VSE) can provide the mission-critical reliability that only large scale up solutions did in the past.

Bottom line: Innovation around server infrastructure happens at multiple levels, with higher-end architectures such as Itanium continuing to move the goal post for reliability features, and it is increasingly difficult to measure how much actual system uptime (and how many 9s of reliability) the solution provides.

P.S. In my day job I manage Intel’s server business (including Itanium & Xeon product lines) in Europe, Middle-East & Africa.