Friday, August 29, 2008

High Availability for AIX

Why HA at All?

While this question may seem counterintuitive, it’s not as cut and dried as it may appear. Oftentimes, the complexities of configuring a highly available environment aren’t worth the expense or the effort. How do we determine this? First and foremost it’s about the Service Level Agreement (SLA) you have with your customers. If you don’t have an SLA, odds are you don’t even need high availability.

An SLA is an agreement between the business and IT that describes the availability required by the application. Some applications, such as the software that powers ATM for banks, can’t afford any downtime. In this scenario, high availability might not be enough—the applications may need a fault-tolerant type of system. Fault-tolerant systems are configured in such a way as to prevent downtime altogether. This usually involves the procurement and configuration of redundant clustered systems. High availability isn’t fault tolerance—it usually involves some kind of downtime, usually measured in minutes, while the failover systems kick into high gear.

The next step is to determine whether you need high availability. These types of discussions are usually held in the design phase of a new application or system deployment. If IT does its homework, it can discuss acceptable levels of downtime and answer several key questions. Can the system be down at all? What is the business/dollars impact of downtime? Downtime can be either schedule or unscheduled. If you can’t take a system down, how are you going to apply patches, upgrade technology levels or apply service packs? At the same time, how are you going to recycle databases, upgrade databases or apply application level patches? When you have a highly available system, you can simply failover to the backup node (during a mutually agreed upon window), do maintenance work and then failover the other way when it’s time to patch up that system.

The net of it all is that the decision whether or not to configure systems for high availability should never be strictly a technical decision. Management needs to sign off on the decision and the business also needs to be ready to pick up the tab for the expense of implementing this type of system. This expense is measured not only in the price of the software, but also in deploying the solution and in educating staff how to maintain it.

IBM’s HACMP

So what’s available for Power Systems customers? Let’s start with IBM’s flagship product HACMP—now called IBM Power HA Cluster Manager (HACMP). HACMP has been around in some form for the better part of two decades now—I myself have used it in varying capacities for almost a decade. It has definitely come a long way. In fact—at this stage it’s now available for the AIX OS and Linux on the Power Systems platform—starting with HACMP 5.4. While there are other clustering solutions available for Linux, HACMP for Linux uses the same interface and configurations as HACMP for AIX, providing a common multi-platform solution that protects your investment as you grow your cluster environments.

IBM also offers an extended distance version, which is commonly referred to as the Disaster Recovery or Business Continuity version of the product. This extends HACMP’s capabilities by replicating critical data—allowing for failover to a remote location. The product is called HACMP/XD.

What’s New with HACMP?

During the days of yore, HACMP was extremely difficult to configure and manage. Through the years it’s gotten much easier to manage. Part of the reason for this is the HACMP Smart Assist programs, which are available for enterprise applications such as DB2, Oracle and WebSphere. These use application-specific knowledge to extend HACMP’s standard auto-discovery features while providing the necessary application monitors and start/stop scripts to streamline configuration of the cluster.

In a sense, HACMP 5.4.1 is a fixpack for HACMP 5.4.0. However there are many enhancements to this product, which align itself to some of the new enhancements in AIX 6 and the POWER6 architecture. Enhancements include:

* Enhanced usability for WebSMIT
* Support for AIX 6.1 workload partitions (WPARs)
* NFS V4 Support (requires AIX 5.3 TL 7 or AIX 6.1)
* New RPV Status monitor
* Support for PPRC consistency groups

IBM’s HACMP solution is mature, rock solid and has the advantage of being aligned by the hip with AIX and the Power Systems platform. With HACMP, one thing you’ll never have to worry about when calling IBM support is someone trying to point the finger of blame to a third-party product.

It should also be noted that there are some exciting third party tools that also offer high availability in the AIX arena, specifically Vision Solutions’ EchoCluster and EchoStream, and Veritas Cluster Server.
EchoCluster/EchoStream:

Vision Solutions systems offer two products: EchoCluster for AIX and EchoStream for AIX. Vision’s high availability solution combines both of these products, so it provides continuous protection which allows one to both prevent downtime, while also recovering data from a single point in time. EchoStream offers Continuous Data Protection, which provides the point-in-time recovery. EchoCluster provides the failover capabilities—in many ways, it’s similar to IBM’s HACMP, though on a smaller scale. It allows for both automated failover and scheduled failover, even at the application level, without disturbing other applications. This is very helpful when one has multiple applications on a single LPAR and one needs to either patch or upgrade their OS to support an application. In this scenario, you could failover applications that have a more stringent SLA, so that you could patch your servers and still run your applications. I’ve seen the GUI and in many ways this is the most simplest of all HA solutions I’ve seen which work with AIX. It’s the simplicity that can eliminate the special training required to run a complex system like HACMP. This solution isn’t for everyone, and it’s marketed for businesses that don’t have complicated clusters. If you have a multi-clustered environment with lots of complexities, you probably should stay with HACMP or evaluate Veritas.
Veritas Cluster Server

Because of its storied history with Sun and Solaris (Sun used to own Veritas)—most people aren’t even aware that Veritas plays in the AIX world. While it doesn’t have the AIX maturity of a product like HACMP, it has had a product that’s worked with AIX for more than five years: Veritas Cluster Server, or VCS. While a competitor of HACMP for the high availability market on the AIX OS, this is mostly a niche product for companies that have already standardized on Veritas (usually those with large Solaris server farms)—and aren’t prepared to make the investment to learn or buy new technology. Owned by Symantec, it’s also the most expensive of the three choices discussed here. One of the advantages to this product is that in speaking with administrators that have worked with both Veritas and HACMP, the consensus is that it’s much easier to build out and maintain a cluster with Veritas than with HACMP. Though HACMP has gotten much more admin-friendly in recent years, in many ways this product is more intuitive.
Evaluate Your Options

Competition is always a good thing, especially when you have these kinds of choices. The safest of the three choices presented here is always going to be HACMP because of its maturity and tight integration with the AIX OS and the Power Systems platform. Without a doubt, Vision’s EchoCluster and Veritas Cluster Server also have strong systems and a client base which proves the worth of their products. While not as well known as HACMP or Veritas, Vision is starting to exert some real muscle as a viable alternative to more complex solutions in the market space. Veritas is certainly a viable solution for those that have large Solaris server farms and/or strong Veritas administrators. When evaluating HA technology—I would evaluate all three before making any choices.


What is HACMP?
Before we explain what is HACMP, we have to define the concept of high availability.


High availability

High availability is one of the components that contributes to providing continuous service for the application
clients, by masking or eliminating both planned and unplanned systems and
application downtime. This is achieved through the elimination of hardware and
software single points of failure (SPOFs).

High Availability Solutions should eliminate single points of failure (SPOF)
through appropriate design, planning, selection of hardware, configuration of
software, and carefully controlled change management discipline.

Downtime
The downtime is the time frame when an application is not available to serve its
clients. We can classify the downtime as:

Planned:
– Hardware upgrades
– Repairs
– Software updates/upgrades
– Backups (offline backups)
– Testing (periodic testing is required for cluster validation.)
– Development

Unplanned:
– Administrator errors
– Application failures
– Hardware failures
– Environmental disasters

Short description for HACMP:

The IBM high availability solution for AIX, High Availability Cluster Multi
Processing, is based on the well-proven IBM clustering technology, and consists
of two components:

High availability: The process of ensuring an application is available for use
through the use of duplicated and/or shared resources.

Cluster multi-processing: Multiple applications running on the same nodes
with shared or concurrent access to the data.

A high availability solution based on HACMP provides automated failure
detection, diagnosis, application recovery, and node reintegration. With an
appropriate application, HACMP can also provide concurrent access to the data
for parallel processing applications, thus offering excellent horizontal scalability.

A typical HACMP environment is shown in Figure 1-1.

No comments: