There are a number of high availability solutions for IBM i-based systems, but they can be broken down into two broad classes: hardware-level and logical. Each has unique qualities.
Reliability has been a hallmark of the AS/400 since its inception. Careful design and integrated hardware and software gave it a reputation as a system that almost never crashed. Nonetheless, reliability does not equal availability. Databases must be backed up and reorganized. Operating systems must be upgraded. And disasters can happen. Any of these incidents can create system unavailability.
For many organizations, reliability is necessary but not sufficient. Depending on the nature of their business, downtime may be unthinkable.
A major factor in the increased sensitivity to downtime is the current dependence on IT. When companies first computerized, they maintained manual systems that they could revert to if necessary. That's often no longer the case. Now, a system stoppage typically creates a business stoppage. For a large company, this may mean that operations worth millions of dollars an hour come to a halt.
Plus, lengthy system outages can create a perception of an inept company in the eyes of customers, suppliers, and industry watchdogs. The resulting profitability and viability impacts may be far greater than the loss of sales attributed directly to a system downtime event.
For some companies, availability isn't an option. It's mandatory. An increasingly rigorous regulatory environment forces many organizations to address their data availability issues. This is particularly true in the financial services and healthcare sectors, as well as in publicly traded companies.
This article reviews some of the unique features of IBM i that influence data and application availability, examines options for increasing availability on IBM i, looks at the role of IBM HA Business Partners, and offers some suggestions for helping you choose the most appropriate HA solution for your organization.
Unique IBM i Platform Technologies
AS/400 has always been a unique platform. Despite the merging of what had been the System i and System p hardware platforms into Power Systems, that uniqueness remains.
This section will review some of the distinctive features of the platform, including the following:
- Single-level store
- Switchable Independent Auxiliary Storage Pools (IASPs)
- OS-level remote and local journaling.
Single-level store, also referred to as single-level storage, was a founding technology in AS/400 and probably its most widely discussed unique feature. With a single-level store architecture, IBM i treats all storage—both main memory and attached disks—as a single large memory organized into a two-dimensional view of address spaces.
In many cases, this reduces the number of instructions that the operating system has to execute to access and manage storage. The result is improved application and operating system performance.
It is important to note that neither programmers nor users would suffer if they remained ignorant of the existence of the single-level store because data objects on IBM i are accessed by name, never by address. The single-level store is relevant only much closer to the machine level than any human typically sees.
In IBM i-speak, an Auxiliary Storage Pool (ASP) is simply a software definition of a collection of disk storage units. An IASP is similar, the difference being that an IASP is a collection of storage that you can bring online or take offline independent of the rest of the storage on a system—hence the "I," which stands for "Independent."
IASPs can be either switchable or non-switchable. Switchable IASPs can be easily switched between different IBM i-based systems.
Clearly, all systems that will potentially use a switchable IASP must be able to address that storage. To facilitate this, switchable IASPs make use of the IBM i clustering framework to control the allocation of IASP address ranges. When an IASP is created, the entire assigned range of system addresses is reserved on all systems that can access the IASP.
Journaling is at the heart of most logical HA solutions, but it also plays a part in some hardware-level technologies. When an object is enrolled in the journal, any update to that object is written first to the journal and then to the database itself. While it might seem counterintuitive, the journal is thus more up-to-date than the database itself. As such, the journal is invaluable for HA.
In addition to local journaling, IBM i also offers remote journaling that writes transaction information to both the local journal and to journal receivers running on another system. This second system can be in the same room, or it might be on the other side of the globe. If the remote replica is in another location, it should be distant enough that it will not be affected by any disaster that may befall the primary server.
Remote journaling can operate in either synchronous or asynchronous mode. Synchronous remote journaling first sends transactions to the remote journal receivers. Those transactions are written to the local journals and the local database only after the local system receives an acknowledgement from the remote journal receivers indicating that they have successfully received the transactions.
The benefits and drawbacks of synchronous and asynchronous journaling are readily apparent.
When using synchronous journaling, because user transactions are not considered to be complete until they have been received by the remote journals, the remote journal is even more up-to-date than the local journals at any point in time. Thus, 100 percent of transaction data can be recovered if a disaster or other failure strikes the local system.
On the other hand, because user transactions are held in abeyance until the remote journal has received them and acknowledged their receipt, network bottlenecks may result in unacceptably slow user response times. For this reason, synchronous journaling is rarely used when great distances separate the primary and remote systems.
IBM i offers a variety of hardware-level HA facilities either inherent in the operating system or as optional components, including the following:
- Switchable I/O pools
- Switchable storage towers
- LUN-level switching
- XSM Geographic Mirroring
- PPRC Metro Mirroring
- PPRC Global Mirroring
The first four of these technologies maintain only a single store of your data that can then be switched between IBM i-based servers. As such, they do not, on their own, offer protection against data destruction.
The technologies discussed below make use of or can make use of IBM Cluster Resource Services to maintain cluster nodes, monitor node availability, and provide switchover/failover functionality.
Switchable I/O Pools
A switchable I/O pool is configured on a disk internal to an IBM i-based system. It can be switched between two or more LPARs within that system. This can be useful when, for example, installing a software upgrade in one partition. When the upgrade is complete, the I/O pool can be switched between the two partitions with little or no operational downtime.
Switchable Storage Towers
Switchable tower technology is similar to switchable I/O pools, but rather than being internal disk units, the storage is contained in a tower that has been assigned to a single IASP. An HSL cable loop connects the storage tower to two systems. The IASP can then be switched between those systems.
This may not be a feasible option in the future. Many organizations are replacing HSL cables with 12X cable technology because 12X offers a bandwidth advantage that can be as high as 50 percent. However, the 12X cable technology is not compatible with switchable tower technology.
IBM i 7.1 introduced LUN-level switching as an operating system option when using IBM System Storage DS6000 or DS8000. With LUN-level switching, two or more systems or partitions within a cluster can connect to the same storage unit within a SAN. The Logical Unit Numbers (LUNs) that define an IASP can then be switched between the systems as required.
Because this option, like switchable I/O pools and storage towers, facilitates the switching of only a single store of your data, LUN-level switching is typically used in conjunction with a disk mirroring or data replication technology to provide a disaster recovery option.
XSM Geographic Mirror
XSM Geographic Mirror uses functions within IBM i to maintain a copy of an IASP on a second system in a cluster. The IASP can be varied on to either system, but not both simultaneously.
Because XSM Geographic Mirror performs synchronous mirroring, it is not suitable for maintaining mirrors over great distances. Consequently, it does not protect data against a disaster that would affect an entire data center.
PPRC Metro Mirror
PPRC Metro Mirror works at the storage unit level within a SAN. It uses IBM TotalStorage functions to create a copy of an IASP on another SAN attached to a second system within a cluster. When the primary system becomes unavailable or needs to be taken offline for maintenance, the replica IASP can be varied on to the backup system, which will then assume the production role.
Like XSM Geographic Mirror, PPRC Metro Mirror is a synchronous technology that is not suitable for long-distance replication of data.
PPRC Metro Mirror is an IBM proprietary technology that operates only on IBM storage units. EMC has a similar technology called SRDF/S.
PPRC Global Mirror
PPRC Global Mirror is similar to Metro Mirror, but it functions asynchronously. As a result, the distance between production and backup IASPs will not affect user response times.
The lag between transactions being committed on the primary system and their being reflected on the backup IASP puts data at risk. To lessen this risk, PPRC uses multiple communication paths to reduce data latencies and it adds a point-in-time technology that maintains a consistent copy of the IASP by writing sectors in the same order as they occurred on the primary system.
EMC's version of asynchronous mirroring is called SDRF/A.
Logical HA Technologies
In this context, "logical" is not meant to be a synonym for rational, but rather that it is based in program logic that operates further above the hardware level than hardware-level technologies (although still below the user application level). These HA applications copy all data on a production server to storage units on a second server. The HA software then maintains the currency of the backup data by replicating production system changes as they happen.
The two servers may be any of the following:
- Partitions on a single system
- Partitions on separate systems located in the same facility
- Standalone servers located at the same facility
- Servers (partitions on separate systems or standalone servers) considerably distant from each other
When the primary and backup servers are located in the same facility, the solution still provides high availability in the sense that it can prevent or reduce downtime from events such as scheduled maintenance or single-system failure. However, it does not prevent downtime that results from disasters that knock out an entire facility.
In contrast, when the primary and backup servers are geographically remote from each other, at least one of the facilities is likely to survive any disaster. Thus, if the primary server goes down, users can be switched to the backup.
Logical HA products usually provide facilities for automating the switchover process. In addition, they may include functionality that monitors the availability of the primary system from the backup. When the primary system is unavailable, the software may then be able to automatically initiate a switchover, although it is usually referred to as a "failover" in these cases.
The failover facility, if it exists in the product you buy, can usually be turned off if you prefer to retain control over when a switchover is initiated.
In the IBM i world, logical HA technologies use journaling—including both user and system journals—to capture changes made on the primary system. Remote journaling is usually, but not always, the transport mechanism for replication to the backup system.
Some vendors include an option for using a proprietary transport mechanism. This allows them to add functionality to the process, although possibly at a slight performance cost.
Regardless of how the solution transmits the changed data between the two systems, the HA software is responsible for moving the data out of the journal (or other data store if remote journaling is not used) on the backup machine and into the appropriate databases or files. This is necessary to ensure that the system remains a hot-standby replica server that is ready to take over operations whenever necessary.
Logical HA technologies typically operate asynchronously, thereby virtually eliminating the impact on user application performance. However, when they employ remote journaling as the change capture and transmission mechanism, you can use synchronous mode if you have zero tolerance for lost data. Synchronous remote journaling is subject to the same user response time issues as synchronous hardware-level disk mirroring.
Because hardware-level HA technologies are tied to the hardware, they are available only from the hardware vendor. Logical HA solutions, on the other hand are available from both IBM and independent software vendors.
Continuous Data Protection
Traditional HA technologies do not offer complete protection for your data. Most data losses are not of a catastrophic nature. Instead, they result from more common occurrences, such as the accidental deletion of a file or the corruption of one or a few data items.
Traditional HA technologies cannot help in the remediation of these problems as they replicate data changes in real-time or near real-time. Thus, even if the problem is discovered within minutes of it occurring, the accidental deletion or corruption will probably already be reflected on the backup server, meaning that it no longer provides a source for data recovery.
A newer technology, Continuous Data Protection (CDP), can fill this gap. Like the replication process of traditional HA technologies, CDP captures changes applied to the production system and transmits them to a backup server. Unlike HA technologies, those changes are not applied to operational files and databases. Instead, CDP stores information about each update.
When some data is corrupted or mistakenly deleted, the CDP software can use the information it stored about each update to restore the data to its state at a particular point in time.
CDP comes in two flavors. True CDP transmits changes to the backup server as they occur, allowing data to be restored to its state at any point in time. Near CDP accumulates updates and transmits them in a batch only at certain intervals, possibly when a file is closed. With near CDP, data can be restored only to its state at one of the batch boundaries.
CDP does not necessarily require a second IBM i-based system. Instead, it typically stores backup data in a cross-platform file format. Therefore, CDP may make use of IBM i, Windows, Linux, AIX, or a UNIX system as the backup server.
Hardware-level and logical HA technologies each have unique benefits and drawbacks. To choose the one that's best for your organization, you must understand these differences, assess your organization's requirements, and then choose the solution that best serves your needs.
The following discussion briefly reviews some of the more important differences between hardware-level and logical replication.
Inter-System Bandwidth: User Response Time and Data Latency Issues
Because the HA technology must transmit all updates from the primary server to the backup, bandwidth between the two systems may be an issue. When using asynchronous replication, the critical consideration is the volume of data that might be awaiting replication at the time of a primary system failure. In general, the more constrained the bandwidth is and the higher the transaction volume, the greater this data latency will be. Depending on the nature of a failure, all of this data may be irretrievably lost.
This is not a problem with synchronous replication because the remote journal currency is never behind the primary system. However, this introduces another problem. If bandwidths are severely constrained, users may find their systems locked for unacceptable lengths of time while the journaling function waits for an acknowledgment to return from the remote system. In this case, response times increase depending on how overburdened the pipes between the two systems are.
Logical HA technologies typically offer some advantages in this area as they generally send less data across the wires. Logical replication transmits only changed data. In addition, it can be tuned to replicate only those items that are needed for recovery.
In addition, the remote journaling at the heart of logical HA technologies provides features that can reduce the bandwidth requirement even further. For example, it can bundle journal entries and send them in more efficient batches when a backlog forms. Remote journaling can also make use of IBM i Data Port Services to press multiple communication lines into play if necessary. And because remote journaling operates at the operating system level, it has priority access to system resources.
Hardware-level HA, on the other hand, typically transmits blocks of data even if much of that block has not changed. For example, XSM Geographic Mirror transmits copies of whole memory pages from the source to the target. Each page is 4KB or, for some OS processes running on newer versions of IBM i, 64KB. XSM Geographic Mirror uses IBM i Data Port Services to transmit data over up to four communication lines simultaneously.
Technically, PPRC Metro Mirror and PPRC Global Mirror copy sectors, not pages, of data that change on the primary SAN. However, in practical terms, that amounts to the same thing. On Power Systems for IBM i, each memory page that is written to disk results in all sectors being rewritten. The bandwidth available between IBM TotalStorage SAN units depends on the number of fiber-channel cards installed.
Recovery Time Implications
Hardware-level and logical HA technologies both rely on journaling to provide for system recovery. The difference is that logical HA immediately applies the received journal entries to the databases and files on the backup system as soon as they are received. As a result, the backup system is always a ready-to-run replica of the primary system.
In contrast, a hardware-level solution employs the remote journal to recover damaged and lost data and objects after a system failure. However, these recovery operations begin only after the failure has occurred. Consequently, recovery times may be longer for hardware-level HA technologies than for logical HA.
In addition, using hardware-level mirroring, the mirror IASP must be varied on to the backup system before it can take over operations. The vary-on process is an operating system–level operation that includes more than 30 sequential steps, which further lengthens recovery times.
The need to vary on the backup IASP mirror before it can be used for any purposes other than mirroring limits the flexibility of hardware-level HA solutions. Logical HA technologies maintain a fully functioning replica server. Thus, the backup can be used for read-only operations such as tape-backup jobs, queries, and report generation. Transferring these read-only operations to the backup system will reduce the workload on the primary system, thereby improving user response times on that system.
In contrast, because the backup IASP in a hardware-level HA environment cannot be used for operational purposes until it is varied on, it ceases to serve as a mirror target because the mirror must be attached to the primary system.
It is possible to detach the target IASP and vary it on to the backup system in order to perform read-only functions. When you do so, changes made on the primary system are buffered and mirrored when the mirror IASP is reattached to the primary system. However, buffer sizes then become an issue.
Thus, it is not practical to use this method to perform lengthy operations on the backup system, particularly during times when transaction volumes are high.
In addition to limiting the flexibility to use the backup system for other purposes, hardware-level HA also limits your storage hardware choices. As the name implies, hardware-level HA is hardware-dependent. You can typically mirror only storage devices from the same vendor. You can't, for example, use IBM storage devices on the primary system and mirror them to EMC devices on the backup.
In contrast, logical HA operates at the data level. It is neither aware of nor cares what make of storage units you use on either the primary or backup system as long as they are addressable by your IBM i system.
IBM and HA Business Partners
In the beginning, AS/400 did very little to address availability issues that weren't related to reliability. As the technology progressed through new releases of OS/400 and then i5/OS on AS/400, iSeries, System i, and now IBM i on Power Systems, IBM added availability functionality either inherent in the operating system or as add-on features. Nonetheless, IBM has always relied on HA Business Partners to fill in the gaps in availability coverage.
From a personal perspective, I first became interested in HA systems more than a quarter century ago when I was asked to serve on an IBM Corporate task force on HA. The purpose of this task force was to set the direction for HA on all future IBM systems. The initial focus was on the mainframe, because few System/36 or System/38 customers needed HA. That changed when we announced the AS/400.
Shortly after the AS/400 was announced, it became very clear to IBM that many of our customers, especially those in the financial and healthcare industries, needed HA solutions. Unfortunately, there was not enough time or budget inside of IBM to develop our own HA solution for the AS/400.
I knew of a few Business Partners and System/38 customers that had written HA packages for their own use. I visited several of these firms and determined that some of these packages with modifications could be used for the AS/400. This was the beginning of IBM's reliance on Business Partners for HA solutions.
Over time, new HA partners appeared, and several of the original partners merged into larger worldwide partners. IBM continues to work closely with these partners to define what new availability functions are needed. At almost every release, IBM adds some new or enhanced availability functions into the operating system or as add-on features.
So Many Choices…
No organization can afford to ignore issues concerning the availability of its data and application. Downtime tolerance varies depending on the nature and volume of the business, but few, if any, modern enterprises could continue to exist indefinitely without access to their systems.
Yet, threats to availability are real and unavoidable. Even if you never suffer a disaster (somewhat possible; disasters are rare) or never experience a hardware, software, or power failure (less likely; these issues are more common), you will still have to shut your systems down for regular maintenance.
There are a number of technologies that can, to one extent or another, address these availability issues in the IBM i sphere. Each of these technologies has different costs (both acquisition costs and operating costs) and different availability capabilities associated with it.
It is not possible to prescribe an HA solution for you here because the technology that is most appropriate for your organization depends on your unique needs.
as/400, os/400, iseries, system i, i5/os, ibm i, power systems, 6.1, 7.1, V7,