It is no secret that the IBM iSeries midrange computer is considered to be among the most reliable in the industry, but any sane IT manager wouldn't bet the company (and his/her job) that the system will always function flawlessly. Consider the following from the IBM Redbook, Clustering and IASPs for Higher Availability on the IBM eServer iSeries Server:
"According to one IBM study, the iSeries server averages 61 months between hardware failures. However, even this stellar record can be cause for availability concerns. Stated another way, 61 months between hardware failures means that nearly 67 percent of all iSeries servers can expect some type of hardware failure within the first five years."
Nobody should have to spell out to you the value of your data. But just in case you need another sobering reminder, consider this: According to the Huritz Group, 29% of businesses that experience the loss of a system due to disaster close within two years. It's likely that's the percentage that don't have a solid backup and recovery strategy. But forget for a moment the loss of the entire system; for some companies, the direct and indirect costs of simply not having access to their system for just one hour during peak business times can result in losses of thousands--or even millions--of dollars. It's only healthy, therefore, to regularly and seriously consider the following questions:
- What is the longest amount of downtime that users, management, and customers will tolerate if a system failure occurs?
- How much data loss, if any, can the company endure once the system is restored?
- Will the current backup and recovery strategy meet the company's downtime and data-loss thresholds in the event of a system failure?
Sure, every company would like technology installed that guarantees 24x7 reliability, but only a small percentage of shops can afford that kind of solution. What is important is to come up with a practical backup and recovery strategy that strikes the balance between the amount of downtime and data loss the company can bear if the system fails and how much the company can afford to spend on a solution. For some companies, 24 hours of downtime and the re-creation of a day's worth of transactions may be no big deal, while for another, as already mentioned, being down for even a few hours could cost millions and being down for a day could spell doom. Of course, these are two ends of a wide spectrum, and the key is to determine exactly where your organization fits within this spectrum and then put together the right solution.
Before talking about specific solutions, I must emphasize that it is crucial to ensure that the solution works--whichever backup and recovery strategy you choose. This means regular, thorough testing of the data recovery process. Seasoned IT people know that, regardless of the level of technology implemented in a backup and recovery strategy, the recovery process is notoriously fraught with surprises. Good sleep depends on thoroughly testing your recovery process and having a very high level of confidence that the restoration of your data will meet your company's predefined downtime and data integrity thresholds.
The remainder of this article gives a high-level overview of backup and recovery approaches, from the simplest and least expensive (but generally having the largest downtime and data-loss exposure) to those that are the most complex and expensive (yet providing the greatest level of protection). In addition to using this article as a resource for learning about various backup and recovery solutions, it is recommended that you read IBM's document, Backup, Recovery, and Availability--Planning a Backup and Recovery Strategy. Especially, be sure to read Chapter 4, Choosing Availability Options.
It should go without saying that it is vital for every shop, regardless of its size, to have reliable tape backups of critical data. This is the foundation of any backup and recovery strategy. Of course, recovering data only from tape means that data will be recovered only to the point of the last save. Any transactions, changed programs, etc. that occurred since that time will need to be manually recreated. If you are saving once each day, then you have an exposure window of up to 24 hours. In addition, if your backup and recovery strategy only includes saving to tape, a dedicated system is required while the save is being performed--that is, unless you are using the Save While Active (SWA) function of the OS/400.
SWA has been a feature of OS/400 for many years and is considered a reliable way to do daily saves without giving users the boot. The downside with SWA is that if any object has an exclusive lock and it isn't released within a predetermined time window, the object won't be saved. Unless you are monitoring for these each day and have other good backups of the object(s), your restore won't be complete. Another problem with SWA is that all of the saved objects are likely to be in different states. This means that if you need to restore from tape, you will have incomplete units of work and will need to audit your data to find and delete the incomplete transactions. This can be resolved by ending all jobs that use the objects to be saved ("quiesce") so SWA can set synchronized control points (jobs ended for anywhere from 10 to 30 minutes) prior to initiating the save process. The trouble with this is you still incur downtime during saves, although you can get around this by using journaling--but more on this later.
Besides SWA, other options are available for reducing or eliminating the downtime that occurs from data backups. The trouble is, they are expensive. But depending on your quantity of stored data, system transaction volumes, and other needs, they might be worth considering. The first option is to use a high availability (HA) solution to mirror selected objects to a separate ASP or LPAR. This allows you to perform tape backups on the mirrored objects rather than the production objects, thus eliminating downtime for the backup. Many HA vendors sell a scaled-down solution for this purpose (See Data Replication/Availability below). Another solution is IBM's Enterprise Storage Server (ESS). This massive storage device is typically purchased only for a multiplatform storage area network (SAN). However, if you store your production data on the ESS, you can perform a "flash copy," which in a matter of seconds makes a complete snapshot of selected data. This data can then be backed up to tape without affecting the production data. Once the data is backed up, the flash copy is deleted from the ESS.
Several third-party solutions are available to help you manage the tape save-and-restore process as well as reduce the downtime required for dedicated saves. The solutions available typically involve managing tapes to ensure that current data isn't accidentally overwritten and that objects can be quickly located and restored. Additionally, some solutions provide the ability to perform parallel tape saves, allowing save time to be shortened by concurrently saving data to multiple tape drives. Finally, several hardware vendors offer automated tape backup devices that automate the process of rotating tapes and loading volumes and that come equipped with multiple tape drive units to allow simultaneous saving and restoring of data from multiple volumes.
Tip: To test how well your full-system save tapes would restore your system, for $1,500 you can send your tapes to IBM, where they will attempt to restore your data on a test machine. You will receive a comprehensive report describing how successfully the restore process went as well as detailed recommendations for improving your backup process. For more information, go to the IBM recovery options for your iSeries or AS/400 Web site.
Below is a partial list of vendors and their solutions that automate backup and restore processes and manage backup volumes (listed alphabetically by vendor):
Tape Backups Plus Journaling
By augmenting tape saves in your backup and recovery strategy with the OS/400 journaling function, you can dramatically increase the odds of recovering transactions up to the point of system failure very economically. Essentially, journaling takes compact snapshots of data changes, called journal entries, and writes them to objects called journal receivers. In the event of a system failure, the journal entries can then be "applied" to their associated objects (that are restored from tape) to bring the data back to the point in which the system failed. Another real advantage of using journaling is that if a tape fails during the restoration process, it is possible to restore the previous day's tape (or even ones before that) and then apply journal entries from that point to recover the data right up to the point of failure. Keep in mind that recovering data from journal receivers can take a fair bit of time, depending on the volume of transactions, the speed of your processor, and the number of days of journaled data that is being recovered. A general rule of thumb is one to two hours of recovery time for each day of journal entries that are applied.
When using journaling as part of your backup and recovery strategy, it is important to protect the journaled data by regularly saving journal receivers to tape or at least ensuring journal receivers reside on a separate disk or auxiliary storage pool (ASP) from the production data.
Curiously, there seems to be only one vendor who supplies a solution to aid the management of journaling on the iSeries. Sure, many shops know journaling inside and out and use it confidently to augment their backup and recovery strategies. But many smaller shops that are a bit intimidated by journaling have missed out on its tremendous and inexpensive backup and recovery benefits. HA vendor iTera says its GuardianSave product fully automates and manages the data journaling and apply processes as well as automatically keeps journal entries saved on any remotely connected storage device (i.e., network server or Linux PC). The offline journal entries can then be automatically retrieved and applied by the product when data recovery is necessary.
The remaining topics in this article cover a variety of technologies, from simple to complex, that allow for quick restoration of data when different kinds of hardware failures occur, without having to restore data from tape.
The first level of protection is disk protection. Several different levels of disk protection can be enabled within OS/400 when additional disk units are installed. These include parity protection (RAID), ASPs, and disk mirroring.
Device Parity Protection (RAID5)
Device parity protection is a hardware availability function that can protect data from being lost because of a disk unit failure or a damaged disk. It requires only 10% to 25% additional disk capacity to provide the protection, allowing data to be reconstructed by using the parity value and the values of the bits in the same locations on the other disks. The system continues to run while the data is being reconstructed. At one time, this used to provide a fairly cost-effective means of protecting data against a drive failure; however, because of the current low cost of disk storage, the additional benefits provided by disk mirroring are available for just a bit more money.
Auxiliary Storage Pools (ASPs)
Without having ASPs (aka disk pools) defined on a system, data is written equally across all disks, but when data is organized into "pools," a significant recovery advantage is provided if a disk failure occurs. With disk pools defined, data recovery is only required for the objects that resided in the pool that contains the failed disk, dramatically speeding up recovery time.
A relatively new twist in ASPs is the independent ASP (IASP). This is a storage pool located on a separate disk storage tower, connected to two or more iSeries servers via high-speed lines. The idea is that application data from one of the servers is written to the IASP so that, in the event the server fails, a secondary system can then access the data in order to quickly execute a recovery. Currently, only V5R2 supports the sharing of application data, while V5R1 only accommodates IFS objects. Of course, if a disk fails in the IASP, all of the data in the IASP will need to be restored from tape. The solution also has other limitations: Users cannot automatically be switched to the second server; there are distance limitations between the servers and the IASP; only one server can be attached to the IASP at a time; and some system objects are not supported (e.g., spool files, job queues, output queues, etc.).
With disk mirroring, identical copies of data are kept on two separate but identical disk units in the same server. If one disk unit fails, the system can continue to operate without interruption by using the data on the mirror disk unit until the bad disk is repaired or replaced. Disk mirroring can be performed either for designated data or for the complete system.
Disk mirroring has significant advantages over device parity and disk pool protection as these solutions have several vulnerable points of failure that might require a system reload. For instance, if a failed disk is on the base system ASP or if two drives fail within a single parity set, data would not be reconstructed. Disk mirroring protects against most cases of lost disk drives.
All of these solutions can be implemented through OS/400 when additional DASD is purchased.
Data Replication/High Availability
Data replication is exactly that--creating an ongoing mirror image of changes to application data and other designated system objects either on the same machine (in a separate logical partition or ASP) or on one or more separate but connected iSeries servers. In the event of a hardware failure, users can quickly access the replicated data to continue working. In order to perform true data replication, a middleware HA solution must be installed.
Replicating Data to the Same Machine
When an HA solution is used to replicate selected data to a logical partition (LPAR) or a separate set of disks in an ASP on the same machine, disk failure protection is provided as long as multiple disks don't fail within different LPARs or ASPs (never forget Murphy's Law!). This solution is commonly employed not only to provide some data and application protection from disk failures, but also to eliminate downtime from daily backups, as the backup can be made over the mirrored data rather than the production data. This solution can even improve system performance if a significant amount of read-only processing (e.g., queries, reports, etc.) is being performed on production data, as this processing can be done on the mirrored data instead. The main advantage of using replication versus disk mirroring on a single machine is flexibility; you can choose selected objects to be mirrored instead of entire disks. Additionally, with disk mirroring, you can't temporarily turn off the mirroring process in order to back up data from the mirrored disks from a designated point in time as you can with an HA solution.
Replicating Data to One or More Secondary Servers
By synchronizing data from your production server to one or more secondary servers, you can dramatically decrease the occurrence of significant downtime from most kinds of system failures. In addition, planned downtime events such as backups, file reorganizations, release updates, etc. can usually be eliminated. When properly configured, maintained, and tested, a secondary server can quickly assume the role of the production server, sometimes without users having to do anything other than sign on again. Depending on the HA software used, the performance power of the secondary server, and the communication bandwidth between the two servers, the amount of time it takes to move users to the secondary server can be anywhere from a few seconds to an hour.
Every HA solution uses journaling to capture system and data changes; however, the available solutions transmit the captured changes to the secondary node(s) one of two ways: either by using a proprietary data transit process or by using OS/400's remote journaling function.
Regardless of the method used, the data is transmitted either in a synchronous or asynchronous mode, depending on the criticality of the data and the distance between nodes. Synchronous mode is usually used in situations that require an absolute assurance that each transmission of data is accurately sent and received. Very often, banks and other financial institutions require synchronous data replication. This, of course, can dramatically affect performance and increase the necessary communication bandwidth requirements. Most shops that use mirroring between servers only require asynchronous mirroring, with the software doing some kind of ongoing object monitoring to assure objects remain synchronized.
There is a great deal of discussion in the HA solution sector about the pros and cons of using remote journaling to perform the mirroring of data. Remote journaling is a function of OS/400 that moves journal entries from journal receivers on the production server to matching journal receivers on the secondary server(s), which are then applied to duplicates of the objects. All HA vendors claim to offer solutions that use remote journaling, but the larger, more established vendors tend to promote their own proprietary replication technologies, claiming better control of replicated data and a wider spectrum of objects that can be replicated. The more recent HA solutions are built around a remote journaling engine claim that data is replicated faster, with better integrity, and requires less overhead--plus, their solutions cost significantly less.
Data replication solutions can be further enhanced if a shop is using OS/400's clustering functionality that was introduced withV4R4. Clustering essentially allows the defining of multiple systems and resources as a single device so resources can be more easily shared and workload more easily distributed. With clustering enabled, integrated communication processes and APIs that interface with the HA software and the application software permit early detection of system problems. This can allow for transparent switchovers between nodes at--or even prior to--the time of a system failure. Of course, the extra level of protection provided by clustering means an extra level of expense, and it also requires an additional level of system expertise. For a good overview of clustering on the iSeries, see the IBM's High Availability and Clusters Web site.
Here's a list of HA software vendors and their products:
But Wait, There's More
This article has provided you with information about some of the solutions available and the vendors who provide them. The Vendor Directory at MC Press Online offers a more substantial list of Backup and Recovery vendors and their products.