|Round-the-Clock Business Information System Management|
|System Administration - High Availability / Disaster Recovery|
|Written by Jeffrey Ashman|
|Sunday, 07 October 2007 19:00|
Round-the-clock systems are more than just an intellectual exercise for an increasing number of companies. Some of the "modern" reasons why more companies, including smaller ones, are moving to 24x7 operations have become clichés. They include the need to support Web presences and global operations and the need to shoehorn lengthier maintenance tasks made necessary by rapidly growing databases into shrinking maintenance windows. Other reasons are mentioned less frequently but are still important. These include an insistence by some companies that the IT department never say no to CEOs, CFOs, marketing managers, and other authorized personnel who, without warning, want to use the VPN connection from their homes to investigate an idea that comes to them on the weekend, in the evening, or even in middle of the night.
How High Is High?
When considering high availability and operations that extend beyond "nine-to-five," one question to ask is, how high is high? A decade or so ago, organizations had only three generic choices. They could do nothing to protect the availability of their data and applications and, instead, cross their fingers and hope that nothing went wrong. Few companies were that brave or that foolhardy.
More likely, they took nightly tape backups that allowed them to recover their data if disaster struck. Of course, back in the days of slow tape drives, the recovery process would probably have taken a few days. Possibly worse, they were able to recover data only up to the previous night. Updates entered after that point were lost in the event of a disaster. Systems generally had to be shut down during the nightly backup process, which was acceptable for traditional nine-to-five but not for the round-the-clock operations that are becoming more common. And neither of these solutions did anything to avert the downtime required for other regular maintenance, such as hardware and software upgrades and database reorganizations.
The third option was a full HA solution that provided real-time or near real-time replication of all data and objects to a hot-standby server. The HA software could also automatically failover to the backup system when the primary system became unavailable. Or an administrator could initiate a switchover to accommodate maintenance on the primary server. Because the HA software could support a remote backup server, this alternative would also prevent business downtime in the event of a disaster. This was, and remains, the gold standard in HA. The only problem was that, until a few years ago, a full HA solution was too expensive for most small and medium sized businesses.
When the first true AS/400 (now System i) HA options were introduced more than 25 years ago, they were directed primarily at large enterprises. They required considerable effort to implement, monitor, and manage. This administration workload, combined with high price tags, put them out of reach of many AS/400 shops. New options that have come on the scene in the intervening years changed the economics. Almost all companies can now cost-justify an investment that significantly improves the availability of the data and applications beyond a solely tape-based backup strategy. This is partly because the total cost of ownership of HA solutions has fallen and partly because the market now includes lower-cost options that fall between tape backups and full HA on the availability spectrum.
Tape drives have become considerably faster over the years. Consequently, one of the drawbacks of tape, namely the time required to perform save and restore tasks, has diminished. Nonetheless, the time to restore a full data center from tape after a disaster may still be too much of a burden on an organization.
Another former problem with tape saves—the need to shut systems down during backup operations—has also been alleviated. System i has had a save-while-active function for some time. This eliminates the need for downtime during tape save operations, but it is not totally satisfactory for 24x7 operations because tape saves still greatly impair application performance while they are running.
Even ignoring the performance issues associated with the save-while-active function, tape backups are still far from ideal. Data is generally saved to tape and sent offsite once a day, typically at night. Assuming you use journaling, changes made throughout the day will be saved to the journal, but if that journal is local to the production system, a disaster that destroys the data center will likely also destroy the journal. Thus, you may lose up to a day's worth of data after a disaster. The data loss will be greater still if the most recent tape had not yet been shipped offsite or if it was corrupted.
Data vaulting offers an inexpensive way to overcome these liabilities. Data vaulting software captures changes made on the production system and saves them on disks on another system. Vaulting software can copy changes as they happen, batch them to be sent periodically, or allow you to choose between these two options.
The system hosting the vault can be local or remote, and it doesn't necessarily have to match the production system. For example, you might use a low-cost Linux or Windows server to run the vault for your production System i server. Unlike HA, the objective of vaulting is not to have a hot-standby server ready to take over operations immediately whenever necessary, but rather to capture changes made between tape saves so you can recover data close to a point of failure—or right up to the point of failure if using continuous data capture and transmission.
If you use a data vault, you will likely still want to use tape saves as a last line of defense in the protection of your data. The vault can help you with that as well. Tapes can be created from the vault, eliminating the impact that save operations normally has on production systems.
Data vaulting offers another advantage over tape saves. Using the vault to recover single or multiple objects that become corrupted or are accidentally deleted is generally fast and easy.
Single-Point Availability Solutions
Data vaulting can fill in some of the gaps of a tape-only backup strategy, but it doesn't prevent the regular and lengthy downtime that results from maintenance tasks such as database reorganizations or hardware and software migrations. However, there are affordable products on the market that can address these availability issues.
Database reorganizations are periodically necessary to free up space consumed by logically deleted records and to improve application performance. Traditionally, the reorganization tool provided with the database required that applications be shut down while it ran, but there are ways to overcome this problem. As the name implies, reorganize-while-active tools reorganize databases while production systems remain active.
These tools might perform the reorganization in-place or they might create a mirrored file that is reorganized. In the latter case, the tool keeps the mirrored file synchronized with the production database by replicating any changes applied to the production file while the reorganization is in progress. When the reorganization is complete, the mirrored file becomes the production file.
When it comes to system resource usage, nothing is free. While-active reorganization tools consume some System i resources. Fortunately, sophisticated reorganize-while-active software allows you to schedule the reorganization processes to run during periods when the demand on the system is expected to be low. The tool may also be able to split the reorganization job into smaller tasks that can fit into the limited windows allotted to them.
The downtime traditionally required to convert databases when upgrading applications or to migrate databases when upgrading servers can be eliminated with tools that work in a fashion similar to the reorganize-while-active tools. Convert-while-active, upgrade-while-active, and migrate-while-active tools typically create a mirror database on the new hardware or in the new format and then keep that mirrored database synchronized until you are ready to begin using the new database.
Lower TCO for HA
Data vaulting and single-point solutions are steps on the road to high availability, but they are still far from the ultimate goal. In a true HA environment, HA software replicates all data and objects to a second server and keeps that server synchronized with the primary server in real-time or near real-time, allowing the second server to act as hot-standby backup. Functionality in the HA software can automatically switch users to the backup when the primary server is unavailable, or operators can manually tell it to initiate a switchover to accommodate maintenance on the primary server.
The need for redundant hardware and software, not to mention the cost of the HA software itself and the administrative burden that the early versions of that software imposed, used to put a full HA solution beyond the budgets of all but the largest of enterprises, but that's no longer the case.
Consider first the HA software. In the early days, it was designed specifically for large enterprises and came with a price tag to match. Over the years, new competitors entered the market with lower-priced products. To meet this competition, the early entrants created additional product editions designed specifically for small and medium-size shops. When used in fairly straightforward system environments, these less expensive editions typically offer the same HA sophistication as the more expensive editions, but they don't support some of the particularly complex system technologies and topologies that are generally used exclusively at larger enterprises.
The high cost of buying a second server that acts solely as a backup was another serious impediment for many small and medium-size companies. IBM has helped to lessen this burden. It offers many configurations of "Capacity BackUp" (CBU) Editions for its Model 520, 525, 570 and 595 System i servers. The CBU editions are available at a much lower price than their equivalent non-CBU editions, but, under normal circumstances, the CBU servers can be used only as a backup machine. While the primary system is handling normal operations, the only software other than the operating system that can legally run on the CBU machine is the HA software that maintains data and object redundancy.
When a disaster brings down the primary server or it must be taken offline for other reasons, its System i software licenses temporarily transfer to the CBU machine, allowing users to be switched to it without breaking any IBM licenses and without having to pay for second licenses. When the primary server is brought back online, the System i software licenses automatically revert to it. (It is likely also possible to transfer your application and other software licenses temporarily without having to buy a second license, but that depends on the terms of each vendor's license.)
The first generations of HA software were often cumbersome to install and required considerable monitoring and management. Small IT shops found it difficult to justify the extra headcount. This is no longer an issue for some of the products. Over the years, HA vendors have incorporated considerable automation into their products, making them self-installing and, to a large extent, self managing. Autonomics also makes the more sophisticated of the products self-healing. In the background, they automatically check the integrity of replicated data and objects, correcting any problems without the need for operator intervention. Because of this increased automation and autonomics, monitoring and managing of advanced HA software may require less than 15 minutes of an administrator's time each day.
Round-the-clock system management is about more than just maintaining availability 24x7. It's also about keeping systems continually tuned and problem-free. Some products adhere to this holistic philosophy of system management by providing an integrated set of tools that perform a variety of functions, including resource utilization reporting and analysis, system performance reporting and analysis, file reorganizations, and more.
If all such a system management product does is provide a convenient interface to integrate that functionality, it provides a productivity benefit, but it doesn't help the small to medium-size IT department that wants to manage its systems 24x7, without the need for round-the-clock onsite staff. To meet this higher objective, the software requires automation and autonomics. Ideally, the tool should have considerable intelligence, with the ability to automatically recognize storage problems—not just the need for file reorganizations, but also the existence of obsolete data that can be safely archived, among other issues—that may arise on all types of System i storage, including System i disks, IFS files, ASPs, and iASPs. It should recommend appropriate actions to resolve the issues, allow you to schedule those actions to run at convenient times, and, in some cases, if the appropriate options are set, it should execute those actions on its own, without the need for manual intervention.
The bottom line is that IT managers at small and medium-size System i shops now have more options for affordable round-the-clock system management and HA than they did in the past. If your knowledge of the vendors' products is more than a few years old, a review of the today's market offerings may yield considerable value for your organization.