A 10,000-foot view shows that you have two major categories of data protection for your Power AIX: Disaster Recovery (DR) products and High Availability (HA) products.
Do you know that the HA solutions are most often used for normal maintenance? And that the most common cause of an outage is human? Sloppy network people taking down the network without letting users know, software bugs, and operator errors cause more than 50 percent of outages. Many people purchase HA products so that they can do their daily backups on the target.
One misconception is that HA products replace DR products. The truth is that DR can stand alone, but HA products need DR products to offer a complete solution. You may ask why. A real-life example illustrates that. Google has many redundant copies of the information they store for us, which is another way to say they use HA to protect our data. But back in February 2011, Google was hit with a bug that deleted all the email for .02 percent of their customer's Gmail accounts. The problem was replicated across all servers that had data for the affected users. As a result, Google had to use tape backups to restore their customers' data.
Everyone needs DR, so we'll start with that.
Disaster Recovery (DR) Solutions
DR products come in different flavors, including tape, vaulting, disk-based backup appliances, and hybrid (disk and tape) backup appliances.
- Tape has been the mainstay of DR for decades, and it still is the least expensive and most prevalent DR solution. When you add the time it takes to handle the tape, the transportation cost to send the tape to your offsite storage, and the offsite storage cost itself, the overall cost goes up dramatically. Tape technology continues to improve, increasing both the reliability and density of tape. LTO 5 holds 3 TB compressed, has a transfer speed of 410 MB/s, and provides encryption. Tape is a good choice for long-term storage and for small shops.
- Vaulting solutions (or today's cloud solutions) are a step up from saving to tape. All vaulting solutions save the backup data to offsite storage daily, reducing exposure risk and transportation cost. Many vaulting products use database logging to reduce the amount of data lost down to a few hours. They perform a full save once a day and then save the logs every few hours. Years ago when I worked at a pump company, we created a process to read and apply journal entries. It worked great, but by the time we restored the system and the libraries from the last save and applied the entries it took up to two days! Too many people think of vaulting as their HA, but you can lose too much data and it takes too long to restore data for vaulting to be considered anything but a DR product. Vaulting is a good alternative to tape and you get the benefit of less data loss, but you need to factor in your recovery time objective (RTO).
- Disk-based backup appliances allow you to back up to disk and eliminate tape. This is one of the fastest-growing DR solutions. Backup appliances also reduce the time it takes to perform saves, thus shortening daily backup window downtime. Many backup appliances offer the ability to replicate to another device offsite. You could even build your own procedure to change receivers every few hours and save them to your backup appliance. Next, you would have to create a program to apply the journal entries. If you've been there and done that, save yourself the headaches and use a vaulting solution or move to HA. Backup appliances compete well against vaulting solutions, unless you want to use the journal management and apply features of vaulting.
- Hybrid (disk and tape) backup appliances are a much more robust offering because they combine the benefits of disk and tape. You can back up to the appliance's disk, release the host, and then write to tape or disk from the appliance at a convenient time, either locally or remotely. You can download your data to another hybrid appliance, SAN, NAS, tape drive, or even other servers. It combines the advantages of backup appliances with many of the features of vaulting solutions and adds a lot of its own features. This solution is particularly suitable for larger accounts with many servers.
Is DR Enough?
For many companies, "Yes." But for more and more companies, "No!"
I ask my customers, "Can you easily reproduce the changes that will have been made since the last save?" If they say yes, I ask them how. The industry term for the point in time you will be able to recover to is recovery point objective (RPO), and you have to consider it carefully. In many companies, productivity improvements from leveraging the Internet have often come at the price of audit trails. If you lose your system, chances are that order data, move tickets, time and attendance, and shipping documents will all have disappeared, with no way to recover them. How do your customers input their orders? Are your work orders all online? What documents are retained in shipping?
The next question you need to ask is how long you can be down, or Recovery Time Objective (RTO). Tape will have to be retrieved from your offsite location, and after the system is restored, users will have to enter the missing data. Vaulting solutions cut the RPO down to a few hours, but you still have to perform the same recovery process as with tape, and then you have to apply the journal or logs. Vaulting reduces your RPO, which is a great thing, but your RTO could be much longer when you add the time it takes to apply the logs. Backup devices don't themselves improve RPO, but they do reduce the RTO because you don't have to go get the tapes.
Another consideration is legal requirements. In the United States alone, you have Sarbanes-Oxley, HIPAA, GLBA, SEC, PCI, banking, and many other standards. Stockholders have expectations. Your users have expectations, and even your customers do. You could suffer financial penalties, damage to your company's reputation, decreased stock value, and even criminal charges that can be filed against you.
Each year, fewer companies can risk using only DR products, so let's talk about HA.
High Availability (HA) Solutions
HA products all perform the same functions. They group at least two servers (nodes) together, creating a replication group (cluster). The production server (active node) sends all changes to the target server (passive node) as they are made. HA products control which node is acting as the active node, directing the users to this node, and passive node, replicating the data from the active node to each passive node.
Power AIX HA products are based on hardware (storage-based) or on software to replicate the data.
Storage-based replication, or clustering, is the most common HA solution for the AIX. In simple terms, storage-based replication allows two or more SAN or NAS units to keep each other in sync ("mirrored") using either shared-everything topology or shared-disk topology.
Shared-everything technology allows for a "no switching" operation because all servers or nodes can access and update the data at the same time. Nodes are in the active mode when users make updates directly to them and are in passive mode when they are receiving data from another node. Roles switch dynamically and require no intervention. Built-in locking operations ensure that only one node can update or write at the same time. HA software controls which server users are reassigned to when a failure occurs, and the data flows automatically.
Shared-disk technology allows the HA software to specify which node is allowing users to access the node (active), and the rest of the nodes are in passive mode. The data flows from the user into the active node, and the active node pushes the data to the cluster's passive nodes.
Software-based replication uses features built into the operating system to capture the changes that are made on the production system. The HA solution then sends the data to the target system and applies the changes. Vendors often refer to this as "mirroring," but to be fair they are "replicating" the data and changes. The data is not stored in the same location, as many storage-based products do, and that also allows them to use different storage devices on the target than on the production system. Logs allow these solutions to be storage-independent, allowing you to easily change storage devices at the active or passive site.
In addition to normal replication and protections, the software can provide for recovering your replicated data from a point in time. Software-based replication products have extensive audits to verify that the target is the same as the source. Requirements vary, but you are expected to run the audits to ensure the integrity of your data. With storage-based replication, if these features exist, they are built into the SAN and performed automatically.
Storage-Based or Software-Based?
Which is better: storage-based or software-based? It depends. Hardware replication requires less user intervention, but it also requires expensive SAN or NAS, has to be the same on every node, and in general costs more. Software replication allows more choice in what is replicated, and it offers features that are not available with storage-based products.
The Ultimate Question
Finally, ask yourself this question: When the CEO goes to the board to explain why you were down for two days, how you lost eight hours of data (including orders), and why your best customer pulled their orders after you missed a shipment because you didn't have a way to recover the lost orders, is he going to say that the RTO and RPO were agreed on and your recovery was within guidelines?
We know what will really happen, and I don't want it to happen to you. If you can't get management to agree to HA, then get your recovery objectives in writing and get a written sign off. If you can't protect your company, at least protect yourself.