Disasters are rare, but when they occur, they can bankrupt an unprepared organization. Here are 20 quick tips than can help you to minimize a disaster's impact on your IT assets.
Disasters happen. They're uncommon, but they happen. Pretending otherwise serves only to amplify their negative effects.
In many people's minds, a "disaster" means a hurricane, earthquake, flood, fire, or other natural calamity or, possibly, a terrorist attack. Those clearly qualify as disasters, but for the purposes of this article, "disaster" has a broader meaning.
In this context, a disaster is any event that causes either of the following:
- The destruction of all online operational copies of an organization's data and/or applications. "Online operational copies" includes both the production copies and any ready-to-run backup copies that can be placed in the production role immediately and, preferably, seamlessly.
- The loss of access to all online operational copies of the organization's data and/or applications for a sufficiently long period such that a recovery operation will be faster and more cost-effective than waiting for the online operational copies to come back online.
In the event of a natural disaster or terrorist attack, the organization's first objective should be, clearly, to protect and maintain the safety and security of its employees and other people on its premises. Once this objective has been achieved, or if people have not been placed at risk by the situation, the highest-priority task of the IT department after a disaster is to get the business' critical systems running again as quickly as possible.
Failure to resume operations swiftly can compound the disaster and threaten the survival of the organization. According to one often-cited statistic from the U.S. Bureau of Labor Statistics, 40 percent of all companies that experience a disaster never reopen, and more than 25 percent of the remaining companies close within two years.
Thus, disaster recovery (DR) is crucial. Nonetheless, DR doesn't just happen. Furthermore, in the midst of the excessive stress that is inevitable in any DR process, if something can go wrong, there is a high probability that it will go wrong. And in any complex IT environment, many unimaginable things can go wrong.
Fortunately, there are a number of ways to lessen the chance of things going wrong and to reduce the impact of a disaster. The following are 20 quick tips that can help you to ensure DR success.
- Inventory your IT assets. To recover from a disaster, you first must know what needs to be recovered. If you haven't already done so, make a detailed inventory of all of your IT assets--both tangible and intangible. What hardware, software, and data will have to be recovered? Which skills will be required to perform the recovery operations and then run the business' systems at a backup location if necessary?
The IT asset inventory list should be included in your disaster recovery plan, which is the subject of the next few tips.
- Maintain offsite data backups. A comprehensive tape archive strategy is crucial. To minimize recovery times in situations where the physical assets of the primary data center are still operational, you must be able to recover data from tapes that are stored locally.
However, you also need to protect business operations against the risk of the destruction of the data center. Thus, you must also be able to recover from tapes at a secondary location.
Having an up-to-date copy of backup data at a remote location is worth almost any price. A local fireproof vault is not an adequate alternative to off-site storage because, depending on the circumstances, the vault may not offer sufficient protection or it may not be accessible quickly after a disaster.
- Prioritize your data and applications. Data and applications are not all created equal. Assess the varying criticality of data and applications. Some of them are utterly essential to reestablish the business. Those applications and data must be restored first. Recovery of secondary applications and data can be deferred until the critical apps and data are restored. Your DR plan should explicitly state the recovery order of data and applications to reflect these priorities.
- Define detailed disaster recovery processes. After creating your IT asset inventory and prioritizing your IT assets, map out detailed, step-by-step instructions for recovering each IT asset, in the order in which they should be recovered.
- Don't omit "standalone" data. Increasingly, business-critical data and documents are stored on laptop and desktop computer disk drives. Your DR plan should include details on how this data will be backed up and recovered if lost.
And remember, a laptop or desktop computer may be destroyed in the same disaster that strikes a data center. Therefore, it is not enough to back up PC-based data onto a network drive in the primary data center. Critical PC-based data must also be included in the offsite backup datasets.
- Formally document the plan. A disaster recovery plan that exists only in someone's head is no plan at all. Keep in mind that you are creating a plan to recover from a disaster. While we'd rather not consider the prospect, it's possible that some critical employees will not be available due to the effects of the disaster. Even if the worst doesn't happen, some key staff may be on vacation and unreachable during a recovery operation. If the recovery plan exists only in those people's heads, the available staff won't be able to execute the plan.
- Keep hard copies of the plan. There may be some efficiencies to be gained from storing a disaster recovery plan online. For example, it may be possible to automate the initiation of some of the recovery processes and use the system to enforce the completion of checklists. Nonetheless, also keep printed copies of the recovery plan in secure locations, including at the recovery site. A plan for restarting the organization's systems that is locked inside a system that is unavailable will be of no use when it comes time to initiate the recovery operations.
Remember to replace the hard copies whenever the plan is updated.
- Keep multiple copies of the plan. A plan that exists only at the primary data center will be useless if the data center is destroyed. At a minimum, store a copy of the plan at the recovery site. Keeping additional copies of the plan at the homes of one or more of the key personnel who will be involved in the recovery operations will provide added safety and may allow those people to begin executing the plan without having to get to the recovery site first.
- Test the solution. In any complex system or process, what works in theory often fails in practice. Regular testing not only ensures that your recovery plan is viable, but also acts as a training tool. People who have already performed the recovery procedures a number of times during regular testing will be familiar with the plan and confident in their abilities to perform the required actions.
You should test the recovery processes at least three or four times per year. Tests will often reveal flaws in your recovery plan. When this happens, be sure to update the plan to fix the flaws.
- Create and maintain a test script. Avoid using an off-the-cuff approach to DR testing. Maintain a test script that follows your DR recovery plan as closely as possible and tests as much of it as possible. (For operational reasons, it may not be possible to test all aspects of a recovery operation during every test, but every effort should be made to leave as little as possible out of the DR tests.)
Remember to update the test script when your DR plan changes.
- Consider disk-based remote backups. Traditional tape-based backups suffer from a variety of weaknesses. In addition to tape being slower than disk during backup and recovery operations, backup tapes are usually created only daily, typically at night. If a disaster occurs just before a new backup tape is created, there may be as much as a full day's worth of data that does not exist on any backup tape.
Disk-based backup products that transmit changed data to an offsite location much more frequently than daily--perhaps even continuously--can reduce the volume of unsaved data, possibly to zero.
- Store required passwords in multiple locations. You never know what a disaster will throw at you. If system passwords are available only at the primary site, you may find that you are unable to access critical information if that site is destroyed.
What's more, if only one person has the required high-level system passwords and that person stores them only in his or her head, you may be unable to restore your systems if that person is not available after a disaster. It is, therefore, essential to designate backups for all key staff.
- Ensure that backup procedures are followed. It sounds simple enough, but be sure that your data backup and protection procedures are followed rigorously on the prescribed schedule. After regularly backing up data for a long time without experiencing a disaster, and therefore not needing the backups, there is a tendency to become lax about compliance with backup policies. But, because you can't recover what you didn't save, this negligence could result in a business failure when a disaster does happen.
- Respect tapes' "best-before" dates. Tapes have a limited shelf life that is determined primarily by the number of times the tapes are used. In addition to wear through use, tapes can become brittle and corrupted over time even if they aren't used.
Tapes should be rotated regularly and replaced as they age. If your tape supplier provides life-expectancy estimates, replace tapes before the recommended expiry dates.
Err on the side of caution. Tape life-expectancy values are only estimates. It is much less expensive to replace a tape that could have lasted for a few more runs than to find through brutal experience that you can't recover your data when necessary because a tape is unreadable.
As a general rule of thumb, tapes used on a daily basis should be replaced every six to nine months to avoid deterioration. Other tapes should be replaced on a regular, less-frequent schedule based on the frequency of use.
- Maintain multiple communication channels. When you need to notify your staff about a DR event, you may not have access to normal communication channels. Email may not be working, or the phone system may be down. Consider text messaging, personal email addresses, etc. as alternative communication vehicles. In addition, there are third-party companies that can handle this communication for you.
- Automate as much as possible. Human error is possible under any circumstances. In particularly stressful situations, it is almost inevitable. Thus, the more of the recovery processes that you can automate, thereby removing the human element, the better.
However, keep in mind that the systems responsible for automating the recovery operations may be unavailable after a disaster. Thus, just as your business applications and data need backups, you need manual backups for all of the automated recovery processes.
- Don't neglect security. When recovering from a disaster, it can be tempting to bypass your normal security protocols and policies in order to simplify and speed the recovery. Generally, this is a bad idea. Those security policies were established for a reason, and you don't want to create a potential security risk that can be as disruptive as, or more disruptive than, the disaster itself.
- View DR as an ongoing, evolving effort. Businesses change and grow, and their IT infrastructure, applications, and data evolve to support the changes and growth. As a result, a static DR plan will protect yesterday's data and applications, while leaving today's business operations exposed. Thus, don't approach DR as a one-time project, but rather as an ongoing exercise.
- Build a culture that emphasizes the importance of DR preparedness. If senior management is seen to have little concern for DR preparedness, that attitude will filter down to the front-line employees responsible for defining and executing the recovery processes and maintaining the backup data stores. Therefore, senior-level buy-in to business continuity initiatives is essential. In addition, that buy-in must be clearly communicated throughout the organization.
- Ask for help. Creating an effective DR plan can be challenging. A DR consultant with extensive knowledge and experience in the field can help. This allows you to leverage the experience of many companies and more effectively craft a plan that meets all of your business requirements at a cost that fits your budget and is justified by the benefits.
Furthermore, it is human nature to often not see consciously what's most obvious to us. A DR consultant may spot an unprotected data store, application, process, or piece of hardware that employees overlook because its use has become second nature to them.