Organizations often view an investment in DR as insurance, but that perspective may blind them to the returns available from investments in advanced DR solutions.
Expenditures on disaster recovery (DR) solutions are frequently considered a cost of doing business, not an investment. Or they may be viewed as insurance policies that, hopefully, will never be called on to pay out claims. From this perspective, it's difficult to justify more than the minimum expenditure that will provide "good enough," but not necessarily optimal, protection against losses due to a disaster.
If it were true that DR solutions are merely insurance policies that pay out only when a disaster strikes, then the "good enough" strategy might, indeed, be appropriate. To understand why, first consider the definition of "disaster."
The "Insurance" Value of DR
From the DR insurance viewpoint, a disaster is an event that causes a level of destruction that forces you to restore data and applications from backup media and/or transfer system operations to another site. These situations are exceptionally rare. Most companies will go for many years without experiencing such a calamity, and some organizations may never experience one.
Economists generally suggest that the best way to value an uncertain outcome is with expected value theory. Put simply, the expected value is the value of the possible outcome times the probability of it occurring. Consider, for example, the following scenario (all numbers are hypothetical and likely bear little relationship to your circumstances):
- Disaster Angst Corp. (DAC) is considering an investment in DR technologies. With the technology in place, the improvement in disaster recovery time and completeness will reduce the cost of a disaster by $1 million compared to the status quo.
- DAC uses a five-year planning horizon for its technology investments.
The probability of one and only one disaster occurring within DAC's planning horizon is 0.1 percent (= 0.001).
- The probability of two and only two disasters occurring is 0.05 percent (= 0.0005).
- The probability of three and only three disasters occurring is 0.005 percent (= 0.00005).
- The probability of more than three disasters occurring within DAC's planning horizon is small enough to be ignored.
Under the above scenario, the expected value of the DR technology is ($1,000,000 * .001) + ($1,000,000 * .0005) + ($1,000,000 * .00005) = $1,550, minus the cost of the technology. Using these numbers, an investment in DR technology would be a losing proposition if the technology costs more than $1,550.
Even this scenario overestimates the value because it does account for the time value of money. The hardware and software costs associated with implementing DR solutions are incurred up front, whereas the "insurance" benefits will be received only when and if a disaster occurs sometime down the road. A discounted cash flow calculation would, therefore, be more appropriate. This calculation would further reduce the expected value, but it is beyond the scope of this article.
As can be seen, even when the payout from the insurance aspect of DR is large, because the probability of a disaster is so low, expected value theory suggests that a large investment in a DR solution is unwarranted.
There are a few problems with the insurance view of DR. For one thing, the expected value calculations are dubious. Their accuracy depends on the accuracy of two component forecasts: the probability of disasters and the cost of disaster-related downtime and data losses. Both of these estimates are typically fraught with error.
History generally provides the only readily available estimate of the probability of disasters, but history is, at best, a weak predictor. The problem is a paucity of data points. Disasters happen at random intervals, and they occur only very rarely. Thus, historical averages are not statistically significant.
What's more, the historical data used must be restricted to companies in similar circumstances. Some geographic areas never see hurricanes, but others are at a high risk during hurricane seasons. The presence of nearby tectonic fault lines determines the probability of earthquakes. Forest fires, tornadoes, and wars are, likewise, more prevalent in some locations than in others. Clearly, a worldwide disaster frequency average would produce a poor disaster frequency prediction for a specific company. Yet restricting the data to just companies in similar circumstances as yours forces you to base your forecast on a small subset of an already small set of data.
The costs of downtime and data losses can be forecast much more accurately than the disaster frequency. Nevertheless, most companies significantly underestimate these costs. Unless organizations perform a rigorous analysis, the true costs are likely multiples of their "back-of-the-envelope" estimates.
What's more, even when companies undertake a comprehensive analysis of their potential disaster-related costs, they often overlook one gloomy statistic. A sizeable proportion of companies that incur a cessation of operations lasting more than a couple of days go bankrupt within a few years or never reopen at all. Thus, unless you have DR solutions that allow you to recover effectively and rapidly, the true cost of a disaster may be the full value of the business.
And, in some industries, such as the financial sector, the choice may be taken out of the hands of the business. Regulations in critical industries make an investment in business continuity technologies a minimum cost of doing business.
The "Insurance" Legacy of DR
DR became viewed as insurance because the traditional DR technology, which is still the primary or only DR technology used in many companies, was incapable of acting as any more than that. Tape-based backups are a very cumbersome and time-consuming way to recover data, particularly if the tapes have been sent to an offsite location. As a result, backup tapes are typically used as only a last resort.
Yet tape-based backups don't even provide an especially good insurance policy. In addition to being a comparatively slow medium, they are more fallible than disk. Hence, it is possible that, when you try to recover data from tape, you will find that the most recent version is unusable, forcing you to rely on backup tapes that are up to 48 hours old.
Even when the most recent backup tapes are usable, they do not allow complete data restoration. Backup tapes are typically created once every 24 hours, usually in the middle of the night. Data that are added to or updated on a company's databases during the next day are not represented on any backup tapes. Consequently, if a disaster destroys the data center, including any online journals, data recovery from the backup tapes will omit up to 24 hours worth of data—and possibly more if the most recent backup tapes have not been sent offsite at the time of the disaster.
Furthermore, it can take several hours or even a few days to fully recover a data center from backup tapes, particularly if those tapes have to be recovered from a vault some distance from the recovery site. As noted above, many companies would not survive such lengthy business outages.
Because of the difficulty in justifying large expenditures based solely on DR's insurance value, many companies don't move beyond tape-based backups, despite the considerable liabilities of this approach. Nevertheless, moving DR beyond tape unlocks significant ROI potential in addition to what's available from the insurance value of DR. And, unlike "insurance," the realization of that potential is assured, which makes these advanced solutions easier to justify to a cautious CFO.
The options for moving beyond tape are many, but they can be distilled into two broad categories: geographically distributed high availability (HA) and continuous data protection (CDP). To be clear, most, if not all, companies that adopt technologies in one or both of these categories will not abandon tape as a back-up medium. However, tape will be relegated to a last line of defense, to be used only when all else fails.
Geographically Distributed HA
HA technologies create and maintain real-time replicas of production servers, including fully redundant copies of all data. Because HA solutions can replicate data and objects over any distance, a backup server can be located far enough away from the primary site that a single disaster will almost certainly not affect both sites.
This can be classified as a DR solution because of the protection it offers from the IT-related consequences of disasters. However, thinking of it as such requires a mind shift because, unlike tape-based backup technologies, there are no data or objects to recover before normal operations can resume after a disaster. Instead, users are simply switched to the remote, hot-standby, backup server. Then, when the primary site becomes available again, the HA software can automatically resynchronize the two sites.
Unlike tape-based backups, which provide only an insurance benefit, the ROI of an HA investment is much easier to predict. More importantly, you don't need to incur a disaster to earn the return.
Because switching to a redundant backup server can be done fairly quickly, you can use this option in a much broader range of circumstances than you can use tape-based recoveries. For example, when you need to perform maintenance on the primary server, rather than shut down operations until the maintenance work is finished, you can switch users to the backup system so they can continue their normal activities with minimal interruption.
Like tape-based DR, this option offers an insurance value, but it no longer derives its ROI primarily from its insurance value. Some maintenance is performed regularly, on a well-planned schedule. Other types of maintenance, such as hardware and software upgrades, occur at more random intervals, but they are sufficiently recurring that their frequency is predictable with a fair degree of accuracy.
In addition, companies have experience with most of the types of maintenance that will be required in the future. Thus, unlike disasters, with which most organizations have scant or no experience, organizations can accurately compute the cost of maintenance downtime. By measuring past costs and projecting them forward, it is possible to predict the value that will be contributed by this benefit with a high degree of precision.
Geographically dispersed HA delivers other benefits that can also be forecast with reasonable accuracy. For example, in the past, it was necessary to take applications offline while backing up related data. Save-while-active technologies make this unnecessary, but backup jobs usually consume much of the available disk I/O bandwidth, while also hogging processor resources. As a result, while business applications may, technically, be able to run while performing backup tasks, response times may slow unacceptably.
Geographically distributed HA can eliminate the burden that backup jobs place on production operations. Because HA maintains a complete, up-to-date replica of all data and applications, backup tapes can be created on the remote server, thereby eliminating the impact on the production system.
The cost of the productivity that is lost when backup jobs are run on the production system is calculable. Furthermore, backup jobs run at very regular intervals, typically exactly once every 24 hours. Thus, when considering an investment in geographically distributed HA, the value that will be received from moving backup jobs off the production system can be forecast with considerable accuracy.
What's more, because the backup server contains a current copy of all data, it can also be used to run other read-only tasks, such as batch reporting. Shifting processing off the primary server in this way may defer the need for a server upgrade.
The need to recover data most often arises not from disasters, but from much more common and, seemingly, less significant events. An operator accidentally deletes an important file. A user corrupts data with an inappropriate update. A simultaneous disk failure overcomes RAID protection and destroys a portion of the company's data. A computer virus deletes or corrupts data. A disgruntled employee destroys some data before departing. The list of such occurrences is almost endless.
In these circumstances, the job of the IT department is not to recover the entire data center but, rather, to restore the individual file or data item in question, preferably to the point immediately prior to when it was deleted or corrupted. HA software alone can't do this because the software will immediately copy the deletion or corruption to the backup server to ensure that it is always an exact replica of the production server.
Tape-based backups offer a partial solution, but they force you to restore data to its state as of when the backup tape was created, probably the previous night. Doing so may discard several updates that were performed on the data between then and when the corruption or deletion occurred.
In addition, recovering a single data item from tape can be a labor-intensive, lengthy process—particularly if the tape has already been sent offsite. Because these are relatively common occurrences compared to disasters, many companies keep the most recent backup tapes onsite so they won't have to be recalled when they're needed for these sorts of recovery operations. However, doing so reduces the insurance value of tape-based backups. Should a disaster destroy the data center, including the most recent backup tapes, the company will lose up to two full days' worth of data rather than only one.
CDP provides a solution by copying data inserts, updates, and deletes either continuously (True CDP) or batched at intervals (Near CDP), such as when a file is closed or saved, to an online data store that is usually some distance from the production server. Unlike HA, CDP does not attempt to maintain a replica of the production server. Instead, it stores information about each individual update. This way, this information can be used to restore one or more individual data items to their state at a time of an administrator's choosing, likely immediately before they were corrupted.
Depending on the vendor, CDP may be sold as a standalone product or bundled with HA software. Typically, CDP stores data in a simple file structure, and, as a result, the CDP server usually does not have to use the same platform as the production server. Instead, it can often be a low-cost Windows- or Linux-based server.
The problems that CDP resolves happen randomly, but they are frequent enough that past history provides a reasonable forecast of their frequency. In addition, it is easy to measure how much operator time is consumed in recovering data from tape as opposed to how long that task will take when using a CDP solution. The product of these two values (frequency and avoided cost) can be compared to the cost of the CDP solution to provide an estimate of the return on an investment in CDP.
CDP also provides value when a disaster occurs.
The CDP backup server does not contain a complete copy of all of an organization's data. Thus, when CDP, but not HA, is in place, the IT department begins a disaster recovery operation by first restoring data from the most recent backup tapes. The CDP database is then used to bring data up to date by applying the data updates that were made after the backups were created and up to the point of the disaster.
Because tape-based recovery is a necessary element in this scenario, IT's role in the recovery operation will probably take slightly longer than it would if recovery were from only backup tapes. However, because true CDP can be used to recover data right up to the point of failure, the organization will not have to manually restore data that is not on the backup tapes. Thus, the total recovery process across the whole organization usually takes considerably less time with CPD than without it.
Recognizing the Full ROI of DR
The above discussion is not intended to belittle the insurance value of DR. One buys insurance when the cost of the insured risk would be greater than one can bear should the threat come to pass. DR definitely fits this bill. And, in some industries, regulations demand that companies acquire DR technologies for this purpose.
The point is that, by viewing DR as only an insurance policy, organizations may blind themselves to the additional value that can be achieved through investments in advanced DR solutions. Those investments are well worth investigating, and they can often unlock returns that are far larger and more assured than the returns available through DR "insurance" alone.