This article is an excerpt from the book System i Disaster Recovery and Planning.
Many backup best practices seem basic, but accomplishing them isn't always easy. They depend on appropriate reporting and measurement capabilities, and staff competency within the organization. If you can't accomplish all of the best practices, try to address the most critical. If time and resources are the issue, develop a plan to justify them. Against these hurdles, you must measure the acceptable risk of unrecoverable data during any major outages.
Here are two key points to remember when developing a data recovery strategy:
Back up all critical data. Ask yourself, are you backing up the right stuff?
Backups are the backbone to any recovery situation. In most recovery situations, the backups are not adequate, so excessive time is spent recreating parts of the operating system.
Issue 1: My backups run on the night shift, so I never hear about any issues!
The key element to recovery of your system in a disaster is the completeness of the backups. If your backups are incomplete or flawed prior to the disaster, then the disaster recovery plan simply will not work, no matter how many experts you recruit. Having a process in place means a lot more than simply signing your name to it. With a sign-off, the process implies correctness. It means you have adhered to all the necessary steps in verifying that the backups are fully complete. That means 100% complete. This is especially important as it pertains to your backups.
Many backup solutions are partially broken. I often observe graphs posted in IT shops stating things like, "We have a 96% backup success rate. We observe all standards to ensuring your data is backed up." A 96% on a math exam is amazing. A 96% in backups implies failure. This means that 4% of the time, the server isn't backed up completely. On a yearly calendar, you have 14 days with incomplete backups. Is this acceptable to your business? My guess is no.
Customer Backup Log Sample
System not in restricted state, SAVSYS processing completed with errors.
Starting SAVDLO of folder *ANY to devices TAP01.
2,574 document library objects saved.
Starting save of list *LINK to devices TAP01.
43,917 objects saved. 342 not saved.
Save of list *LINK completed with errors.
Starting save of media information at level *OBJ to device TAP01.
18 objects saved from library QUSRBRM.
Save of BRM media information at level *OBJ complete.
DAILY *BKU 0070 *EXIT CALL PGM(BBSYSTEM/ENDDAYBU.
Control group DAILY type *BKU completed with errors.
In this example, 342 objects were not saved on this web server. The response was, "Oh, we always get this message. It's no big deal." Backup is signed off as successful.
Was the backup really successful? Of course not. The backup could not get a lock on 342 objects, which probably means they were in use. If these 342 objects are in use, they must be critical to the function of this application. You need 100%.
The person responsible for managing the backups is under pressure to report only the good. So, if no one asks . . . hey, no one asked. Backups are typically managed by a junior staffer and on an off-shift. Many times, backup failures are not reported to management because the people performing the backups think if they tell anyone how bad the backups are, they'll be fired. They're betting that the backups will work the next night, or everything will be captured with the weekly full system option 21 save.
Examine your own house and see just how well your backups are really running.
Issue 2: Develop a backup plan.
Backups are just one component of data protection. Backup planning should be a fundamental part of every backup and recovery program's design. Your backup solution must always consider a process for rolling out new applications, adding additional partitions or guest operating systems, and being able to manage disk growth. A proper backup plan enables the system administrators to fully understand the application needs and any additional business requirements.
Issue 3: Establish a backup lifecycle program.
An effective backup strategy requires certain tasks to be completed successfully each and every day. In addition, there are weekly, monthly, and even yearly tasks that are vital to your business. An effective backup lifecycle program demands that all tasks are documented and performed on a regular, published, and agreed-upon schedule. This also lends itself to ITIL and SOX compliance.
Daily tasks are the operational fundamentals that most backup administrators are familiar with. They include items such as these:
Problem analysis and resolution
Tape handling and library management
Offsite tape storage
Scheduling of special saves
Weekly, monthly, and long-term backups
Capacity backup planning for disk and tape drives
Review of backup policies
Recovery testing and verification
Evaluate your daily/weekly/monthly tasks as needed. Document them, and make sure they're recorded and signed off. All this will seem very tedious at first, but as you automate these processes, you will immediately come to realize the benefits.
Issue 4: Review backup logs daily.
A review of backup job logs or BRMS reports and the QSYSOPR message queue is a key daily task, but one that's not routinely performed. It can be time-consuming, but it does pay dividends for ensuring backup completeness. DSPLOGBRM provides you with the BRMS activities per date, interactively. You can see today's history, yesterday's, last week's, or as long as you keep the history.
Backup issues tend to manifest themselves. How many times has one backup event resulted in a series of subsequent failures? A weekly backup gets run instead of a daily, thus shutting down all subsystems, including QINTER. A very angry group of users is less than impressed. Then, you put the weekly job schedule entry on hold and forget to release it later. Now you have missed the weekend save. The system administrator responsible never informed anyone about putting the weekly backup job on hold.
It takes considerable skill to troubleshoot backup failures. It is therefore important to verify everything works. When it does not, determine the root cause rather than guess based on some symptom.
Issue 5: Protect your manual tape backup database or BRMS catalog.
BRMS backup control groups maintain a database or catalog that is absolutely critical to the recovery of your system from the backup tapes. Having no access to the BRMS catalog or, worse, a corrupted catalog means you have lost your ability to restore anything from your backup tapes.
Every backup performed through control groups automatically writes the catalog to the tape. The following objects will permit the catalog to be retrieved:
Control Item Type Name Number Date Time Saved
Saved Number Group
---------- ----- ---------- ----- -------- -------- -------
------- --------- ---------- ----------
__ QBRM *FULL *SYSBAS 00001 xx xx MTHLY
__ QMSE *FULL *SYSBAS 00001 xx xx MTHLY
__ Q1ABRMSF *FULL *SYSBAS 00001 xx xx MTHLY
__ Q1ABRMSF01 *FULL *SYSBAS 00001 xx xx MTHLY
__ QUSRBRM *FULL *SYSBAS 00001 xx xx MTHLY
__ QUSRBRM *QBRM *SYSBAS 00033 xx xx DAILY
If you do not have a tape management software solution, how do you know what data is on which tapes? Many clients manually record volume information in Microsoft Excel or Word and store the information on a local PC. The key to remember is to get this information offsite, so that it does not go down in flames with the rest of your infrastructure.