Disaster Recovery – Are You Saving The Right Stuff?

For most IBMi shops, tape backup is their Disaster Recovery plan.  In the event that data on the server is lost due to human error, hardware failure, cyber-attack or disaster, the data can be recovered from a tape backup that is ideally stored off-site.  Tape backup works and it is reliable, but it takes time.  First you need to recover from the incident that caused the loss.  That may involve repairing or replacing the server or remediating the agent that caused the data loss.  The first step can take hours to days.  A full restore from a backup tape can take an additional day.  At a minimum it would take a day for recovery using tape backup and it could take a week or more.  Can your business tolerate a week long outage?

Experts will tell you that the most important aspect of a Disaster Recovery plan is testing it to make sure that it works when needed.  How do you test a Disaster Recovery plan that uses tape backup?  You cannot simply do a backup, install a clean version of the operating system on the server and then do a restore.  If it doesn’t work right, then you will have lost data, perhaps permanently lost!  How do you know that you are saving the right stuff?  How do you know that your staff can successfully perform a system recovery?  Recovering from a disaster is not the time for on-the-job training.

A full-system-save (Save 21) saves everything on the IBMi server.  But it takes time.  How much time?  That depends on how much is being saved, the speed of the server, the speed of the tape drive and the speed of the interface.  Some customers can perform a full-system-save every night.  Most cannot.  Some IBMi shops do not have a weekly, monthly or even quarterly window large enough to perform a full-system-save.  For those in that situation, a full-system-save can only be performed on national holidays when the business is closed.  When a full-system-save cannot be done regularly, then daily incremental backups are necessary.  When backing up with both full-systems-save and incrementals, the recovery process is even longer and becomes more complicated.  That is why you need to know if your recovery process really works.

There are several options for testing recovery using tape backups.  You cannot only make sure that you are saving the right stuff, but also train the operations staff for recovery. 

So, you’ve identified a number of digital opportunities for your business; but how do you determine which opportunities will give you a quick win or provide the highest business value at the lowest cost of implementation? Here are some best practices to help you kick-start your journey toward a digital transformation.

Options for Testing Disaster Recovery:

1. Backup Server – this option is possible if you have access to a second server that you can use to test recovery.  It could be a retired server that you have, as long as it supports the version of the operating system on the current production server and it has enough disk.  It could also be a server that you could borrow or rent to test the recovery.  If you just installed a new server, before you get rid of the old server, why not scratch install a new OS on it and practice your recovery process.  That’s getting the most value out of your old server.  

2. Virtual LPAR – current IBM servers (Power6 and later) and operating systems (V6.1 and later) allow the creation of a virtual LPAR using hardware resources allocated from the production environment.  If you meet the server and OS requirements for Virtual LPAR and have enough available CPU, memory and disk, then you can create a Virtual LPAR and restore the backup to it.  Once the restore is completed, the system can be tested and when completed the Virtual LPAR can be removed and the hardware resources returned to the production environment.  You will need IBM’s PowerVM software and a Hardware Management Console (HMC) to create and remove the Virtual LPAR. 

3. Cloud Server – cloud providers can provision a virtual server for you and restore your backup to it.  You can then test the recovered server via the internet.  Some cloud providers can provision a virtual server for you just for the test.  With other cloud providers a longer contract may be necessary.   

Options for reducing system outages for tape backup:

1. Save While Active – this option has been available on the IBMi since V2R2. It allows backups to occur while the system is active, after reaching a checkpoint.  The checkpoint is a point in time when all processing has been stopped and data has become synchronized.  This option works well for some customers, but requires skill to implement and operate. 

2. Faster hardware – this can include a faster tape drive, a faster tape interface(fibre), a faster tape adapter, more memory or disk and even a faster server. All these hardware components contribute to tape backup performance.

3. Parallel Backups – using IBM BRMS (Backup Recovery Media & Services) you can allocate more than one tape drive to a backup and significantly reduce backup times. This can be done without BRMS, but recovery will be very complicated.   

4. Virtual Tape Library (VTL) – appears to the IBMi as a tape device, but is actually an appliance that has Hard Disk Drives (HDD) or Solid State Drives (SSD). Backups, and recoveries, are much faster when they use disk devices instead of tape.  A tape drive may be attached to the VTL so that a tape backup can be created for off-site storage.

5. Cloud Backup – sends your backups to a cloud provider. Since backups to the cloud can be very slow, most cloud backup providers use an appliance that is like a VTL to get the backup performed quickly and then later the appliance sends the backup data to the cloud.  With Cloud Backup not only is your data stored safely off-site, your backups should be much faster.

6. Data Replication – is a High Availability/Disaster Recovery solution that uses a backup server and data replication software to keep data on the backup server in sync with the production server. In the event of a failure of the production server, the backup server is ready to take over the production workload.  Since the backup server has the same data as the production server, backups can be performed on the backup server while the production server is active.  That means no downtime for the production server for backups.

7. Hardware Replication – instead of using internal disk on your IBMi, external disk is used.  External disk systems have various options for replicating data to another external disk system.  External disk on the IBMi is a viable solution, but one that requires management and coordination of an additional resource.  Most IBMi shops prefer internal storage because it is faster, less expensive and simpler to operate.

If you don’t know if you are saving the right stuff, or if you need to improve your tape backup performance, how do you get started?  Find an expert that offers ALL of the possible solutions so that you can work with them to determine which option is the best solution for your environment.