
Disaster Recovery
As a quick thought experiment, the next time you are in your data center,
look around, and imagine for a moment that it is gone. And not just the
computers. Imagine that the entire building no longer exists. Next, imagine
that your job is to get as much of the work that was being done in the data
center going in some fashion, some where, as soon as possible. What would you
do?
By thinking about this, you have taken the first step of disaster recovery.
Disaster recovery is the ability to recover from an event impacting the
functioning of your organization's data center as quickly and completely as
possible. The type of disaster may vary, but the end goal is always the same.
The steps involved in disaster recovery are numerous and wide-ranging. Here
is a high-level overview of the process, along with key points to keep in
mind.
Creating, Testing, and Implementing a Disaster Recovery
Plan
A backup site is vital, but it is still useless without a disaster
recovery plan. A disaster recovery plan dictates every facet of the disaster
recovery process, including but not limited to:
- What events denote possible disasters
- What people in the organization have the authority to declare a
disaster and thereby put the plan into effect
- The sequence of events necessary to prepare the backup site once a
disaster has been declared
- The roles and responsibilities of all key personnel with respect to
carrying out the plan
- An inventory of the necessary hardware and software required to
restore production
- A schedule listing the personnel that will be staffing the backup
site, including a rotation schedule to support ongoing operations without
burning out the disaster team members
- The sequence of events necessary to move operations from the backup
site to the restored/new data center
Disaster recovery plans often fill multiple looseleaf binders. This level
of detail is vital because in the event of an emergency, the plan may well
be the only thing left from your previous data center (other than the last
off-site backups, of course) to help you rebuild and restore operations.
| |
While disaster recovery plans should be
readily available at your workplace, copies should also be stored
off-site. This way, a disaster that destroys your workplace will not
take every copy of the disaster recovery plan with it. A good place to
store a copy is your off-site backup storage location. If it does not
violate your organization's security policies, copies may also be kept
in key team members' homes, ready for instant use. |
Such an important document deserves serious thought (and possibly
professional assistance to create).
And once such an important document is created, the knowledge it contains
must be tested periodically. Testing a disaster recovery plan entails going
through the actual steps of the plan: going to the backup site and setting
up the temporary data center, running applications remotely, and resuming
normal operations after the "disaster" is over. Most tests do not attempt to
perform 100% of the tasks in the plan; instead a representative system and
application is selected to be relocated to the backup site, put into
production for a period of time, and returned to normal operation at the end
of the test.
Backup Sites: Cold, Warm, and Hot
One of the most important aspects of disaster recovery is to have a
location from which the recovery can take place. This location is known as a
backup site. In the event of a disaster, a backup
site is where your data center will be recreated, and where you will operate
from, for the length of the disaster.
There are three different types of backup sites:
- Cold backup sites
- Warm backup sites
- Hot backup sites
Obviously these terms do not refer to the temperature of the backup site.
Instead, they refer to the effort required to begin operations at the backup
site in the event of a disaster.
A cold backup site is little more than an appropriately configured space
in a building. Everything required to restore service to your users must be
procured and delivered to the site before the process of recovery can begin.
As you can imagine, the delay going from a cold backup site to full
operation can be substantial.
Cold backup sites are the least expensive sites.
A warm backup site is already stocked with hardware representing a
reasonable facsimile of that found in your data center. To restore service,
the last backups from your off-site storage facility must be delivered, and
bare metal restoration completed, before the real work of recovery can
begin.
Hot backup sites have a virtual mirror image of your current data center,
with all systems configured and waiting only for the last backups of your
user data from your off-site storage facility. As you can imagine, a hot
backup site can often be brought up to full production in no more than a few
hours.
A hot backup site is the most expensive approach to disaster recovery.
Backup sites can come from three different sources:
- Companies specializing in providing disaster recovery services
- Other locations owned and operated by your organization
- A mutual agreement with another organization to share data center
facilities in the event of a disaster
Each approach has its good and bad points. For example, contracting with
a disaster recovery firm often gives you access to professionals skilled in
guiding organizations through the process of creating, testing, and
implementing a disaster recovery plan. As you might imagine, these services
do not come without cost.
Using space in another facility owned and operated by your organization
can be essentially a zero-cost option, but stocking the backup site and
maintaining its readiness is still an expensive proposition.
Crafting an agreement to share data centers with another organization can
be extremely inexpensive, but long-term operations under such conditions are
usually not possible, as the host's data center must still maintain their
normal production, making the situation strained at best.
In the end, the selection of a backup site is a compromise between cost
and your organization's need for the continuation of production.
Hardware and Software Availability
Your disaster recovery plan must include methods of procuring the
necessary hardware and software for operations at the backup site. A
professionally-managed backup site may already have everything you need (or
you may need to arrange the procurement and delivery of specialized
materials the site does not have available); on the other hand, a cold
backup site means that a reliable source for every single item must be
identified. Often organizations work with manufacturers to craft agreements
for the speedy delivery of hardware and/or software in the event of a
disaster.
Availability of Backups
When a disaster is declared, it is necessary to notify your off-site
storage facility for two reasons:
- To have the last backups brought to the backup site
- To arrange regular backup pickup and dropoff to the backup site (in
support of normal backups at the backup site)
Network Connectivity to the Backup Site
A data center is not of much use if it is totally disconnected from the
rest of the organization that it serves. Depending on the disaster recovery
plan and the nature of the disaster itself, your user community might be
located miles away from the backup site. In these cases, good connectivity
is vital to restoring production.
Another kind of connectivity to keep in mind is that of telephone
connectivity. You must ensure that there are sufficient telephone lines
available to handle all verbal communication with your users. What might
have been a simple shout over a cubicle wall may now entail a long-distance
telephone conversation; so plan on more telephone connectivity than might at
first appear necessary.
Backup Site Staffing
The problem of staffing a backup site is multi-dimensional. One aspect of
the problem is determining the staffing required to run the backup data
center for as long as necessary. While a skeleton crew may be able to keep
things going for a short period of time, as the disaster drags on more
people will be required to maintain the effort needed to run under the
extraordinary circumstances surrounding a disaster.
This includes ensuring that personnel have sufficient time off to unwind
and possibly travel back to their homes. If the disaster was wide-ranging
enough to affect peoples' homes and families, additional time must be
allotted to allow them to manage their own disaster recovery. Temporary
lodging near the backup site will be necessary, along with the
transportation required to get people to and from the backup site and their
lodgings.
Often a disaster recovery plan includes on-site representative staff from
all parts of the organization's user community. This depends on the ability
of your organization to operate with a remote data center. If user
representatives must work at the backup site, similar accommodations must be
made available for them, as well.
Moving Back Toward Normalcy
Eventually, all disasters end. The disaster recovery plan must address
this phase as well. The new data center must be outfitted with all the
necessary hardware and software; while this phase often does not have the
time-critical nature of the preparations made when the disaster was
initially declared, backup sites cost money every day they are in use, so
economic concerns dictate that the switchover take place as quickly as
possible.
The last backups from the backup site must be made and delivered to the
new data center. After they are restored onto the new hardware, production
can be switched over to the new data center.
At this point the backup data center can be decommissioned, with the
disposition of all temporary hardware dictated by the final section of the
plan. Finally, a review of the plan's effectiveness is held, with any
changes recommended by the reviewing committee integrated into an updated
version of the plan.