Small-to-medium-size enterprises that leverage the cloud often overlook the critical requirement of having a disaster recovery plan (DRP) in place for their data and production environments.
Designing and maintaining an enterprise DRP can be time-consuming and cumbersome, but its value to your business will be immeasurable if it prevents true disaster from happening.
What is a disaster recovery plan?
A disaster recovery plan (DRP) is a collection of processes that quickly migrate production application traffic from one cloud region to another in the case of a major catastrophe within the primary region.
Why are disaster recovery plans important?
Disaster recovery plans are critically important for business continuity because when enterprise resources go offline, revenue can be impacted and business reputation can suffer immensely.
Disaster Recovery Plans: Common Mistakes
The biggest common mistake companies make with disaster recovery plans is the assumption that because enterprise resources are in the cloud, they will always be available. The cloud does not have inherited disaster recovery (DR) in place because it’s always possible for an entire region’s data centers to go offline simultaneously.
Furthermore, while cloud providers shoulder the responsibility for storing and protecting clients’ data and running their mission-critical applications, it’s up to the enterprise to minimize downtime should any data center or even a subset of cloud services go offline.
Even for enterprises with a significant on-premise presence, the cloud can serve as a disaster recovery strategy where a domain name service (DNS) points production traffic to a cloud-based DR site.
In these cases, the cloud significantly lowers costs from not having to own and maintain two on-premise data centers in different regions.
6 Steps to Build Your Disaster Recovery Plan:
- Create a Contingency Statement
- Conduct a Detailed Business Impact Analysis (BIA)
- Draft the Contingency Plan
- Outline the Control Measures
- Implement Testing and Training
- Plan for Maintenance
1. Create a contingency statement
Begin by formalizing a set of rules or guidelines that authorizes a DRP to be developed and implemented in your enterprise.
This is the mission statement that defines the boundaries and requirements of the DRP. It can be a reflection of your enterprise service-level agreement (SLA) that states that within a certain amount of time, mission-critical components will be redundant to a certain level.
2. Conduct a detailed Business Impact Analysis (BIA)
The BIA identifies and prioritizes your mission-critical IT applications and components. It should be a collaborative effort between the infrastructure, web, and product management teams to document these components in a tiered manner.
Determine the order of application and data store importance with the following classifications:
- Absolutely mission-critical: The major revenue generators with as minimal downtime as possible, measured in minutes or hours.
- Semi-important applications or components: Minor revenue generators with larger acceptable downtimes.
- Low-tier applications or components: Little to no revenue-generating impact. These might have a downtime of several hours to days with little or no impact on the mission-critical applications.
Each tier should have its own SLA and detail on potential downtime losses and how the risks will affect business operations and growth. Emphasis should be placed on two key elements:
- Recovery Time Objective (RTO): The maximum acceptable time that your application can be offline.
- Recovery Point Objective (RPO): The maximum targeted period in which data might be lost from an IT service due to a major incident; i.e., the amount of time that an application or data store can tolerate data loss.
Personal data needs to be included in this discussion as well, in terms of the value your enterprise places on maintaining and protecting sensitive customer data.
The collaborative effort on the BIA comes from business owners articulating the biggest revenue losses, application owners illustrating how applications would behave during a shutdown, and operations and infrastructure team members who would be responsible for enacting the DRP.
In the end, you should have a solid outline of how to implement the DR strategy for each tier, and when a region is lost, what are the steps for a site to come back online with as minimal business loss as possible. All of these conclusions will be part of the DRP documentation.
3. Draft the contingency plan
The contingency plan identifies “who does what,” distinctly naming those responsible for enacting the various DR procedures.
4. Outline the control measures
This is the step-by-step process of the DR procedures, consisting of three types of control measures:
- Preventive measures: To identify and reduce risk should a disaster occur (having a current backup and restore model, for example).
- Detective measures: To uncover unwanted events within the IT infrastructure (via antivirus and networking monitoring software, for example) that could stand in the way of corrective measures.
- Corrective measures: To restore the system in a secondary environment following a disaster event (a precise series of steps ensuring systems are up and running within the RTO and RPO constraints). The secondary production environment can be a reduced or exact replica of the primary production environment.
To have an effective and attainable DRP, Infrastructure as Code (IaC) practices for network infrastructure and application and data tiers must be implemented for leveraging the secondary production environment.
Amazon AWS uses Cloudformation templates, Microsoft Azure uses ARM (Azure Resource Management) templates, and Google uses Cloud Deployment Manager solutions, all of which turn infrastructure into software that can be version controlled and backed up.
The extent of DRP appropriate for your enterprise will depend on your BIA. It might be one of the following:
- Pilot light: A small implementation in another region that can be easily spun up to take full production traffic.
- Warm site: Frequent replications going back and forth in a larger implementation.
- Multi-site implementation: Both regions serve the same amount of traffic, but each region has sufficient resources so that if one region goes offline, the other region can take all the traffic seamlessly.
As organizations review the ‘options’ for a given application as well as the cost/budget impacts it is common for the DRP to be updated and changed over time.
5. Implement testing and training
A DRP is a wasted effort if it’s not tried and true. Review — and whenever possible, test — all the steps in the DRP quarterly or biannually to ensure the failover process is fail-safe.
Senior management and every employee must be trained in their part of the DR procedures to ensure they explicitly understand how to execute the steps.
If your development environment is close in scope to your production environment, you can run the tests there.
Keep in mind that the needs of a particular department in your organization may change over time, and regular testing can help to identify those changing needs. These changes should be taken into consideration after each testing process.
6. Plan for maintenance
The maintenance plan should be a live document updated on a regular basis to remain tuned with system enhancements. This document should be updated any time regular testing is performed.
Also, keep in mind that AWS and Azure consistently release new features that might impact your DRP and that also might help to automate some features that currently require hands-on attention.
In addition to the DRP, enterprises need to have a formal backup and restore model as part of their DR strategy. It’s surprising how many firms actually do not have a reliable model in place.
The model should be documented with detail regarding the data that is backed up, the process to restore it when needed, and how often the process is tested.
A smart enterprise backs up its data at least every month. It’s wise to have regular backup and restore exercises because if you have a problem, it’s best to encounter it in one of these regular exercises rather than during a moment of disaster.
Disaster Recovery Plan: An Essential Insurance Policy
Think of a DRP as a critical risk-mitigation insurance policy for your business, and as an essential part of your business continuity planning.
If you don’t have a DRP in place, start by having conversations with stakeholders about the BIA and control measures. These conversations will start to build a roadmap of the requirements and assumptions for an attainable, sustainable, and cost-effective DR strategy.
If you need DR planning assistance, AIM can help by performing a comprehensive DR assessment and recommending the best strategy for your organization.
Need Help Building Your Disaster Recovery Plan?
We are technology consulting experts & subject-matter thought leaders who have come together to form a consulting community that delivers unparalleled value to our client partners.