How cost and complexity are factored into AWS DR strategies

As reliable as they are, cloud services inevitably fail, making disaster recovery essential.

When developing a disaster recovery (DR) plan, AWS users should consider how much they are willing to spend, in terms of time and money, to achieve the desired DR results. They should also understand Availability Zones (AZ) and AWS Regions, and how these concepts play out in DR.

DR Levels in AWS

There are four levels of DR protection in AWS. Depending on the cloud provider, these options vary in cost and complexity. They are listed below, starting with the least expensive and least complex option and getting more and more expensive and complex:

  • Backup and restore. Administrators typically choose this option for DR requirements such as minimizing data loss. Items are restored hours or days after the DR event due to cold storage recovery.
  • Night light. This option replicates data from one AWS Region to another. It also provides a copy of the underlying application infrastructure, but resources, such as servers, are enabled only for testing or failover. There is some downtime, but the workloads are back online fairly quickly, within minutes or hours depending on the amount of replicated data.
  • Warm standby. This option keeps a scaled down version of your production environment in another AWS Region. Downtime is minimal, typically a few minutes, as the workloads remain functional in the other region.
  • Multi-site active/active. With this option, users run workloads in multiple AWS Regions concurrently, ensuring little or no service disruption. Although this is the most complex and expensive option, it can reduce recovery times to near zero.

When choosing an AWS DR strategy, consider how much data you can afford to lose, how quickly you need to recover, and the cost of that recovery effort.

How Regions and Availability Zones Influence DR

Regions and Availability Zones are a key part of DR initiatives in AWS. A region is a geographic location in which AWS data centers reside. An AZ is a group of logical data centers within a particular region.

Many people think of DR as hardware redundancy and the need to distribute workloads across multiple AZs. This is mostly true. An AZ typically consists of multiple data centers that already have built-in redundancy, such as power and networking. However, this does not mean that the data centers of a certain AZ are mirror clones of each other. Services and data can move between them, but more likely as a failover rather than being fully synchronized. This results in partial — not complete — redundancy.

Compare AWS Availability Zones vs Regions

Each AWS Region typically has two or more Availability Zones. If you connect your application to two Availability Zones, latency will be low and therefore downtime will be minimal. Each AZ can have multiple redundant data centers for failover, ensuring zone protection. A Multi-AZ strategy is often used to protect against localized disasters, such as an earthquake or flood.

If you’re concerned about an event that could affect all Availability Zones in a Region, such as a massive power outage on the East Coast, you go a step further and bridge two AWS Regions. However, a multi-region strategy will create more complexity and cost.

Use automation to reduce costs

An aggressive DR strategy can reduce most redundancy gaps, but at a cost. A copy of your environment in another location can double your AWS bill. That’s a big expense for something waiting to be used rather than being actively used.

This is why infrastructure as code (IaC) is ideal for disaster recovery. If your recovery time objective can handle a short outage, why not create the infrastructure for your data only when you need it? Automation can activate infrastructure on demand, when you need it, rather than in case You need it. This is a much cheaper approach to DR in AWS.

You can also use smaller standby environments that still run in a limited active/active scenario. AWS Auto Scaling transforms these standby environments into a full production environment without human intervention and with limited downtime. There may be a lag in services during recovery, but the cost savings may be significant enough to warrant a short performance dip.

Automation via IaC and AWS Autoscaling will, however, require staff time and effort for setup and testing.

If a workload calls for it, pursue a comprehensive Multi-AZ strategy. A disaster recovery plan does not need to be built around a one-size-fits-all approach. Applying a DR policy to all workloads would likely be costly and restrictive. Some workloads require a higher level of downtime protection, while others do not.

As an organization, make disaster recovery choices that reflect your priorities and cost preferences, and adjust them over time.

Sharon D. Cole