Whitepapers

Resilience Lifecycle Framework

This whitepaper shares services, strategies, best practices, and mechanisms you can incorporate into your organizational and developmental processes to drive continuous resilience.

Learn more »

Multi-Region Fundamentals

This whitepaper is intended for cloud architects and senior leaders building workloads on AWS who are interested in using a multi-Region architecture to improve resilience for their workloads.

Learn more »

Advanced Multi-AZ Resilience Patterns

This whitepaper provides guidance on how to instrument workloads to detect impact from gray failures that are isolated to a single Availability Zone, and then take action to mitigate that impact in the Availability Zone.

Learn more »

Using AWS Fault Isolation Boundaries

This whitepaper details how AWS uses its fault isolation boundaries, inclusive of Availability Zones (AZ), Regions, control planes, and data planes, to create zonal, Regional, and global services. 

Learn more »

Disaster Recovery of Workloads on AWS

This whitepaper outlines best practices for planning and testing disaster recovery for any workload deployed to AWS, and offers different approaches to mitigate risks and meet the recovery objectives for that workload.

Learn more »

Resilience Analysis Framework

This whitepaper introduces a resilience analysis framework that provides a consistent way to analyze failure modes and how they could impact your workloads.

Learn more »

Blogs

Resilience Best Practices

Four things everyone should know about resilience 
New to resilience? Read this blog to learn about the top four most important concepts to get you started on your journey to building resilient applications in the cloud. 
Building resilient Well-Architected workloads using AWS Resilience Hub
Learn how to use Resilience Hub to assess and improve the resilience of your single Availability Zone (AZ) architecture based on Resilience Hub recommendations.

High Availability Patterns

Series: Creating a multi-Region application with AWS services
Learn about the specific services and features AWS offers to help you build resilient, multi-Region architectures. 
Rapidly recover from application failures in a single AZ
Performing a zonal shift with Amazon Route 53 Application Recovery Controller enables you to achieve rapid recovery from application failures in a single Availability Zone (AZ).
Automating safe, hands-off deployments
Learn how Amazon automatically validates and safely deploys any type of source change to production, and how you can apply this strategy to your work. 
Reliability, constant work, and a good cup of coffee
Learn about building simple, scalable, resilient systems using a clever coffee analogy and AWS services such as Amazon Route 53 and S3. 
Making retries safe with idempotent APIs
Learn strategies for using idempotent APIs to reduce complexity and manage retries.
Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling
Customers frequently use Elastic Load Balancing (ELB) load balancers and Amazon EC2 Auto Scaling groups (ASG) to build scalable, resilient workloads.

Disaster Recovery

Series: Disaster recovery (DR) architecture on AWS
This four-part series shares best practices for disaster recovery across four strategies: backup and restore, pilot light, warm standby, and multi-site active/active. 
Creating disaster recovery mechanisms using Amazon Route 53
Modern DNS services, like Amazon Route 53, offer health checks and failover records that you can use to simplify and strengthen your DR plan. 

Chaos Engineering

Any day can be Prime Day: How Amazon.com search uses chaos engineering to handle over 84K requests per second
Discover how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through Chaos Engineering.
View more blogs »

Videos

Itaú Unibanco Improves Application Resilience with AWS (1:29)
Vanguard Improves Resilience and Communication with AWS Well-Architected (1:19)
Broadridge taps AWS to help improve resilience of their critical systems (1:05)
Multi-Region design patterns and best practices (ARC306) (58:05)
Reducing your area of impact and surviving difficult days (ARC305) (49:03)
Reliable scalability: How Amazon.com scales in the cloud (ARC206) (57:37)