23 October 2025 The October 20 AWS Outage: A wake-up call for SMB resilience

By Eric Pinet

CONTEXT

On October 20, 2025, AWS’s US-East-1 region experienced a major service interruption that lasted nearly 15 hours, affecting DynamoDB, EC2, Lambda, and dozens of other services.

As Amazon CTO Werner Vogels famously said: “Everything fails all the time.” Harsh as it sounds, this statement captures a fundamental truth in the digital world: no infrastructure, whether on-premise or cloud, is immune to outages.

For Quebecois and Canadian SMBs, this event reignited legitimate questions about cloud reliability and resilience strategies. Between calls for multi-cloud setups, suggestions to revert to on-premise infrastructure, and complex multi-region architecture recommendations, a pragmatic approach is essential. The reality is that every solution comes with trade-offs, and a “perfect” solution does not exist; only informed choices tailored to your business context.

What really happened: Technical analysis of the incident

The outage started with an apparently simple problem: a latent concurrency issue in DynamoDB’s automated DNS management system. This bug caused the deletion of DNS records for DynamoDB’s regional endpoint (dynamodb.us-east-1.amazonaws.com), making the service unavailable for nearly three hours.

The domino effect across the AWS ecosystem

The DynamoDB outage triggered a chain reaction impacting critical services:

EC2: New instance launches failed. The Droplet Workflow Manager (DWFM), responsible for managing the physical servers hosting EC2 instances, relies on DynamoDB to maintain leases with these servers. Without DynamoDB, leases expired, rendering thousands of servers unavailable for new launches.
When DynamoDB came back online at 2:25 a.m., DWFM tried to restore all leases simultaneously, creating a congestion state; the system was so overloaded it could not progress.
Lambda: Lambda functions experienced invocation errors and processing delays. Event sources like SQS and Kinesis were particularly affected, resulting in large message backlogs.
Network Load Balancer (NLB): NLBs saw increased connection errors due to failing health checks. The health verification system launched new EC2 instances whose network configurations were not yet propagated, creating alternating “healthy” and “unhealthy” results.

Numbers that put it in perspective

According to AWS, us-east-1 accounts for roughly 20% of their global capacity. The incident affected part of that region for 15 hours. To put this in context: AWS guarantees 99.99% uptime for DynamoDB, allowing roughly 52 minutes of annual downtime. This single outage far exceeded that quota but remains exceptional in AWS’s history.

Customers using DynamoDB Global Tables were able to failover to replicas in other regions, demonstrating the effectiveness of multi-region architectures, for those who had implemented them.

Important note on billing

In addition, it is possible to obtain a credit in the event of non-compliance with the SLA. Since the service interruption greatly exceeded the maximum duration allowed by the SLAs for the various services, affected customers are eligible to request compensation in the form of a credit. For example, the SLA for DynamoDB is 99.99%.

Multi-cloud: Great in theory, but unrealistic for SMBs

Faced with such an outage, the instinctive reaction is often: “Why not spread services across AWS, Azure, and Google Cloud? If one fails, the others take over!” In theory, it’s flawless. In practice, it’s rarely the right move for most SMBs.

Exponential complexity

Adopting a multi-cloud architecture means:

Multiple expertise: Teams must master services from each provider, requiring months of training and experimentation.
Increased development costs: Every feature must be built, tested, and maintained across platforms. Infrastructure-as-code tools must handle different syntaxes, and CI/CD pipelines become more complex.
Operational overhead: Monitoring, patching, and optimizing multiple environments simultaneously drives up monthly management costs.

Cost-benefit analysis falls short

For a typical SMB:

Expected benefit: Protection against major outages. AWS historically maintains uptime above 99.9%. Even assuming 2–3 major incidents per year lasting 10 hours each, that’s about 30 hours of potential downtime.
Revenue perspective: For an app generating $500,000/hour (≈$4.3B/year), protecting 30 hours justifies the investment. For an SMB earning $2–5M/year, 30 hours of downtime costs only $20–50k—far below multi-cloud costs.

On-premise: A misleading nostalgia

Some see cloud outages as validation for their skepticism: “See? AWS fails! Better to control our own infrastructure!” This overlooks a simple statistical reality: on-premise infrastructures fail too, often more frequently.

Outages happen, even if invisible

When AWS fails, the world notices. When your local data center has an outage, only your team and customers know. This visibility bias distorts perception.

According to Uptime Institute, 31% of organizations experienced at least one major on-premise outage in 2023, with an average downtime of seven hours. The most common causes: hardware failures (35%) and human error (27%).

The hidden cost of on-premise

Servers must be replaced every 3–5 years, data centers require maintenance, network infrastructure must be updated. SMBs hosting on-premise typically spend 40–60% more than in the cloud for often lower availability (95–98% vs. 99.9%+).

Multi-region: Effective, but selective

Multi-region architectures are often hailed as the ultimate solution. AWS offers Canadian regions (Montreal and Calgary), allowing compliance with data sovereignty requirements while providing redundancy. But pragmatism is still key.

When multi-region makes sense

Mission-critical applications: Payment systems, telemedicine platforms, financial apps where each minute of downtime is costly.
High SLA contracts: If clients demand 99.99% uptime (≈52 minutes downtime/year), single-region architecture is insufficient.
Significant revenue: Applications generating millions annually where even one hour of downtime impacts the bottom line.

True costs of multi-region

Data replication: DynamoDB Global Tables or Aurora Global Database cost 2–3x more than single-region databases. Cross-region replication adds data transfer fees (~$0.02/GB).
Duplicated services: Lambda, API Gateway, load balancers, caches—all deployed in each region, increasing infrastructure costs by 80–120%.
Routing complexity: Route 53 health checks, automatic failover, cross-region session management, secret/config sync.
Testing and validation: Every deployment must be tested in all regions; failover tests add 20–30% to release cycles.

A smarter approach: Selective multi-region

A realistic strategy is to identify truly critical components and deploy them multi-region, keeping others single-region:

Transactional databases: DynamoDB Global Tables or Aurora Global Database for critical data.
Stateless services: Lambda and containers can be deployed across regions at moderate cost.
Critical storage: Cross-region S3 replication for key documents.
Non-critical services: Logs, metrics, dev environments remain single-region.

This approach can reduce overhead by 50–60% while protecting essentials.

Key lessons from the October 20 outage

Decoupled architecture: The crucial factor

Marc Sanfaçon, VP at Coveo, shared in an article that the company’s search engine (the core of its platform) remained fully operational throughout the outage. How? Thanks to a highly decoupled architecture where the critical query path was isolated from DynamoDB and new EC2 launches.

Cédric Thibault’s analysis highlighted differences in recovery times across SaaS: Reddit was back online quickly, while Atlassian took several extra hours. The differentiator? Elastic architectures capable of handling message surges and well-prepared failover plans.

Operational preparedness: Beyond architecture

True resilience isn’t only technical. It’s also operational:

Response playbooks: Documented, tested procedures for every incident type. Who to call? What steps? In what order?
Deep observability: Real-time dashboards, smart alerts, metric correlation. You can’t fix what you can’t see.
Chaos engineering: Regularly simulate outages (service down, region unreachable) to validate recovery mechanisms.

Accept informed trade-offs

For SMBs, the question isn’t “How do I achieve 100% uptime?” That’s impossible and prohibitively expensive. The real question is: “Which level of uptime justifies what investment?”

Assess real downtime tolerance: How much does an hour of outage actually cost (in lost revenue, dissatisfied customers, contractual penalties)?
Review internal SLAs: Can your team respond at 3 a.m.? Do escalation procedures exist?
Prioritize quick wins: Before multi-region investment, implement automated backups, robust health checks, and intelligent retry mechanisms.
Invest gradually: Start with well-designed single-region architectures, add resilience for critical components, and expand to multi-region only when ROI is clear.

Conclusion: Resilience is built, not bought

The AWS outage on October 20, 2025, reminds us of a fundamental truth: “Everything fails all the time.” Even with billions in investment and thousands of engineers, AWS isn’t immune. Your on-premise infrastructure isn’t either. And competitors claiming to run three clouds simultaneously are likely understating operational complexity.

The real question is not “How do I prevent every outage?” but “How do I build resilience proportionate to my needs and resources?” For most Quebec SMBs, the answer isn’t expensive multi-cloud setups, nostalgic on-premise solutions, or multi-region architectures worthy of GAFAM.

The answer is pragmatic: managed services in a Canadian region (Montreal or Calgary for sovereignty), systematic multi-AZ deployment, decoupled architecture, robust monitoring, and operational preparedness. Invest in resilience gradually, guided by real cost-benefit analysis, not fear or dogma.

After all, the best architecture isn’t the one that never fails. It’s the one that fails rarely, recovers quickly, and whose construction and maintenance costs are justified by the value it protects.

At Unicorne, we help Quebec businesses design cloud architectures tailored to their realities. We prioritize pragmatic solutions over technological chimera and honest cost-benefit analysis over marketing rhetoric. Your infrastructure should serve your business, not the other way around.

Need an AWS architecture review? Our experts can assess your current setup, identify real (not theoretical) vulnerabilities, and propose a practical, financially justified improvement plan.

Offices

Québec

Montréal

References

Information