AWS Outage Analysis: October 20, 2025 - by ThousandEyes Internet Research Team

jlagan74
4 days ago
12 min read

To keep up to date on all major outages, see the ThousandEyes Internet Outages Timeline at: https://www.thousandeyes.com/resources/internet-outages-timeline Summary

On October 20, AWS experienced a significant disruption in its US-EAST-1 region, impacting multiple services that rely on AWS. See how the AWS outage unfolded in this analysis.

ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights.

Outage Analysis

Updated on October 24, 2025

On October 20, Amazon Web Services (AWS) experienced a significant disruption in its US-EAST-1 region that lasted over 15 hours and impacted multiple services that rely on AWS including Slack, Atlassian, Snapchat, and others.

What began as a DNS race condition triggered a complex cascade of infrastructure failures across multiple dependent systems, demonstrating how a single technical defect in critical infrastructure can create ripple effects throughout interconnected cloud services.

This analysis examines what ThousandEyes network monitoring observed throughout the incident, what the patterns revealed about the nature of the outage, and key takeaways for network infrastructure teams.

Read on to learn more and check out the video below for insights from Kemal Sanjta of the ThousandEyes Internet Intelligence team.

What Happened During the AWS Outages?

At approximately 6:49 AM (UTC) on October 20, ThousandEyes monitoring detected packet loss at AWS edge nodes in Ashburn, Virginia—the first observable symptom of what would become an extended regional disruption. This loss was occurring at the last hop before AWS infrastructure, not on customer networks or intermediate providers, indicating the problem originated within AWS's network boundary.

*Figure 1. ThousandEyes observed loss occurring at the last hop before AWS infrastructure*

The timing of this packet loss coincided precisely with AWS's official reported start time for the incident, when they noted increased error rates and latencies for AWS services in the US-EAST-1 region.

Packet loss at provider edge routers can indicate various conditions. Think of it like traffic problems: downstream congestion causes traffic to queue and overflow (packets get dropped when buffers fill), routing failures prevent forwarding (packets arrive but the router doesn't know where to send them), hardware issues cause random drops, or resource exhaustion means the router is simply overwhelmed. In this case, the packet loss location and timing suggested infrastructure-level problems—something wrong with AWS's internal systems rather than isolated failures in individual components.

Explore sample impact of the outage within the ThousandEyes platform (no login required)

The Root Cause: A DNS Race Condition

AWS's post-incident report revealed that the root technical cause was a latent race condition in DynamoDB's automated DNS management system. Understanding this requires some context about how DynamoDB operates at scale.

DynamoDB maintains hundreds of thousands of DNS records to operate its massive fleet of load balancers across each region. Automation constantly updates these records to add capacity, handle hardware failures, and distribute traffic efficiently. This DNS management system is split into two independent components for reliability:

1. The DNS Planner monitors load balancer health and capacity, periodically creating new DNS plans for each service endpoint—essentially deciding which load balancers should receive traffic and with what weight distribution.

2. The DNS Enactor operates independently and redundantly across three different Availability Zones. Each instance watches for new plans and updates Route53 (AWS's DNS service) by replacing the current plan with the new plan, helping ensure consistent updates even when multiple DNS Enactors work concurrently.

The race condition involved an unlikely interaction between two of these DNS Enactors. Here's what happened:

One DNS Enactor began applying an older plan but experienced unusually high delays, requiring retries on several endpoints. While it slowly worked through updates, the DNS Planner continued producing newer plan generations. A second DNS Enactor then picked up one of these newer plans and rapidly applied it across all endpoints.

When the second Enactor completed its updates, it triggered a cleanup process to delete significantly older plans. At that exact moment, the first (delayed) Enactor finally applied its much older plan to the regional DynamoDB endpoint, overwriting the newer plan. The timing meant that the initial check—which should have prevented applying an outdated plan—was stale due to the unusual delays.

The cleanup process then deleted this older plan (because it was many generations old), immediately removing all IP addresses for the regional endpoint. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented any DNS Enactor from applying subsequent updates. This ultimately required manual operator intervention to correct.

To understand why this had such a big impact, it helps to understand what DNS does. DNS acts like a phone book for the Internet—when an application wants to connect to DynamoDB, it first looks up the DNS name (like dynamodb.us-east-1.amazonaws.com) to find the actual IP addresses where DynamoDB servers are located. With an empty DNS record, applications literally couldn't find DynamoDB—it was like trying to call someone when the phone book has no listing for them.

*Figure 2. Network and application layer symptoms during the AWS outage*

By around 7:55 AM (UTC), the symptoms ThousandEyes was observing began to shift in a revealing way. The early packet loss at AWS edge routers had appeared to clear. Network conditions were no longer showing the consistent loss patterns seen at 6:49 AM (UTC). However, ThousandEyes monitoring revealed that a different issue was now happening: widespread connection timeouts at the application layer. Toward the end of this period, HTTP 503 (Service Unavailable) errors also began appearing, suggesting that some requests were reaching edge systems that were alive but too overwhelmed to process requests.

The presence of connection timeouts without corresponding network loss suggested that the network paths were clear but the infrastructure behind them couldn't process or route requests properly. Think of it like a phone ringing but no one picking up, versus getting a busy signal—both are failures, but they indicate different problems. The evolution from network-layer symptoms (packet loss) to application-layer symptoms (timeouts and HTTP 503s) indicated that while network congestion might have been addressed, the underlying systems dependent on DynamoDB remained impaired.

*Figure 3. As the AWS outage progressed, ThousandEyes began observing 503 Service Unavailable errors*

DNS Recovery and Persistent Cascading Effects

AWS engineers identified the DNS issue by 7:26 AM (UTC) and began implementing temporary mitigations by 8:15 AM (UTC) that enabled some internal services to connect to DynamoDB. Full DNS information was restored by 9:25 AM (UTC), and customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired between 9:25 AM and 9:40 AM (UTC).

However, restoring DNS didn't immediately restore all services. The three-hour DynamoDB outage had triggered cascading failures across systems that depend on it, and these effects persisted even after DynamoDB connectivity was restored.

Consider EC2's Droplet Workflow Manager (DWFM)—the system responsible for managing the physical servers that host EC2 instances. EC2 (Elastic Compute Cloud) is AWS's virtual server service—when customers or AWS services need computing resources, they launch EC2 instances. These launches aren't simple operations; they require coordination across multiple AWS systems.

EC2's DWFM requires DynamoDB to maintain leases on these servers, checking in with each server every few minutes to maintain proper state tracking. During the DynamoDB outage, DWFM couldn't complete these required state checks, causing lease management to fail and the state to become inconsistent.

Even after DynamoDB connectivity was restored at 9:25 AM (UTC), the accumulated state inconsistencies from those three hours of lost leases continued to cause problems. New EC2 instance launches failed or experienced connectivity issues until 8:50 PM (UTC)—over 11 hours after DNS was restored. Between 12:30 PM and 9:09 PM (UTC), network load balancers experienced health check failures, resulting in increased connection errors during the broader recovery period.

Services like Amazon Connect, AWS Security Token Service, and Amazon Redshift experienced extended impact as effects rippled through their dependencies. Some Redshift clusters remained impaired even after DynamoDB recovered because EC2 launch failures prevented replacement workflows from completing. The backlog of these workflows didn't begin draining until EC2 launches started succeeding, and full restoration took until 11:05 AM (UTC) on October 21—more than a day after the initial DNS issue.

Understanding the Multiphase Recovery

This incident demonstrated how modern infrastructure failures rarely resolve in a single step, even after identifying and fixing the root technical cause. The recovery unfolded across distinct phases:

Phase 1: Identifying and fixing the root cause (DNS race condition): This took from the incident’s start at 6:49 AM until 9:40 AM (UTC) when DNS was fully restored and DynamoDB connectivity recovered. AWS engineers had to diagnose the problem, understand the inconsistent state, and manually correct it since the automated systems couldn't recover on their own.

During this phase, in our analysis of the network layer, ThousandEyes observed initial packet loss at AWS edge nodes indicating infrastructure-level stress or congestion. This cleared relatively quickly as AWS addressed immediate routing and capacity issues.

Phase 2: Addressing cascading effects in dependent systems: Even with DynamoDB accessible, systems that had accumulated state problems during the outage needed additional recovery time. Health check systems experiencing state-related issues, lease management systems with expired or inconsistent leases, and coordination systems with stale or corrupted state all required additional remediation beyond the DNS fix.

During this period, while network symptoms decreased, ThousandEyes started seeing issues in the application layer: Widespread connection timeouts emerged, indicating backend systems weren't responding even though network paths were functioning. The progression to HTTP 503 errors showed systems were reachable but unable to process requests—characteristic of systems with impaired routing logic or health check systems.

Phase 3: Processing accumulated backlogs: Services needed to work through deferred operations, retry failed workflows, and process queued work that accumulated during the outage. Redshift cluster replacements, EC2 instance launch retries, and similar operations created backlogs that took hours to clear even after the underlying systems were functioning.

Phase 4: Application-level recovery: Even after AWS infrastructure recovery, customer applications required additional time to fully recover. This included clearing accumulated queues, re-establishing dropped connections, processing backlogs built up during degraded operation, resetting circuit breakers that opened for protection, refreshing caches with stale data, and allowing retry logic to exhaust and reset.

After AWS reported that all AWS services had returned to usual operations at 10:01 PM (UTC), ThousandEyes monitoring observed recovery patterns varying across different monitored endpoints during this period.

*Figure 4. As the environment recovered, ThousandEyes observed symptoms becoming more intermittent and varied*

The total recovery time exceeded 15 hours not because the DNS race condition took that long to fix, but because each subsequent phase required time to complete. Attempting to accelerate recovery by skipping phases or rushing restoration of normal traffic levels would likely have triggered new problems as fragile systems struggled under load they weren't ready to handle.

Additionally, the multi-layered symptom progression—from network to application layer, from consistent failures to intermittent issues, from widespread impact to localized problems—reflected the complex dependencies in modern cloud infrastructure. Each layer of symptoms corresponded to a different aspect of the cascading failure and its recovery.

Understanding these patterns matters for incident response. When you see symptom evolution rather than simple clearing, it signals that you're dealing with cascading effects rather than an isolated failure. When recovery appears incomplete or intermittent despite addressing what seemed like the root cause, it indicates downstream effects that require additional remediation time.

Along those lines, it’s also key to understand the difference between infrastructure recovery and application recovery. When infrastructure providers restore their core systems to operational status, that's a critical milestone—but application-level recovery work often extends beyond that point. Applications need to clear accumulated queues of deferred work, re-establish connections dropped during outages, process backlogs built up during degraded operation, reset circuit breakers that opened for protection, refresh caches that may contain stale responses, and allow retry logic to exhaust and reset.

Infrastructure teams should plan for this application recovery phase in their runbooks: verify service health from actual user perspectives, monitor for elevated error rates as systems work through backlogs, check for circuit breakers or manual safety measures needing reset, validate data consistency, and plan for elevated resource usage as systems catch up on deferred work.

What Does the AWS Outage Reveal About Modern Infrastructure Outages?

The AWS incident revealed several key aspects of modern infrastructure outages and disruptions, highlighting key takeaways for infrastructure teams.

1. Single points of failure can hide in the most sophisticated architectures. AWS operates with extensive redundancy—multiple availability zones, distributed systems, automated failover. Yet a latent race condition in a single critical system (DNS management) was able to trigger widespread failures. The sophistication of the architecture actually contributed to the complexity of the failure—the more interconnected and automated systems become, the more subtle race conditions and edge cases can emerge.

The key lesson is that even with redundancy at multiple layers, critical shared subsystems represent concentration risk. DNS management, health check systems, coordination services, and similar infrastructure components often operate as hidden single points of failure. When they fail, they can affect an entire region regardless of how well individual services are architected. Infrastructure teams should identify these critical shared components and understand what happens when they fail, even if they're designed to be highly reliable.

2. Cascading failures create complexity that exceeds the initial problem. The root cause was specific—a DNS race condition that created an empty record and an inconsistent state. But the operational impact extended far beyond DNS. DynamoDB became unreachable. Services depending on DynamoDB failed. Systems that had been maintaining state via DynamoDB lost that state. Health checks failed. Lease management broke down. Instance launches failed. The cascade continued for hours after the root cause was fixed.

This is characteristic of modern distributed infrastructure. Services aren't just dependent on other services—they maintain state, hold leases, make assumptions about availability, and accumulate backlogs when dependencies fail. Fixing the root technical defect addresses only the first link in the chain. The accumulated state problems, lost leases, failed workflows, and stale caches all require additional time to resolve.

Infrastructure teams should plan for this reality: Root cause fixes are milestones, not endpoints. Recovery requires working through the cascade of effects the failure created. Understanding your dependency chains helps predict what secondary effects to expect when critical infrastructure fails.

3. Different layers can reveal different aspects of the same failure. ThousandEyes observed this outage across multiple layers—network, transport, and application—and each layer told part of the story. Network layer packet loss appeared early, suggesting infrastructure stress. This cleared relatively quickly. Application layer timeouts then emerged, indicating routing or backend selection problems. HTTP 503 errors appeared, showing systems accepting connections but unable to process them.

Each layer of symptoms corresponded to a different aspect of the cascading failure. Network symptoms reflected immediate infrastructure impact. Application symptoms revealed backend systems struggling with lost state and broken dependencies. The temporal progression of symptoms across layers helped understand how the failure was evolving and which recovery phases were complete versus ongoing.

For incident response, this means visibility at a single layer provides incomplete understanding. Network monitoring might show recovery while applications still fail. Application monitoring might miss infrastructure problems that will cause subsequent issues. Observing multiple layers simultaneously provides more complete insight into both the nature of failures and the progress of recovery.

4. Recovery timelines are sums of dependent phases, not parallel operations. The AWS outage lasted over 15 hours from initial failure to complete restoration, not because any single fix took 15 hours, but because recovery required multiple sequential phases. DNS state had to be corrected manually. DynamoDB connectivity had to be restored. Systems that lost state during the outage had to rebuild it. Services with accumulated backlogs had to process them. Each phase depended on previous phases completing successfully.

Attempting to compress these timelines by skipping phases or rushing traffic restoration often backfires. Systems that are technically functioning but still fragile can't handle normal load levels. Backlogs that aren't processed create downstream problems. A state that isn't properly restored causes subtle failures later.

Infrastructure teams planning for outages should account for multi-phase recovery in their incident response procedures, communication plans, and recovery time objectives. A two-hour fix might require eight additional hours of phased recovery before normal operations can safely resume. Setting accurate expectations requires understanding that recovery is a process, not an event.

5. Regional failures can have global impact through service dependencies. Although the DNS issue was isolated to US-EAST-1, services worldwide that depended on that region experienced an impact. Applications with proper multi-region architecture still experienced failures if they relied on coordination services, data replication, or API calls that touched US-EAST-1. The interconnected nature of modern cloud infrastructure means regional boundaries don't always contain failure impact.

Even when you architect applications with geographic redundancy, you remain exposed to failures in critical infrastructure components that aren't regionally isolated. A cloud provider might have redundancy within regions, but if a critical shared subsystem fails, it can affect that entire region regardless of your application architecture.

Understanding your failure domains means recognizing that provider-level failures can bypass your architectural safeguards. Defense-in-depth strategies become important: multi-cloud approaches for truly critical workloads, tested procedures for failing to alternative providers or regions, ability to operate with degraded functionality, and communication plans that account for provider-level disruptions you cannot predict or prevent.

Your resilience has limits defined by your dependencies. Planning for infrastructure failures means accepting that some scenarios will exceed your architectural controls and having response plans ready when they do.

Previous Updates

[October 20, 2025, 10:00 AM PDT]

On October 20, at approximately 07:55 UTC, ThousandEyes began observing an issue affecting Amazon Web Services' Northern Virginia (US-EAST-1) region, impacting multiple dependent services including Slack, Atlassian, Snapchat, and others. Amazon has attributed the outage to DynamoDB API DNS resolution issues, cascading to affect AWS services and global features dependent on US-EAST-1 endpoints.

During the incident, ThousandEyes data showed degradation in service availability with requests timing out or returning service-related errors, indicating backend infrastructure issues. Notably, ThousandEyes observed no coinciding network events, further suggesting the problem resided within AWS's internal service architecture. The service started to recover around 09:22 UTC, with the issue appearing to clear by approximately 09:35 UTC.