Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region, 2025-10

Post Title Image (Illustration: Screenshot of AWS Health Dashboard at 2025-10-20 12:51 PDT. Image source: Ernest.)

✳️ tl;dr

  • The following content is from an official AWS report 1, segmented and highlighted by AWS Community Hero Ernest 2 from the perspective of a developer and technical manager, aiming to stay close to the facts and conduct reasoning and extended learning based on these facts.
  • Through studying this report, we hope that both parties (AWS and us as AWS customers) can accumulate experience and continue to improve together, whether in the cloud or on-premises.
  • Unless otherwise specified, all times below are in Pacific Daylight Time (PDT) from AWS Seattle headquarters on the West Coast.
  • This note will begin with a knowledge graph, followed by a breakdown of the original official report content, divided into four sections: Amazon DynamoDB, Amazon EC2, Network Load Balancer (NLB), Other AWS Services
  • If you have the budget to adjust your architecture for cross-region high availability but don’t have enough time for major architectural changes, it is recommended to take a look at AWS services with “global” in their name. For example, “Amazon DynamoDB Global Tables” from the same DynamoDB family was almost unaffected during this incident.

  • We wanted to provide you with some additional information about the service disruption that occurred
    • in the N. Virginia (us-east-1) Region 3
    • on October 19 and 20, 2025.
    • While the event started at 11:48 PM PDT on October 19 (Taipei Timezone UTC+8, 2025-10-20 14:48)
    • and ended at 2:20 PM PDT on October 20 (Taipei Timezone UTC+8, 2025-10-21 05:20),
    • there were three distinct periods of impact to customer applications.
      • First, between 11:48 PM on October 19 and 2:40 AM on October 20, Amazon DynamoDB experienced increased API error rates in the N. Virginia (us-east-1) Region.
      • Second, between 5:30 AM and 2:09 PM on October 20, Network Load Balancer (NLB) experienced increased connection errors for some load balancers in the N. Virginia (us-east-1) Region.
        • This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs.
      • Third, between 2:25 AM and 10:36 AM on October 20, new EC2 instance launches failed and, while instance launches began to succeed from 10:37 AM, some newly launched instances experienced connectivity issues which were resolved by 1:50 PM.

✳️ Knowledge Graph

(More about Knowledge Graph…)

%%{init: {'theme':'default'}}%%
graph TB
    DynamoDB[DynamoDB]:::instance
    DNS_System[DNS Management System]:::concept
    DNS_Planner[DNS Planner]:::instance
    DNS_Enactor[DNS Enactor]:::instance
    Route53[Route53]:::instance
    Endpoint[Service Endpoint]:::concept
    Race_Condition[Race Condition]:::concept
    
    EC2[EC2]:::instance
    DWFM[DWFM]:::instance
    Network_Manager[Network Manager]:::instance
    Lease[Lease Mechanism]:::concept
    
    NLB[Network Load Balancer]:::instance
    Health_Check[Health Check Subsystem]:::instance
    AZ[Availability Zone]:::concept
    Failover[Failover Mechanism]:::concept
    
    Lambda[Lambda]:::instance
    ECS[ECS]:::instance
    
    Congestion[Congestive Collapse]:::concept
    Propagation[Network State Propagation]:::concept
    Throttling[Throttling]:::concept
    Recovery[Recovery Process]:::concept
    
    DynamoDB -->|provides| DNS_System
    DNS_System -->|consists of| DNS_Planner
    DNS_System -->|consists of| DNS_Enactor
    DNS_Planner -->|generates| DNS_Plan[DNS Plan]:::concept
    DNS_Enactor -->|applies plan to| Route53
    DNS_Enactor -->|updates| Endpoint
    DNS_Enactor -->|triggers| Race_Condition
    Race_Condition -->|causes| DNS_Record_Deletion[DNS Record Deletion]:::concept
    DNS_Record_Deletion -->|blocks| Endpoint
    
    Endpoint -->|enables connection to| DynamoDB
    Endpoint -->|enables connection to| EC2
    Endpoint -->|enables connection to| Lambda
    
    EC2 -->|managed by| DWFM
    DWFM -->|maintains| Lease
    DWFM -->|depends on| DynamoDB
    Lease -->|times out without| DynamoDB
    Lease -->|causes| Congestion
    Congestion -->|delays| Recovery
    
    EC2 -->|network configured by| Network_Manager
    Network_Manager -->|performs| Propagation
    Propagation -->|delays cause| Health_Check_Failure[Health Check Failure]:::concept
    
    NLB -->|uses| Health_Check
    Health_Check -->|monitors| EC2
    Health_Check_Failure -->|triggers| Failover
    Failover -->|operates across| AZ
    Failover -->|causes| Connection_Error[Connection Error]:::concept
    
    EC2 -->|hosts| Lambda
    EC2 -->|hosts| ECS
    Lambda -->|depends on| DynamoDB
    ECS -->|depends on| EC2
    
    Recovery -->|requires| Manual_Intervention[Manual Intervention]:::concept
    Recovery -->|applies| Throttling
    Throttling -->|protects| DWFM
    Throttling -->|protects| Network_Manager
    
    classDef concept fill:#FF8000,stroke:#CC6600,stroke-width:2px,color:#000
    classDef instance fill:#0080FF,stroke:#0066CC,stroke-width:2px,color:#fff
%%{init: {'theme':'default'}}%%
sequenceDiagram
    participant DNS_Enactor_1
    participant DNS_Enactor_2
    participant Route53
    participant DynamoDB_Endpoint
    participant Customer
    
    Note over DNS_Enactor_1: Experiences delays
    DNS_Enactor_2->>Route53: Apply new DNS plan
    Route53-->>DNS_Enactor_2: Success
    DNS_Enactor_2->>DNS_Enactor_2: Trigger cleanup
    DNS_Enactor_1->>Route53: Apply old DNS plan
    Note over Route53: Old plan overwrites new
    DNS_Enactor_2->>Route53: Delete old plan
    Note over Route53: All IPs removed
    Customer->>DynamoDB_Endpoint: Connection attempt
    DynamoDB_Endpoint-->>Customer: DNS resolution failure
    Note over Customer: Service disruption begins
%%{init: {'theme':'default'}}%%
graph TD
    Customer[Customer Application]:::external
    
    subgraph AWS_Services[AWS Services Layer]
        Lambda[Lambda]:::service
        ECS[ECS]:::service
        Redshift[Redshift]:::service
        Connect[Connect]:::service
    end
    
    subgraph Core_Infrastructure[Core Infrastructure]
        DynamoDB[DynamoDB]:::core
        EC2[EC2]:::core
        NLB[NLB]:::core
    end
    
    subgraph Management_Layer[Management Systems]
        DNS_System[DNS Management]:::mgmt
        DWFM[DWFM]:::mgmt
        Network_Mgr[Network Manager]:::mgmt
        Health_Check[Health Check]:::mgmt
    end
    
    Customer -->|uses| Lambda
    Customer -->|uses| ECS
    Customer -->|uses| Redshift
    Customer -->|uses| Connect
    
    Lambda -->|depends on| DynamoDB
    Lambda -->|runs on| EC2
    ECS -->|runs on| EC2
    Redshift -->|depends on| DynamoDB
    Redshift -->|runs on| EC2
    Connect -->|uses| NLB
    Connect -->|depends on| Lambda
    
    DynamoDB -->|managed by| DNS_System
    EC2 -->|managed by| DWFM
    EC2 -->|configured by| Network_Mgr
    NLB -->|monitored by| Health_Check
    
    DNS_System -.->|failure propagates| DynamoDB
    DWFM -.->|depends on| DynamoDB
    Health_Check -.->|monitors| EC2
    
    classDef external fill:#FFE6E6,stroke:#CC0000,stroke-width:2px
    classDef service fill:#E6F3FF,stroke:#0066CC,stroke-width:2px
    classDef core fill:#FFE6CC,stroke:#FF8000,stroke-width:3px
    classDef mgmt fill:#E6FFE6,stroke:#00CC00,stroke-width:2px
%%{init: {'theme':'default'}}%%
stateDiagram-v2
    [*] --> Normal_Operation
    Normal_Operation --> DNS_Failure: Race condition triggered
    
    DNS_Failure --> DynamoDB_Unavailable: Endpoint resolution fails
    DNS_Failure --> Manual_Fix_Required: System inconsistent
    
    DynamoDB_Unavailable --> DWFM_Lease_Timeout: State checks fail
    DynamoDB_Unavailable --> Lambda_Errors: API calls fail
    DynamoDB_Unavailable --> Redshift_Errors: Query processing fails
    
    DWFM_Lease_Timeout --> EC2_Launch_Failure: No valid droplets
    DWFM_Lease_Timeout --> Congestive_Collapse: Queue buildup
    
    Congestive_Collapse --> Network_Propagation_Delay: Backlog processing
    Network_Propagation_Delay --> NLB_Health_Failures: New instances not ready
    NLB_Health_Failures --> Connection_Errors: Nodes removed from service
    
    Manual_Fix_Required --> DNS_Restored: Operator intervention
    DNS_Restored --> DynamoDB_Available: Endpoint accessible
    
    DynamoDB_Available --> DWFM_Recovery: Lease establishment
    DWFM_Recovery --> EC2_Throttled: Gradual recovery
    EC2_Throttled --> EC2_Normal: Throttles removed
    
    Connection_Errors --> NLB_Recovered: Auto-failover disabled
    Lambda_Errors --> Lambda_Recovered: DynamoDB restored
    Redshift_Errors --> Redshift_Recovered: EC2 launches succeed
    
    EC2_Normal --> Normal_Operation
    NLB_Recovered --> Normal_Operation
    Lambda_Recovered --> Normal_Operation
    Redshift_Recovered --> Normal_Operation

1️⃣ Amazon DynamoDB

  • Between 11:48 PM PDT on October 19 and 2:40 AM PDT on October 20, customers experienced increased Amazon DynamoDB API error rates in the N. Virginia (us-east-1) Region.
    • During this period, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service.
    • The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.
  • Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality.
    • Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region.
      • Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers’ experience.
      • This automation has been designed for resilience, allowing the service to recover from a wide variety of operational issues.
      • In addition to providing a public regional endpoint, this automation maintains additional DNS endpoints for several dynamic DynamoDB variants including a FIPS compliant endpoint, an IPv6 endpoint, and account-specific endpoints.
    • The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
      • To explain this event, we need to share some details about the DynamoDB DNS management architecture.
      • The system is split across two independent components for availability reasons.
        • The first component, the DNS Planner, monitors the health and capacity of the load balancers and periodically creates a new DNS plan for each of the service’s endpoints consisting of a set of load balancers and weights.
          • We produce a single regional DNS plan, as this greatly simplifies capacity management and failure mitigation when capacity is shared across multiple endpoints, as is the case with the recently launched IPv6 endpoint and the public regional endpoint.
        • A second component, the DNS Enactor, which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service.
          • For resiliency, the DNS Enactor operates redundantly and fully independently in three different Availability Zones (AZs). 3
          • Each of these independent instances of the DNS Enactor looks for new plans and attempts to update Route53 by replacing the current plan with a new plan using a Route53 transaction, assuring that each endpoint is updated with a consistent plan even when multiple DNS Enactors attempt to update it concurrently.
        • The race condition involves an unlikely interaction between two of the DNS Enactors.
          • Under normal operations, a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan.
          • This process typically completes rapidly and does an effective job of keeping DNS state freshly updated.
          • Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan.
          • As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint.
          • In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints.
        • Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints.
          • As it was slowly working through the endpoints, several other things were also happening.
          • First, the DNS Planner continued to run and produced many newer generations of plans.
          • Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints.
          • The timing of these events triggered the latent race condition.
          • When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them.
          • At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan.
          • The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing.
          • Therefore, this did not prevent the older plan from overwriting the newer plan.
          • The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
          • As this plan was deleted, all IP addresses for the regional endpoint were immediately removed.
          • Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors.
          • This situation ultimately required manual operator intervention to correct.
  • When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB.
    • This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.
    • Customers with DynamoDB global tables were able to successfully connect to and issue requests against their replica tables in other Regions, but experienced prolonged replication lag to and from the replica tables in the N. Virginia (us-east-1) Region.
    • Engineering teams for impacted AWS services were immediately engaged and began to investigate.
    • By 12:38 AM on October 20, our engineers had identified DynamoDB’s DNS state as the source of the outage.
      • ℹ️ Ernest’s note: Less than an hour! (Think back to your own services, observability tools, and workflows - could you locate the problem and identify the potential root cause within an hour?)
    • By 1:15 AM, the temporary mitigations that were applied enabled some internal services to connect to DynamoDB and repaired key internal tooling that unblocked further recovery.
    • By 2:25 AM, all DNS information was restored, and all global tables replicas were fully caught up by 2:32 AM.
    • Customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired between 2:25 AM and 2:40 AM.
    • This completed recovery from the primary service disruption event.

2️⃣ Amazon EC2

  • Between 11:48 PM PDT on October 19 and 1:50 PM PDT on October 20,
    • customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region.
    • Existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event.
    • After resolving the DynamoDB DNS issue at 2:25 AM PDT, customers continued to see increased errors for launches of new instances.
    • Recovery started at 12:01 PM PDT with full EC2 recovery occurring at 1:50 PM PDT.
    • During this period new instance launches failed with either a “request limit exceeded” or “insufficient capacity” error.
  • To understand what happened,
    • we need to share some information about a few subsystems that are used for the management of EC2 instance launches,
    • as well as for configuring network connectivity for newly launched EC2 instances.
    • The first subsystem is DropletWorkflow Manager (DWFM), which is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances – we call these servers “droplets”.
    • The second subsystem is Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances.
    • Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management.
    • This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems.
    • As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.
  • Starting at 11:48 PM PDT on October 19,
    • these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete.
    • While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting.
    • Between 11:48 PM on October 19 and 2:24 AM on October 20,
    • leases between DWFM and droplets within the EC2 fleet slowly started to time out.
  • At 2:25 AM PDT,
    • with the recovery of the DynamoDB APIs,
    • DWFM began to re-establish leases with droplets across the EC2 fleet.
    • Since any droplet without an active lease is not considered a candidate for new EC2 launches,
    • the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests.
    • DWFM began the process of reestablishing leases with droplets across the EC2 fleet;
    • however, due to the large number of droplets,
    • efforts to establish new droplet leases took long enough that the work could not be completed before they timed out.
    • Additional work was queued to reattempt establishing the droplet lease.
    • At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.
    • Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.
    • After attempting multiple mitigation steps,
    • at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation.
    • Restarting the DWFM hosts cleared out the DWFM queues,
    • reduced processing times,
    • and allowed droplet leases to be established.
    • By 5:28 AM, DWFM had established leases with all droplets within the N. Virginia (us-east-1) Region and new launches were once again starting to succeed,
    • although many requests were still seeing “request limit exceeded” errors due to the request throttling that had been introduced to reduce overall request load.
  • When a new EC2 instance is launched,
    • a system called Network Manager propagates the network configuration that allows the instance to communicate with other instances within the same Virtual Private Cloud (VPC), other VPC network appliances, and the Internet.
    • At 5:28 AM PDT, shortly after the recovery of DWFM,
    • Network Manager began propagating updated network configurations to newly launched instances and instances that had been terminated during the event.
    • Since these network propagation events had been delayed by the issue with DWFM,
    • a significant backlog of network state propagations needed to be processed by Network Manager within the N. Virginia (us-east-1) Region.
    • As a result, at 6:21 AM, Network Manager started to experience increased latencies in network propagation times as it worked to process the backlog of network state changes.
    • While new EC2 instances could be launched successfully,
    • they would not have the necessary network connectivity due to the delays in network state propagation.
    • Engineers worked to reduce the load on Network Manager to address network configuration propagation times and took action to accelerate recovery.
    • By 10:36 AM, network configuration propagation times had returned to normal levels,
    • and new EC2 instance launches were once again operating normally.
  • The final step towards EC2 recovery was to fully remove the request throttles that had been put in place to reduce the load on the various EC2 subsystems.
    • As API calls and new EC2 instance launch requests stabilized,
    • at 11:23 AM PDT our engineers began relaxing request throttles as they worked towards full recovery.
    • At 1:50 PM, all EC2 APIs and new EC2 instance launches were operating normally.

3️⃣ Network Load Balancer (NLB)

  • The delays in network state propagations for newly launched EC2 instances also caused impact to the Network Load Balancer (NLB) service and AWS services that use NLB.
    • Between 5:30 AM and 2:09 PM PDT on October 20 some customers experienced increased connection errors on their NLBs in the N. Virginia (us-east-1) Region.
    • NLB is built on top of a highly scalable, multi-tenant architecture that provides load balancing endpoints and routes traffic to backend targets, which are typically EC2 instances.
    • The architecture also makes use of a separate health check subsystem that regularly executes health checks against all nodes within the NLB architecture and will remove any nodes from service that are considered unhealthy.
  • During the event the NLB health checking subsystem began to experience increased health check failures.
    • This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated.
    • This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy.
    • This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS,
    • only to be returned to service when the next health check succeeded.
  • Our monitoring systems detected this at 6:52 AM,
    • and engineers began working to remediate the issue.
    • The alternating health check results increased the load on the health check subsystem,
    • causing it to degrade,
    • resulting in delays in health checks and triggering automatic AZ DNS failover to occur.
    • For multi-AZ load balancers,
    • this resulted in capacity being taken out of service.
    • In this case,
    • an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.
    • At 9:36 AM, engineers disabled automatic health check failovers for NLB,
    • allowing all available healthy NLB nodes and backend targets to be brought back into service.
    • This resolved the increased connection errors to affected load balancers.
    • Shortly after EC2 recovered, we re-enabled automatic DNS health check failover at 2:09 PM.

4️⃣ Other AWS Services

  • Between October 19 at 11:51 PM PDT and October 20 at 2:15 PM PDT,
    • customers experienced API errors and latencies for Lambda functions in the N. Virginia (us-east-1) Region.
    • Initially, DynamoDB endpoint issues prevented function creation and updates,
    • caused processing delays for SQS/Kinesis event sources and invocation errors.
    • By 2:24 AM,
    • service operations recovered except for SQS queue processing,
    • which remained impacted because an internal subsystem responsible for polling SQS queues failed and did not recover automatically.
    • We restored this subsystem at 4:40 AM and processed all message backlogs by 6:00 AM.
    • Starting at 7:04 AM,
    • NLB health check failures triggered instance terminations leaving a subset of Lambda internal systems under-scaled.
    • With EC2 launches still impaired,
    • we throttled Lambda Event Source Mappings and asynchronous invocations to prioritize latency-sensitive synchronous invocations.
    • By 11:27 AM,
    • sufficient capacity was restored, and errors subsided.
    • We then gradually reduced throttling and processed all backlogs by 2:15 PM, and normal service operations resumed.
  • Between October 19 at 11:45 PM PDT and October 20 at 2:20 PM PDT,
    • customers experienced container launch failures and cluster scaling delays across both Amazon Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate in the N. Virginia (us-east-1) Region.
    • These services were recovered by 2:20 PM.
  • Between October 19 at 11:56 PM PDT and October 20 at 1:20 PM PDT,
    • Amazon Connect customers experienced elevated errors handling calls, chats, and cases in the N. Virginia (us-east-1) Region.
    • Following the restoration of DynamoDB endpoints,
    • most Connect features recovered except customers continued to experience elevated errors for chats until 5:00 AM.
    • Starting at 7:04 AM,
    • customers again experienced increased errors handling new calls, chats, tasks, emails, and cases, which was caused by impact to the NLBs used by Connect as well as increased error rates and latencies for Lambda function invocations.
    • Inbound callers experienced busy tones, error messages, or failed connections.
    • Both agent-initiated and API-initiated outbound calls failed.
    • Answered calls experienced prompt playback failures, routing failures to agents, or dead-air audio.
    • Additionally, agents experienced elevated errors handling contacts, and some agents were unable to sign in.
    • Customers also faced elevated errors accessing APIs and Contact Search.
    • Real-time, Historical dashboards, and Data Lake data updates were delayed,
    • and all data will be backfilled by October 28.
    • Service availability was restored at 1:20 PM as Lambda function invocation errors recovered.
  • On October 19, between 11:51 PM and 9:59 AM PDT,
    • customers experienced AWS Security Token Service (STS) API errors and latency in the N. Virginia (us-east-1) Region.
    • STS recovered at 1:19 AM after the restoration of internal DynamoDB endpoints.
    • Between 8:31 AM and 9:59 AM,
    • STS API error rates and latency increased again as a result of NLB health check failures.
    • By 9:59 AM, we recovered from the NLB health check failures, and the service began normal operations.
  • Between October 19 at 11:51 PM PDT and October 20 at 1:25 AM PDT,
    • AWS customers attempting to sign into the AWS Management Console using an IAM user experienced increased authentication failures due to underlying DynamoDB issues in the N. Virginia (us-east-1) Region.
    • Customers with IAM Identity Center configured in N. Virginia (us-east-1) Region were also unable to sign in using Identity Center.
    • Customers using their root credential, and customers using identity federation configured to use signin.aws.amazon.com experienced errors when trying to log into the AWS Management Console in regions outside of the N. Virginia (us-east-1) Region.
    • As DynamoDB endpoints became accessible at 1:25 AM, the service began normal operations.
  • Between October 19 at 11:47 PM PDT and October 20 at 2:21 AM PDT,
    • customers experienced API errors when creating and modifying Redshift clusters or issuing queries against existing clusters in the N. Virginia (us-east-1) Region.
    • Redshift query processing relies on DynamoDB endpoints to read and write data from clusters.
    • As DynamoDB endpoints recovered, Redshift query operations resumed and by 2:21 AM, Redshift customers were successfully querying clusters as well as creating and modifying cluster configurations.
    • However, some Redshift compute clusters remained impaired and unavailable for querying after the DynamoDB endpoints were restored to normal operations.
    • As credentials expire for cluster nodes without being refreshed, Redshift automation triggers workflows to replace the underlying EC2 hosts with new instances.
    • With EC2 launches impaired, these workflows were blocked, putting clusters in a “modifying” state that prevented query processing and making the cluster unavailable for workloads.
    • At 6:45 AM, our engineers took action to stop the workflow backlog from growing and when Redshift clusters started to launch replacement instances at 2:46 PM, the backlog of workflows began draining.
    • By 4:05 AM PDT October 21, AWS operators completed restoring availability for clusters impaired by replacement workflows. In addition to cluster availability impairment, between October 19 at 11:47 PM and October 20 at 1:20 AM, Amazon Redshift customers in all AWS Regions were unable to use IAM user credentials for executing queries due to a Redshift defect that used an IAM API in the N. Virginia (us-east-1) Region to resolve user groups.
    • As a result, IAM’s impairment during this period caused Redshift to be unable to execute these queries. Redshift customers in AWS Regions who use “local” users to connect to their Redshift clusters were unaffected.
  • Between October 19 at 11:48 PM PDT and October 20 at 2:40 AM PDT,
    • customers were unable to create, view, and update support cases through the AWS Support Console and API.
    • While the Support Center successfully failed over to another region as designed, a subsystem responsible for account metadata began providing responses that prevented legitimate users from accessing the AWS Support Center. While we have designed the Support Center to bypass this system if responses were unsuccessful, in this event, this subsystem was returning invalid responses.
    • These invalid responses resulted in the system unexpectedly blocking legitimate users from accessing support case functions.
    • The issue was mitigated at 2:40 AM, and we took additional actions to prevent recurrence at 2:58 AM.
  • Other AWS services that rely on DynamoDB, new EC2 instance launches, Lambda invocations, and Fargate task launches
    • such as Managed Workflows for Apache Airflow, and Outposts lifecycle operations were also impacted in the N. Virginia (us-east-1) Region.
    • Refer to the event history for the full list of impacted services.
  • We are making several changes as a result of this operational event.
    • We have already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide.
    • In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans.
    • For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.
    • For EC2, we are building an additional test suite to augment our existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions.
    • We will improve the throttling mechanism in our EC2 data propagation systems to rate limit incoming work based on the size of the waiting queue to protect the service during periods of high load.
    • Finally, as we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.
  • In closing
    • We apologize for the impact this event caused our customers.
    • While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses.
    • We know this event impacted many customers in significant ways.
    • We will do everything we can to learn from this event and use it to improve our availability even further.

✳️ Bottom Line

At the time of the incident, I was working on a static website deployment and encountered an unfamiliar error message while clearing the CloudFront CDN cache. I thought it was odd - this type of error message shouldn’t occur. Then I gradually noticed that some websites were accessible but certain features were failing or stuck loading indefinitely. That’s when I realized this might be an AWS service disruption and quickly switched to the AWS Health Dashboard to monitor the situation.

I tracked the entire process that day 4 (but to allow for sufficient discussion and exchange of opinions, access is limited to friends I’ve met in person). Although it was somewhat alarming - initially only marked as “Impacted,” then suddenly DynamoDB was marked as “Disrupted” - I thought this was serious, as DynamoDB is the underlying database for many services. However, in less than an hour, AWS engineers had already updated that they had identified the likely problem and cause. I was impressed, thinking they must regularly conduct drills for such scenarios.

Coincidentally, after AWS’s DNS-related service disruption on 2025-10-20 concluded, Microsoft also experienced a DNS outage on 2025-10-29 5. At one point, even the Microsoft official website was inaccessible, affecting services like Microsoft Azure and Microsoft 365. This made me wonder if there might be some DNS-related specification changes, limitation conditions, or compliance requirements related to this timeframe. Sure enough, I found something about “U.S. Federal Government OMB Memorandum M-21-07,” issued in November 2020, which mandates that by the end of fiscal year 2025 (September 30, 2025), at least 80% of IP-enabled assets on federal networks must operate in IPv6-only environments. Phased targets include:

  • End of FY 2023: At least 20% IPv6-only
  • End of FY 2024: At least 50% IPv6-only
  • End of FY 2025: At least 80% IPv6-only

This explains why there was news on 2025-10-09 about Amazon DynamoDB now supports Internet Protocol version 6 (IPv6). It’s unclear whether this is directly or indirectly related to this service disruption, but it certainly raises associations.

The next few months may warrant attention to DNS or IPv6-related service disruptions. However, with AWS and Microsoft holding over half the cloud service market share, other U.S. service providers operating federal assets may also experience issues in the coming months, though the impact scope may be relatively smaller.

✳️ Further Reading