AWS outage affecting some Darwin services

Incident Report for DarwinCX

Postmortem

Some Darwin services were affected by an AWS outage. This included delays in our reporting and some scheduled jobs, signing into our app, etc. Below is the latest summary that AWS has provided…

Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time.
At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints. After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB.
As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch. We recovered the Network Load Balancer health checks at 9:38 AM. As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations.
Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours.

Posted Oct 21, 2025 - 14:30 UTC

Resolved

all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours.
Posted Oct 21, 2025 - 03:20 UTC

Update

We are experiencing delays in reporting and some scheduled jobs due to the AWS outage.

Latest update from AWS: Service recovery across all AWS services continues to improve. We continue to reduce throttles for new EC2 Instance launches in the US-EAST-1 Region that were put in place to help mitigate impact. Lambda invocation errors have fully recovered and function errors continue to improve. We have scaled up the rate of polling SQS queues via Lambda Event Source Mappings to pre-event levels.
Posted Oct 20, 2025 - 21:22 UTC

Update

We are experiencing delays in reporting and some scheduled jobs due to the AWS outage.

Latest update from AWS: We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches.
Posted Oct 20, 2025 - 18:34 UTC

Update

We are experiencing delays in reporting and some scheduled jobs due to the AWS outage.

Latest update from AWS: We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations.
Posted Oct 20, 2025 - 16:06 UTC

Update

We are experiencing delays in reporting and some scheduled jobs due to the AWS outage
Posted Oct 20, 2025 - 15:56 UTC

Update

Latest updates from AWS: We have confirmed multiple AWS services experienced network connectivity issues in the US-EAST-1 Region. We are seeing early signs of recovery for the connectivity issues and are continuing to investigate the root cause.
We can confirm significant API errors and connectivity issues across multiple services in the US-EAST-1 Region. We are investigating and will provide further update in 30 minutes or soon if we have additional information.
Posted Oct 20, 2025 - 15:22 UTC

Identified

Latest update from AWS: We continue to observe recovery across most of the affected AWS Services. We can confirm global services and features that rely on US-EAST-1 have also recovered. We continue to work towards full resolution and will provide updates as we have more information to share.
Posted Oct 20, 2025 - 14:08 UTC
This incident affected: AWS (AWS cloudfront).