Between 08:41 UTC on October 20th and 09:00 UTC on October 21st, Neo4j Aura experienced service disruptions affecting the us-east-1 (N. Virginia) region in AWS. The incident was triggered by a broad AWS regional outage impacting Identity and Access Management (IAM) and the EC2 control plane. This resulted in delayed backups, temporary loss of database fault tolerance for a subset of users, and internal delays in administrative actions due to toolchain failures.
A detailed summary of the AWS regional incident can be found here: https://aws.amazon.com/message/101925/
The primary cause was several AWS regional service disruptions in us-east-1. We will cover how each of these affected Neo4j Aura and its users. Neo4j Aura is designed to isolate regional failures. This is achieved through deploying customer instances in Orchestras that have instances in three availability zones and no cross region dependencies.
AWS IAM and Identity Center became unresponsive, preventing Neo4j Aura’s automated systems from authenticating with AWS resources in us-east-1. This affected Neo4j Aura backups to be written to AWS S3 buckets. Customers were also not able to resume paused instances during this period as the resume process was not able to authenticate with AWS S3 buckets to retrieve the paused data set. Neo4j Aura’s inability to authenticate with Route53 to create new DNS entries affected Neo4j Aura DB creation, as new databases were created.
AWS Network Load Balancer health check system failures and AWS EC2 “request limit exceeded” or “insufficient capacity” errors in us-east-1 were false negatives with the NLB heath checks which resulted in some Neo4j Aura DB clusters in us-east-1 losing fault tolerance (1 out of 3 cluster members unavailable). When this happened instances were removed by kubernetes and we were not able to provision new ones due to the EC2 failures noted earlier, leaving the clusters without fault tolerance for a prolonged period of time. All of these clusters still had full availability of the other two cluster members.
Largely Neo4j Aura responded as designed to these events, isolating failures to us-east-1, with no to minimal cross system failure propagation. The cross cloud impact of creating DNS records for new instances was the main outlier.
To mitigate the scope of impact of future regional outages, Neo4j is implementing the following measures: