AuraDB instances on AWS affected by unavailability
Incident Report for Neo4j Aura
Postmortem

What happened

On 2024-04-29 at 08:57:35 UTC, Neo4j implemented a code update that had unintended consequences for our infrastructure management.

The aforementioned code introduced a bug in our cluster management tool, which became apparent when an update to our Kubernetes clusters resulted in the termination of VMs hosting certain databases. While the system self-healed, this led to instances becoming temporarily unavailable for some AWS customers.

Since our code rollout occurs progressively across environments, the impact was staggered across AWS service tiers.

How the service was affected

Some of our AWS customers (running instances up to 32GB of RAM) experienced a <10 minute service interruption between 2024-04-29 08:57:35 UTC and 2024-04-29 12:33:40 UTC when the last batch of impacted databases completed the self-healing process.

The impact occurred during multiple time intervals. Each interval affected a different group of instances, resulting in shorter recovery times for individual instances.

This occurred in the following time intervals as the roll out was staggered:

  • 08:57:35 - 09:09:50 UTC
  • 09:11:55 - 09:25:05 UTC
  • 09:37:05 - 09:48:55 UTC
  • 10:08:35 - 10:20:50 UTC
  • 11:13:00 - 11:27:10 UTC
  • 12:10:20 - 12:23:40 UTC
  • 12:19:30 - 12:33:40 UTC

What we are doing now

After conducting a comprehensive analysis of the situation, we are taking decisive actions to prevent such incidents from occurring in the future. Our efforts fall into two key areas:

Immediate Actions:
  • Enhancing monitoring and alert systems for the underlying Aura infrastructure, with a focus on detecting and responding to individual (or small groups of) database unavailability promptly.
  • Strengthening testing protocols and code review processes to identify and address bugs in the components managing Cloud Infrastructure at an earlier stage.

These measures are aimed at ensuring the reliability and stability of our services moving forward.

Posted Apr 30, 2024 - 15:59 UTC

Resolved
This incident is now resolved
Posted Apr 29, 2024 - 12:33 UTC
Monitoring
Some AuraDB instances running on AWS have lost temporarily availability before recovering automatically.
Posted Apr 29, 2024 - 08:57 UTC
This incident affected: AuraDB Enterprise (*.databases.neo4j.io) (AuraDB Enterprise on AWS (*.databases.neo4j.io)), AuraDB Professional (*.databases.neo4j.io) (AuraDB Professional on AWS (*.databases.neo4j.io)), and AWS (Amazon Web Services) (AWS ec2-ap-south-1, AWS eks-ap-south-1, AWS s3-ap-south-1, AWS ec2-ap-southeast-2, AWS eks-ap-southeast-2, AWS s3-ap-southeast-2, AWS s3-ca-central-1, AWS ec2-ca-central-1, AWS eks-eu-central-1, AWS s3-eu-central-1, AWS ec2-eu-central-1, AWS eks-eu-west-1, AWS ec2-eu-west-1, AWS s3-eu-west-1, AWS s3-eu-west-2, AWS eks-eu-west-2, AWS ec2-eu-west-2, AWS ec2-eu-west-3, AWS eks-eu-west-3, AWS s3-eu-west-3, AWS eks-us-east-1, AWS s3-us-east-1, AWS eks-us-east-2, AWS ec2-us-east-2, AWS s3-us-east-2, AWS s3-us-west-2, AWS eks-us-west-2, AWS ec2-us-west-2).