Endpoint ingress degradation in AWS us-east-1, us-west-2 and ap-northeast-2

Incident Report for Neo4j Aura

Postmortem

What happened

A configuration change in Aura’s DB Ingress service caused intermittent connectivity issues for customer databases across multiple AWS regions. The issue was due to a misconfiguration of AWS Network Load Balancers (NLBs), which resulted in dropped inbound traffic.

A change focused on making the service more efficient rolled on 2025-01-15 and reduced db-ingress replicas to three per region, affecting AWS regions with more than three availability zones (AZs). The AWS Network Load Balancer (NLB) couldn't route traffic correctly when requests landed in AZs without a db-ingress pod. Cross-zone load balancing was not enabled, preventing the NLB from distributing traffic across zones. This led to intermittent connection failures in us-east-1, us-west-2, and ap-northeast-2. On 2025-01-16 the fix was deployed.

How the service was affected

Intermittent connectivity failures impacted Aura Professional, Business Critical, and DS Enterprise orchestras. The issue was caused by traffic being dropped by the AWS Network Load Balancer (NLB) due to improper routing. As a result, failure rates (for queries using the Bolt protocol) reached approximately 40% in us-east-1, 25% in us-west-2, and 25% in ap-northeast-2.

We reverted a change that reduced the number of db-ingress replicas, ensuring that instances were running in all availability zones and this restored normal database operations.

What we are doing now

Neo4j remains committed to providing reliable service and is implementing additional safeguards to prevent similar incidents in the future.

To prevent similar incidents in the future, we are implementing the following improvements:

Enable cross-zone Load Balancing: Ensuring traffic is correctly distributed across all availability zones.
Adding automated monitoring to detect ingress failures before deployment.
Adding improved Alarms to detect and respond to connectivity issues quickly.

Posted Feb 21, 2025 - 13:41 UTC

Resolved

The incident is resolved. A postmortem will be available once it is complete.

Posted Jan 16, 2025 - 20:16 UTC

Monitoring

A fix has been deployed across all impacted regions. This issue has been resolved and we will monitor to ensure the service remains healthy before considering this Resolved.

Posted Jan 16, 2025 - 18:48 UTC

Identified

Endpoint ingress degradation in AWS us-east-1, us-west-2 and ap-northeast-2 was introduced earlier on January 16, 2025. We have identified a fix and are currently deploying it to all impacted regions.

Posted Jan 16, 2025 - 15:56 UTC

This incident affected: AuraDB Professional (*.databases.neo4j.io) (AuraDB Professional on AWS (*.databases.neo4j.io)), AuraDS (*.databases.neo4j.io) (AuraDS on AWS (*.databases.neo4j.io)), AuraDS Enterprise (*.databases.neo4j.io) (AuraDS Enterprise on AWS (*.databases.neo4j.io)), and AuraDB Business Critical (*.databases.neo4j.io) (AuraDB Business Critical (*.databases.neo4j.io) on AWS).