A configuration change in Aura’s DB Ingress service caused intermittent connectivity issues for customer databases across multiple AWS regions. The issue was due to a misconfiguration of AWS Network Load Balancers (NLBs), which resulted in dropped inbound traffic.
A change focused on making the service more efficient rolled on 2025-01-15 and reduced db-ingress replicas to three per region, affecting AWS regions with more than three availability zones (AZs). The AWS Network Load Balancer (NLB) couldn't route traffic correctly when requests landed in AZs without a db-ingress pod. Cross-zone load balancing was not enabled, preventing the NLB from distributing traffic across zones. This led to intermittent connection failures in us-east-1, us-west-2, and ap-northeast-2. On 2025-01-16 the fix was deployed.
Intermittent connectivity failures impacted Aura Professional, Business Critical, and DS Enterprise orchestras. The issue was caused by traffic being dropped by the AWS Network Load Balancer (NLB) due to improper routing. As a result, failure rates (for queries using the Bolt protocol) reached approximately 40% in us-east-1, 25% in us-west-2, and 25% in ap-northeast-2.
We reverted a change that reduced the number of db-ingress replicas, ensuring that instances were running in all availability zones and this restored normal database operations.
Neo4j remains committed to providing reliable service and is implementing additional safeguards to prevent similar incidents in the future.
To prevent similar incidents in the future, we are implementing the following improvements: