A small number of instances are not accepting write queries
Incident Report for Neo4j Aura
Postmortem

What happened

We rolled out the latest release of the database on Neo4j Aura. During the rollout a small number of database instances turned to read-only mode as the out of disk protection was triggered erroneously.

How the service was affected

Affected database instances were placed in read-only mode (serving read queries only), but still displaying an online status.

We were notified by customers of issues. Whilst we monitor disk usage (to help prevent data corruption) we didn’t detect this issue as it was triggered by some backup-restore component containers running into an out of memory condition (2024-09-24 at 2:10 UTC) resulting in being unable to serve disk metrics. If the operator component cannot read metrics, it falls back to using estimated values. The estimated resulting values triggered the safeguard to place the small number of affected instances in read-only mode.

Working with our engineering teams, we quickly identified that adjusting the memory setting to the backup-restore component serving the disk metrics, would allow new cluster members to start successfully. A fix was released, tested in lower environments and rolled out to all affected instances (2024-09-24 at 15:10 UTC) and then the whole service.

What we are doing now

This was an extreme case of a safety feature (out of disk protection) causing an issue. We have immediately fixed the issue by preventing the backup-restore to run out of memory in the same conditions. We believe this deserves a number of further changes that we we are carrying out to better prevent, detect and mitigate issues affecting this out of disk protection feature:

  • Fixing the issue: implement a fix for the out of disk in case it cannot receive any metric.
  • Aura console instance status display: implement a change to reflect the instance read-only mode.
  • Detection: implement an alert to detect a surge of instances going out of disk.
  • Prevention: implement an alert to detect rate of OOM for components class involved in the OOD protection chain of decision.
Posted Oct 01, 2024 - 09:16 UTC

Resolved
We have rolled out a fix and confirmed the full recovery of features and complete resolution with affected customers.
Posted Sep 24, 2024 - 14:41 UTC
Update
A fix is now being rolled out to production.
Posted Sep 24, 2024 - 12:35 UTC
Identified
We have identified the issue and are working on rolling out a fix.
Posted Sep 24, 2024 - 10:38 UTC
Investigating
We are investigating an issue about some instances not accepting write queries.
Posted Sep 24, 2024 - 10:20 UTC
This incident affected: AuraDB Virtual Dedicated Cloud (*.databases.neo4j.io) (AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io), AuraDB Virtual Dedicated Cloud on Azure (*.databases.neo4j.io)), AuraDB Business Critical (*.databases.neo4j.io) (AuraDB Business Critical (*.databases.neo4j.io) on AWS, AuraDB Business Critical (*.databases.neo4j.io) on Azure), and AuraDB Free (*.databases.neo4j.io).