We rolled out the latest release of the database on Neo4j Aura. During the rollout a small number of database instances turned to read-only mode as the out of disk protection was triggered erroneously.
Affected database instances were placed in read-only mode (serving read queries only), but still displaying an online status.
We were notified by customers of issues. Whilst we monitor disk usage (to help prevent data corruption) we didn’t detect this issue as it was triggered by some backup-restore component containers running into an out of memory condition (2024-09-24 at 2:10 UTC) resulting in being unable to serve disk metrics. If the operator component cannot read metrics, it falls back to using estimated values. The estimated resulting values triggered the safeguard to place the small number of affected instances in read-only mode.
Working with our engineering teams, we quickly identified that adjusting the memory setting to the backup-restore component serving the disk metrics, would allow new cluster members to start successfully. A fix was released, tested in lower environments and rolled out to all affected instances (2024-09-24 at 15:10 UTC) and then the whole service.
This was an extreme case of a safety feature (out of disk protection) causing an issue. We have immediately fixed the issue by preventing the backup-restore to run out of memory in the same conditions. We believe this deserves a number of further changes that we we are carrying out to better prevent, detect and mitigate issues affecting this out of disk protection feature: