AuraDB/DS : Some instance operations failing

Incident Report for Neo4j Aura

Postmortem

What happened

Our cloud provider rolled out a new feature in GKE that handles the resources relating to kubernetes snapshots. A new configuration associated with this feature led to a race condition exacerbated by concurrent calls to the resource with identical snapshot names.

We worked to isolate the issue and to discard any involvement from our changes. After we got clarity, we declared this incident and engaged with our cloud service provider and reported the issue and escalated soon after as we became more confident of the nature and impact of the issue.

How the service was affected

Most Aura operations (Resume, Restore from backup, Loading, and Clone to new) were failing as they require the kubernetes snapshot functionality. Some customers may have initiated operations and those could have got stuck. We proactively monitored and assisted manually whenever possible.

What we are doing now

Our provider is reviewing their testing procedure to detect better possible race conditions.
Our provider is implementing additional safeguards to prevent race conditions from occurring.
We have now established a better channel to work and escalate issues with our service provider when it is affecting our service.

Posted Jun 26, 2024 - 12:32 UTC

Resolved

The issue is now fixed and our cloud provider has confirmed the resolution of the underlying root cause by rolling out a change.

Posted May 23, 2024 - 16:39 UTC

Update

Some operations (Resume, Restore from backup, Loading, and Clone to new) on Aura instances are impacted and may not complete successfully.
We are working closely with our cloud provider on a solution.

Posted May 23, 2024 - 11:49 UTC

Update

Posted May 23, 2024 - 09:00 UTC

Update

Posted May 23, 2024 - 06:04 UTC

Update

Posted May 23, 2024 - 03:00 UTC

Update

Posted May 23, 2024 - 00:25 UTC

Update

Posted May 22, 2024 - 18:53 UTC

Update

Posted May 22, 2024 - 16:00 UTC

Update

Posted May 22, 2024 - 13:00 UTC

Update

Posted May 22, 2024 - 10:00 UTC

Update

Posted May 22, 2024 - 07:01 UTC

Update

Posted May 22, 2024 - 04:20 UTC

Update

Posted May 22, 2024 - 01:01 UTC

Update

Posted May 21, 2024 - 14:32 UTC

Update

Posted May 21, 2024 - 11:25 UTC

Update

Posted May 21, 2024 - 08:25 UTC

Update

Posted May 21, 2024 - 05:00 UTC

Update

Posted May 21, 2024 - 02:11 UTC

Update

Posted May 20, 2024 - 23:08 UTC

Update

Posted May 20, 2024 - 20:21 UTC

Update

Posted May 20, 2024 - 16:40 UTC

Update

Posted May 20, 2024 - 13:09 UTC

Update

Posted May 20, 2024 - 10:00 UTC

Update

Posted May 20, 2024 - 07:03 UTC

Update

Posted May 20, 2024 - 04:01 UTC

Update

Posted May 20, 2024 - 01:10 UTC

Update

Posted May 19, 2024 - 21:30 UTC

Update

Posted May 19, 2024 - 18:30 UTC

Update

Posted May 19, 2024 - 15:30 UTC

Update

Posted May 19, 2024 - 12:31 UTC

Update

Posted May 19, 2024 - 09:31 UTC

Update

Posted May 19, 2024 - 06:31 UTC

Update

Some operations (Resume, Restore from backup, Loading, and Clone to new) on Aura instances are impacted and may not complete successfully.
We are working closely with our cloud provider on a solution

Posted May 19, 2024 - 03:31 UTC

Update

Posted May 19, 2024 - 00:31 UTC

Update

Posted May 18, 2024 - 21:00 UTC

Update

Posted May 18, 2024 - 18:02 UTC

Update

Posted May 18, 2024 - 15:01 UTC

Update

Some operations (Resume, Restore from backup, Loading, and Clone to new) on Aura instances are impacted and may not be completed successfully.
We are working closely with our cloud provider on a solution.

Posted May 18, 2024 - 12:13 UTC

Update

Posted May 18, 2024 - 09:04 UTC

Update

Posted May 18, 2024 - 06:03 UTC

Update

Posted May 18, 2024 - 02:58 UTC

Update

Posted May 17, 2024 - 23:42 UTC

Update

Some operations (Resume, Restore from backup, Loading, and Clone to new) on Aura instances are impacted and may not be completed successfully.

We are working closely with our cloud provider on a solution.

Posted May 17, 2024 - 21:36 UTC

Update

Posted May 17, 2024 - 18:30 UTC

Update

Posted May 17, 2024 - 15:04 UTC

Update

Posted May 17, 2024 - 10:02 UTC

Update

Posted May 17, 2024 - 07:02 UTC

Update

Posted May 17, 2024 - 04:14 UTC

Update

Posted May 17, 2024 - 01:02 UTC

Update

Posted May 16, 2024 - 22:35 UTC

Update

Posted May 16, 2024 - 19:28 UTC

Update

Some operations (Resume, Restore from backup, Loading and Clone to new) on Aura instances may experience some failure rate.
We are working closely with our cloud provider on a solution.

Posted May 16, 2024 - 16:35 UTC

Update

We are currently working on a fix for the issue.
We will continue to update you as we work this problem.

Posted May 16, 2024 - 12:56 UTC

Update

We are currently working on a fix for the issue.
We will continue to update you as we work this problem.

Posted May 16, 2024 - 10:20 UTC