Our cloud provider rolled out a new feature in GKE that handles the resources relating to kubernetes snapshots. A new configuration associated with this feature led to a race condition exacerbated by concurrent calls to the resource with identical snapshot names.
We worked to isolate the issue and to discard any involvement from our changes. After we got clarity, we declared this incident and engaged with our cloud service provider and reported the issue and escalated soon after as we became more confident of the nature and impact of the issue.
Most Aura operations (Resume, Restore from backup, Loading, and Clone to new) were failing as they require the kubernetes snapshot functionality. Some customers may have initiated operations and those could have got stuck. We proactively monitored and assisted manually whenever possible.