Database clone feature impacted
Incident Report for Neo4j Aura
Postmortem

Incident postmortem - Database clone feature impacted

What happened

A provisioning pipeline deployment caused an incomplete configuration map to be formed, resulting in a couple of Google cloud regions being omitted, thereby making some control components unaware of these two Kubernetes cluster environments.

How the service was affected

The impact to Aura customers using Google Cloud was felt in the inability across all tiers (Free, Professional and Enterprise) to clone database instances. Also, users of missing Google cloud regions lost the ability to create / delete / suspend database instances or access the console.

The issue was resolved when engineers from multiple Aura teams applied fixes to various back-end components in order to remove bugs from the code.

What we are doing now

We don’t believe that the issues our customers experienced are acceptable for a DBaaS. We’ve carried out a thorough analysis of what went wrong in this situation. The actions that we’re carrying out to ensure that nothing like this can happen again fall into two areas:

  • Preventative

    • Add guardrails to the logic that is updating the configuration map. 
    • Ensure that when the configuration map is updated, orchestras with running databases can’t be removed.
    • Change the data type of one of the fields related to previous identifier of the configuration map to be an integer.
    • Consider adopting a policy of not using bash in high-impact components.
    • Investigate if Aura relies on other large configuration maps that are mounted as environment variables.
    • Investigate ways of storing configurations without using configuration maps due to the 1 MiB hard limit.
    • Consider adopting a policy of requiring components to be mergeable & deployable within x minutes.
  • Detection

    • Monitor the correctness of the configuration maps.
    • Monitor and alert based on the size of the configuration map.
Posted Sep 01, 2023 - 14:28 UTC

Resolved
We have now verified the issue has been addressed and validated the fix.
All operations are now recovered and the Aura Console is fully functional.
Posted Aug 17, 2023 - 14:42 UTC
Monitoring
We have identified the root cause and have deployed a fix to address the issue to all affected users.
Any clone operation that is pending may need to be retried.
We will continue to monitor for any signs of error and update you shortly.
Posted Aug 17, 2023 - 14:12 UTC
Investigating
We have detected an issue with the database instance clone feature. We are investigating and will update you shortly.
Databases are running and other database operations are not affected.
Posted Aug 17, 2023 - 13:49 UTC
This incident affected: Aura Console (console.neo4j.io).