On 2024-04-29 at 08:57:35 UTC, Neo4j implemented a code update that had unintended consequences for our infrastructure management.
The aforementioned code introduced a bug in our cluster management tool, which became apparent when an update to our Kubernetes clusters resulted in the termination of VMs hosting certain databases. While the system self-healed, this led to instances becoming temporarily unavailable for some AWS customers.
Since our code rollout occurs progressively across environments, the impact was staggered across AWS service tiers.
Some of our AWS customers (running instances up to 32GB of RAM) experienced a <10 minute service interruption between 2024-04-29 08:57:35 UTC and 2024-04-29 12:33:40 UTC when the last batch of impacted databases completed the self-healing process.
The impact occurred during multiple time intervals. Each interval affected a different group of instances, resulting in shorter recovery times for individual instances.
This occurred in the following time intervals as the roll out was staggered:
After conducting a comprehensive analysis of the situation, we are taking decisive actions to prevent such incidents from occurring in the future. Our efforts fall into two key areas:
These measures are aimed at ensuring the reliability and stability of our services moving forward.