Performance degradation - High CPU load on Aura 5

Incident Report for Neo4j Aura

Postmortem

What happened

A change in the way we call the RAFT resolver (clustering protocol) as part of how single node instances are managed resulted in a big increase of an internal API calls. This caused a latent memory leak to become apparent and in the process forced the Java garbage collection to run with high intensity consuming valuable CPU resources.

How the service was affected

The issue only affected single node instances on the AuraDB Free tier. Users would notice issues with performance on queries or operations requiring CPU resources.

What we are doing now

Considering running the changes over a soak period to allow better detection of slow memory leak
Reviewing how we can better detect these conditions amongst the running of the service and better detect a pattern of issues with CPU usage.
Improving internal handling of early warning signs of some alarms and make a better impact assessment

Posted Jun 04, 2024 - 11:23 UTC

Resolved

The roll out of the fix is complete and the service is fully restored.

Posted Mar 15, 2024 - 15:23 UTC

Monitoring

The fix has been rolled out and we will now be monitoring.

Posted Mar 15, 2024 - 13:42 UTC

Update

We are progressing through the roll out of the fix and have now completed the Aura Professional tier.
We are on course to finish today.
We continue to monitor and will take any proactive mitigating action if necessary until the fix is fully rolled out.

Posted Mar 15, 2024 - 13:03 UTC

Update

We are continuing to work on a fix for this issue.

Posted Mar 15, 2024 - 13:00 UTC

Update

The fix is currently being rolled out now and will be completed today.
We will continue actively monitoring instances and taking any mitigation actions as needed.

Posted Mar 15, 2024 - 11:07 UTC

Update

We have a fix ready and will be rolling it out today.
We will continue actively monitoring instances and taking any mitigation actions as needed.

Posted Mar 15, 2024 - 08:50 UTC

Update

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.

Posted Mar 15, 2024 - 06:35 UTC

Update

Our engineers are continuing to work on a fix.
We will continue actively monitoring instances and taking any mitigation actions as needed.

Posted Mar 15, 2024 - 04:34 UTC

Update

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.

Posted Mar 15, 2024 - 02:30 UTC

Update

Our engineers are continuing to work on a fix. We will continue actively monitoring instances and taking any mitigation actions as needed.

Posted Mar 15, 2024 - 00:13 UTC

Update

Our engineers continue working on a fix that will be rolled out when it is ready. In the meanwhile, please be assured that we're actively monitoring instances and are taking mitigation actions.

Posted Mar 14, 2024 - 22:03 UTC

Identified

Some customers running Aura 5 may experience higher than usual CPU level on their instances.
We have identified the root cause and are actively working on a fix and preparing for its roll out.
Meanwhile, please be assured that we are actively monitoring instances and are taking mitigating actions.

Posted Mar 14, 2024 - 20:24 UTC

This incident affected: AuraDB Virtual Dedicated Cloud (*.databases.neo4j.io) (AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io)) and AuraDB Professional (*.databases.neo4j.io) (AuraDB Professional on AWS (*.databases.neo4j.io)).