Increased Latency on Customer Metric Integration Requests

Incident Report for Neo4j Aura

Postmortem

What happened

At approximately 17:00 UTC on 2025-05-20 our cloud provider released and rolled out a change on the managed version of Prometheus we use to provide the Customer Metrics Integration (CMI) endpoint. This change affected our production PromQL query performance because “the change to the PromQL query path now evaluates queries that previously had empty results”. This was a change we had no warning and no control over and effected multiple customers.

We quickly raised the issue to our cloud provider and they rolled back the change.

While we were checking the root cause of the issue we immediately recommended increasing the timeout value to 20 seconds as a remediation.

How the service was affected

Customers with low timeout settings on their PromQL queries to fetch metrics from the Neo4j Aura CMI endpoint would see an increase in query timeouts (HTTP error 499).

What we are doing now

This incident was not caused by anything Neo4j directly controls but we have been looking at improving our handling of this situation and have devised the following actions:

  • Provided feedback to our cloud provider on the impact this has had on our service.
  • Added CMI endpoint on the statuspage: Aura Customer Metrics (customer-metrics-api.neo4j.io) to better represent the status of the service.
  • Updated our documentation and recommended a larger timeout be set. 
  • Reviewed improvements to our detection and alerting in timeouts and errors to queries (499 and 5xx errors)
  • Review our own queries to make them more efficient and more resilient to a performance degradation from our suppliers’ service
Posted Jun 13, 2025 - 10:17 UTC

Resolved

The latency issue affecting our Customer Metric Integration endpoints has been resolved.
All services operate normally now, and timeout errors should no longer occur.
Posted May 22, 2025 - 09:09 UTC

Monitoring

The issue has been resolved, and we are currently monitoring the system to ensure continued stability.

If you were experiencing timeouts while fetching metrics, those should now be resolved.
We will continue to observe the system and provide further updates if necessary.
Posted May 22, 2025 - 07:37 UTC

Update

Our Engineering team is continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics, we recommend temporarily increasing your timeout value to 20 seconds until this incident is resolved.
Posted May 22, 2025 - 04:30 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 22, 2025 - 00:45 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 23:46 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 22:47 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 21:48 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 20:50 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 19:57 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 18:51 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 17:35 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 16:13 UTC

Update

Engineering are continuing to investigate the root cause of the issue.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 14:44 UTC

Investigating

Engineering have identified an increase in latency with our Customer Metric Integration endpoints, they are currently investigating the cause.

If you are seeing timeouts when attempting to fetch metrics we recommend increasing your timeout value to 20 seconds temporarily until this incident is resolved.
Posted May 21, 2025 - 13:40 UTC
This incident affected: AuraDB Virtual Dedicated Cloud (*.databases.neo4j.io) (AuraDB Virtual Dedicated Cloud on AWS (*.databases.neo4j.io), AuraDB Virtual Dedicated Cloud on Azure (*.databases.neo4j.io), AuraDB Virtual Dedicated Cloud on GCP (*.databases.neo4j.io)), AuraDB Professional (*.databases.neo4j.io) (AuraDB Professional on AWS (*.databases.neo4j.io), AuraDB Professional on Azure (*.databases.neo4j.io), AuraDB Professional on GCP (*.databases.neo4j.io)), AuraDS (*.databases.neo4j.io) (AuraDS on AWS (*.databases.neo4j.io), AuraDS on Azure (*.databases.neo4j.io), AuraDS on GCP (*.databases.neo4j.io)), AuraDS Enterprise (*.databases.neo4j.io) (AuraDS Enterprise on AWS (*.databases.neo4j.io), AuraDS Enterprise on Azure (*.databases.neo4j.io), AuraDS Enterprise on GCP (*.databases.neo4j.io)), AuraDB Business Critical (*.databases.neo4j.io) (AuraDB Business Critical (*.databases.neo4j.io) on AWS, AuraDB Business Critical (*.databases.neo4j.io) on Azure, AuraDB Business Critical (*.databases.neo4j.io) on GCP), and AuraDB Free (*.databases.neo4j.io).