At approximately 17:00 UTC on 2025-05-20 our cloud provider released and rolled out a change on the managed version of Prometheus we use to provide the Customer Metrics Integration (CMI) endpoint. This change affected our production PromQL query performance because “the change to the PromQL query path now evaluates queries that previously had empty results”. This was a change we had no warning and no control over and effected multiple customers.
We quickly raised the issue to our cloud provider and they rolled back the change.
While we were checking the root cause of the issue we immediately recommended increasing the timeout value to 20 seconds as a remediation.
Customers with low timeout settings on their PromQL queries to fetch metrics from the Neo4j Aura CMI endpoint would see an increase in query timeouts (HTTP error 499).
This incident was not caused by anything Neo4j directly controls but we have been looking at improving our handling of this situation and have devised the following actions: