Some cypher query regression
Incident Report for Neo4j Aura
Postmortem

What happened

On 2024-06-28 10:46 (UTC) we released a new version of the database (v. 5.21). This release of the database contained an improvement to fix a problem of queries returning an entity ordered by an indexed-backed property that could come out in the wrong order due to concurrent writes (eg with an index-backed property:  MATCH (n:L) WHERE n.x IS NOT NULL RETURN n ORDER BY n.x) Unfortunately this introduced two undetected regressions:

  • key not found: VariableSlotKey(...) where the runtime would try to access a cached variable that was not in scope 
  • NullCheckReferenceProperty cannot be cast to class ASTCachedProperty  a class cast exception in the slotted runtime happening in some cases when retrieving cached properties

Contrary to the usual process, a human error meant it was rolled-out directly to the AuraDB Enterprise tier without having been exposed first to the AuraDB Free tier and Professional tier.

How the service was affected

Customers running certain types of queries where the whole node would get cached as part of the optimization were seeing failures key not found: VariableSlotKey(...) OR  NullCheckReferenceProperty cannot be cast to class ASTCachedProperty with no simple workarounds. 

We fixed the regression and rolled out a new version 5.21.1 (2024-06-29 01:10) and then a second 5.21.2 to address both of these regressions.

What we are doing now

We consider this incident as severe and have committed resources to deliver on the following actions:

  • Rolling out

    • Prevent rolling out to the Enterprise tier at the end of the working week
    • Release roll-out process and tooling to enforce the sequence of tiers to receive a new release
    • Automate the release and roll-out process to remove risks of further human error
    • Improve on the time it takes to release an emergency fix
  • Database release

    • Add further tests around the release of improvements to Cypher
  • Detection

    • Build a monitoring dashboard specific to detect spikes of errors due to Cypher queries
Posted Jul 12, 2024 - 18:29 UTC

Resolved
We have now completed the roll out of the fix to all affected Aura instances. incident resolved
Posted Jun 29, 2024 - 00:54 UTC
Update
We have a fix and are currently progressing with the release of that fix.
The packaging and testing ran longer than earlier estimates.
The current projected ETA for the rollout on Aura is currently June 29th 00:00 UTC
Posted Jun 28, 2024 - 21:29 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jun 28, 2024 - 21:08 UTC
Update
We have a fix and are currently ongoing packaging and release of that fix.
The ETA for the rollout on Aura is currently 22:00 UTC
Posted Jun 28, 2024 - 17:05 UTC
Update
We continue to work on a fix for this issue.
To identify if you are affected - Error Signature: key not found: VariableSlotKey(...)
We are taking steps to contact affected customers.
Posted Jun 28, 2024 - 15:29 UTC
Identified
Aura 5 latest update has introduced a query regression

Error Signature: key not found: VariableSlotKey(ke)
Workaround: prefixing the queries with runtime=legacy should address the issue
Posted Jun 28, 2024 - 13:58 UTC
This incident affected: AuraDB Enterprise (*.databases.neo4j.io) (AuraDB Enterprise on AWS (*.databases.neo4j.io), AuraDB Enterprise on Azure (*.databases.neo4j.io), AuraDB Enterprise on GCP (*.databases.neo4j.io)), AuraDB Professional (*.databases.neo4j.io) (AuraDB Professional on AWS (*.databases.neo4j.io), AuraDB Professional on Azure (*.databases.neo4j.io), AuraDB Professional on GCP (*.databases.neo4j.io)), AuraDS (*.databases.neo4j.io) (AuraDS on AWS (*.databases.neo4j.io), AuraDS on Azure (*.databases.neo4j.io), AuraDS on GCP (*.databases.neo4j.io)), AuraDS Enterprise (*.databases.neo4j.io) (AuraDS Enterprise on AWS (*.databases.neo4j.io), AuraDS Enterprise on Azure (*.databases.neo4j.io), AuraDS Enterprise on GCP (*.databases.neo4j.io)), and AuraDB Free (*.databases.neo4j.io).