During a routine automated deployment the load balancers that sit in front of Aura’s Neo4j databases were inadvertently replaced. The new load balancers were assigned different IP addresses to the old ones. The DNS entries for the databases were automatically updated to point to the new load balancers.
All databases experienced an outage while the new IP addresses propagated through the DNS infrastructure to client applications. The length of the outage varied depending on the DNS caching characteristics of the infrastructure and applications. (The 50th percentile outage was approximately 5 minutes, the 95th percentile approximately 15 minutes.)
Some client applications that sit behind firewalls with explicit allowlisting of database IP addresses experienced a longer outage while those firewalls were reconfigured.
We don’t believe that an outage of up to 15 minutes is acceptable for a DBaaS. We’ve carried out a thorough analysis of what went wrong in this situation. The actions that we’re carrying out to ensure that nothing like this can happen again fall into three areas.