Major Outage on Mambu Production (N. Virginia, USA)

Incident Report for Mambu

Postmortem

Summary

Mambu experienced a downtime of 2m19s on Monday 13 August between Aug 13, 2018 5:00:47 PM and Aug 13, 2018 5:03:07 PM on the shared UsEast1 (N. Virginia, USA) Production environment.

What Happened?

Mambu utilises AutoScaling Groups(ASG) as a mechanism to keep the desired number of healthy instances up and running in order to serve traffic. The Application Load Balancer (ALB) is in a close communication with the ASG in order to forward traffic to healthy instances and keep up with scaling based on traffic and latency. The destination of forwarded traffic is called a Target Group (TG). In normal conditions when the ASG spins up a new instance, it gets added to the TG via a registration process.

Due to a failure in the instance registration process, some of our healthy instances weren't successfully registered with the Target Group and remained in the "Pending" state for longer than usual. This resulted in a significant decrease in capacity, required to serve the demanded amount of request, which subsequently caused overload on existing operational instances. As per current self-healing mechanism, we terminate and replace such instances to proactively prevent CPU starvation. During the replacement procedure, there were no more operational instances to serve the requests, hence tenants experienced unavailability for the duration of replacement.

What Are We Doing About This?

As preventive measures, we have added the TG and ASG metrics to our monitoring system and are creating an additional alarm in CloudWatch to proactively respond for this kind of issues. We have also requested clarifications from AWS support in regards to failed Target Group registration, in order to prevent this kind of issues in the future.

We do realise, that the incident caused the disruption of the functionality. As always, if you have any questions or concerns, feel free to contact us via usual support channels

Posted Aug 23, 2018 - 10:55 UTC

Resolved

Mambu experienced an outage of 2m19s (139s in total) from Aug 13, 2018 5:00:47 PM to Aug 13, 2018 5:03:07 PM UTC. This issue was fixed by detecting the erroneous servers and shutting them down. We started an internal investigation of this issue and will post a detailed post-mortem as soon as possible.

Sincere apologies for the issues this event has caused.

Posted Aug 13, 2018 - 17:49 UTC

This incident affected: Mambu Production (N. Virginia, USA) and Mambu Sandbox (N. Virginia, USA).