Major Outage on EuWest1 Production (Dublin, Ireland)
Incident Report for Mambu
Postmortem

Summary

The EU WEST live production environment was not available for 7 minutes during working hours.

What Happened?

On the production environment on EU WEST region we initiated a configuration change that doesn't require any downtime. This change was done via rolling update of the environment.

For improved performance, Mambu employs CPU optimized instances. Our cloud services provider, sets a limit of the total number of instances that can be used per day per region. Due to a previous incident on the sandbox environment the CPU credits for the EU WEST region were exhausted, hence the rolling update did not succeed. We changed the instance type to non-credit based.

The whole resolution process, including the startup of the application took 7 minutes.

What Are We Doing About This?

We will take some preventive measures to enhance the detection of this scenario and to monitor the behaviour of the CPU credit consumption on AWS. In addition, we plan to optimize the bootstrap process of the application to reduce the startup time.

We do realize that the incident caused the disruption of the functionality. We have conducted a detailed post-mortem analysis to identify the root causes and contributing factors and scheduled a work on improving the application as well as the service overall.

As always, if you have any questions or concerns, feel free to contact us via usual support channels.

Posted May 22, 2018 - 16:46 UTC

Resolved
This incident has been resolved.
Posted May 15, 2018 - 13:40 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 15, 2018 - 10:33 UTC
Investigating
We are currently investigating this issue.
Posted May 15, 2018 - 10:29 UTC
This incident affected: Mambu Production (Dublin, Ireland) and Mambu Sandbox (Dublin, Ireland).