Delay in processing notifications for the Dublin,Ireland region
Incident Report for Mambu
Postmortem

Post Mortem - EUWest1 production notifications were delayed

Summary

Notifications on our EU-West1 environment were delayed between 15th October 2019, 19:00 UTC until 16th October 2019, 14:52 UTC.

What Happened?

  • 2019-10-15, 19:00:00 UTC notifications were not sent via Mambu’s Cron Jobs capabilities

  • 2019-10-16, 06:52:41 UTC a tenant informed Mambu about this issue

  • 2019-10-16, 07:55AM UTC issue was handled by Mambu Tech Support

  • 2019-10-16, 08:01AM UTC CRON machine was restarted and notifications started to be sent

  • 2019-10-16, 09:19:06Z UTC the same tenant informed Mambu again that their notifications are not sent.

  • After this hour the investigation started again, this time restarts on cron-machine did not work

  • 2019-10-16, 14:00:00 UTC we have identified the issue at database level and killed a long-running query

  • 2019-10-16, 14:52:00 UTC we confirmed that notifications are sent again

What Are We Doing About This?

We have not yet identified a root cause for the notifications latencies and we are continuing our investigations. In order to avoid future incidents, we defined the following actions:

  1. Collect a thread-dump, a memory dump every time we manually restart or terminate a machine, as all the data from the server gets lost on restarts, making the process of finding the root cause very difficult. This will ensure that we have all the information to find the root cause of a problem and take the right measures to make our services more resilient
  2. Create monitoring and alerting system for notifications, in order for Mambu to find such malfunctions before our customer do and proactively address them
  3. As a long-term solution, add tracing mechanism for notifications, in order to speed up investigations for such issues
  4. As a temporary solution until the previous actions will be implemented: update the run-book for this type of alert to ensure our team is equipped in case such incidents happen again

At Mambu we take our commitment to deliver a high quality service very seriously and we sincerely apologise for the inconvenience this issue has caused. If you have any questions, please contact us via our usual support channels.

Posted Oct 25, 2019 - 12:53 UTC

Resolved
This incident has been resolved.
Posted Oct 16, 2019 - 14:55 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 16, 2019 - 14:19 UTC
Identified
The issue has been identified and notifications have started again. Thank you for your patience in this matter.
Posted Oct 16, 2019 - 14:06 UTC
Update
We are continuing to investigate this situation and eliminated several leads but at this time we have not yet identified root cause.
Posted Oct 16, 2019 - 13:09 UTC
Investigating
Mambu has become aware of a situation affecting notifications (sms, email and webhooks) in the shared Dublin,Ireland environment. Users may experience issues with a delay when processing notifications.

We are currently investigating the root cause and will update you when have identified it.
Posted Oct 16, 2019 - 12:14 UTC
This incident affected: Mambu Production (Dublin, Ireland) and Mambu Sandbox (Dublin, Ireland).