Incident Report for Ubidots

MQTT Socket Issues Postmortem








MQTT socket connection service was degraded during 13 minutes during the morning of the 13th of April, 2019 UTC


Medium, some socket connection from external devices were not possible.

Root Causes

The amount of socket connections for MQTT has been increased in the last months due to the number of new users with an active account at Ubidots. The DevOps team decided to increase the available socket connections per server to avoid future issues and to support this user increment, and to deploy this change on the 13th of April.

Once the change was deployed, a huge amount of logs began to be stored in the hard disk, making that the access to certain spaces of the disk slightly delayed, which was expected, but the RAM usage was suddenly triggered to non-expected rates. This made that the server began to use the swap memory, which belongs to the hard drive that was in fact already slow due to the new logs storing, from this slowness was derived a socket timeout raise exception and some of the new socket connections were rejected.


Latent bug triggered by sudden increase of RAM usage.


The stored RAM caché was erased, and the server began to use the non-volatile memory instead of the hard disk Swap.


Detected by the automated internal service health checker.

Action Items

Action Item Type Owner Bug
Free RAM caché to avoid to use SWAP memory mitigate gustavo DONE
Script to free RAM caché every 3 hours prevent gustavo DONE

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

RAM caché overflow was a non-expected state of the normal operation workflow. This is something that must be handled for future updates.

Supporting Information


Posted Apr 16, 2019 - 15:24 UTC

This incident has been resolved.
Posted Apr 13, 2019 - 16:52 UTC
A fix has been implemented and we are monitoring the results.
Posted Apr 13, 2019 - 16:41 UTC
We have identified an error with our MQTT publish service and are currently looking into the issue.
Posted Apr 13, 2019 - 16:31 UTC