MQTT
Incident Report for Ubidots
Postmortem

MQTT Socket Issues Postmortem

Date

2019-04-13

Authors

jose

Status

Complete

Summary

MQTT socket connection service was degraded during 13 minutes during the morning of the 13th of April, 2019 UTC

Impact

Medium, some socket connection from external devices were not possible.

Root Causes

The amount of socket connections for MQTT has been increased in the last months due to the number of new users with an active account at Ubidots. The DevOps team decided to increase the available socket connections per server to avoid future issues and to support this user increment, and to deploy this change on the 13th of April.

Once the change was deployed, a huge amount of logs began to be stored in the hard disk, making that the access to certain spaces of the disk slightly delayed, which was expected, but the RAM usage was suddenly triggered to non-expected rates. This made that the server began to use the swap memory, which belongs to the hard drive that was in fact already slow due to the new logs storing, from this slowness was derived a socket timeout raise exception and some of the new socket connections were rejected.

Trigger

Latent bug triggered by sudden increase of RAM usage.

Resolution

The stored RAM caché was erased, and the server began to use the non-volatile memory instead of the hard disk Swap.

Detection

Detected by the automated internal service health checker.

Action Items

Action Item Type Owner Bug
Free RAM caché to avoid to use SWAP memory mitigate gustavo woakas@ubidots.com DONE
Script to free RAM caché every 3 hours prevent gustavo woakas@ubidots.com DONE

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

RAM caché overflow was a non-expected state of the normal operation workflow. This is something that must be handled for future updates.

Supporting Information

Support: support@ubidots.com

Posted 6 months ago. Apr 16, 2019 - 15:24 UTC

Resolved
This incident has been resolved.
Posted 6 months ago. Apr 13, 2019 - 16:52 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 6 months ago. Apr 13, 2019 - 16:41 UTC
Investigating
We have identified an error with our MQTT publish service and are currently looking into the issue.
Posted 6 months ago. Apr 13, 2019 - 16:31 UTC