EVENTS & MQTT & LOGIN APPS
Incident Report for Ubidots
Postmortem

Events & MQTT & Login Apps Issues Postmortem

Date

2019-05-17

Authors

jose

Status

Complete

Summary

HTTP data ingestion service was down for 22 minutes during the early morning of the 19th of May, 2019 UTC. This affected some services like Events, MQTT and Login Apps due to their dependency of the REST API server.

Impact

Medium, some requests from external devices and internal clusters were rejected.

Root Causes

The MQTT broker allows to create multiple TCP connections per device in order to support bulk data ingestion. Some devices from South America began TCP negotiations multiple times per second, and as the broker did not reject them to support bulk operations, the amount of RAM necessary to support them exceeded the server's available non-volatile memory. Because of this, the kernel began to paginate the RAM access derivating in a high data access latency was presented.

A high latency with the REST API derivates in data reception through any IoT protocol issues.

Trigger

Latent bug triggered by sudden increase of RAM usage.

Resolution

IPs with a very high number of TCP connections were banned.

Detection

Detected by the automated internal service health checker.

Action Items

Action Item Type Owner Bug
Ban to IPs with multiple socket connections mitigate gustavo woakas@ubidots.com DONE
Script to free RAM Pagination every 1 hour prevent gustavo woakas@ubidots.com DONE
Alert through internal channels if the server RAM usage reaches more than 85% prevent gustavo woakas@ubidots.com DONE

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

It is necessary to support bulk operations to allow to many devices to open many socket connections, but a device with hundreds of socket is something not expected. The DevOps must analyze how to mitigate this for the incoming MQTT broker update.

Supporting Information

Support: support@ubidots.com

Posted May 21, 2019 - 15:34 UTC

Resolved
This incident has been resolved.
Posted May 19, 2019 - 01:02 UTC
Update
We are continuing to monitor for any further issues.
Posted May 19, 2019 - 01:01 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 19, 2019 - 00:38 UTC
Investigating
Events engine, MQTT, HTTP, TCP/UDP and login apps services affected
Posted May 19, 2019 - 00:22 UTC
This incident affected: America (Login Apps Toronto, MQTT Publish Toronto, Events Engine Toronto).