EVENTS & MQTT & LOGIN APPS

Incident Report for Ubidots

Postmortem

Events & MQTT & Login Apps Issues Postmortem

Date

2019-05-17

Authors

jose

Status

Complete

Summary

HTTP data ingestion service was down for 22 minutes during the early morning of the 19th of May, 2019 UTC. This affected some services like Events, MQTT and Login Apps due to their dependency of the REST API server.

Impact

Medium, some requests from external devices and internal clusters were rejected.

Root Causes

The MQTT broker allows to create multiple TCP connections per device in order to support bulk data ingestion. Some devices from South America began TCP negotiations multiple times per second, and as the broker did not reject them to support bulk operations, the amount of RAM necessary to support them exceeded the server's available non-volatile memory. Because of this, the kernel began to paginate the RAM access derivating in a high data access latency was presented.

A high latency with the REST API derivates in data reception through any IoT protocol issues.

Trigger

Latent bug triggered by sudden increase of RAM usage.

Resolution

IPs with a very high number of TCP connections were banned.

Detection

Detected by the automated internal service health checker.

Action Items

Action Item	Type	Owner	Bug
Ban to IPs with multiple socket connections	mitigate	gustavo woakas@ubidots.com	DONE
Script to free RAM Pagination every 1 hour	prevent	gustavo woakas@ubidots.com	DONE
Alert through internal channels if the server RAM usage reaches more than 85%	prevent	gustavo woakas@ubidots.com	DONE

‌

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

It is necessary to support bulk operations to allow to many devices to open many socket connections, but a device with hundreds of socket is something not expected. The DevOps must analyze how to mitigate this for the incoming MQTT broker update.

Supporting Information

Support: support@ubidots.com

Posted May 21, 2019 - 15:34 UTC

Resolved

This incident has been resolved.

Posted May 19, 2019 - 01:02 UTC

Update

We are continuing to monitor for any further issues.

Posted May 19, 2019 - 01:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 19, 2019 - 00:38 UTC

Investigating

Events engine, MQTT, HTTP, TCP/UDP and login apps services affected

Posted May 19, 2019 - 00:22 UTC

This incident affected: America (Login Apps, MQTT Publish, Events Engine).