HTTP data ingestion service was down for 22 minutes during the early morning of the 19th of May, 2019 UTC. This affected some services like Events, MQTT and Login Apps due to their dependency of the REST API server.
Medium, some requests from external devices and internal clusters were rejected.
The MQTT broker allows to create multiple TCP connections per device in order to support bulk data ingestion. Some devices from South America began TCP negotiations multiple times per second, and as the broker did not reject them to support bulk operations, the amount of RAM necessary to support them exceeded the server's available non-volatile memory. Because of this, the kernel began to paginate the RAM access derivating in a high data access latency was presented.
A high latency with the REST API derivates in data reception through any IoT protocol issues.
Latent bug triggered by sudden increase of RAM usage.
IPs with a very high number of TCP connections were banned.
Detected by the automated internal service health checker.
|Ban to IPs with multiple socket connections||mitigate||gustavo firstname.lastname@example.org||DONE|
|Script to free RAM Pagination every 1 hour||prevent||gustavo email@example.com||DONE|
|Alert through internal channels if the server RAM usage reaches more than 85%||prevent||gustavo firstname.lastname@example.org||DONE|
The automated health checker alerted to the DevOps team once the issue was presented.
It is necessary to support bulk operations to allow to many devices to open many socket connections, but a device with hundreds of socket is something not expected. The DevOps must analyze how to mitigate this for the incoming MQTT broker update.