After the balancer servers fine-tuning and the MQTT Flapping detect mechanism activation last Friday (September 6th), our internal checks still detected failures to deliver data over MQTT, and similar reports were found from test implemented by our support team. Nonetheless, the occurrences of alerts have been decreasing with each measure our DevOps has taken.
As our goal is to provide stability and make sure data isn't lost, we continue to implement changes to completely mitigate this MQTT intermittencies. With that, today we have:
1. Deployed an additional load balancer. 2. Increased the number of pods (containers) running the MQTT ingestion services.
Posted Sep 09, 2024 - 20:55 UTC
Update
Our DevOps team has taken the following additional measures to lower the MQTT intermittencies, although they're still present:
1. Fine tuning of the servers running our load balancers to support greater concurrent connections. 2. Enabled a feature in the MQTT broker that automatically detects client disconnections, which at the same time speaks of the connection rate, in a time window. If a threshold is exceeded during the window, the particular MQTT client is banned for a configurable duration.
Posted Sep 06, 2024 - 23:40 UTC
Identified
After the implementation of the detailed log, we were able to spot that significant traffic reaching our servers came from inactive users. The traffic was not being rejected directly in our load balancers; instead, it was being allowed to connect and published data.
We have now block said traffic completely, ensuring only active customer are able to connect. This prevent overloading the MQTT servers with invalid traffic. So far, the internal alerts have decreased considerably, but there are still remains.
Our team continues investigating what else is causing the remaining alerts and MQTT intermittencies.
Posted Sep 05, 2024 - 21:59 UTC
Investigating
Over the past 2 weeks, our MQTT service has been experiencing latencies and intermittencies when publishing data or creating connections to do so. This has resulted, in some cases, in data loss, and in a diminished perception of quality of service.
Our DevOps team, through the reports sent from users channeled through our support team, is aware of the problem. Our internal checks has pointed us of the issue was well.
Our DevOps team has been monitoring the behavior, and so far, we believe there are sudden spikes of connections that causes the intermittencies. The team has:
1. Established more aggressive restrictions with respect to connections per IP. 2. Established a lower rate limit of connection per seconds per IP
These 2 changes have improved the issue, but not fixed it completely.
As of the time of these note (05/08/2024 16:38 UTC), we're implementing a more robust and detailed log that allows us to trace the networking and usage per client, with the aim of find the direct cause of the spike. This is allow us to determine paths to implemente a definitive solution.
We will keep updating this incident as more information becomes available.
Posted Sep 05, 2024 - 16:38 UTC
This incident affects: America (MQTT Publish Toronto).