Update - We continue to experiences instances where data, although published and an ACK is sent back, is not routed correctly into the database.
As such, we have implemented new measures that aim at balancing connections better, and making sure that customers, because of possible bad MQTT client implementations, don't overload the broker with unused connections.
Specifically, our DevOps team has:

1. Introduced an authentication rate limit, that is, a token cannot send more than 4 auth messages per second.
2. Implemented a max MQTT connections per user. This means a particular customer cannot create more than X amount of MQTT connections. X has been set based on number of devices, license and current usage.
3. Deployed what's know as "Sticky connection"s, which ensure that an established session is always routed through the same balancing server as long as the connection is still alive. Along with this, instead of a single MQTT balancing server, there are now 3.

Oct 10, 2024 - 12:21 UTC
Update - This week, after a several deep dives in our MQTT broker logs, our DevOps team focused their efforts in troubleshooting one particular point in the MQTT data reception stack: the internal HTTP webhook that interfaces the broker we the internal data ingestion queues.

We found that webhook timeout and connection pool size configuration was playing an important role in ensuring data reception. With that:

1. We increased the internal HTTP webhook timeout
2. We increased the maximum connection pool size for every MQTT node.

After seeing a positive impact in the alerts occurrence rate, the DevOps team took the following measures, on top of the above 2, to verify the behavior:

1. Disconnected 1 of 3 balancers. We had added one when testing if the problem came from a connections overload.
2. Stopped routing a portion of the traffic to a separate MQTT broker. This was deployed to minimize the load in the deployment.
3. Increased the number of the HTTP-service pods receiving the webhook requests from the MQTT broker.

In summary, these actions have meant a substantial reduction in data loss. We now only see very sporadical alerts, but after detailed monitoring of client data, we are no longer seeing data gaps.

We will continue to monitor the stability of the MQTT data reception service for further tuning.

Sep 20, 2024 - 21:40 UTC
Monitoring - After the balancer servers fine-tuning and the MQTT Flapping detect mechanism activation last Friday (September 6th), our internal checks still detected failures to deliver data over MQTT, and similar reports were found from test implemented by our support team. Nonetheless, the occurrences of alerts have been decreasing with each measure our DevOps has taken.

As our goal is to provide stability and make sure data isn't lost, we continue to implement changes to completely mitigate this MQTT intermittencies. With that, today we have:

1. Deployed an additional load balancer.
2. Increased the number of pods (containers) running the MQTT ingestion services.

Sep 09, 2024 - 20:55 UTC
Update - Our DevOps team has taken the following additional measures to lower the MQTT intermittencies, although they're still present:

1. Fine tuning of the servers running our load balancers to support greater concurrent connections.
2. Enabled a feature in the MQTT broker that automatically detects client disconnections, which at the same time speaks of the connection rate, in a time window. If a threshold is exceeded during the window, the particular MQTT client is banned for a configurable duration.

Sep 06, 2024 - 23:40 UTC
Identified - After the implementation of the detailed log, we were able to spot that significant traffic reaching our servers came from inactive users. The traffic was not being rejected directly in our load balancers; instead, it was being allowed to connect and published data.

We have now block said traffic completely, ensuring only active customer are able to connect. This prevent overloading the MQTT servers with invalid traffic.
So far, the internal alerts have decreased considerably, but there are still remains.

Our team continues investigating what else is causing the remaining alerts and MQTT intermittencies.

Sep 05, 2024 - 21:59 UTC
Investigating - Over the past 2 weeks, our MQTT service has been experiencing latencies and intermittencies when publishing data or creating connections to do so. This has resulted, in some cases, in data loss, and in a diminished perception of quality of service.

Our DevOps team, through the reports sent from users channeled through our support team, is aware of the problem. Our internal checks has pointed us of the issue was well.

Our DevOps team has been monitoring the behavior, and so far, we believe there are sudden spikes of connections that causes the intermittencies. The team has:

1. Established more aggressive restrictions with respect to connections per IP.
2. Established a lower rate limit of connection per seconds per IP

These 2 changes have improved the issue, but not fixed it completely.

As of the time of these note (05/08/2024 16:38 UTC), we're implementing a more robust and detailed log that allows us to trace the networking and usage per client, with the aim of find the direct cause of the spike. This is allow us to determine paths to implemente a definitive solution.

We will keep updating this incident as more information becomes available.

Sep 05, 2024 - 16:38 UTC
Functions Operational
90 days ago
100.0 % uptime
Today
America Degraded Performance
90 days ago
99.99 % uptime
Today
Login Apps Toronto ? Operational
90 days ago
100.0 % uptime
Today
HTTP Post Toronto ? Operational
90 days ago
100.0 % uptime
Today
TCP Toronto ? Operational
90 days ago
100.0 % uptime
Today
MQTT Publish Toronto ? Degraded Performance
90 days ago
99.99 % uptime
Today
MQTT Subscribe Toronto ? Operational
90 days ago
99.99 % uptime
Today
Events Engine Toronto ? Operational
90 days ago
100.0 % uptime
Today
UDP Toronto ? Operational
90 days ago
100.0 % uptime
Today
Synthetic Variables ? Operational
90 days ago
100.0 % uptime
Today
Ubidots Australia (Private deployment) ? Operational
90 days ago
99.99 % uptime
Today
Login Apps Sydney ? Operational
90 days ago
100.0 % uptime
Today
MQTT Publish Sydney ? Operational
90 days ago
99.99 % uptime
Today
TCP Sydney ? Operational
90 days ago
99.99 % uptime
Today
Events Engine Sydney ? Operational
90 days ago
100.0 % uptime
Today
HTTP ? Operational
90 days ago
99.99 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
REST API
Fetching
Past Incidents
Nov 20, 2024
Completed - The scheduled maintenance has been completed.
Nov 20, 13:00 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Nov 20, 11:00 UTC
Scheduled - We will be updating our Kubernetes version. This should not take more than an hour, and the only impact it may have on the platform is mild latencies. No data will be lost during this maintenance window.
Nov 15, 13:57 UTC
Nov 19, 2024

No incidents reported.

Nov 18, 2024

No incidents reported.

Nov 17, 2024

No incidents reported.

Nov 16, 2024

No incidents reported.

Nov 15, 2024
Resolved - This incident has been resolved.
Nov 15, 13:57 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Nov 1, 12:45 UTC
Investigating - We have identified an error with our HTTP data ingestion service and are currently looking into the issue.
Nov 1, 12:27 UTC
Resolved - This incident has been resolved.
Nov 15, 13:57 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Nov 1, 12:45 UTC
Investigating - We have identified an error with our MQTT publish service and are currently looking into the issue.
Nov 1, 12:33 UTC
Nov 14, 2024

No incidents reported.

Nov 13, 2024

No incidents reported.

Nov 12, 2024

No incidents reported.

Nov 11, 2024

No incidents reported.

Nov 10, 2024

No incidents reported.

Nov 9, 2024

No incidents reported.

Nov 8, 2024

No incidents reported.

Nov 7, 2024

No incidents reported.

Nov 6, 2024

No incidents reported.