Data Ingestion service degradation
Incident Report for Ubidots
Postmortem

General Service Ingestion Outage (HTTP, MQTT, TCP, UDP, Events)

Date

2020-01-31

Authors

Jose Garcia

Status

Done

Summary

Between 20:49 to 22:39 UTC January 27th, 2020, our data ingestion service experienced a critical issue that prevented not only data storage to the DB but also retrieving data from it. The service affected our HTTP REST API, and in turn, the MQTT, TCP and UDP protocols were also affected as they operate as gateways, translating data coming through them to HTTP. The issue had an impact in data visualization from the web app as well. Alerts were not triggered given the data gap during this time frame.

Impact

Critical, dots, regardless of the protocol, were not being saved to the DB during the issue time frame window.

Root Causes

Context

Ubidots’ complex architecture requires data processing at different microservice levels, which among other implications, means the availability of the services exposed in the Web App are managed by queues, each carrying a weight score within the task manager cluster. The task cluster handles all the messages, including data inputs, data processing outputs, processing results, etc. that are shared across all of the internal modules of the platform. The availability scores allow the task manager to define task’s execution priority based on the microservice business criticality. For that matter, data ingestion is on the highest rank of priority.

Issue

At 19:30 UTC January 27th, 2020, the health checks raised an alert indicating a degradation in the task queue system, specifically in the message broker cluster, deriving in a general HTTP REST API response and Web App access slowness, but no outage at this point as the server was responding correctly. The degradation did not respond to a pattern trackable by the DevOps team, so continued monitor persisted through that moment and forward.

At 20:30, the degradation became a major issue and quickly escalated to an outage at 20:49 UTC. The cluster, for reasons still under investigation, wasn’t queuing additional tasks nor resolving the existing ones, ultimately making the system unresponsive and the REST API to experience a general outage.

Initially, the DevOps team deployed additional nodes to the cluster but this didn’t have any sort of effect solving the issue, the machine where the nodes were deployed in was simply not working as expected. Also, a general reboot of the machine had no effect either, that’s why our hypothesis relates with a physical or software management service issue coming from our infrastructure provider.

At last, DevOps decided to launch a new machine with improved and higher resources to host the task queue manager cluster, rendering additional time to get the system back on operation.

Trigger

The broker message cluster handling queuing of new requests/messages in our system experienced a general downtime, generating, as a consequence, a general REST API outage.

Resolution

The DevOps team deployed a new cluster hosted in a machine with higher resources, also taking this as an opportunity in preparation for the coming improvements of the API this year.

Detection

Detected by the automated internal service health checker.

Action Items

Action item Type Owner Bug
1 Deployment of a new broker message cluster mitigate Gustavo Angulo, Juan Agudelo DevOps Team Done
2 High availability tests for the new cluster, to avoid a general downtime in the future pending Gustavo Angulo, Juan Agudelo DevOps Team Done

Lessons Learned

What went well

The health check alerted not only about the degradation but also the outage of our services.

What went wrong

We didn’t know the main reason of the issue in the broker, our actual hypothesis is related with stuff out of our actual management like physical machine problems or cloud provider management software problems.

Posted Feb 03, 2020 - 21:54 UTC

Resolved
The issue has been resolved. Data is being restored to all the devices.
Posted Jan 28, 2020 - 04:22 UTC
Monitoring
We have deployed a hot patch that has fixed the data ingestion and retrieval issues. We are actually storing the missed dots during the downtime date span in all the Ubidots accounts.
Posted Jan 28, 2020 - 02:33 UTC
Identified
Our DevOps team has identified the issue and is working in a hot patch deployment. We are currently work to make available the service again asap.
Posted Jan 27, 2020 - 21:45 UTC
Investigating
We are currently experiencing long delays at our data ingestion engine. We are looking for the root cause of this issue.
Posted Jan 27, 2020 - 21:00 UTC
This incident affected: Oceania (MQTT Publish Sydney, TCP Sydney) and America (HTTP Post Toronto, TCP Toronto, MQTT Publish Toronto, MQTT Subscribe Toronto).