2020-01-31
Jose Garcia
Done
Between 20:49 to 22:39 UTC January 27th, 2020, our data ingestion service experienced a critical issue that prevented not only data storage to the DB but also retrieving data from it. The service affected our HTTP REST API, and in turn, the MQTT, TCP and UDP protocols were also affected as they operate as gateways, translating data coming through them to HTTP. The issue had an impact in data visualization from the web app as well. Alerts were not triggered given the data gap during this time frame.
Critical, dots, regardless of the protocol, were not being saved to the DB during the issue time frame window.
Context
Ubidots’ complex architecture requires data processing at different microservice levels, which among other implications, means the availability of the services exposed in the Web App are managed by queues, each carrying a weight score within the task manager cluster. The task cluster handles all the messages, including data inputs, data processing outputs, processing results, etc. that are shared across all of the internal modules of the platform. The availability scores allow the task manager to define task’s execution priority based on the microservice business criticality. For that matter, data ingestion is on the highest rank of priority.
Issue
At 19:30 UTC January 27th, 2020, the health checks raised an alert indicating a degradation in the task queue system, specifically in the message broker cluster, deriving in a general HTTP REST API response and Web App access slowness, but no outage at this point as the server was responding correctly. The degradation did not respond to a pattern trackable by the DevOps team, so continued monitor persisted through that moment and forward.
At 20:30, the degradation became a major issue and quickly escalated to an outage at 20:49 UTC. The cluster, for reasons still under investigation, wasn’t queuing additional tasks nor resolving the existing ones, ultimately making the system unresponsive and the REST API to experience a general outage.
Initially, the DevOps team deployed additional nodes to the cluster but this didn’t have any sort of effect solving the issue, the machine where the nodes were deployed in was simply not working as expected. Also, a general reboot of the machine had no effect either, that’s why our hypothesis relates with a physical or software management service issue coming from our infrastructure provider.
At last, DevOps decided to launch a new machine with improved and higher resources to host the task queue manager cluster, rendering additional time to get the system back on operation.
The broker message cluster handling queuing of new requests/messages in our system experienced a general downtime, generating, as a consequence, a general REST API outage.
The DevOps team deployed a new cluster hosted in a machine with higher resources, also taking this as an opportunity in preparation for the coming improvements of the API this year.
Detected by the automated internal service health checker.
Action item | Type | Owner | Bug | |
---|---|---|---|---|
1 | Deployment of a new broker message cluster | mitigate | Gustavo Angulo, Juan Agudelo DevOps Team | Done |
2 | High availability tests for the new cluster, to avoid a general downtime in the future | pending | Gustavo Angulo, Juan Agudelo DevOps Team | Done |
The health check alerted not only about the degradation but also the outage of our services.
What went wrong
We didn’t know the main reason of the issue in the broker, our actual hypothesis is related with stuff out of our actual management like physical machine problems or cloud provider management software problems.