EVENTS & MQTT & WEB PORTAL

Incident Report for Ubidots

Postmortem

Events & MQTT & HTTP & Login Apps Issues Postmortem

Date

2019-05-17

Authors

jose

Status

Complete

Summary

HTTP data ingestion service was down for 160 minutes during the early morning of the 20th of May, 2019 UTC. This affected some services like Events, MQTT and Login Apps due to their dependency of the REST API server.

Impact

Critical, all the requests from both external devices and internal clusters were rejected.

Root Causes

Our infrastructure service provider, IBM, through its email support channel announced a maintenance job with ID 79902059 during the Sunday. The report specified: "Any customers that DO NOT have Dual path networking configured on a baremetal server will experience disruption".

The Ubidots message broker actually has its services with virtual servers, and thus, an outage was not expected. Unfortunately, we experienced a downtime of ten minutes during this job, and the internal rabbitMQ of our cluster lost its syncronizing.

The DevOps team had to restart the whole cluster to re-syncronize the message broker.

Trigger

The message broker could not syncronize its queues after a maintenance job of the infrastructure provider.

Resolution

Hard reboot of the cluster.

Detection

Detected by the automated internal service health checker.

Action Items

Action Item	Type	Owner	Bug
Cluster reboot	mitigate	gustavo woakas@ubidots.com	DONE

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

* The maintenance job was not supposed to derivate in a downtime, according to the provider’s notification. We are asking for a complete job description to know if anything else should had been kept in mind during this planned job.

* The health check alerted through email and internal messages, but the phone call to the DevOps leader was triggered several minutes after the incidence, it is necessary to update the incident scaling policy to avoid that.

Supporting Information

Support: support@ubidots.com

Posted May 21, 2019 - 16:00 UTC

Resolved

This incident has been resolved.

Posted May 20, 2019 - 09:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 20, 2019 - 09:50 UTC

Investigating

Events engine, MQTT, HTTP, TCP/UDP and login apps services affected

Posted May 20, 2019 - 06:58 UTC

This incident affected: America (Login Apps, HTTP Post, MQTT Publish, Events Engine).