HTTP data ingestion service was down for 160 minutes during the early morning of the 20th of May, 2019 UTC. This affected some services like Events, MQTT and Login Apps due to their dependency of the REST API server.
Critical, all the requests from both external devices and internal clusters were rejected.
Our infrastructure service provider, IBM, through its email support channel announced a maintenance job with ID 79902059 during the Sunday. The report specified: "Any customers that DO NOT have Dual path networking configured on a baremetal server will experience disruption".
The Ubidots message broker actually has its services with virtual servers, and thus, an outage was not expected. Unfortunately, we experienced a downtime of ten minutes during this job, and the internal rabbitMQ of our cluster lost its syncronizing.
The DevOps team had to restart the whole cluster to re-syncronize the message broker.
The message broker could not syncronize its queues after a maintenance job of the infrastructure provider.
Hard reboot of the cluster.
Detected by the automated internal service health checker.
|Cluster reboot||mitigate||gustavo email@example.com||DONE|
The automated health checker alerted to the DevOps team once the issue was presented.
* The maintenance job was not supposed to derivate in a downtime, according to the provider’s notification. We are asking for a complete job description to know if anything else should had been kept in mind during this planned job.
* The health check alerted through email and internal messages, but the phone call to the DevOps leader was triggered several minutes after the incidence, it is necessary to update the incident scaling policy to avoid that.