HTTP, Login Apps

Incident Report for Ubidots

Postmortem

HTTP & Login Apps Issues Postmortem

Date

2020-02-09

Authors

jose garcia

Status

Complete

Summary

We experienced issues with our ingestion engine exposed through HTTP during 02:28-02:31 - 02:55-02:57, and 03:13 - 03:17 UTC, on February 2nd. Access to both owners' and end-users web apps was also affected by the issue, avoiding to visualize data stored at Ubidots.

During the issue time windows, our HTTP servers responded 50x standard errors, so users should have been able to capture the error and to set values read in their devices to be sent later.

Impact

Critical, all the requests from external devices and scripts that used our REST API to ingest data were rejected.

Root Causes

We experienced a major issue with our Postgresql data base that caused index errors that had to be handled, after some other attempts, with a sigkill to our instance.
Once the DevOps team was notified, a reindex action took place inside 5 different BTREE indexes that support some of our daily logs taks
The non-availability of the database caused that both web-access and http ingestion experienced issues, and as data did not have a path to be stored in the DB, our servers responded with a 50x error response code.

Trigger

We couldn’t find the root cause of the issue with the data base, so we have decided to solve the issue by creating another replication instances.

Resolution

Sigkill and reindex of our Postgress DB instance

Detection

Detected by the automated internal service health checker.

Action Items

Action	Type	Owner	Status
DB reindex	mitigate	gustavo woakas@ubidots.com	DONE
Create Backup DB instances	Prevent	gustavo woakas@ubidots.com	Done

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

We have been working in a kubernetes cluster that should avoid these sort of issues, unfortunately we must take some actions to ensure our deployments before of this development.
Our Postgress instance takes several minutes to be restarted and indexed, so we have decided to deploy more read-only instances that acts as backup-

Supporting Information

Support: support@ubidots.com

Posted 6 minutes ago. May 21, 2019 - 16:00 UTC

Resolved

This incident has been resolved.

Posted 1 day ago. May 20, 2019 - 09:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted 1 day ago. May 20, 2019 - 09:50 UTC

Investigating

Events engine, MQTT, HTTP, TCP/UDP and login apps services affected

Posted 1 day ago. May 20, 2019 - 06:58 UTC

Posted Feb 09, 2021 - 15:47 UTC

Resolved

We experienced issues with our ingestion engine exposed through HTTP during 02:28-02:31 - 02:55-02:57, and 03:13 - 03:17 UTC on February 2nd. Access to both owners' and end-users web apps was also affected by the issue, avoiding to visualize data stored at Ubidots.

During the issue time windows, our HTTP servers responded 50x standard errors, so users should have been able to capture the error and to set values read in their devices to be sent later.

The problem raised due to unexpected behavior at our internal DB, that derivates in the creation of non-valid indexes.

Posted Feb 03, 2021 - 15:10 UTC