HTTP, Login Apps
Incident Report for Ubidots
Postmortem

HTTP & Login Apps Issues Postmortem

Date

2020-02-09

Authors

jose garcia

Status

Complete

Summary

We experienced issues with our ingestion engine exposed through HTTP during 02:28-02:31 - 02:55-02:57, and 03:13 - 03:17 UTC, on February 2nd. Access to both owners' and end-users web apps was also affected by the issue, avoiding to visualize data stored at Ubidots.

During the issue time windows, our HTTP servers responded 50x standard errors, so users should have been able to capture the error and to set values read in their devices to be sent later.

Impact

Critical, all the requests from external devices and scripts that used our REST API to ingest data were rejected.

Root Causes

  • We experienced a major issue with our Postgresql data base that caused index errors that had to be handled, after some other attempts, with a sigkill to our instance.

  • Once the DevOps team was notified, a reindex action took place inside 5 different BTREE indexes that support some of our daily logs taks

  • The non-availability of the database caused that both web-access and http ingestion experienced issues, and as data did not have a path to be stored in the DB, our servers responded with a 50x error response code.

Trigger

We couldn’t find the root cause of the issue with the data base, so we have decided to solve the issue by creating another replication instances.

Resolution

Sigkill and reindex of our Postgress DB instance

Detection

Detected by the automated internal service health checker.

Action Items

Action Type Owner Status
DB reindex mitigate gustavo woakas@ubidots.com DONE
Create Backup DB instances Prevent gustavo woakas@ubidots.com Scheduled

Lessons Learned

What went well

The automated health checker alerted to the DevOps team once the issue was presented.

What went wrong

  • We have been working in a kubernetes cluster that should avoid these sort of issues, unfortunately we must take some actions to ensure our deployments before of this development.
  • Our Postgress instance takes several minutes to be restarted and indexed, so we have decided to deploy more read-only instances that acts as backup-

Supporting Information

Support: support@ubidots.com
Posted 6 minutes ago. May 21, 2019 - 16:00 UTC
Resolved
This incident has been resolved.
Posted 1 day ago. May 20, 2019 - 09:53 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 1 day ago. May 20, 2019 - 09:50 UTC
Investigating
Events engine, MQTT, HTTP, TCP/UDP and login apps services affected
Posted 1 day ago. May 20, 2019 - 06:58 UTC

Posted Feb 09, 2021 - 15:47 UTC

Resolved
We experienced issues with our ingestion engine exposed through HTTP during 02:28-02:31 - 02:55-02:57, and 03:13 - 03:17 UTC on February 2nd. Access to both owners' and end-users web apps was also affected by the issue, avoiding to visualize data stored at Ubidots.

During the issue time windows, our HTTP servers responded 50x standard errors, so users should have been able to capture the error and to set values read in their devices to be sent later.

The problem raised due to unexpected behavior at our internal DB, that derivates in the creation of non-valid indexes.
Posted Feb 03, 2021 - 15:10 UTC