Login App access

Incident Report for Ubidots

Postmortem

Login Apps Issues Postmortem

Date

2021-05-04

Authors

jose

Status

Complete

Summary

We experienced issues with our website that prevented both owner and end-users to access their Ubidots accounts. Public dashboards and widget links were also unavailable. Additionally, some GET HTTP request replied with a 500 timeout server error. The Data Ingestion service didn’t suffer any impact and all Dots were received correctly and saved to our DB during the timeframe of this incident.

Impact

Critical, GET requests to retrieve values were not processed and web access was not available

Root Causes

We experienced a major issue with our PostgreSQL data base that caused index errors that had to be handled, after some other attempts, with a sigkill to our instance.
Our DB back-up was not switching, and thus, our cluster to retrieve data and serve the frontend UI did not respond.

Trigger

We have a large logs table related with the our Events engine that is penalizing the database performance. I addition to this, the back-up did not work properly.

Resolution

PostgreSQL DB instance reindex

Detection

Detected by the automated internal service health checker.

Action Items

Action	Type	Owner	Bug
Cluster reboot	mitigate	gustavo woakas@ubidots.com	DONE
Create Backup DB instances Switching	Prevent	gustavo woakas@ubidots.com	Scheduled
Deploy a new backup / replication database instance	Prevent	gustavo woakas@ubidots.com	Scheduled
Test the backup database switching	Prevent	gustavo woakas@ubidots.com	Scheduled

Lessons Learned

What went well

The automated health checker alerted the DevOps team once the issue started.

What went wrong

* Back on February 3rd, 2021, our service faced a similar degradation with the PostgreSQL DB. To prevent it from happening again, we deployed a task that would automatically deploy a new back-up DB instance while the main DB took operation again. Because of a human-error, the server routine developed to make the back-up DB available into production wasn't executed this time and hence, this previous prevention action didn't serve it purpose as expected

Supporting Information

Support: support@ubidots.com

Posted May 04, 2021 - 15:34 UTC

Resolved

The issue is actually solved

Posted Apr 30, 2021 - 21:33 UTC

Monitoring

We have deployed a new instance and have truncated our database to manage the issue. Due to this, all users should have been logged out from their owner and end-users accounts. We will continue monitoring the patch.

Posted Apr 30, 2021 - 20:52 UTC

Update

We have detected an issue in one of our PSQL databases, and are working right now to deploy a new instance that allows to our system to process the incoming login apps requests

Posted Apr 30, 2021 - 20:48 UTC

Identified

We are experiencing issues with our website that prevents our users to access their owner accounts. We are working to solve this as soon as possible.

Posted Apr 30, 2021 - 19:55 UTC

This incident affected: America (Login Apps).