Login App access
Incident Report for Ubidots
Postmortem

Login Apps Issues Postmortem

Date

2021-05-04

Authors

jose

Status

Complete

Summary

We experienced issues with our website that prevented both owner and end-users to access their Ubidots accounts. Public dashboards and widget links were also unavailable. Additionally, some GET HTTP request replied with a 500 timeout server error. The Data Ingestion service didn’t suffer any impact and all Dots were received correctly and saved to our DB during the timeframe of this incident.

Impact

Critical, GET requests to retrieve values were not processed and web access was not available

Root Causes

  • We experienced a major issue with our PostgreSQL data base that caused index errors that had to be handled, after some other attempts, with a sigkill to our instance.
  • Our DB back-up was not switching, and thus, our cluster to retrieve data and serve the frontend UI did not respond.

Trigger

We have a large logs table related with the our Events engine that is penalizing the database performance. I addition to this, the back-up did not work properly.

Resolution

PostgreSQL DB instance reindex

Detection

Detected by the automated internal service health checker.

Action Items

Action Type Owner Bug
Cluster reboot mitigate gustavo woakas@ubidots.com DONE
Create Backup DB instances Switching Prevent gustavo woakas@ubidots.com Scheduled
Deploy a new backup / replication database instance Prevent gustavo woakas@ubidots.com Scheduled
Test the backup database switching Prevent gustavo woakas@ubidots.com Scheduled

Lessons Learned

What went well

The automated health checker alerted the DevOps team once the issue started.

What went wrong

* Back on February 3rd, 2021, our service faced a similar degradation with the PostgreSQL DB. To prevent it from happening again, we deployed a task that would automatically deploy a new back-up DB instance while the main DB took operation again. Because of a human-error, the server routine developed to make the back-up DB available into production wasn't executed this time and hence, this previous prevention action didn't serve it purpose as expected

Supporting Information

Support: support@ubidots.com

Posted May 04, 2021 - 15:34 UTC

Resolved
The issue is actually solved
Posted Apr 30, 2021 - 21:33 UTC
Monitoring
We have deployed a new instance and have truncated our database to manage the issue. Due to this, all users should have been logged out from their owner and end-users accounts. We will continue monitoring the patch.
Posted Apr 30, 2021 - 20:52 UTC
Update
We have detected an issue in one of our PSQL databases, and are working right now to deploy a new instance that allows to our system to process the incoming login apps requests
Posted Apr 30, 2021 - 20:48 UTC
Identified
We are experiencing issues with our website that prevents our users to access their owner accounts. We are working to solve this as soon as possible.
Posted Apr 30, 2021 - 19:55 UTC
This incident affected: America (Login Apps Toronto).