We experienced issues with our website that prevented both owner and end-users to access their Ubidots accounts. Public dashboards and widget links were also unavailable. Additionally, some GET HTTP request replied with a 500 timeout server error. The Data Ingestion service didn’t suffer any impact and all Dots were received correctly and saved to our DB during the timeframe of this incident.
Critical, GET requests to retrieve values were not processed and web access was not available
We have a large logs table related with the our Events engine that is penalizing the database performance. I addition to this, the back-up did not work properly.
PostgreSQL DB instance reindex
Detected by the automated internal service health checker.
|Cluster reboot||mitigate||gustavo email@example.com||DONE|
|Create Backup DB instances Switching||Prevent||gustavo firstname.lastname@example.org||Scheduled|
|Deploy a new backup / replication database instance||Prevent||gustavo email@example.com||Scheduled|
|Test the backup database switching||Prevent||gustavo firstname.lastname@example.org||Scheduled|
The automated health checker alerted the DevOps team once the issue started.
* Back on February 3rd, 2021, our service faced a similar degradation with the PostgreSQL DB. To prevent it from happening again, we deployed a task that would automatically deploy a new back-up DB instance while the main DB took operation again. Because of a human-error, the server routine developed to make the back-up DB available into production wasn't executed this time and hence, this previous prevention action didn't serve it purpose as expected