We experienced issues with our ingestion engine exposed through HTTP during 02:28-02:31 - 02:55-02:57, and 03:13 - 03:17 UTC, on February 2nd. Access to both owners' and end-users web apps was also affected by the issue, avoiding to visualize data stored at Ubidots.
During the issue time windows, our HTTP servers responded 50x standard errors, so users should have been able to capture the error and to set values read in their devices to be sent later.
Critical, all the requests from external devices and scripts that used our REST API to ingest data were rejected.
We experienced a major issue with our Postgresql data base that caused index errors that had to be handled, after some other attempts, with a sigkill to our instance.
Once the DevOps team was notified, a reindex action took place inside 5 different BTREE indexes that support some of our daily logs taks
The non-availability of the database caused that both web-access and http ingestion experienced issues, and as data did not have a path to be stored in the DB, our servers responded with a 50x error response code.
We couldn’t find the root cause of the issue with the data base, so we have decided to solve the issue by creating another replication instances.
Sigkill and reindex of our Postgress DB instance
Detected by the automated internal service health checker.
|DB reindex||mitigate||gustavo email@example.com||DONE|
|Create Backup DB instances||Prevent||gustavo firstname.lastname@example.org||Scheduled|
The automated health checker alerted to the DevOps team once the issue was presented.
Posted 6 minutes ago. May 21, 2019 - 16:00 UTC
This incident has been resolved.
Posted 1 day ago. May 20, 2019 - 09:53 UTC
A fix has been implemented and we are monitoring the results.
Posted 1 day ago. May 20, 2019 - 09:50 UTC
Events engine, MQTT, HTTP, TCP/UDP and login apps services affected
Posted 1 day ago. May 20, 2019 - 06:58 UTC