HTTP

Incident Report for Ubidots

Postmortem

2021-09-21 - Post Mortem - Down of Ingestion

Date

2021-09-21

Authors

Benjamin

Status

Reviewed

Issue Summary

Occasional downtimes in (http) data ingestion in time frames of 30 seconds were caused by an increased influx of ingestion requests with 15x bigger payloads which triggered an internal limit of Kubernetes restarting the pods. Issue was resolved by increasing the limit provisioning more CPU and RAM capacity to this service (as well as the most critical other services).

Impact

A down in (exclusively https) ingestion, in total 8 minutes and 15 seconds, where new dots were rejected completely. The 8 minutes and 15 seconds were not a consecutive time frame, rather in intervals of 30 seconds. Total incident started Monday, 21st of September 2021 at 09:54 am Bogotá/Colombia until 11:38am Bogotá/Colombia (1:44h) the same day. Leading to the 30 second intervals the https ingestion api presented also a temporary increase in latency from 400ms to 7000ms.

Detection

The internal Ubidots checks (running each minute) detected the down immediately, at 09:53am.

Root Causes

In Reactor, our core service that we use for data ingestion, the CPU utilization increased substantially (hitting a limit in Kubernetes) due to 2 factors:

Substantial increase in number of requests by +33% within a very short time frame.
Requests with big payloads (3 Kb) while on average the payloads have a size of 200 Bytes. This represents a 15-fold increase in payload size.

Bigger payloads require more CPU because the payloads have to be converted from json to a bytestring. Together with the bigger footprint of requests (more in quantity) the CPU utilization spiked above the limit. The CPU utilization has been on average at approx. 6.25% and at approx. 10% with past peaks in data ingestion. We set our internal limits to 4x the average CPU utilization (4x 6.25% => 25%). The combined effect let the CPU utilization skyrocket to above 25%.

Kubernetes limits are imposed for the global stability of the cluster assuring that no service consumes excessive capacity affecting other services. When the limit was hit Kubernetes started to note that the pod was responding with increased latency and restarted the pod as soon as the pod took too much time to process the heavy requests resulting in down time in https ingestion until the pod was restarted which takes approx. 30-40 seconds.

Resolution

The solution was to increase the 4x (25%) limit to a 12x (75%) in Kubernetes. This will a lot more buffer capacity to process temporary peaks.

What went well

The Ubidots checks identified the issue immediately.

What went wrong

As the limits are not frequently updated it had a value not reflecting the current influx of data.

Lessons Learned

Metrics and statistics for our Kubernetes cluster are key for each individual micro service.
Any request with payload of 10KB is rejected by Ubidots backend.
Kubernetes supports peaks, the pod is only restarted if the CPU utilization is above a limit for a period of time.

Action Items

Action Item	Type	Owner	Status
Revise all limits in Kubernetes (minimum and maximum allocated CPU and RAM capacity)	Mitigate	Gustavo Angulo woakas@ubidots.com	To Do
Set up monitoring in Grafana for Kubernetes services	Check	Gustavo Angulo woakas@ubidots.com	To Do
Recurring item that asks to check/adjust the Kubernetes limits	Mitigate	Benjamin Heinke benjamin@ubidots.com	Done
Add a new instance to the Kubernetes cluster adding additional CPU and RAM capacity (starting in October)	Mitigate	DevOps Team	To Do

‌

Supporting Information

Resolved
Monday, 20th of September 2021: 04:38pm UTC

Monitoring
Monday, 20th of September 2021: 04:38pm UTC

Investigating
Monday, 20th of September 2021: 02:54pm UTC

Posted Sep 22, 2021 - 17:25 UTC

Resolved

This incident has been resolved.

Posted Sep 20, 2021 - 20:30 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 17, 2021 - 15:42 UTC

Investigating

We have identified an error with our HTTP data ingestion service and are currently looking into the issue.

Posted Sep 17, 2021 - 15:31 UTC