HTTP
Incident Report for Ubidots
Postmortem

2021-09-21 - Post Mortem - Down of Ingestion

Date

2021-09-21

Authors

Benjamin

Status

Reviewed

Issue Summary

Occasional downtimes in (http) data ingestion in time frames of 30 seconds were caused by an increased influx of ingestion requests with 15x bigger payloads which triggered an internal limit of Kubernetes restarting the pods. Issue was resolved by increasing the limit provisioning more CPU and RAM capacity to this service (as well as the most critical other services).

Impact

A down in (exclusively https) ingestion, in total 8 minutes and 15 seconds, where new dots were rejected completely. The 8 minutes and 15 seconds were not a consecutive time frame, rather in intervals of 30 seconds. Total incident started Monday, 21st of September 2021 at 09:54 am Bogotá/Colombia until 11:38am Bogotá/Colombia (1:44h) the same day. Leading to the 30 second intervals the https ingestion api presented also a temporary increase in latency from 400ms to 7000ms.

Detection

The internal Ubidots checks (running each minute) detected the down immediately, at 09:53am.

Root Causes

In Reactor, our core service that we use for data ingestion, the CPU utilization increased substantially (hitting a limit in Kubernetes) due to 2 factors:

  1. Substantial increase in number of requests by +33% within a very short time frame.
  2. Requests with big payloads (3 Kb) while on average the payloads have a size of 200 Bytes. This represents a 15-fold increase in payload size.

Bigger payloads require more CPU because the payloads have to be converted from json to a bytestring. Together with the bigger footprint of requests (more in quantity) the CPU utilization spiked above the limit. The CPU utilization has been on average at approx. 6.25% and at approx. 10% with past peaks in data ingestion. We set our internal limits to 4x the average CPU utilization (4x 6.25% => 25%). The combined effect let the CPU utilization skyrocket to above 25%.

Kubernetes limits are imposed for the global stability of the cluster assuring that no service consumes excessive capacity affecting other services. When the limit was hit Kubernetes started to note that the pod was responding with increased latency and restarted the pod as soon as the pod took too much time to process the heavy requests resulting in down time in https ingestion until the pod was restarted which takes approx. 30-40 seconds.

Resolution

The solution was to increase the 4x (25%) limit to a 12x (75%) in Kubernetes. This will a lot more buffer capacity to process temporary peaks.

What went well

The Ubidots checks identified the issue immediately.

What went wrong

As the limits are not frequently updated it had a value not reflecting the current influx of data.

Lessons Learned

  • Metrics and statistics for our Kubernetes cluster are key for each individual micro service.

  • Any request with payload of 10KB is rejected by Ubidots backend.

  • Kubernetes supports peaks, the pod is only restarted if the CPU utilization is above a limit for a period of time.

Action Items

Action Item Type Owner Status
Revise all limits in Kubernetes (minimum and maximum allocated CPU and RAM capacity) Mitigate Gustavo Angulo woakas@ubidots.com To Do
Set up monitoring in Grafana for Kubernetes services Check Gustavo Angulo woakas@ubidots.com To Do
Recurring item that asks to check/adjust the Kubernetes limits Mitigate Benjamin Heinke benjamin@ubidots.com Done
Add a new instance to the Kubernetes cluster adding additional CPU and RAM capacity (starting in October) Mitigate DevOps Team To Do

Supporting Information

Resolved
Monday, 20th of September 2021: 04:38pm UTC

Monitoring
Monday, 20th of September 2021: 04:38pm UTC

Investigating
Monday, 20th of September 2021: 02:54pm UTC

Posted Sep 22, 2021 - 17:25 UTC

Resolved
This incident has been resolved.
Posted Sep 20, 2021 - 20:30 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 17, 2021 - 15:42 UTC
Investigating
We have identified an error with our HTTP data ingestion service and are currently looking into the issue.
Posted Sep 17, 2021 - 15:31 UTC