2021-09-21
Benjamin
Reviewed
Occasional downtimes in (http) data ingestion in time frames of 30 seconds were caused by an increased influx of ingestion requests with 15x bigger payloads which triggered an internal limit of Kubernetes restarting the pods. Issue was resolved by increasing the limit provisioning more CPU and RAM capacity to this service (as well as the most critical other services).
A down in (exclusively https) ingestion, in total 8 minutes and 15 seconds, where new dots were rejected completely. The 8 minutes and 15 seconds were not a consecutive time frame, rather in intervals of 30 seconds. Total incident started Monday, 21st of September 2021 at 09:54 am Bogotá/Colombia until 11:38am Bogotá/Colombia (1:44h) the same day. Leading to the 30 second intervals the https ingestion api presented also a temporary increase in latency from 400ms to 7000ms.
The internal Ubidots checks (running each minute) detected the down immediately, at 09:53am.
In Reactor, our core service that we use for data ingestion, the CPU utilization increased substantially (hitting a limit in Kubernetes) due to 2 factors:
Bigger payloads require more CPU because the payloads have to be converted from json to a bytestring. Together with the bigger footprint of requests (more in quantity) the CPU utilization spiked above the limit. The CPU utilization has been on average at approx. 6.25% and at approx. 10% with past peaks in data ingestion. We set our internal limits to 4x the average CPU utilization (4x 6.25% => 25%). The combined effect let the CPU utilization skyrocket to above 25%.
Kubernetes limits are imposed for the global stability of the cluster assuring that no service consumes excessive capacity affecting other services. When the limit was hit Kubernetes started to note that the pod was responding with increased latency and restarted the pod as soon as the pod took too much time to process the heavy requests resulting in down time in https ingestion until the pod was restarted which takes approx. 30-40 seconds.
The solution was to increase the 4x (25%) limit to a 12x (75%) in Kubernetes. This will a lot more buffer capacity to process temporary peaks.
The Ubidots checks identified the issue immediately.
As the limits are not frequently updated it had a value not reflecting the current influx of data.
Metrics and statistics for our Kubernetes cluster are key for each individual micro service.
Any request with payload of 10KB is rejected by Ubidots backend.
Kubernetes supports peaks, the pod is only restarted if the CPU utilization is above a limit for a period of time.
Action Item | Type | Owner | Status |
---|---|---|---|
Revise all limits in Kubernetes (minimum and maximum allocated CPU and RAM capacity) | Mitigate | Gustavo Angulo woakas@ubidots.com | To Do |
Set up monitoring in Grafana for Kubernetes services | Check | Gustavo Angulo woakas@ubidots.com | To Do |
Recurring item that asks to check/adjust the Kubernetes limits | Mitigate | Benjamin Heinke benjamin@ubidots.com | Done |
Add a new instance to the Kubernetes cluster adding additional CPU and RAM capacity (starting in October) | Mitigate | DevOps Team | To Do |
Resolved
Monday, 20th of September 2021: 04:38pm UTC
Monitoring
Monday, 20th of September 2021: 04:38pm UTC
Investigating
Monday, 20th of September 2021: 02:54pm UTC