We completed migrating the majority of our production components from our hand rolled container infrastructure to Kubernetes earlier this month. Our previous post discussed problems making it this far. This post discusses recent issues production issues, how we got into trouble, out of trouble, and how to stay out of trouble.

Our first migration strategy exposed Kubernetes components with a LoadBalancer service with ~14 ports. We could not figure out why this caused instant bad connection errors on those ELBs. We switched to multiple LoadBalancer services with a single port. This “solved” the problem. I quote solved because we did not determine the root cause and skipped out since it was only relevant in the migration phase. We’ve switched to ClusterIP now that all components are running in Kubernetes.

We also changed our migration approach to mitigate a big bang rollout. We used the existing HAProxy in our old infrastructure to do percentage based load balancing to the container running in the old infrastructure to the matching Kubernetes LoadBalancer. That worked like a charm! Moral of the story: I’m suspicious of services with a large number of ports (say more than two?).

Where are we now after about a month in production?

Situation Report

We have ~20 Deployments in our application. 2 Deployments consume ~80% of cluster CPU capacity. These two deployments are the Core API and the Search Service.

The Core API powers our web site, Android, and iOS applications. We make classifieds sites. The most common interaction in the product is either searching for ads or viewing an ad. Thus, the majority API calls include calls to the Search Service. It handles translating application level searches to ElasticSearch. Core API and Search Service horizontal scales (replica counts) have increased by 2x (or more) since migrating to Kubernetes. We’re seeing severe latency in these two components which directly impacts customer facing flows. Our SERP (Search Engine Results Page) availability is ~75% during peak hours. Here’s flow:

  1. An application (web site, Android, or iOS) requests GET /v1/serp. (Core API)
  2. The Core API validates parameters does some transformation makes a Thrift RPC call to the Search Service
  3. The Search Service makes a query to Elasticsearch
  4. The Search service generates an appropriate Thrift RPC response
  5. The Core API generates an appropriate JSON response

This particular flow (or similar) accounts for ~85% of API requests. Here’s what this looks like in numbers:

Search Service latencies. Purple/Blue: averages; Red/Yellow: 95th percentile. Numbers in seconds(!)

Latency issues started on July 15th and the eventual resolution about a week later. The sharp p95 increase caused all sorts of problems. Let me explain what caused it and how we resolved it.

Getting into Trouble

Consider the operability characteristics of the Core API, Search Service, and Kubernetes nodes. The Core API & Search Service are both Ruby applications running with threaded servers. We use MRI which has a GIL. This means no two threads may execute Ruby code at the same time. The corollary is that processes will only use one CPU (or 1000m in Kubernetes terms). The cluster nodes were c4.xlarge with 4 vCPUs (or 4000m in Kubernetes terms). It’s straight forward to estimate how much compute capacity the cluster requires. This is where we made our first mistakes which were then aggravated by (seemly) unrelated changes.

We started seeing production issues seemly related to CPU usage. Engineers increased replicas for the Core API and/or Search Service. This temporary resolved unavailability issues, but aggravated other conditions. This is where we started seeing more problems.

First, those components did not set CPU limits. Second, the Core API and Search Service were not scaled up correctly. The Core API calls Search Service in the most frequent API requests. This creates a bottleneck at the Search Service. The Core API replicas were doubled in some cases without doing the same on the Search Service. This meant more replicas were waiting longer for Search Service connections. This created more CPU load on the node which in turn impacted other pods on the node. This aggravated the scaling mismatch between pod CPU requirements and node capacity.

We created a noisy neighbor problem. There was a straight forward way to resolve this problem once we realized what was going on.

Getting Out of Trouble

We were packing more replicas onto an incorrect number of incorrectly vertical scaled nodes. The fact that we did not CPU limits introduced random problems for some pods. We may have more or less problems depending on how the scheduler placed pods onto individual nodes. We could end up with a 4000m node with 15 replicas (all trying to use 1000m) or a 10 nodes running 2 replicas of each.

The solution required three changes:

  1. Increase the vertical node capacity
  2. Decrease the number of cluster nodes
  3. Decrease the replica counts for Core API and Search Service.

Increasing the node vertical capacity provides much more power to handle any Core API or Search Service replicas. Naturally the number of cluster nodes would decrease given more CPUs were available on an individual node. Decreasing the replica counts minimized the noisy neighbor effect and ensure the replicas that were running had full CPU access.

We went from 20 c4.xlarge (4 vCPUs) nodes to 7 c4.4xlarge (16 vCPUs) nodes and changed the replica counts accordingly. The problems went away after we tuned replica counts. Great success! The next question is: how could we avoid this in the future?

Staying Out of Trouble

The following preventative measures largely mitigate the problem.

  • Enforce the various CPU allocation limits on the node itself. This minimizes over provisioning the node.
  • Enforce CPU limits on all containers. This ensure nodes are not over provisioned (in combination with the previous point) and that pods always receive the required CPU.
  • Better estimate node vertical scale in relation to pods that eventually land on them. Not all pods are created equally, some are more CPU intensive than others.
  • Consider bottlenecks when changing replica counts. If component A calls component B; simply doubling replicas counts for A will most likely not resolve issues in component B.

I’m a embarrassed we got into this situation, but some growing pains are expected. We learned our lesson and will do better in the future. I hope this post help you run Kubernetes in production.

Good luck out there. Happy shipping!