Saltside’s SRE team is leading the charge to replace our home grown container orchestration setup with Kubernetes. Saltside makes classified (as in user posted ads, not secrets) sites. We operate in multiple markets. I’ll refer to the total number of markets as M going forward. Market containers are deployed and operated separately from the others. This ensures each market’s infrastructure may be scaled accordingly and that failures in one market do not impact the others. Given each market’s infrastructure is deployed separately, it must progress through all the stages of deployment process. I’ll refer to this as S going forward.

Our solution leverages Helm. Our implementation is to build one helm chart per M per S. This creates M * S charts. Technically we keep all the configuration in one values.yaml and pick out the appropriate values based on $.Values.market and $.Values.stage. This detail is not entirely relevant for this post, but I figured it may be interesting for those curious about the implementation. I’ll share details about the complete solution in future posts.

Last week’s objective was to demonstrate that the “clean-build-install-test” CI process worked reliably for our chart repo. Our team runs hundreds of CI builds a day without issue. We trust CI for the current code bases. There are no flakes, so failures are considered genuine. Our chart repo’s test suite must be equally reliable. Reliability for this repo is more complicated because of how many moving parts are involved. The high level CI process looks something like this:

  1. Clean up (delete all helm releases and ensure all k8s resources are removed from the cluster)
  2. Test our values.yaml for our specific semantics (e.g. each application has declared environment variables for M and S).
  3. Build M * S charts. This produces a preconfigured chart for each M and S values.
  4. Install sandbox stage chart on the non-production cluster
  5. Run helm test against releases from previous setup

I’d run script/ci/test over and over again from my own machine earlier in the week with varying results. The results were inconsistent and time consuming. The whole process tends to take ~20–25 minutes without parallelization. Instead I wanted something that would better approximate conditions in the engineering team.

The Experiment

My experiment is simple: repeat the script/ci/test command in an endless loop and count each success or failure. This experiment hits the following points:

  • make clean does, in fact, delete everything and set a clean state for the next step
  • helm install with a specific release name can be reused across after helm delete --purge
  • Our Kubernetes cluster can consistently schedule, run, and delete all the resources in application
  • Failures in one run do not impact failures in the next run (this is important because parallel builds run against the same Kubernetes cluster)

This may not be most scientific experiment, but it’s the best I can come up with right now. Please let me know any improvements in the comments.

The Execution

There are a few existing bugs in Helm that require work arounds in our code:

  1. helm install --wait does not work. Our install process executes helm install then uses a separate script to wait until all Pods are running.
  2. helm install does not generate release names synchronously. We have a script that awaits a release name in helm ls.
  3. helm delete has no --wait. This is important to keep scripts as synchronous as possible. I’ve written a script that attempts to wait for successful deletion and handle the case where helm delete was initiated but never completed.
  4. helm test does not use proper exit codes. I’ve written a wrapper to test on helm test output instead of exit codes. This is a known issue with multiple failure modes.

Here are the various scripts used as shims in the process:

The test script (known as script/harden) is executed with nohup script/harden > harden.log &.

Understanding the Log File

One loop of this process creates a lot of output. Here is the log output (with some irrelevant text remove) for a successful run:

Let’s walk through it:

  1. First make clean tears down everything. This is lines 3–17
  2. Now the charts are built and tested before installing them. This is line 23–44.
  3. Now the charts are installed on the test cluster. helm install happens then script/await-release checks pods matching a label selector until everything is running. You can see that over time the amount of pending pods decreases. This is clearly visible in the log messages.
  4. Finally script/test-release runs against each release. Helm reports everything as PASSED.

This proves that the process does work. The question is: how often does it work? The execution can be repeated and results measured.

Test Run 1

I ran nohup script/harden > harden.log & on our kops bastion node before going to sleep. I checked in morning after it ran for about 9 hours. Here are the results:

  • 28 consecutive successes
  • 1 failure
  • 29 consecutive failures

The 28 consecutive successes surprised me. Earlier in the week I could not get a single success of one market, let alone all 4 in succession. Then everything broke. Something happened on the 29th run that broke itself and each after it. Was it cluster itself? Was it a bad node? Did helm/tiller break in some way? So went wrong? Let’s turn to the log:

All test-efritin pods running. Done!

--- Testing installation tmp/charts/bikroy-sandbox

env KUBE_CONTEXT=np.k8s.saltside.io KUBE_NAMESPACE=development script/test-release test-bikroy

RUNNING: test-bikroy-kviberg-bikr-smoke-test

PASSED: test-bikroy-kviberg-bikr-smoke-test

--- Testing installation tmp/charts/ikman-sandbox

env KUBE_CONTEXT=np.k8s.saltside.io KUBE_NAMESPACE=development script/test-release test-ikman

RUNNING: test-ikman-kviberg-ikman-smoke-test

PASSED: test-ikman-kviberg-ikman-smoke-test

--- Testing installation tmp/charts/tonaton-sandbox

env KUBE_CONTEXT=np.k8s.saltside.io KUBE_NAMESPACE=development script/test-release test-tonaton

RUNNING: test-tonaton-kviberg-ton-smoke-test

PASSED: test-tonaton-kviberg-ton-smoke-test

--- Testing installation tmp/charts/efritin-sandbox

env KUBE_CONTEXT=np.k8s.saltside.io KUBE_NAMESPACE=development script/test-release test-efritin

RUNNING: test-efritin-kviberg-efr-smoke-test

E0429 21:01:37.203572   22725 portforward.go:175] lost connection to pod

Error: transport is closing

test-efritin test failed! Capturing logs and cleaning up

Makefile:132: recipe for target 'tmp/charts/efritin-sandbox.test-release' failed

make: *** [tmp/charts/efritin-sandbox.test-release] Error 1

=> Results: failure=1 success=28

This is the last step (helm test) of the proccess, so it got all the way to the end. The failure happened on the last market (test-efritin). Here’s the relevant section:

RUNNING: test-efritin-kviberg-efr-smoke-test

E0429 21:01:37.203572   22725 portforward.go:175] lost connection to pod

Error: transport is closing

test-efritin test failed! Capturing logs and cleaning up

Makefile:132: recipe for target 'tmp/charts/efritin-sandbox.test-release' failed

The Error: transport closing is a helm/tiller thing. This is another known issue with Helm/Tiller. My hypothesis is there is a timeout between kube-proxy, tiller, or the helm client. These failures may also cause tiller pods to crash. Kubernetes detects the failure and restarts the pod.

Helm using 1 replica tiller by default. Crashes creates a downtime while the next pod starts up. This should not create permanent issues since tiller is a stateless process and tiller stores data in Kubernetes itself with config maps — in theory. This can be confirmed by looking at the tiller logs or the Kubernetes logs to see if replica crashed.

Unforutnately there is nothing in kubectl logs for the tiller pod nor useful events in kubectl describe deployment for tiller. It seems like there was not a tiller crash in this case. I have to write that off as a transient issue if it does not repeat in the subsequent run. Here are the logs for the next run:

--- Cleaning

--> Deleting release test-bikroy

Error: could not find a ready tiller pod

==> No release found; deleting manually

==> Deleting any dangling Service

... snipped ...

==> Deleting any dangling Deployment

... snipped ...

==> Deleting any dangling Pod

... snipped ...

==> Deleting any dangling Secret

... snipped ...

==> Deleting any dangling ConfigMap

No resources found

--> Awaiting resource deleting confirmation

--> Awaiting helm confirmation

... snipped ...

test-bikroy still in tiller. 14/15 checks completed; retrying.

test-bikroy failed to purge from tiller delete in time.

Makefile:140: recipe for target 'clean' failed

make: *** [clean] Error 1

This log shows that it failed to purge in time. That works by grep'ing for the release name (test-bikroy in this case) in helm ls --all -q. It is true that it may take more than 15 attempts for that to happen. The number of attempts could be increased in the next experiment. If that’s the issue, then it’s safe to assume that the next run would wait again for it to purge eventually leading to consistency. The logs for the next runs show:

The logs show that the test-bikroy release was not found when it checked. This indicates that --purge took longer than the expected timeout. The code dutifully continued onto the next market and failed waiting for the pods to delete. Again, it is plausible that there are too few attempts to handle the deletion case. In that case, we’d expect all these things to eventually clean after a large amount of test runs. The log shows:

The next run matches that assumption. Here test-bikroy and test-ikman were not found and successfully confirmed deleted. This time it failed again but for another Helm release (test-efritin). Seems this may in fact be a timeout related thing (solvable by increasing the timeout). Let’s fast forward through the log until the initial clean step succeeds and examine the following runs for any patterns.

The logs shows that the test-bikroy release registered as DEPLOYED. Unfortunately the next step failed to completed within a (generous) timeout. The logs for all the subsequent runs are failing in the same way. Here’s the tail of the last completed run before I terminated the experiment:

test-bikroy Pod still running. 14/15 tests completed; retrying.

NAME                                                            READY     STATUS        RESTARTS   AGE

test-bikroy-kviberg-bikr-admin-service-db                       0/1       Terminating   0          23m

test-bikroy-kviberg-bikr-image-service-db                       0/1       Terminating   0          23m

test-bikroy-kviberg-bikr-payment-service-db                     0/2       Terminating   0          23m

test-bikroy-kviberg-bikr-review-service-thrift-52563960-vj6rc   0/1       Terminating   0          23m

test-bikroy-kviberg-bikr-sms-service-thrift-1663037747-zcnb9    0/1       Terminating   0          23m

test-bikroy Pod failed to delete in time.

Makefile:140: recipe for target 'clean' failed

make: *** [clean] Error 1

=> Results: failure=34 success=28

Here the logs show the same pods that stalled in Pending in earlier runs are also stalled in Terminating.

Analysis: Run 1

I terminated the experiment since it was apparent that there would be no change in any future runs. The results lead me to think:

  1. The delete confirmation attempt is too short. Doubling to 30 should account for more realistic --purge conditions. This should increase consistency.
  2. There’s something fishy about those pods. It’s strange that the same (by considering the application/container like sms-service-thrift) would consistenly fail for hours. It could be a node level problem but there’s no way to know at the time of this analysis. The next run should add -o wide to show node information on test output.
  3. I have a hunch this is related to eventually consistent behavior across components in the stack (helm -> tiller -> kubernetes -> docker -> kernel). Perhaps this is related to using the same Helm release name?

This may be also related to an issue I opened regarding periodic lockups in the Docker daemon. This resulted in pods stuck in the Pending state until the Docker daemon was restarted. Our kubelet checks showed that one node had been failing health checks for hours.

Failing kubelet checks notifications

Earlier experiments prepared me for this scenario. I added verbose logging to the Docker daemon to help debug this problem. The test logs do not show the node information, but it can be inferred from the AWS instance id in the slack ping. These logs would most likely not be conclusive for problems with all pods on that node. Instead it may yield enough information to add to the existing bug report.

Relevant snippets from the kubelet logs regarding the pending pods:

22362 kubelet.go:1860] Failed cleaning pods: operation timeout: context deadline exceeded
Apr 30 06:18:37 ip-172-20-46-113 kubelet[22362]: E0430 06:18:37.298097   22362 kubelet.go:1522] Unable to mount volumes for pod "test-bikroy-kviberg-bikr-payment-service-db_development(e0c34301-2d53-11e7-a00a-02409e88de25)": timeout expired waiting for volumes to attach/mount for pod "development"/"test-bikroy-kviberg-bikr-payment-service-db". list of unattached/unmounted volumes=[default-token-1b86g]; skipping pod
Apr 30 06:18:37 ip-172-20-46-113 kubelet[22362]: E0430 06:18:37.298504   22362 pod_workers.go:184] Error syncing pod e0c34301-2d53-11e7-a00a-02409e88de25, skipping: timeout expired waiting for volumes to attach/mount for pod "development"/"test-bikroy-kviberg-bikr-payment-service-db". list of unattached/unmounted volumes=[default-token-1b86g]
Apr 30 06:18:37 ip-172-20-46-113 kubelet[22362]: E0430 06:18:37.299065   22362 kubelet.go:1522] Unable to mount volumes for pod "test-bikroy-kviberg-bikr-review-service-thrift-52563960-tw388_development(e0e7d017-2d53-11e7-a00a-02409e88de25)": timeout expired waiting for volumes to attach/mount for pod "development"/"test-bikroy-kviberg-bikr-review-service-thrift-52563960-tw388". list of unattached/unmounted volumes=[default-token-1b86g]; skipping pod
Apr 30 06:18:37 ip-172-20-46-113 kubelet[22362]: E0430 06:18:37.299085   22362 pod_workers.go:184] Error syncing pod e0e7d017-2d53-11e7-a00a-02409e88de25, skipping: timeout expired waiting for volumes to attach/mount for pod "development"/"test-bikroy-kviberg-bikr-review-service-thrift-52563960-tw388". list of unattached/unmounted volumes=[default-token-1b86g]
Apr 30 06:18:41 ip-172-20-46-113 kubelet[22362]: I0430 06:18:41.511963   22362 helpers.go:101] Unable to get network stats from pid 3962: couldn't read network stats: failure opening /proc/3962/net/dev: open /proc/3962/net/dev: no such file or directory

This snippet is repeated throughout the entire kubelet log. Here’s another repeated snippet documenting the failing health check:

Apr 30 07:11:38 ip-172-20-46-113 kubelet[22362]: logging error output: "[+]ping ok\n[+]syncloop ok\n[-]pleg failed: PLEG took longer than expected: pleg was last seen active at 2017-04-30 07:09:37.757209172 +0000 UTC\nhealthz check failed\n"

Unfortunately I cannot grok any useful information from 128MB of docker daemon logs. I have archived the Kubelet and Docker daemon logs for further analysis though.

Conclusion

There is nothing conclusive about this experiment. It must be rerun after fixing the broken kubelet. Run 2 will use updated timeouts and hope for no Kubelet errors. Check part for that analysis in part two.