This post follows up part 1. It documents my experience and process building a reliable test process for our helm chart. The general idea is that the “clean-build-test-install-verify” flow should work consistently across parallel success/failed runs without impacting subsequent runs.

Part 1 covered the general setup, process, and analysis for experiment run 1. This post covers the analysis for runs 2–8.

Run 2

Changes since run 1:

  • Terminated failing node mentioned in part 1
  • Added -o wide to debugging output to provide more information on potential failing pods.

This run was a complete failure. There wasn’t a tiller throughout the entire run. Various helm commands failed with “transport closed” or “connection refused” type errors. The was not a single success across 10 hours. Unfortunately there was nothing in the tiller logs to debug.

Run 3

Changes since run 2:

  • Deleted broken tiller pod

The results were similar to run 1. There were consecutive successes followed by consecutive failures. There was a single alert regarding failing Kublet PLEG (Pod Lifecycle Event Generator) from our monitoring system which coincided with the failures. All failing pods were confined to a failing node (visible by -o wide). This made me conclude that this is a node issue and not something in helm/tiller.

Luckily I remembered to check dmesg as part of the follow up on my periodic Docker daemon lock up issue. The logs revealed killed process for out of memory errors. The node had 98% memory utilization. That’s not good for anything. This lead me to conclude that pod/node density is too high. The next run needed to address this. It also made me wonder: what is the kubelet’s memory pressure check and why didn’t it alert in this case?

Run 4

Changes focus on memory pressure and monitoring. Changes:

  1. Set --eviction-hard=memory.available<500mi as described here.
  2. Added resource requests and limits to all pods in the chart. Test installs happen using the sandbox stage. This stage is unique because it includes datastore (e.g. mongodb, redis, or mysql) pods for all applications in the chart. Unfortunately these pods were forgotten in the commit that added resource requests/limits to other containers. It revealed another issue running the JVM in containers on the other hand. We fixed this with our open source local DynamoDB image that uses -XX:+UseCGroupMemoryLimitForHeap to set JVM heap limits from Docker cgroups. You can read up on JVM heap settings in Docker. Here’s more info on the new -XX:+UseCGroupMemoryLimitForHeap setting.
  3. Added alarms for memory usage > 90% on Kubernetes nodes.

The first change ensures that pods are evicted (moved to another node) in memory high pressure scenarios. Our cluster uses m4.large instances with 8gb memory. This eviction policy kicks in at roughly ~95% memory utilization.

The second change minimizes the possibility that the eviction policy is triggered. Declaring resource requests ensures pods only go where they fit. Declaring resource limits caps containers resource utilization to prevent unexpected growth which may trigger evictions or downtime.

These two changes should keep nodes in a healthier state and perhaps minimize the chance of a Docker daemon lock up. Computers don’t like it when they run out of memory — especially when there’s no swap!

The results are encouraging! The experiment ran for 12 hours with a high success rate. There were 51 successes and 3 failures. There were no alarms during the experiment as well. There were also failures, but that did not impact future runs. There were successful runs after failed runs. Here’s a snippet from the log:

failures=0 success=1
failures=0 success=2
... snipped ...
failures=0 success=47
failures=0 success=48
failures=1 success=48
failures=2 success=48
failures=2 success=49
failures=2 success=50
failures=2 success=51
failures=3 success=51

The failures are internal errors from helm test. Here’s an example:

Run options: --seed 11474

# Running:

..E...E.....

Finished in 4.355899s, 2.7549 runs/s, 9.1829 assertions/s.

1) Error:
GenericRPCsTest#test_ping_email_service:
Thrift::TransportException: Could not connect to test-efritin-kviberg-efr-email-service-thrift:9090:
    /usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/socket.rb:57:in `open'
    /usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/framed_transport.rb:37:in `open'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:91:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:9:in `block in call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:8:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:9:in `block in call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:8:in `call'
    /usr/local/bundle/gems/middleware-0.1.0/lib/middleware/runner.rb:31:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:188:in `block in invoke'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:187:in `invoke'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:63:in `block (3 levels) in build'
    test/stage1/generic_rpcs_test.rb:60:in `assert_rpcs'
    test/stage1/generic_rpcs_test.rb:27:in `test_ping_email_service'

2) Error:
GenericRPCsTest#test_ping_search_service:
Thrift::TransportException: Could not connect to test-efritin-kviberg-efr-search-service-thrift:9090:
    /usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/socket.rb:57:in `open'
    /usr/local/bundle/gems/thrift-0.10.0.0/lib/thrift/transport/framed_transport.rb:37:in `open'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:91:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:9:in `block in call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/rpc_metrics.rb:8:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:9:in `block in call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:30:in `time'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter/middleware/client_metrics.rb:8:in `call'
    /usr/local/bundle/gems/middleware-0.1.0/lib/middleware/runner.rb:31:in `call'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:188:in `block in invoke'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:64:in `block (2 levels) in with'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `handle_interrupt'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:63:in `block in with'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `handle_interrupt'
    /usr/local/bundle/gems/connection_pool-2.2.1/lib/connection_pool.rb:60:in `with'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:187:in `invoke'
    /app/vendor/cache/thrifter-adf42779d5bd/lib/thrifter.rb:63:in `block (3 levels) in build'
    test/stage1/generic_rpcs_test.rb:60:in `assert_rpcs'

These errors indicate that some containers where not ready when the test executes. This is provable because the await-release step blocks until all pods are ready. Readiness and liveness probes are defined for all containers. One of the following reasons likely caused the problem:

  1. Failed liveness probe
  2. Failed readiness probe
  3. Container restarted for scheduling reason

There is no way to know the cause at the time of this analysis. The test-release script requires more logging in the failed test condition.

it seems (curiously) that there is a relation between successive failures. Consider this log snippet:

failures=0 success=48
failures=1 success=48
failures=2 success=48
failures=2 success=49
failures=2 success=50
failures=2 success=51
failures=3 success=51

There was 48 successful executions, followed by 2 failures, 2 successes, and another failure. The test log shows the same error (the earlier log snippet) for all failing runs. This leads me to think there is something fishy going on with the specific containers in question, but there is no supporting evidence at this time. Hopefully the next run provides more data.

Run 5

This run’s goal is to acquire more data around pod state after failed test runs. This should provide enough data to correctly determine the failure reason. I’ve made the following changes for run 5:

  1. Print all pods in the release in the failure case
  2. Describe all pods in the non-running state in the failure case. This shows the event log — more importantly the liveness and readiness probe history — along with other useful information.

Here’s the updated script/test-release.

The experiment ran for approximately 18 hours. The results are encouraging! There were 90 runs with 85 successes and 5 failures. Again, failures in one run are not impacting subsequent runs. The results show failed liveness probes did cause some of the previous failures. Here’s the log grepping for specific success or failure messages. Note that the output also captures useful bits of kubectl describe to debug such failures:

Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3
Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3
  failures=1 success=0
  failures=1 success=1
  failures=1 success=2
  failures=1 success=3
  failures=1 success=4
  failures=1 success=5
  failures=1 success=6
  failures=1 success=7
  failures=1 success=8
  failures=1 success=9
  failures=1 success=10
  failures=1 success=11
  failures=1 success=12
  failures=1 success=13
  failures=1 success=14
  failures=1 success=15
  failures=1 success=16
  failures=1 success=17
  failures=1 success=18
  failures=1 success=19
  failures=1 success=20
  failures=1 success=21
  failures=1 success=22
  failures=1 success=23
  failures=1 success=24
  failures=1 success=25
  failures=2 success=25
  failures=2 success=26
  failures=2 success=27
  failures=2 success=28
  failures=2 success=29
  failures=2 success=30
  failures=2 success=31
  failures=2 success=32
  failures=2 success=33
  failures=2 success=34
  failures=2 success=35
  failures=2 success=36
  failures=2 success=37
  failures=3 success=37
  failures=3 success=38
  failures=3 success=39
  failures=3 success=40
  failures=3 success=41
  failures=3 success=42
  failures=3 success=43
  failures=3 success=44
  failures=3 success=45
  failures=3 success=46
  failures=3 success=47
  failures=3 success=48
  failures=3 success=49
  failures=3 success=50
  failures=3 success=51
  failures=3 success=52
  failures=3 success=53
  failures=3 success=54
  failures=3 success=55
  failures=3 success=56
  failures=3 success=57
  failures=3 success=58
  failures=3 success=59
  failures=3 success=60
  failures=3 success=61
  failures=3 success=62
  failures=3 success=63
  failures=3 success=64
  failures=3 success=65
  failures=3 success=66
  failures=3 success=67
  failures=3 success=68
  failures=3 success=69
  failures=3 success=70
Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3
  failures=4 success=70
  failures=4 success=71
  failures=4 success=72
  failures=4 success=73
  failures=4 success=74
  failures=4 success=75
  failures=4 success=76
  failures=4 success=77
  failures=4 success=78
  failures=4 success=79
Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3
Liveness: tcp-socket :9090 delay=5s timeout=1s period=10s #success=1 #failure=3
  failures=5 success=79
  failures=5 success=80
  failures=5 success=81
  failures=5 success=82
  failures=5 success=83
  failures=5 success=84
  failures=5 success=85

This runs provides useful data for analyzing the 5 failures. Here’s the analysis for each failure from log data.

  1. This is an interesting failure. The log shows script/await-release finished immediately. This has never happened before. I cannot believe that all pods were available just have helm install completed. I chock this up to a curious bug in script/await-release. The test filure logs show that two RPCs timed out. The get pod output shows the relevant pods were still running their init containers. This run proves there is a bug somewhere but it’s uncertain where. I will review script/await-release before run 6.
  2. This is a similar failure to #1. One RPC timed out to one of the same containers from run 1. Except this log shows that script/await-relase finished correctly and all pods are running. I would expect this test to pass given the information on hand. Default RPC timeouts need confirmation. It’s possible that various load factors prevented the server from responding in time. Centralized logging would be useful here as well. I cannot think of a straight forward way to associate tests with pods for more accurate log collection in the failure case. It’s likely the only solution is to dump logs for all pods.
  3. Same situation as #2.
  4. Similar situation as #1. The test failure logs show that script/await-release completed successfully, but the pods were restarted at some point during the test run.
  5. Similar to #1. script/await-release immediately reported that all pods were available. The failed test results show that RPCs failed because pods were still running init containers.

These results lead me to conclude that:

  • There is a bug in script/await-release check for Running pods. It’s possible that kubectl get pods shows them as Running for a moment before init containers start. The logs show that pods had zero restarts as well.
  • Liveness probe initial delay seconds may need to be increased
  • RPC timeouts may need to be increased
  • Recording pod state before running helm test would help. This could help debug which pods underwent status changes during the test run.
  • script/await-release should print kubectl get pods output when everything is OK. This would help debug why grep -vqF 'Running' returned zero. I have a hunch that kubectl get pods is not returning more than two lines and the tail -n +2 is making the pipeline return an empty string. Although I have no idea why that would occur. I’ll add a test that output is non-empty as a compound condition for release readiness. This combined with printing the state of all pods would confirm assumptions.

Run 6

This run’s goal is to debug issues in script/await-release. Changes in this run:

  1. Capture and print release pod status in the event of failed helm test
  2. Print all pod status when script/await-release registers everything as ready
  3. Add a guard clause that kubectl get pods -l "release=${release}" | tail -n +2 returns a non empty string before testing for non Running pods.

I’ve not changed anything regarding timeouts or liveness probes in this run. I have a feeling that probes themselves are configured with appropriate delays and timeouts. It’s more likely that pods were never ready in the first place, thus invalidating prerequisites for script/test-release.

I started the experiment and monitored it until failures occured. I noticed two failures. I terminated the run when it was clear that the changes were not reducing the failures.

The log shows that there is some strange behavior in script/await-release. The following gist shows the problem area along with a log snippet. The log shows that the script detects everything is “running”, but the following lines show the text grep ran against. You can see the log shows pods in Pending and Running states. The script should not have counted this as OK.

My only guess is strange grep behavior at this point.

Run 7

The goal for this run is further debug failures in script/await-release. I’ve made a single change: set -x to print all commands. This allows debugging variable state and what string grep ran against.

I eagerly baby sat the test runs and waited for a failure. I got one after 10 hours or so. Here’s the overall result before analyzing each individual result:

failures=0 success=1       
failures=0 success=2       
failures=0 success=3       
... snipped ...       
failures=0 success=39       
failures=1 success=39       
failures=2 success=39       
failures=2 success=40       
failures=2 success=41       
failures=2 success=42       
failures=2 success=43       
failures=2 success=44

There are two failures in this run. The first one is the most interesting. script/await-release incorrectly exited before everything was ready. Here’s a gist with script/await-release and log from last inner loop.

The log text has long line. There are annotations at end of each line to help understanding and analisys. Unfortunately Medium’s default width is too small so scroll right to pick them up or open the Gist on Github.com.


Let’s step through this log.

  1. Log lines 1–2 correspond to line 60 in the script. This pipeline captures all the pods in the release.
  2. Log line 3 shows the complete string assigned to the pods variable. Note that the very first pod in the list is Pending.
  3. Log line 27 is the first compound statement in the if in script line 68. The value correctly tests as non-empty.
  4. Log line 51 is the grep command in the if statement.
  5. Log line 88 is a second grep command. However I cannot determine why that is printed. Could it cause a failure? What’s it doing? It’s unknown.
  6. Log line 113 corresponds to script line 80. This means that some combination of those two grep commands (Really how is there a second call?) made the if statement evaluate to false and trigger the else body.
  7. Log line 117 corresponds to script line 82. $pod is written to stdout. You can see pods are still pending (the first item in the list).

You can see from point #2 that the string does contain Pending. Truthfully I have no idea how this is happening. It’s curious because it’s happening occasionally. I can only guess that bash as incorrectly parsed that if statement (script line 68) and some how ran grep twice which is causing a non-zero exit. That’s a hand-wavy answer because I really have no idea what’s going on. Simplifying the script for the next could remove this strange behavior.

An internal error in one database container caused the second failure. My guess is that ElasticSearch index creation is not synchronous or there is some (legitmate) bug in the application code in the empty result case. Either way this is outside the scope.

Run 8

The goal for this run is to (again) solve the bug in script/await-release. Changes have been made to script/await. The script is updated to collapse the conditional. Output is redirected to a scratch file and grep -qvF runs against that file instead of standard input. Here’s the complete script/await-release for run 8:

The results are surprising! This ran lasted for almost 48 hours simply because I had forgotten about it over thee week. The results are 176 successful runs and 1 failure! I cannot say for certain that the changed fixed the problem because I could not reproduce it. I also cannot say if it was just a transient problem or something to do with stdin inside a nohup process. Regardless this is best outcome of all runs. Timeouts caused the second failure. The occasional timeout always come from the same container. This pushes me to bump the timeout for calls to the service running in that container. However it’s happening rarely enough that it may not happen in day-to-day usage. Run 8 is enough for me to conclude this experiment.

Conclusion

It took two weeks to conclude this experiment. It has been interesting and informative. Here are my takeaways:

  • Declare resource requests and limits for every container
  • Set the proper node allocation limits. I think setting --eviction-hard had the biggest impact on this experiment.
  • Ensure your EBS volumes are provisioned with enough IOPs
  • Pre-pull images to lesson initial pull load. Impact depends on how big the initial pulls are. If your images are small then you need not worry. It had an impact for us because it’s ~10GB across all Docker images.
  • The Docker daemon lock up issue is fixed in Kubernetes 1.6. This PR changes the Kubelet health check to improve this failure mode. These experiments happened on Kubernetes 1.5.4. Thanks to the helpful #helm-dev channel for pointing this out. I’ll keep my issue open until I can repeat the experiements on Kubernetes 1.6
  • Kops 1.5.x is missing the Docker babysitter. This may have resolved the stability issues in earlier runs. The fix is scheduled for Kops 1.6. The workaround is enable it manually on each node until then. Unfortunately kops 1.5.x has no support for running user provided commands on the nodes themselves. This one will have to wait until we upgrade to Kops 1.6. This combined with the previous point should make the node more resilient to different Docker daemon failure modes. These two points make me hopeful for Kubernetes 1.6 (when we’re ready to upgrade).
  • Getting a good helm install/delete/test UX (in my opinion) requires too much end user scripting. helm install --wait will eventually be fixed. script/await-release works well enough until then. helm test requires the most leg work and is the most dissapointing of the three. Our script/test-release works pretty well. It addresses the known helm test exit code issue, deletes existing test pods before running the test, prints test pod logs on failed test runs, and provides a debugging snapshot. I may add another step to print all container logs when the test fails. The general UX problem helm test is that it does not provide information required to understand what failed. Our wrapper for helm delete handles different cases which may not make sense to have in helm itself. I suggested that helm delete include a --wait flag to match helm install. This would improve UX when either are part of synchronous execution.
  • HA Tiller is a pending story. It’s unknown what will happen here. Helm 2.4 fixes some known tiller panics. I’m worried that new failures will occur and how reliable tiller is in general. Hopefully this changes in future Helm releases. There is no workaround until then.
  • Running this clean-build-install-test loop has been a great test for our clusters. I’m happy with the results and will apply the same method to test Kubernetes/Helm versions before upgrading.

That’s all for this series. My next step is parallelizing the process. The non-parallel process takes about ~20 minutes. That’s a long wait time since this process happens on every git commit. Ideally I’d like it be 10 minutes. That remains to be seen though!

I suggest you run a similar experiment on your own clusters. You never know what you’ll find. I had no idea I’d encounter that grep thing.

Good luck out there and happy shipping!