rms1000watt

If you’ve ran Helm in a development k8s cluster for over a year and weren’t aware of TILLER_HISTORY_MAX, you’re not alone 🤣🤣

Introduction

I’m Ryan and I help build and maintain the DevOps infrastructure at Calm for the past 1.5 years. We run Kubernetes in AWS EKS and deploy via Helmfile.

Here’s a quick post about a Helm 2 issue we ran into with our development k8s cluster.

‘Twas a Normal Day Until..

Last week, everything was running smoothly until we started noticing failing deployments and blocked developers.

When we scanned the CI/CD logs, we observed Helmfile failing deployments due to reaching timeouts. However, we realized this wasn’t because the application deployed into a broken state. We were able to deduce the new version of the application never went out in the first place.

Digging deeper.. we checked Tiller logs and noticed errors mentioning timeouts of the configmaps endpoint of the k8s api server:

Error: Get https://172.20.0.1:443/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%!D(MISSING)TILLER: read tcp 10.1.123.123:48172->172.20.0.1:443: read: connection timed out

We also noticed helm ls command would fail, generating additional errors like this ^^.

However, kubectl get cm commands would succeed.

We figured these symptoms were a bit odd, so we shouldertapped our friendly neighborhood AWS TAM. We wanted to gather additional insight into the control plane and guidance on parsing the k8s api logs in AWS Cloudwatch. We just wanted to rule out any deep seeded issues going on. 👍

Google to the Rescue

Of course we Googled the error and stumbled on an open Helm issue regarding this.

Lo and behold, we accumulated over 12k configmaps in kube-system:

-> kubectl -n kube-system get cm | wc -l
   12766

You’d think 10s of MB worth of configmaps wouldn’t cause issues.. but hey 🤷‍♂️

Resolution

From the Helm issue above, I grabbed and modified an existing script to clean up old Helm configmaps older than the last 10 versions:

After running this, the cluster began to behave properly once again. 😅

Here We Are in the Future

How can we prevent this from happening in the future?

“Migrate to Helm 3”

Lol, yes of course. It’s backlogged and we’ll get to it. 🤣

How can we prevent this from happening in the near future?

The quick and dirty solution that we should have already known about is to update the tiller-deploy k8s deployment to include the environment variable:

TILLER_HISTORY_MAX=10

Which grants us automatic garbage collection!

Contact & Hiring

If you’ve enjoyed reading this, please reach out; I’d love to hear what you’re working on.