Lessons learned running Kubernetes
I was responsible for Kubernetes at Better.com for over three years. My contributions included evaluating Kubernetes, migrating all services to it, provisioning and upgrading clusters, and resolving production issues (like scalability and downtime). It’s been a valuable experience and I learned a lot.
In 2016 the Better.com engineering team finished its transition of microservices to Docker. We needed a new platform to operate these microservices since our solution at the time, AWS Elastic Beanstalk, had many shortcomings. I spent a few weeks researching options.
I was interested in solving a variety of production problems that had come up over the previous year, such as zero-downtime deployments, easy scalability, high availability, and preview environments (a.k.a. review apps). I also wanted a solution that was portable, open source, locally executable, and capable of running scheduled jobs.
Most solutions I considered didn’t fit these requirements. Kubernetes was last on the list because it needed the most setup, but it did everything necessary. At the time, its 1.0 release had been out for almost a year, so it seemed production-ready. It was barely just.
Kubernetes cluster provisioning was a minefield in 2016. There were many projects claiming to do it well. I tried several - most failed to deliver. After dealing with kube-up.sh (which was very limited), Kubernetes Ansible (which had users copying client SSL certs from masters), and kube-aws (which provisioned nodes that would mysteriously disappear from the control plane), I went with with the official cluster provisioning tool at the time, kops.
kops had a few issues. Shortly after the first production cluster was provisioned with kops, I discovered that the default Docker storage driver couldn’t be changed, leading to slow performance. Later I noticed nodes in the cluster were becoming unresponsive (fixed only by restarting Docker), a problem that quietly disappeared a few versions later. One time, a kops update took down the production cluster by changing internal certificates and breaking intra-cluster communication, eventually disabling all incoming traffic.
I ran into issues with Kubernetes itself. Most issues fell into two categories: “cloud” (AWS-related) and “operational” (workload-related).
“Cloud” issues usually had to do with networking. For example, the default network plugin, kubenet, requires an AWS VPC route table entry for every node. Route tables have a 100-entry limit so clusters have a 100-node limit, which is a problem when scaling. Another example was missing support for AWS ALB, which had to be implemented in-house until Ticketmaster released its ALB Ingress Controller. A recurring favorite (fixed after 3 years!) is Kubernetes deleting security groups attached to an ELB when the service that created the ELB was deleted.
“Operational” issues were more varied. For example, failing Kubernetes jobs would generate tons of terminated pods. The extra pods would cause masters to slow down, eventually causing timeouts in automation (and maxing out master CPU). Masters would also occasionally lose etcd quorum and get confused about the current leader, breaking automation.
One day a cron job for an ETL process was scheduled on a node with production services. The job used so much memory it eventually crashed the production services and took the site down. After this happened, all cron jobs were moved to their own nodes and strict resource requirements were enforced.
There were many other issues encountered along the way, including DNS latency due to ndots:5, failing kops updates due to route table changes, kube-state-metrics breaking for various reasons, and the combination of
ImagePullPolicy: Always with an AWS ECR outage almost taking production down. The journey to a stable Kubernetes deployment had many surprising detours.
Take the long view
A project like Kubernetes is rare, due to its large and active community that includes huge enterprises. The chance is very good that someone else has your problem (at much bigger scale) and is working on a solution. Unlike many open source projects, you can expect issues to be addressed with some urgency.
For these reasons, it’s better to take the long view with Kubernetes. If you wait long enough, most problems will be solved. Several times during my experience with early Kubernetes versions (1.2 to 1.9), a problem that needed a manual workaround was eventually fixed upstream or solved with a “community standard” project.
Isolate and replicate
Keep your workloads separate and highly available:
- Deploy to a minimum of 3 availability zones.
- Ensure deployments have a replica in each availability zone.
- Use node selectors and different nodes for different tasks.
- Use pod and node anti-affinity to keep pods evenly distributed among nodes and availability zones.
- Use pod disruption budgets to ensure your deployments avoid downtime during cluster updates.
Pin, limit, and validate
Pin your versions. This goes for Kubernetes, Helm, Docker, and any packages, tools, and images used. The problems caused by version mismatches are entirely avoidable.
Limit resource usage from the start. This involves setting up limit ranges and resource quotas. Use validation tools to ensure resources are properly configured.
Validate your files. Use hadolint for Dockerfiles, kubeval for Kubernetes manifests, Polaris for Kubernetes compliance, and conftest for custom policies. Think of these tools as infrastructure tests.
Keep it simple, scalable, and cheap
Stay on a single cluster as long as you can. The RBAC and network policy mechanisms of Kubernetes do a good job of isolating workloads and users from each other. Managing multiple clusters adds tangible overhead.
Test the limits of your cluster scalability before you hit them. If you’re on AWS, use a networking plugin other than kubenet to get around the 100-node limit. Scale out before you need to.
Minimize costs as much as possible. Make use of spot and reserved instances (if on AWS) to get the same compute at a significant discount. This is an easy win if you have a highly available setup or if your cluster doesn’t need to scale.
Helm is a decent tool, but it has problems. Templating YAML files with Go templates can be painful. There are only loosely enforced conventions for Helm charts. Sometimes a chart will simply not work. Helm also has no native backup or declarative format, requiring community solutions like helmfile.
The issues I encountered are solved or solvable. Kubernetes today is a stable, mature system that is getting better with every release.
I didn’t find that Kubernetes is overly complex or that it needs a large team to manage. Kubernetes strikes a great balance between abstraction and flexibility - its alternatives don’t. As for needing a large “Kubernetes team,” I managed several production clusters mostly by myself for several years, only spending a few hours a month on maintenance. Some weeks the clusters required no attention at all, even with lost nodes and network outages. Kubernetes does its job well and fades into the background.
Although I’ve primarily used kops to manage Kubernetes clusters, these days I’d suggest starting with managed Kubernetes (such as GKE, DOKS, EKS, or AKS). Managed Kubernetes has its own issues and limitations, and it’s not always the turnkey solution its vendors promise, but it’s a good starting point. I’d still recommend setting up a cluster yourself, whether with k3s, microk8s, Rancher, OKD, kubeadm, or kops - it’s a valuable learning experience.
Kubernetes changed the way I think about distributed systems, web operations, and software in general. It’s more capable than it seems - with custom resources and the Service Catalog, it can be used to do much more than orchestrate containers. The Kubernetes project itself is well-managed - it has special interest groups (SIGs) covering broad feature areas (like the ACM), detailed and useful changelogs, certification programs, and a thoughtful approach to features.
All in all, I think Kubernetes is excellent software. I learned a lot putting it into production. It fully delivered on its original promise and, years later, it’s made significant improvements. I’m glad to see the industry at large adopting it and I’d use it again without any doubts.