Kubernetes lessons learned

February 2020

I was responsible for Kubernetes (among other things) at Better.com for over three years. My contributions included evaluating Kubernetes, migrating all services to it, provisioning and upgrading clusters, and resolving production issues (like scalability and downtime). It’s been a valuable experience and I learned a lot.

Background

In 2016 the Better.com engineering team finished moving its microservices to Docker. We needed a replacement for AWS Elastic Beanstalk, our solution at the time. I spent a few weeks researching options.

I was interested in solving several production problems that came up over the previous year, such as zero-downtime deployments, easy scalability, high availability, and ephemeral (a.k.a preview) environments. I also wanted a solution that was portable, open source, locally executable, and capable of running scheduled jobs.

Most solutions I found didn’t cover all of these requirements. Kubernetes was last on the list because it needed the most work, but it did everything necessary. Its 1.0 release had been out for almost a year, so it seemed production-ready.

Issues

Kubernetes cluster provisioning was a minefield in 2016. There were many projects claiming to do it well, but most failed to deliver. After dealing with a few, I went with with the official cluster provisioning tool at the time, kops.

kops had a few issues. Shortly after the first production cluster was provisioned with kops, I discovered that the default Docker storage driver couldn’t be changed, leading to slow performance. Later I noticed nodes in the cluster were becoming unresponsive (only fixed by restarting Docker), a problem that was quietly resolved a few releases later. On one occasion, a kops update took down the production cluster by changing internal certificates and breaking intra-cluster communication, eventually disabling all incoming traffic.

I ran into issues with Kubernetes itself. Most issues fell into two categories: “cloud” (AWS-related) and “operational” (workload-related).

“Cloud” issues usually had to do with networking. For example, the default network plugin, kubenet, requires an AWS VPC route table entry for every node. Route tables have a 100-entry limit so clusters have a 100-node limit, which caused problems when we needed to scale out. Another example was missing support for AWS ALB, which had to be implemented in-house until Ticketmaster released its ALB Ingress Controller. One of my favorites (fixed after 3 years!) is Kubernetes deleting security groups attached to an ELB when the service that created the ELB was deleted.

“Operational” issues were more varied. For example, failing Kubernetes jobs would generate tons of terminated pods. The extra pods would cause masters to slow down, eventually causing timeouts in automation (and maxing out master CPU). Masters would also occasionally lose etcd quorum and get confused about the current leader, breaking automation.

One time a cron job for an ETL process was scheduled on a node running production services. The job used so much memory it eventually crashed the production services and took the site down. After this happened, all cron jobs were moved to their own nodes and strict resource requirements were enforced.

There were other issues encountered along the way, including DNS latency due to ndots:5, failing kops updates due to route table changes, kube-state-metrics breaking for various reasons, and the combination of ImagePullPolicy: Always with an AWS ECR outage almost taking production down.

Lessons learned

Take the long view

A project like Kubernetes is rare due to its large and active community that includes huge enterprises. There’s a good chance that someone else has your problem (at scale) and is working on a solution. You can expect issues to be addressed with some urgency.

For this reason, it’s better to take the long view with Kubernetes. If you wait long enough, most problems will be solved. Several times during my experience with early Kubernetes versions (1.2 to 1.9), a problem that needed a manual workaround was eventually fixed upstream or solved with a “community standard” project.

Isolate and replicate

Keep your workloads separate and highly available:

Deploy to a minimum of 3 availability zones.
Make sure there are replicas in each availability zone.
Allocate enough headroom on nodes to support extra pods during deployments.
Use node selectors and different nodes for different tasks.
Use pod and node anti-affinity to keep pods evenly distributed among nodes and availability zones.
Use pod disruption budgets to ensure your deployments avoid downtime during cluster updates.

Pin, limit, and validate

Pin your versions. This goes for Kubernetes, Helm, Docker, and any packages, tools, and images used. The problems caused by version mismatches are entirely avoidable.

Limit resource usage from the start. This involves setting up limit ranges and resource quotas. Use validation tools to ensure resources are properly configured.

Validate your files. Use hadolint for Dockerfiles, kubeval for Kubernetes manifests, Polaris for Kubernetes compliance, and conftest for custom policies. Think of these tools as infrastructure tests.

Keep it simple, scalable, and cheap

Stay on a single cluster as long as you can. The RBAC and network policy mechanisms of Kubernetes do a good job of isolating workloads and users from each other. Managing multiple clusters adds tangible overhead.

Test the limits of your cluster scalability before you hit them. If you’re on AWS, use a networking plugin other than kubenet to get around the 100-node limit. Scale out before you need to.

Minimize costs as much as possible. Make use of spot and reserved instances (if on AWS) to get the same compute at a significant discount. This is an easy win if you have a highly available setup or if your cluster doesn’t need to scale.

Reconsider Helm

Helm has problems. Templating YAML files with Go templates can be painful. There are only loosely enforced conventions for Helm charts. Sometimes a chart simply will not work. Helm also has no native backup or declarative format, requiring community solutions like helmfile.

Closing thoughts

Some people complain that Kubernetes is overly complex or that it needs a large team to manage. I haven’t found that to be the case. Kubernetes strikes a great balance between power, abstraction, and flexibility - its alternatives don’t. As for needing a large team, I managed several production clusters mostly by myself for several years, spending only a few hours a month on maintenance. Some weeks the clusters required no attention at all, even with lost nodes and network outages. Kubernetes does its job well and fades into the background.

Although I’ve primarily used kops to manage Kubernetes clusters, these days I’d suggest starting with managed Kubernetes (such as GKE, DOKS, EKS, or AKS). Managed Kubernetes has its own issues and limitations, and it’s not always simple, but it’s a good starting point. I’d still recommend setting up a cluster yourself, whether with k3s, microk8s, Rancher, OKD, kubeadm, or kops - it’s a valuable learning experience.

Kubernetes changed the way I think about distributed systems, web operations, and software in general. It’s more capable than it seems - with custom resources and the Service Catalog, it can be used to do much more than orchestrate containers. The Kubernetes project itself is well-managed - it has special interest groups (SIGs) covering broad feature areas (like the ACM), detailed and useful changelogs, certification programs, and a thoughtful approach to features.

All in all, I think Kubernetes is excellent. I learned a lot putting it into production. It fully delivered on its original promise and, years later, it’s made significant improvements. I’m glad to see the industry at large adopting it and I’d use it again without a doubt.