kubernetes - matduggan.com

Kubernetes as a Distro

January 05, 2025 in kubernetes

I recently read this great piece by Timur Tukaev discussing how to approach the growing complexity of Kubernetes. You can read it here. Basically as Kubernetes continues to expand to be the everything platform, the amount of functionality it contains is taking longer and longer to learn.

Right now every business that adopts Kubernetes is basically rolling their own bespoke infrastructure. Timur's idea is to try and solve this problem by following the Linux distro model. You'd have groups of people with similar needs work together to make an out-of-the-box Kubernetes setup geared towards their specific needs. I wouldn't start from a blank cluster, but a cluster already configured for my specific usecase (ML, web applications, batch job processing).

I understand the idea, but I think the root cause of all of this is simply a lack of a meaningful package management system for Kubernetes. Helm has done the best it can, but practically speaking it's really far from where we would need to be in order to have something even approaching a Linux package manager.

More specifically we need something between the very easy to use but easy to mess up Helm and the highly bespoke and complex to write Operator concept.

Centralized State Management
- Maintain a robust, centralized state store for all deployed resources, akin to a package database.
- Provide consistency checks to detect and reconcile drifts between the desired and actual states.
Advanced Dependency Resolution
- Implement dependency trees with conflict resolution.
- Ensure dependencies are satisfied dynamically, including handling version constraints and providing alternatives where possible.
Granular Resource Lifecycle Control
- Include better support for orchestrating changes across interdependent Kubernetes objects.
Secure Packaging Standards
- Enforce package signing and verification mechanisms with a centralized trust system.
- Include better support for orchestrating changes across interdependent Kubernetes objects.
Native Support for Multi-Cluster Management
- Allow packages to target multiple clusters and namespaces with standardized overrides.
- Provide tools to synchronize package versions across clusters efficiently.
Rollback Mechanisms
- Improve rollback functionality by snapshotting cluster states (beyond Helm’s existing rollback features) and ensuring consistent recovery even after partial failures.
Declarative and Immutable Design
- Introduce a declarative approach where the desired state is managed directly (similar to GitOps) rather than relying on templates.
Integration with Kubernetes APIs
- Directly leverage Kubernetes APIs like Custom Resource Definitions (CRDs) for managing installed packages and versions.
- Provide better integration with Kubernetes-native tooling (e.g., kubectl, kustomize).

Again a lot of this is Operators, but Operators are proving too complicated for normal people to write. I think we could reuse a lot of that work, keep that functionality and create something similar to what the Operator allows you to do with less maintenance complexity.

Still I'd love to see the Kubernetes folks do anything in this area. The current state of the world is so bespoke and frankly broken there is a ton of low hanging fruit in this space.

Simple Kubernetes Secret Encryption with Python

June 05, 2024 in Python

I was recently working on a new side project in Python with Kubernetes and I needed to inject a bunch of secrets. The problem with secret management in Kubernetes is you end up needing to set up a lot of it yourself and its time consuming. When I'm working on a new idea, I typically don't want to waste a bunch of hours setting up "the right way" to do something that isn't related to the core of the idea I'm trying out.

For the record, the right way to do secrets in Kubernetes is the following:

Turn on encryption at rest for ETCD
Carefully set up RBAC inside of Kubernetes to ensure the right users and service accounts can access the secrets
Give up on trying to do that and end up setting up Vault or paying your cloud provider for their Secret Management tool
There is a comprehensive list of secret managers and approaches outlined here: https://www.argonaut.dev/blog/secret-management-in-kubernetes

However especially when you are trying ideas out, I wanted something more idiot proof that didn't require any setup. So I wrote something simple with Python Fernet encryption that I thought might be useful to someone else out there.

You can see everything here: https://gitlab.com/matdevdug/example_kubernetes_python_encryption

Walkthrough

So the script works in a pretty straight forward way. It reads the .env file you generate as outlined in the README with secrets in the following format:

Make a .env file with the following parameters:

KEY=Make a fernet key: https://fernetkeygen.com/
CLUSTER_NAME=name_of_cluster_you_want_to_use
SECRET-TEST-1=9e68b558-9f6a-4f06-8233-f0af0a1e5b42
SECRET-TEST-2=a004ce4c-f22d-46a1-ad39-f9c2a0a31619

The KEY is the secret key and the CLUSTER_NAME tells the Kubernetes library what kubeconfig target you want to use. Then the tool finds anything with the word SECRET in the .env file and encrypts it, then writes it to the .csv file.

The .csv file looks like the following:

I really like to keep some sort of record of what secrets are injected into the cluster outside of the cluster just so you can keep track of the encrypted values. Then the script checks the namespace you selected to see if there are secrets with that name already and, if not, injects it for you.

Some quick notes about the script:

Secret names in Kubernetes need a specific format for the name. Lower case with words separated by - or . The script will take the uppercase in the .env and convert it into a lowercase. Just be aware it is doing that.
It does base64 encode the secret before it uploads it, so be aware that your application will need to decode it when it loads the secret.
Now the only secret you need to worry about is the Fernet secret that you can load into the application in a secure way. I find this is much easier to mentally keep track of than trying to build an infinitely scalable secret solution. Plus its cheaper since many secret managers charge per secret.
The secrets are immutable which means they are lightweight on the k8s API and fast. Just be aware you'll need to delete the secrets if you need to replace them. I prefer this approach because I'd rather store more things as encrypted secrets and not worry about load.
It is easy to specify which namespace you intend to load the secrets into and I recommend using a different Fernet secret per application.
Mounting the secret works like it always does in k8s

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: your-image:latest
      volumeMounts:
        - name: secret-volume
          mountPath: /path/to/secret/data
  volumes:
    - name: secret-volume
      secret:
        secretName: my-secret

Inside of your application, you need to load the Fernet secret and decrypt the secrets. With Python that is pretty simple.

decrypt = fernet.decrypt(token)

Q+A

Why not SOPS? This is easier and also handles the process of making the API call to your k8s cluster to make the secret.
Is Fernet secure? As far as I can tell it's secure enough. Let me know if I'm wrong.
Would you make a CLI for this? If people actually use this thing and get value out of it, I would be more than happy to make it a CLI. I'd probably rewrite it in Golang if I did that, so if people ask it'll take me a bit of time to do it.

Questions/comments/concerns: https://c.im/@matdevdug

The Time Linkerd Erased My Load Balancer

May 03, 2024 in kubernetes

A cautionary tale of K8s CRDs and Linkerd.

A few months ago I had the genius idea of transitioning our production load balancer stack from Ingress to Gateway API in k8s. For those unaware, Ingress is the classic way of writing a configuration to tell a load balancer what routes should hit what services, effectively how do you expose services to the Internet. Gateway API is the re-imagined process for doing this where the problem domain is scoped, allowing teams more granular control over their specific services routes.

Ingress

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: external-lb
spec:
  controller: example.com/ingress-controller
  parameters:
    apiGroup: k8s.example.com
    kind: IngressParameters
    name: external-lb

This is what setting up a load balancer in Ingress looks like

After conversations with various folks at GCP it became clear to me that while Ingress wasn't deprecated or slated to be removed, Gateway API is where all the new development and features are moving to. I decided that we were a good candidate for the migration since we are a microservice based backend with lower and higher priority hostnames, meaning we could safely test the feature without cutting over all of our traffic at the same time.

I had this idea that we would turn on both Ingress and Gateway API and then cut between the two different IP addresses at the Cloudflare level. From my low-traffic testing this approach seemed to work ok, with me being able to switch between the two and then letting Gateway API bake for a week or two to shake out any problems. Then I decided to move to prod. Due to my lack of issues in the lower environments I decided that I wouldn't set up Cloudflare load balancing between the two and manage the cut-over in Terraform. This turned out to be a giant mistake.

The long and short of it is that the combination of Gateway API and Linkerd in GKE fell down under high volume of requests. Low request volume there were no problems, but once we got to around 2k requests a second the Linkerd-proxy sidecar container memory usage started to grow unbounded. When I attempted to cut back from Gateway API to Ingress, I encountered a GKE bug I hadn't seen before in the lower environments.

"Translation failed: invalid ingress spec: service "my_namespace/my_service" is type "ClusterIP", expected "NodePort" or "LoadBalancer";

What we were seeing was a mismatch between the annotations automatically added by GKE.

Ingress adds these annotations:
cloud.google.com/neg: '{"ingress":true}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1pokfef..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway adds these annotations:
cloud.google.com/neg: '{"exposed_ports":{"80":{}}}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1-oijfoijsdoifj-..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway doesn't understand the Ingress annotations and vice-versa. This obviously caused a massive problem and blew up in my face. I had thought I had tested this exact failure case, but clearly prod surfaced a different behavior than I had seen in lower environments. Effectively no traffic was getting to pods while I tried to figure out what had broken.

I ended up making to manually modify the annotations to get things working and had a pretty embarrassing blow-up in my face after what I had thought was careful testing (but was clearly wrong).

Fast Forward Two Months

I have learned from my mistake regarding the Gateway API and Ingress and was functioning totally fine on Gateway API when I decided to attempt to solve the Linkerd issue. The issue I was seeing with Linkerd was high-volume services were seeing their proxies consume unlimited memory, steadily growing over time but only while on Gateway API. I was installing Linkerd with their Helm libraries, which have 2 components, the Linkerd CRD chart here: https://artifacthub.io/packages/helm/linkerd2/linkerd-crds and the Linkerd control plane: https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane

Since debug logs and upgrades hadn't gotten me any closer to a solution as to why the proxies were consuming unlimited memory until they eventually were OOMkilled, I decided to start fresh. I removed the Linkerd injection from all deployments and removed the helm charts. Since this was a non-prod environment, I figured at least this way I could start fresh with debug logs and maybe come up with some justification for what was happening.

Except the second I uninstalled the charts, my graphs started to freak out. I couldn't understand what was happening, how did removing Linkerd break something? Did I have some policy set to require Linkerd? Why was my traffic levels quickly approaching zero in the non-prod environment?

Then a coworker said "oh it looks like all the routes are gone from the load balancer". I honestly hadn't even thought to look there, assuming the problem was some misaligned Linkerd policy where our deployments required encryption to communicate or some mistake on my part in the removal of the helm charts. But they were right, the load balancers didn't have any routes. kubectl confirmed, no HTTProutes remained.

So of course I was left wondering "what just happened".

Gateway API

So a quick crash course in "what is gateway API". At a high level, as discussed before, it is a new way of defining Ingress which cleans up the annotation mess and allows for a clean separation of responsibility in an org.

So GCP defines the GatewayClass, I make the Gateway and developer provide the HTTPRoutes. This means developers can safely change the routes to their own services without the risk that they will blow up the load balancer. It also provides a ton of great customization for how to route traffic to a specific service.

So first you make a Gateway like so in Helm or whatever:

---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: {{ .Values.gateway_name }}
  namespace: {{ .Values.gateway_namespace }}
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        kinds:
        - kind: HTTPRoute
        namespaces:
          from: Same
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        kinds:
          - kind: HTTPRoute
        namespaces:
          from: All
      tls:
        mode: Terminate
        options:
          networking.gke.io/pre-shared-certs: "{{ .Values.pre_shared_cert_name }},{{ .Values.internal_cert_name }}"

Then you provide a different YAML of HTTPRoute for the redirect of http to https:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: redirect
  namespace: {{ .Values.gateway_namespace }}
spec:
  parentRefs:
  - namespace: {{ .Values.gateway_namespace }}
    name: {{ .Values.gateway_name }}
    sectionName: http
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https

Finally you can set policies.

---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: tls-ssl-policy
  namespace: {{ .Values.gateway_namespace }}
spec:
  default:
    sslPolicy: tls-ssl-policy
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: {{ .Values.gateway_name }}

Then your developers can configure traffic to their services like so:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: store
spec:
  parentRefs:
  - kind: Gateway
    name: internal-http
  hostnames:
  - "store.example.com"
  rules:
  - backendRefs:
    - name: store-v1
      port: 8080
  - matches:
    - headers:
      - name: env
        value: canary
    backendRefs:
    - name: store-v2
      port: 8080
  - matches:
    - path:
        value: /de
    backendRefs:
    - name: store-german
      port: 8080

Seems Straightforward

Right? There isn't that much to the thing. So after I attempted to re-add the HTTPRoutes using Helm and Terraform (which of course didn't detect a diff even though the routes were gone because Helm never seems to do what I want it to do in a crisis) and then ended up bumping the chart version to finally force it do the right thing, I started looking around. What the hell had I done to break this?

When I removed linkerd crds it somehow took out my httproutes. So then I went to the Helm chart trying to work backwards. Immediately I see this:

{{- if .Values.enableHttpRoutes }}
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes-sigs/gateway-api/pull/1923
    gateway.networking.k8s.io/bundle-version: v0.7.1-dev
    gateway.networking.k8s.io/channel: experimental
    {{ include "partials.annotations.created-by" . }}
  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    linkerd.io/control-plane-ns: {{.Release.Namespace}}
  creationTimestamp: null
  name: httproutes.gateway.networking.k8s.io
spec:
  group: gateway.networking.k8s.io
  names:
    categories:
    - gateway-api
    kind: HTTPRoute
    listKind: HTTPRouteList
    plural: httproutes
    singular: httproute
  scope: Namespaced
  versions:

Sure enough, Linkerd CRD Helm chart has that set to default True:

I also found this issue: https://github.com/linkerd/linkerd2/issues/12232

So yeah, Linkerd is, for some reason, pulling this CRD from a pull request from April 6th of last year that is marked as "do not merge". https://github.com/kubernetes-sigs/gateway-api/pull/1923

Linkerd is aware of the possible problem but presumes you'll catch the configuration option on the Helm chart: https://github.com/linkerd/linkerd2/issues/11586

To be clear I'm not "coming after Linkerd" here. I just thought the whole thing was extremely weird and wanted to make sure, given the amount of usage Linkerd gets out there, that other people were made aware of it before running the car into the wall at 100 MPH.

What are CRDs?

Kubernetes Custom Resource Definitions (CRDs) essentially extend the Kubernetes API to manage custom resources specific to your application or domain.

CRD Object: You create a YAML manifest file defining the Custom Resource Definition (CRD). This file specifies the schema, validation rules, and names of your custom resource.
API Endpoint: When you deploy the CRD, the Kubernetes API server creates a new RESTful API endpoint for your custom resource.

Effectively when I enabled Gateway API in GKE with the following I hadn't considered that I could end up in a CRD conflict state with Linkerd:

  gcloud container clusters create CLUSTER_NAME \
    --gateway-api=standard \
    --cluster-version=VERSION \
    --location=CLUSTER_LOCATION

What I suspect happened is, since I had Linkerd installed before I had enabled the gateway-api on GKE, when GCP attempted to install the CRD it failed silently. Since I didn't know there was a CRD conflict, I didn't understand that the CRD that the HTTPRoutes relied on was actually the Linkerd maintained one, not the GCP one. Presumably had I attempted to do this the other way it would have thrown an error when the Helm chart attempted to install a CRD that was already present.

To be clear before you call me an idiot, I am painfully aware that the deletion of CRDs is a dangerous operation. I understand I should have carefully checked and I am admitting I didn't in large part because it just never occurred to me that something like Linkerd would do this. Think of my failure to check as a warning to you, not an indictment against Kubernetes or whatever.

Conclusion

If you are using Linkerd and Helm and intend to use Gateway API, this is your warning right now to go in there and flip that value in the Helm chart to false. Learn from my mistake.

Questions/comments/concerns: https://c.im/@matdevdug

K8s Service Meshes: The Bill Comes Due

March 01, 2024 in kubernetes

When you start using Kubernetes one of the first suggestions you'll get is to install a service mesh. This, of course, on top of the 900 other things you need to install. For those unaware, everything in k8s is open to everything else by default when you start and traffic isn't encrypted between services. Since encrypting traffic between services and controlling what services can talk to which requires something like a JWT and client certificates, teams aren't typically eager to take on this work even though its increasingly a requirement of any stack.

Infrastructure teams can usually implement a feature faster than every app team in a company, so this tends to get solved by them. Service meshes exploded in popularity as it became clear they were easy ways to implement enforced encryption and granular service to service access control. You also get better monitoring and some cool features like circuit breaking and request retries for "free". As the scale of deployments grew with k8s and started to bridge multiple cloud providers or a cloud provider and a datacenter, this went from "nice to have" to an operational requirement.

What is a service mesh?

Service-to-service communication before and after service mesh implementation

Service meshes let you do a few things easily

Easy metrics on all service to service requests since it has a proxy that knows success/failure/RTT/number of requests
Knowledge that all requests are encrypted with automated rotation
Option to ensure only encrypted requests are accepted so you can have k8s in the same VPC as other things without needing to do firewall rules
Easy to set up network isolation at a route/service/namespace level (great for k8s hosting platform or customer isolation)
Automatic retries, global timeout limits, circuit breaking and all the features of a more robustly designed application without the work
Reduces change failure rate. With a proxy sitting there holding and retrying requests, small blips don't register anymore to the client. Now they shouldn't anyway if you set up k8s correctly but its another level of redundancy.

This adds up to a lot of value for places that adopt them with a minimum amount of work since they're sidecars injected into existing apps. For the most part they "just work" and don't require a lot of knowledge to keep working.

However, it's 2024 and stuff that used to be free isn't anymore. The free money train from VCs has ended and the bill has come due. Increasingly, this requirement for deploying production applications to k8s is going to come with a tax that you need to account for when budgeting for your k8s migration and determining whether it is worth it. Since December 2023 the service mesh landscape has changed substantially and it's a good time for a quick overview of what is going on.

NOTE: Before people jump down my throat, I'm not saying these teams shouldn't get paid. If your tool provides real benefits to businesses it isn't unreasonable to ask them to materially contribute to it. I just want people to be up to speed on what the state of the service mesh industry is and be able to plan accordingly.

Linkerd

My personal favorite of the service meshes, Linkerd is the most idiot proof of the designs. It consists of a control plane and a data plane with a monitoring option included. It looks like this:

Recently Linkerd has announced a change to their release process, which I think is a novel approach to the problem of "getting paid for your work". For those unaware, Linkerd has always maintained a "stable" and an "edge" version of their software, along with an enterprise product. As of Linkerd 2.15.0, they will no longer publish stable releases. Instead the concept of a stable release will be bundled into their Buoyant Enterprise for Linkerd option. You can read the blog post here.

Important to note that unlike some products, Linkerd doesn't just take a specific release of Edge and make it Enterprise. There are features that make it to Edge that never get to Enterprise, Stable is also not a static target (there are patch releases to the Stable branch as well), so these are effectively three different products. So you can't do the workaround of locking your org to specific Edge releases that match up with Stable/Enterprise.

Pricing

Update: Linkerd changed their pricing to per-pod. You can see it here: https://buoyant.io/pricing. I'll leave the below for legacy purposes but the new pricing addresses my concerns.

Buoyant has selected the surprisingly high price of $2000 a cluster per month. The reason this is surprising to me is the model for k8s is increasingly moving towards more clusters with less in a cluster, vs the older monolithic cluster where the entire company lives in one. This pricing works against that goal and removes some of the value of the service mesh concept.

If the idea of the Linkerd team is that orgs are going to stick with fewer, larger clusters, then it makes less sense to me to go with Linkerd. With a ton of clusters, I don't want to think about IP address ranges or any of the east to west networking designs, but if I just have like 2-3 clusters that are entirely independent of each other, then I can get a similar experience to Linkerd with relatively basic firewall rules, k8s network policies and some minor changes to an app to encrypt connections. There's still value to Linkerd, but the per-cluster pricing when I was clearly fine hosting the entire thing myself before is strange.

$2000 a month for a site license makes sense to me to get access to enterprise. $2000 a month per cluster when Buoyant isn't providing me with dashboards or metrics on their side seems like they picked an arbitrary number out of thin air. There's zero additional cost for them per cluster added, it's just profit. It feels weird and bad. If I'm hosting and deploying everything and the only support you are providing me is letting me post to the forum, where do you come up with the calculation that I owe you per cluster regardless of size?

Now you can continue to use Linkerd, but you need to switch to Edge. In my experience testing it, Edge is fine. It's mostly production ready, but there are sometimes features which you'll start using and then they'll disappear. I don't think it'll matter for most orgs most of the time, since you aren't likely constantly rolling out service mesh upgrades. You'll pick a version of Edge, test it, deploy it and then wait until you are forced to upgrade or you see a feature you like.

You also can't just buy a license, you need to schedule a call with them to buy a license with discounts available before March 21st, 2024. I don't know about you but the idea of needing to both buy a license and have a call to buy a license is equally disheartening. Maybe just let me buy it with the corporate card or work with the cloud providers to let me pay you through them.

Cilium

Cilium is the new cool kid on the block when it comes to service meshes. It eliminates the sidecar container, removing a major source of failure in the service mesh design. You still get encryption, load balancing, etc but since it uses eBPF and is injected right into the kernel you remove that entire element of the stack.

You also get a LOT with Cilium. It is its own CNI, which in my testing has amazing performance. It works with all the major cloud providers, it gives you incredibly precise network security and observability. You can also replace Kube-proxy with cilium. Here is how it works in a normal k8s cluster with Kube-proxy:

Effectively Kube-proxy works with the OS filtering layer (typically iptables) to allow network communication to your pods. This is a bit simplified but you get the idea.

With the BPF Kube-proxy replacement we remove a lot of pieces in that design.

This is only a tiny fraction of what Cilium does. It has developed a reputation for excellence, where if you full adopt the stack you can replace almost all the cloud-provider specific pieces for k8s to a generic stack that works across providers at a lower cost and high performance.

the UI for seeing service relationships in Cilium is world-class

A Wild Cisco Appears

Cisco recently acquired Isovalent in December of 2023, apparently to get involved in the eBPF space and also likely to augment their acquisition of Splunk. Cilium provides the metrics and traces as well as generating great flow logs and Splunk ingests them for you. If you are on Linkerd and considering moving over to Cilium to avoid paying, you should be aware that with Cisco having purchased them the bill is inevitable.

You will eventually be expected to pay and my guess based on years of ordering Cisco licenses and hardware is you'll be expected to pay a lot. So factor that in when considering Cilium or migrating to Cilium. I'll go out on a limb here and predict that Cilium is priced as a premium multi-cloud product with a requirement of the enterprise license for many of the features before the end of 2024. I will also predict that Linkerd ends up as the cheapest option on the table by the end of 2024 for most orgs.

Take how expensive Splunk is and extrapolate that into a service mesh license and I suspect you'll be in the ballpark.

Istio

Istio, my least favorite service mesh. Conceptually Istio and Linkerd share many of the same ideas. Both platforms use a two-part architecture now: a control plane and a data plane. The control plane manages the data plane by issuing configuration updates to the proxies in the data plane. The control plane also provides security features such as mTLS encryption and authentication.

Istio uses Envoy proxies vs rolling their own like Linkerd and tends to cover more possible scenarios than Linkerd. Here's a feature comparison:

Istio's primary differences are that it supports VMs, runs its own Ingress Controller and is 10x the complexity of setting up any other option. Istio has become infamous among k8s infrastructure staff as being the cause of more problems than any other part of the stack. Now many of these can be solved with minor modifications to the configuration (there is absolutely nothing structurally wrong with Istio), but since a service mesh failure can be "the entire cluster dies", it's tricky.

The reality is Istio is free and open source, but you pay in other ways. Istio has so many components and custom resources that can interact with each other in surprising and terrifying ways that you need someone in your team who is an Istio expert. Otherwise any attempt to create a self-service ecosystem will result in lots of downtime and tears. You are going to spend a lot of time in Istio tracking down performance problems, weird network connectivity issues or just strange reverse proxy behavior.

Some of the earlier performance complaints of Envoy as the sidecar have been addressed, but I still hear of problems when organizations scale up to a certain number of requests per second (less than I used to). The cost for Istio, to me, exceeds the value of a service mesh most of the time. Especially since Linkerd has caught up with most of the traffic management stuff like circuit breaking.

Consul Connect

The next service mesh we'll talk about is Consul Connect. If Istio is highly complicated to set up and Linkerd is easiest but fewest knobs to turn, Consul sits right in the middle. It has a great story when it comes to observability and has performance right there with Linkerd and superior to Istio.

Consul is also very clearly designed to be deployed by large companies, with features around stability and cross-datacenter design that only apply to the biggest orgs. However people who have used it seem to really like it, based on the chats I've had. The ability to use Terraform with Consul with its Consul-Terraform-Sync functionality to get information about services and interact with those services at a networking level is massive, especially for teams managing thousands of nodes or where pods need strict enforced isolation (such as SaaS products where customer app servers can't interact).

Pricing

Consul starts at $0.027 an hour, but in practice your price is gonna be higher than that. It goes up based on how many instances and clusters you are running. It's also not available on GCP, just AWS and Azure. You also don't get support with that, seemingly needing to upgrade your package to ask questions.

I'm pretty down on Hashicorp after the Terraform change, but people have reported a lot of success with Consul so if you are considering a move, this one makes a lot of sense.

Cloud Provider Service Meshes

GCP has Anthos (based on Istio) as part of their GKE Enterprise offering, which is $.10/cluster/hour. It comes with a bunch of other features but in my testing was a much easier way to run Istio. Basically Istio without the annoying parts. AWS App Mesh still uses Envoy but has a pretty different architecture. However it comes with no extra cost which is nice.

AWS App Mesh is also great for orgs that aren't all-in for k8s. You can bridge systems like ECS and traditional EC2 with it, meaning its a super flexible tool for hybrid groups or groups where the k8s-only approach isn't a great fit.

Azure uses Open Service Mesh which is now a deprecated product. Despite that, it's still their recommend solution according to a Google search. Link

Once again the crack team at Azure blows me away with their attention to detail. Azure has a hosted Istio add-on in preview now and presumably they'll end up doing something similar to GKE with Anthos. You can see that here.

What do you need to do

So the era of the free Service Mesh is coming to a close. AWS has decided to use it as an incentive to stay on their platform, Linkerd is charging you, Cilium will charge you At Some Point and Consul is as far from free now as it gets. GKE and Azure seem to be betting on Istio where they move the complexity into their stack, which makes sense. This is a reflection of how valuable these meshes are for observability and resilience as organizations transition to microservices and more specifically split stacks, where you retain your ability to negotiate with your cloud provider by running things in multiple places.

Infrastructure teams will need to carefully pick what horse they want to back moving forward. It's a careful balance between cloud lock-in vs flexibility at the cost of budget or complexity. There aren't any clear-cut winners in the pack, which wasn't true six months ago when the recommendation was just Linkerd or Cilium. If you are locked into either Linkerd or Cilium, the time to start discussing a strategy moving forward is probably today. Either get ready for the bill, commit to running Edge with more internal testing, or brace yourself for a potentially much higher bill in the future.

GKE (Google Kubernetes Engine) Review

July 07, 2023 in DevOps

What if Kubernetes was idiot-proof?

Love/Hate Relationship

AWS and I have spent a frightening amount of time together. In that time I have come to love that weird web UI with bizarre application naming. It's like asking an alien not familiar with humans to name things. Why is Athena named Athena? Nothing else gets a deity name. CloudSearch, CloudFormation, CloudFront, Cloud9, CloudTrail, CloudWatch, CloudHSM, CloudShell are just lazy, we understand you are the cloud. Also Amazon if you are going to overuse a word that I'm going to search, use the second word so the right result comes up faster. All that said, I've come to find comfort in its primary color icons and "mobile phones don't exist" web UI.

Outside of AWS I've also done a fair amount of work with Azure, mostly in Kubernetes or k8s-adjacent spaces. All said I've now worked with Kubernetes on bare metal in a datacenter, in a datacenter with VMs, on raspberry pis in a cluster with k3s, in AWS with EKS, in Azure with AKS, DigitalOcean Kubernetes and finally with GKE in GCP. Me and the Kubernetes help documentation site are old friends at this point, a sea of purple links. I say all this to suggest that I have made virtually every mistake one can with this particular platform.

When being told I was going to be working in GCP (Google Cloud Platform) I was not enthused. I try to stay away from Google products in my personal life. I switched off Gmail for Fastmail, Search for DuckDuckGo, Android for iOS and Chrome for Firefox. It has nothing to do with privacy, I actually feel like I understand how Google uses my personal data pretty well and don't object to it on an ideological level. I'm fine with making an informed decision about using my personal data if the return to me in functionality is high enough.

I mostly move off Google services in my personal life because I don't understand how Google makes decisions. I'm not talking about killing Reader or any of the Google graveyard things. Companies try things and often they don't work out, that's life. It's that I don't even know how fundamental technology is perceived. Is Golang, which relies extensively on Google employees, doing well? Are they happy with it, or is it in danger? Is Flutter close to death or thriving? Do they like Gmail or has it lost favor with whatever executives are in charge of it this month? My inability to get a sense of whether something is doing well or poorly inside of Google makes me nervous about adopting their stack into my life.

I say all this to explain that, even though I was not excited to use GCP and learn a new platform. Even though there are parts of GCP that I find deeply frustrating as compared to its peers...there is a gem here. If you are serious about using Kubernetes, GKE is the best product I've seen on the market. It isn't even close. GKE is so good that if you are all-in on Kubernetes, it's worth considering moving from AWS or Azure.

I know, bold statement.

TL;DR

GKE is the best managed k8s product I've ever tried. It aggressively helps you do things correctly and is easy to set up and run.
GKE Autopilot is all of that but they handle all the node/upgrade/security etc. It's like Heroku-levels of easy to get something deployed. If you are a small company who doesn't want to hire or assign someone to manage infrastructure, you could grow forever on GKE Autopilot and still be able to easily migrate to another provider or the datacenter later on.
The rest of GCP is a bit of a mixed bag. Do your homework.

Disclaimer

I am not and have never been a google employee/contractor/someone they know exists. I once bombed an interview when I was 23 for an job at Google. This interview stands out to me because despite working with it every day for a year my brain just forgot how RAID parity worked on a data tranmission level. Got off the call and instantly all memory of how it worked returned to me. Needless to say nobody at Google cares that I have written this and it is just my opinions.

Corrections are always appreciated. Let me know at: [email protected]

Traditional K8s Setup

One common complaint about k8s is you have to set up everything. Even "hosted" platform often just provide the control plane, meaning almost everything else is some variation of your problem. Here's the typically collection of what you need to make decisions about in no particular order:

Secrets encryption: yes/no how
Version of Kubernetes to start on
What autoscaling technology are you going to use
Managed/unmanaged nodes
CSI drivers, do you need them, which ones
Which CNI, what does it mean to select a CNI, how do they work behind the scenes. This one in particular throws new cluster users because it seems like a nothing decision but it actually has profound impact in how the cluster operates
Can you provision load balancers from inside of the cluster?
CoreDNS, do you want it to cache DNS requests?
Vertical pod autoscaling vs horizontal pod autoscaling
Monitoring, what collects the stats, what default data do you get, where does it get stored (node-exporter setup to prometheus?)
Are you gonna use an OIDC? You probably want it, how do you set it up?
Helm, yes or no?
How do service accounts work?
How do you link IAM with the cluster?
How do you audit the cluster for compliance purposes?
Is the cluster deployed in the correct resilient way to guard against AZ outages?
Service mesh, do you have one, how do you install it, how do you manage it?
What OS is going to run on your nodes?
How do you test upgrades? What checks to make sure you aren't relying on a removed API? When is the right time to upgrade?
What is monitoring overall security posture? Do you have known issues with the cluster? What is telling you that?
Backups! Do you want them? What controls them? Can you test them?
Cost control. What tells you if you have a massively overprovisioned node group?

This isn't anywhere near all the questions you need to answer, but this is typically where you need to start. One frustration with a lot of k8s services I've tried in the past is they have multiple solutions to every problem and it's unclear which is the recommended path. I don't want to commit to the wrong CNI and then find out later that nobody has used that one in six months and I'm an idiot. (I'm often an idiot but I prefer to be caught for less dumb reasons).

Are these failings of kubernetes?

I don't think so. K8s is everything to every org. You can't make a universal tool that attempts to cover every edge case that doesn't allow for a lot of customization. With customization comes some degree of risk that you'll make the wrong choice. It's the Mac vs Linux laptop debate in an infrastructure sphere. You can get exactly what you need with the Linux box but you need to understand if all the hardware is supported and what tradeoffs each decision involves. With a Mac I'm getting whatever Apple thinks is the correct combination of all of those pieces, for better or worse.

If you can get away with Cloud Run or ECS, don't let me stop you. Pick the level of customization you need for the job, not whatever is hot right now.

Enter GKE

Alright so when I was hired I was tasked with replacing an aging GKE cluster that was coming to end of life running Istio. After running some checks, we weren't using any of the features of Istio, so we decided to go with Linkerd since it's a much easier to maintain service mesh. I sat down and started my process for upgrading an old cluster.

Check the node OS for upgrades, check the node k8s version
Confirm API usage to see if we are using outdated APIs
How do I install and manage the ancillary services and what are they? What installs CoreDNS, service mesh, redis, etc.
Can I stand up a clean cluster from what I have or was critical stuff added by hand? It never should be but it often is.
Map out the application dependencies and ensure they're put into place in the right order.
What controls DNS/load balancing and how can I cut between cluster 1 and cluster 2

It's not a ton of work, but it's also not zero work. It's also a good introduction to how applications work and what dependencies they have. Now my experience with recreating old clusters in k8s has been, to be blunt, a fucking disaster in the past. It typically involves 1% trickle traffic, everything returning 500s, looking at logs, figuring out what is missing, adding it, turning 1% back on, errors everywhere, look at APM, oh that app's healthcheck is wrong, etc.

The process with GKE was so easy I was actually sweating a little bit when I cut over traffic, because I was sure this wasn't going to work. It took longer to map out the application dependencies and figure out the Istio -> Linkerd part than it did to actually recreate the cluster. That's a first and a lot of it has to do with how GKE holds your hand through every step.

How does GKE make your life easier?

Let's walk through my checklist and how GKE solves pretty much all of them.

Node OS and k8 version on the node.

GCP offers a wide variety of OSes that you can run but recommends one I have never heard of before.

Container-Optimized OS from Google is an operating system image for your Compute Engine VMs that is optimized for running containers. Container-Optimized OS is maintained by Google and based on the open source Chromium OS project. With Container-Optimized OS, you can bring up your containers on Google Cloud Platform quickly, efficiently, and securely.

I'll be honest, my first thought when I saw "server OS based on Chromium" was "someone at Google really needed to get an OKR win". However after using it for a year, I've really come to like it Now it's not a solution for everyone, but if you can operate within the limits its a really nice solution. Here are the limits.

No package manager. They have something called the CoreOS Toolbox which I've used a few times to debug problems so you can still troubleshoot. Link
No non-containerized applications
No install third-party kernel modules or drivers
It is not supported outside of the GCP environment

I know, it's a bad list. But when I read some of the nice features I decided to make the switch. Here's what you get:

The root filesystem is always mounted as read-only. Additionally, its checksum is computed at build time and verified by the kernel on each boot.
Stateless kinda. /etc/ is writable but stateless. So you can write configuration settings but those settings do not persist across reboots. (Certain data, such as users' home directories, logs, and Docker images, persist across reboots, as they are not part of the root filesystem.)
Ton of other security stuff you get for free. Link

I love all this. Google tests the OS internally, they're scanning for CVEs, they're slowly rolling out updates and its designed to just run containers correctly, which is all I need. This OS has been idiot proof. In a year of running it I haven't had a single OS issue. Updates go out, they get patched, I don't notice ever. Troubleshooting works fine. This means I never need to talk about a Linux upgrade ever again AND the limitations of the OS means my applications can't rely on stuff they shouldn't use. Truly set and forget.

I don't run software I can't build from source.

Go nuts: https://cloud.google.com/container-optimized-os/docs/how-to/building-from-open-source

2. Outdated APIs.

There's a lot of third-party tools that do this for you and they're all pretty good. However GKE does it automatically in a really smart way.

Not my cluster but this is what it looks like

Basically the web UI warns you if you are relying on outdated APIs and will not upgrade if you are. Super easy to check "do I have bad API calls hiding somewhere".

3. How do I install and manage the ancillary services and what are they?

GKE comes batteries included. DNS is there but it's just a flag in Terraform to configure. Service accounts same thing, Ingress and Gateway to GCP is also just in there working. Hooking up to your VPC through a toggle in Terraform so you can naively routeable. They even reserve the Pods IPs before the pods are created which is nice and eliminates a source of problems.

They have their own CNI which also just works. One end of the Virtual Ethernet Device pair is attached to the Pod and the other is connected to the Linux bridge device cbr0. I've never encountered any problems with any of the GKE defaults, from the subnets it offers to generate for pods to the CNI it is using for networking. The DNS cache is nice to be able to turn on easily.

4. Can I stand up a clean cluster from what I have or was critical stuff added by hand?

Because everything you need to do happens in Terraform for GKE, it's very simple to see if you can stand up another cluster. Load balancing is happening inside of YAMLs, ditto for deployments, so standing up a test cluster and seeing if apps deploy correctly to it is very fast. You don't have to install a million helm charts to get everything configured just right.

However they ALSO have backup and restore built it!

Here is your backup running happily and restoring it is just as easy to do through the UI.

So if you have a cluster with a bunch of custom stuff in there and don't have time to sort it out, you don't have to.

5. Map out the application dependencies and ensure they're put into place in the right order.

This obviously varies from place to place, but the web UI for GKE does make it very easy to inspect deployments and see what is going on with them. This helps a lot, but of course if you have a service mesh that's going to be the one-stop shop for figuring out what talks to what when. The Anthos service mesh provides this and is easy to add onto a cluster.

6. What controls DNS/load balancing and how can I cut between cluster 1 and cluster 2

Alright so this is the only bad part. GCP load balancers provide zero useful information. I don't know why, or who made the web UIs look like this. Again, making an internal or external load balancer as an Ingress or Gateway with GKE is stupid easy with annotations.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: my-static-address
    kubernetes.io/ingress.allow-http: "false"
    networking.gke.io/managed-certificates: managed-cert
    kubernetes.io/ingress.class: "gce"

Why would this data be the most useful data?

I don't who this is for or why I would care from what region of the world my traffic is coming from. It's also not showing correctly on Firefox with the screen cut off on the right. For context, this is the correct information I want from a load balancer every single time:

The entire GCP load balancer thing is a tire-fire. The web UI to make load balancers breaks all the time. Adding an SSL through the web UI almost never works. They give you a ton of great information about the backend of the load balancer but adding things like a new TLS policy requires kind of a lot of custom stuff. I could go on and on.

Autopilot

Alright so lets say all of that was still a bit much for you. You want a basic infrastructure where you don't need to think about nodes, or load balancers, or operating systems. You write your YAML, you deploy it to The Cloud and then things happens automagically. That is GKE Autopilot

Here are all the docs on it. Let me give you the elevator pitch. It's a stupid easy way to run Kubernetes that is probably going to save you money. Why? Because selecting and adjusting the type and size of node you provision is something most starting companies mess up with Kubernetes and here you don't need to do that. You aren't billed for unused capacity on your nodes, because GKE manages the nodes. You also aren't charged for system Pods, operating system costs, or unscheduled workloads.

Hardening Autopilot is also very easy. You can see all the options that exist and are already turned on here. If you are a person who is looking to deploy an application where maintaining it cannot be a big part of your week, this is a very flexible platform to do it on. You can move to standard GKE later if you'd like. Want off GCP? It is not that much work to convert your YAML to work with a different hosted provider or a datacenter.

I went in with low expectations and was very impressed.

Why shouldn't I use GKE?

I hinted at it above. As good as GKE is, the rest of GCP is crazy inconsistent. First the project structure for how things work is maddening. You have an organization and below that are projects (which are basically AWS accounts). They all have their own permission structure which can be inherited from folders that you put the projects in. However since GCP doesn't allow for the combination of IAM premade roles into custom roles, you end up needing to write hundreds of lines of Terraform for custom roles OR just find a premade role that is Pretty Close.

GCP excels at networking, data visualization (outside of load balancing), kubernetes, serverless with cloud run and cloud functions and big data work. A lot of the smaller services on the edge don't get a lot of love. If you are heavy users of the following, proceed with caution.

GCP Secret Manager

For a long time GCP didn't have any secret manager, instead having customers encrypt objects in buckets. Their secret manager product is about as bare-bones as it gets. Secret rotation is basically a cron job that pushes to a Pub/Sub topic and then you do the rest of it. No metrics, no compliance check integrations, no help with rotation.

It'll work for most use cases, but there's just zero bells and whistles.

GCP SSL Certificates

I don't know how Let's Encrypt, a free service, outperforms GCPs SSL certificate generation process. I've never seen a service that mangles SSL certificates as bad as this. Let's start with just trying to find them.

The first two aren't what I'm looking for. The third doesn't take me to anything that looks like an SSL certificate. SSL certificates actually live at Security -> Certificate Manager. If you try to go there even if you have SSL certificates you get this screen.

I'm baffled. I have Google SSL certificates with their load balancers. How is the API not enabled?

To issue the certs it does the same sort of DNS and backend checking as a lot of other services. To be honest I've had more problems with this service issuing SSL certificates than any in my entire life. It was easier to buy certificates from Verisign. If you rely a lot on generating a ton of these quickly, be warned.

IAM recommender

GCP has this great feature which is it audits what permissions a role has and then tells you basically "you gave them too many permissions". It looks like this:

Great right? Now sometimes this service will recommend you modify the permissions to either a new premade role or a custom role. It's unclear when or how that happens, but when it does there is a little lightbulb next to it. You can click it to apply the new permissions, but since mine (and most peoples) permissions are managed in code somewhere, this obviously doesn't do anything long-term.

Now you can push these recommendations to Big Query, but what I want is some sort of JSON or CSV that just says "switch these to use x premade IAM roles". My point is there is a lot of GCP stuff that is like 90% there. Engineers did the hard work of tracking IAM usage, generating the report, showing me the report, making a recommendation. I just need an easier way to act on that outside of the API or GCP web console.

These are just a few examples that immediately spring to mind. My point being when evaluating GCP please kick the tires on all the services, don't just see that one named what you are expecting exists. The user experience and quality varies wildly.

I'm interested, how do I get started?

GCP terraform used to be bad, but now it is quite good. You can see the whole getting started guide here. I recommend trying Autopilot and seeing if it works for you just because its cheap.

Even if you've spent a lot of time running k8s, give GKE a try. It's really impressive, even if you don't intend to move over to it. The security posture auditing, workload metrics, backup, hosted prometheus, etc is all really nice. I don't love all the GCP products, but this one has super impressed me.