Simple Kubernetes Secret Encryption with Python

June 05, 2024 in Python

I was recently working on a new side project in Python with Kubernetes and I needed to inject a bunch of secrets. The problem with secret management in Kubernetes is you end up needing to set up a lot of it yourself and its time consuming. When I'm working on a new idea, I typically don't want to waste a bunch of hours setting up "the right way" to do something that isn't related to the core of the idea I'm trying out.

For the record, the right way to do secrets in Kubernetes is the following:

Turn on encryption at rest for ETCD
Carefully set up RBAC inside of Kubernetes to ensure the right users and service accounts can access the secrets
Give up on trying to do that and end up setting up Vault or paying your cloud provider for their Secret Management tool
There is a comprehensive list of secret managers and approaches outlined here: https://www.argonaut.dev/blog/secret-management-in-kubernetes

However especially when you are trying ideas out, I wanted something more idiot proof that didn't require any setup. So I wrote something simple with Python Fernet encryption that I thought might be useful to someone else out there.

You can see everything here: https://gitlab.com/matdevdug/example_kubernetes_python_encryption

Walkthrough

So the script works in a pretty straight forward way. It reads the .env file you generate as outlined in the README with secrets in the following format:

Make a .env file with the following parameters:

KEY=Make a fernet key: https://fernetkeygen.com/
CLUSTER_NAME=name_of_cluster_you_want_to_use
SECRET-TEST-1=9e68b558-9f6a-4f06-8233-f0af0a1e5b42
SECRET-TEST-2=a004ce4c-f22d-46a1-ad39-f9c2a0a31619

The KEY is the secret key and the CLUSTER_NAME tells the Kubernetes library what kubeconfig target you want to use. Then the tool finds anything with the word SECRET in the .env file and encrypts it, then writes it to the .csv file.

The .csv file looks like the following:

I really like to keep some sort of record of what secrets are injected into the cluster outside of the cluster just so you can keep track of the encrypted values. Then the script checks the namespace you selected to see if there are secrets with that name already and, if not, injects it for you.

Some quick notes about the script:

Secret names in Kubernetes need a specific format for the name. Lower case with words separated by - or . The script will take the uppercase in the .env and convert it into a lowercase. Just be aware it is doing that.
It does base64 encode the secret before it uploads it, so be aware that your application will need to decode it when it loads the secret.
Now the only secret you need to worry about is the Fernet secret that you can load into the application in a secure way. I find this is much easier to mentally keep track of than trying to build an infinitely scalable secret solution. Plus its cheaper since many secret managers charge per secret.
The secrets are immutable which means they are lightweight on the k8s API and fast. Just be aware you'll need to delete the secrets if you need to replace them. I prefer this approach because I'd rather store more things as encrypted secrets and not worry about load.
It is easy to specify which namespace you intend to load the secrets into and I recommend using a different Fernet secret per application.
Mounting the secret works like it always does in k8s

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: your-image:latest
      volumeMounts:
        - name: secret-volume
          mountPath: /path/to/secret/data
  volumes:
    - name: secret-volume
      secret:
        secretName: my-secret

Inside of your application, you need to load the Fernet secret and decrypt the secrets. With Python that is pretty simple.

decrypt = fernet.decrypt(token)

Q+A

Why not SOPS? This is easier and also handles the process of making the API call to your k8s cluster to make the secret.
Is Fernet secure? As far as I can tell it's secure enough. Let me know if I'm wrong.
Would you make a CLI for this? If people actually use this thing and get value out of it, I would be more than happy to make it a CLI. I'd probably rewrite it in Golang if I did that, so if people ask it'll take me a bit of time to do it.

Questions/comments/concerns: https://c.im/@matdevdug

The Worst Website In The Entire World

May 15, 2024 in broadcom

What if you set out to make the worst website you possibly could? So poorly designed and full of frustrating patterns that users would not only hate the experience of using this website, but would also come to hate your company. Could we make a web experience so terrible that it would express how much our company hated our users?

As a long-time Internet addict, I've encountered my fair share of terrible websites. Instagram where now half my feed is advertisements for stupid t-shirts and the other half is empty black space.

Or ARNGREN.net which is like if a newspaper ad threw up on my screen.

But Instagram still occasionally shows me pictures of people I follow and ultimately the stuff on ARNGREN is so cool I still want to buy it regardless of the layout.

No, I believe it is the crack team at Broadcom that have nailed it for the worst website in existence.

Lured in with free VMware

So through social media I discovered this blog post from VMware announcing that their popular virtualization software is free for personal use now. You can read that here. Great, I used VMware Fusion before and it was ok and maybe it will let me run Windows on an ARM Mac. Probably not but let's try it out and see.

This means that everyday users who want a virtual lab on their Mac, Windows or Linux computer can do so for free simply by registering and downloading the latest build from the new download portal located at support.broadcom.com. With the new commercial model, we have reduced our product group offerings down to a single SKU (VCF-DH-PRO) for users who require commercial use licensing. This simplification eliminates 40+ other SKUs and makes quoting and purchasing VMware Desktop Hypervisor apps, Fusion Pro and Workstation Pro, easier than ever. The new Desktop Hypervisor app subscription can be purchased from any Broadcom Advantage partner.

I don't want to register at support.broadcom.com but it looks like I don't have a choice as this is the screen on the VMware site.

Now this is where alarm bells start going crazy in my head. Nothing about this notice makes sense. "The store will be moving to a new domain". So it's...not...down for maintenance but actually is just gone? Or is it actually coming back? Because then you say "store will be shutdown" (just a quick note, you want "the store" and "will be shutting down on April 30th 2024"). Also why don't you just redirect to the new domain? What is happening here?

Broadcom

So then I go to support.broadcom.com which is where I was told to register and make an account.

Never a great sign when there's a link to an 11 page PDF of how to navigate your website. That's the "Learn how to navigate Broadcom Support" link. You can download that killer doc here: https://support.broadcom.com/documents/d/ecx/broadcom-support-portal-getting-started-guide

Alright let's register.

First the sentence "Enhance your skills through multiple self-service avenues by creating your Broadcom Account" leaps off the page as just pure corporate nonsense. I've also never seen a less useful CAPTCHA, it looks like it is from 1998 and any modern text recognition software would defeat it. In fact the Mac text recognition in Preview defeats 3 of the 4 characters with no additional work:

So completely pointless and user hostile. Scoring lots of points for the worst website ever. I'm also going to give some additional points for "Ask our chatbot for assistance", an idea so revolting normally I'd just give up on the entire idea. But of course I'm curious, so I click on the link for the "Ask our chatbot" and.....

It takes me back to the main page.

Slow clap Broadcom. Imagine being a customer that is so frustrated with your support portal that you actually click "Ask a chatbot" and the web developers at Broadcom come by and karate chop you right in the throat. Bravo. Now in Broadcom's defense in the corner IS a chatbot icon so I kinda see what happened here. Let's ask it a question.

I didn't say hello. I don't know why it decided I said hello to it. But in response to VMware it gives me this:

Did the chatbot just tell me to go fuck myself? Why did you make a chatbot if all you do is select a word from a list and it returns the link to the support doc? Would I like to "Type a Query"?? WHAT IS A CHATBOT IF NOT TYPING QUERIES?

Next Steps

I fill in the AI-proof CAPTCHA and hit next, only to be greeted with the following screen for 30 seconds.

Finally I'm allowed to make my user account.

Um....alright....seems like overkill Broadcom but you know what this is your show. I have 1Password so this won't be a problem. It's not letting me copy/paste from 1Password into this field but if I do Command + \ it seems to let me insert. Then I get this.

What are you doing to me Broadcom. Did I....wrong you in some way? I don't understand what is happening. Ok well I refresh the page, try again and it works this time. Except I can't copy/paste into the Confirm Password field.

I mean they can't expect me to type out the impossibly complicated password they just had me generate right? Except they have and they've added a check to ensure that I don't disable Javascript and treat it like a normal HTML form.

Hey front-end folks, just a quick note. Never ever ever ever ever mess with my browser. It's not yours, it's mine. I'm letting you use it for free to render your bloated sites. Don't do this to me. I get to copy paste whatever I want whenever I want. When you get your own browser you can do whatever you want but while you are living in my house under my rules I get to copy/paste whenever I goddamn feel like it.

Quickly losing enthusiasm for the idea of VMware

So after pulling up the password and typing it in, I'm treated to this absolutely baffling screen.

Do I need those? I feel like I might need those. eStore at least sounds like something I might want. I don't really want Public Semiconductors Case Management but I guess that one comes in the box. 44 seconds of this icon later

I'm treated to the following.

Broadcom, you clever bastards. Just when I thought I was out, they pulled me back in. Tricking users into thinking a link is going to help them and then telling them to get fucked by advising them to contact your sales rep? Genius.

So then I hit cancel and get bounced back to......you guessed it!

Except I'm not even logged into my newly created account. So then I go to login with my new credentials and I finally make it to my customer portal. Well no first they need to redirect me back to the Broadcom Support main page again with new icons.

Apparently my name was too long to show and instead of fixing that or only showing first name Broadcom wanted to ensure the disrespect continued and sorta trail off. Whatever, I'm finally in the Matrix.

Now where might I go to...actually download some VMware software. There's a search bar that says "Search the entire site", let's start there!

Nothing found except for a CVE. Broadcom you are GOOD! For a second I thought you were gonna help me and like Lucy with the football you made me eat shit again.

Lucy and the football - John Quiggin's Blogstack

My Downloads was also unhelpful.

But maybe I can add the entitlement to the account? Let's try All Products.

Of course the link doesn't work. What was I even thinking trying that? That one is really on me. However "All Products" on the left-hand side works and finally I find it. My white whale.

Except when I click on product details I'm brought back to....

The blank page with no information! Out of frustration I click on "My Downloads" again which is now magically full of links! Then I see it!

YES. Clicking on it I get my old buddy the Broadcom logo for a solid 2 minutes 14 seconds.

Now I have fiber internet with 1000 down, so this has nothing to do with me. Finally I click the download button and I get.....the Broadcom logo again.

30 seconds pass. 1 minute passes. 2 minutes pass. I'm not sure what to do.

No. No you piece of shit website. I've come too far and sacrificed too much of my human dignity. I am getting a fucking copy of VMware Fusion. Try 2 is the same thing. 3, 4, 5 all fail. Then finally.

I install it and like a good horror movie, I think it's all over. I've killed Jason. Except when I'm installing Windows I see this little link:

And think "wow I would like to know what the limitations are for Windows 11 for Arm!". Click on it and I'm redirected to...

Just one final fuck you from the team at Broadcom.

Conclusion

I've used lots of bad websites in my life. Hell, I've made a lot of bad websites in my life. But never before have I seen a website that so completely expresses just the pure hatred of users like this one. Everything was as poorly designed as possible, with user hostile design at every corner.

Honestly Broadcom, I don't even know why you bothered buying VMware. It's impossible for anyone to ever get this product from you. Instead of migrating from the VMware store to this disaster, maybe just shut this down entirely. Destroy the backups of this dumpster fire and start fresh. Maybe just consider a Shopify site because at least then an average user might have a snowballs chance in hell of ever finding something to download from you.

Do you know of a worse website? I want to see it. https://c.im/@matdevdug

The Time Linkerd Erased My Load Balancer

May 03, 2024 in kubernetes

A cautionary tale of K8s CRDs and Linkerd.

A few months ago I had the genius idea of transitioning our production load balancer stack from Ingress to Gateway API in k8s. For those unaware, Ingress is the classic way of writing a configuration to tell a load balancer what routes should hit what services, effectively how do you expose services to the Internet. Gateway API is the re-imagined process for doing this where the problem domain is scoped, allowing teams more granular control over their specific services routes.

Ingress

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: external-lb
spec:
  controller: example.com/ingress-controller
  parameters:
    apiGroup: k8s.example.com
    kind: IngressParameters
    name: external-lb

This is what setting up a load balancer in Ingress looks like

After conversations with various folks at GCP it became clear to me that while Ingress wasn't deprecated or slated to be removed, Gateway API is where all the new development and features are moving to. I decided that we were a good candidate for the migration since we are a microservice based backend with lower and higher priority hostnames, meaning we could safely test the feature without cutting over all of our traffic at the same time.

I had this idea that we would turn on both Ingress and Gateway API and then cut between the two different IP addresses at the Cloudflare level. From my low-traffic testing this approach seemed to work ok, with me being able to switch between the two and then letting Gateway API bake for a week or two to shake out any problems. Then I decided to move to prod. Due to my lack of issues in the lower environments I decided that I wouldn't set up Cloudflare load balancing between the two and manage the cut-over in Terraform. This turned out to be a giant mistake.

The long and short of it is that the combination of Gateway API and Linkerd in GKE fell down under high volume of requests. Low request volume there were no problems, but once we got to around 2k requests a second the Linkerd-proxy sidecar container memory usage started to grow unbounded. When I attempted to cut back from Gateway API to Ingress, I encountered a GKE bug I hadn't seen before in the lower environments.

"Translation failed: invalid ingress spec: service "my_namespace/my_service" is type "ClusterIP", expected "NodePort" or "LoadBalancer";

What we were seeing was a mismatch between the annotations automatically added by GKE.

Ingress adds these annotations:
cloud.google.com/neg: '{"ingress":true}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1pokfef..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway adds these annotations:
cloud.google.com/neg: '{"exposed_ports":{"80":{}}}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1-oijfoijsdoifj-..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway doesn't understand the Ingress annotations and vice-versa. This obviously caused a massive problem and blew up in my face. I had thought I had tested this exact failure case, but clearly prod surfaced a different behavior than I had seen in lower environments. Effectively no traffic was getting to pods while I tried to figure out what had broken.

I ended up making to manually modify the annotations to get things working and had a pretty embarrassing blow-up in my face after what I had thought was careful testing (but was clearly wrong).

Fast Forward Two Months

I have learned from my mistake regarding the Gateway API and Ingress and was functioning totally fine on Gateway API when I decided to attempt to solve the Linkerd issue. The issue I was seeing with Linkerd was high-volume services were seeing their proxies consume unlimited memory, steadily growing over time but only while on Gateway API. I was installing Linkerd with their Helm libraries, which have 2 components, the Linkerd CRD chart here: https://artifacthub.io/packages/helm/linkerd2/linkerd-crds and the Linkerd control plane: https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane

Since debug logs and upgrades hadn't gotten me any closer to a solution as to why the proxies were consuming unlimited memory until they eventually were OOMkilled, I decided to start fresh. I removed the Linkerd injection from all deployments and removed the helm charts. Since this was a non-prod environment, I figured at least this way I could start fresh with debug logs and maybe come up with some justification for what was happening.

Except the second I uninstalled the charts, my graphs started to freak out. I couldn't understand what was happening, how did removing Linkerd break something? Did I have some policy set to require Linkerd? Why was my traffic levels quickly approaching zero in the non-prod environment?

Then a coworker said "oh it looks like all the routes are gone from the load balancer". I honestly hadn't even thought to look there, assuming the problem was some misaligned Linkerd policy where our deployments required encryption to communicate or some mistake on my part in the removal of the helm charts. But they were right, the load balancers didn't have any routes. kubectl confirmed, no HTTProutes remained.

So of course I was left wondering "what just happened".

Gateway API

So a quick crash course in "what is gateway API". At a high level, as discussed before, it is a new way of defining Ingress which cleans up the annotation mess and allows for a clean separation of responsibility in an org.

So GCP defines the GatewayClass, I make the Gateway and developer provide the HTTPRoutes. This means developers can safely change the routes to their own services without the risk that they will blow up the load balancer. It also provides a ton of great customization for how to route traffic to a specific service.

So first you make a Gateway like so in Helm or whatever:

---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: {{ .Values.gateway_name }}
  namespace: {{ .Values.gateway_namespace }}
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        kinds:
        - kind: HTTPRoute
        namespaces:
          from: Same
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        kinds:
          - kind: HTTPRoute
        namespaces:
          from: All
      tls:
        mode: Terminate
        options:
          networking.gke.io/pre-shared-certs: "{{ .Values.pre_shared_cert_name }},{{ .Values.internal_cert_name }}"

Then you provide a different YAML of HTTPRoute for the redirect of http to https:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: redirect
  namespace: {{ .Values.gateway_namespace }}
spec:
  parentRefs:
  - namespace: {{ .Values.gateway_namespace }}
    name: {{ .Values.gateway_name }}
    sectionName: http
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https

Finally you can set policies.

---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: tls-ssl-policy
  namespace: {{ .Values.gateway_namespace }}
spec:
  default:
    sslPolicy: tls-ssl-policy
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: {{ .Values.gateway_name }}

Then your developers can configure traffic to their services like so:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: store
spec:
  parentRefs:
  - kind: Gateway
    name: internal-http
  hostnames:
  - "store.example.com"
  rules:
  - backendRefs:
    - name: store-v1
      port: 8080
  - matches:
    - headers:
      - name: env
        value: canary
    backendRefs:
    - name: store-v2
      port: 8080
  - matches:
    - path:
        value: /de
    backendRefs:
    - name: store-german
      port: 8080

Seems Straightforward

Right? There isn't that much to the thing. So after I attempted to re-add the HTTPRoutes using Helm and Terraform (which of course didn't detect a diff even though the routes were gone because Helm never seems to do what I want it to do in a crisis) and then ended up bumping the chart version to finally force it do the right thing, I started looking around. What the hell had I done to break this?

When I removed linkerd crds it somehow took out my httproutes. So then I went to the Helm chart trying to work backwards. Immediately I see this:

{{- if .Values.enableHttpRoutes }}
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes-sigs/gateway-api/pull/1923
    gateway.networking.k8s.io/bundle-version: v0.7.1-dev
    gateway.networking.k8s.io/channel: experimental
    {{ include "partials.annotations.created-by" . }}
  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    linkerd.io/control-plane-ns: {{.Release.Namespace}}
  creationTimestamp: null
  name: httproutes.gateway.networking.k8s.io
spec:
  group: gateway.networking.k8s.io
  names:
    categories:
    - gateway-api
    kind: HTTPRoute
    listKind: HTTPRouteList
    plural: httproutes
    singular: httproute
  scope: Namespaced
  versions:

Sure enough, Linkerd CRD Helm chart has that set to default True:

I also found this issue: https://github.com/linkerd/linkerd2/issues/12232

So yeah, Linkerd is, for some reason, pulling this CRD from a pull request from April 6th of last year that is marked as "do not merge". https://github.com/kubernetes-sigs/gateway-api/pull/1923

Linkerd is aware of the possible problem but presumes you'll catch the configuration option on the Helm chart: https://github.com/linkerd/linkerd2/issues/11586

To be clear I'm not "coming after Linkerd" here. I just thought the whole thing was extremely weird and wanted to make sure, given the amount of usage Linkerd gets out there, that other people were made aware of it before running the car into the wall at 100 MPH.

What are CRDs?

Kubernetes Custom Resource Definitions (CRDs) essentially extend the Kubernetes API to manage custom resources specific to your application or domain.

CRD Object: You create a YAML manifest file defining the Custom Resource Definition (CRD). This file specifies the schema, validation rules, and names of your custom resource.
API Endpoint: When you deploy the CRD, the Kubernetes API server creates a new RESTful API endpoint for your custom resource.

Effectively when I enabled Gateway API in GKE with the following I hadn't considered that I could end up in a CRD conflict state with Linkerd:

  gcloud container clusters create CLUSTER_NAME \
    --gateway-api=standard \
    --cluster-version=VERSION \
    --location=CLUSTER_LOCATION

What I suspect happened is, since I had Linkerd installed before I had enabled the gateway-api on GKE, when GCP attempted to install the CRD it failed silently. Since I didn't know there was a CRD conflict, I didn't understand that the CRD that the HTTPRoutes relied on was actually the Linkerd maintained one, not the GCP one. Presumably had I attempted to do this the other way it would have thrown an error when the Helm chart attempted to install a CRD that was already present.

To be clear before you call me an idiot, I am painfully aware that the deletion of CRDs is a dangerous operation. I understand I should have carefully checked and I am admitting I didn't in large part because it just never occurred to me that something like Linkerd would do this. Think of my failure to check as a warning to you, not an indictment against Kubernetes or whatever.

Conclusion

If you are using Linkerd and Helm and intend to use Gateway API, this is your warning right now to go in there and flip that value in the Helm chart to false. Learn from my mistake.

Questions/comments/concerns: https://c.im/@matdevdug

AI Is Speaking In Tongues

April 23, 2024

In my hometown in Ohio, church membership was a given for middle-class people. With a population of 8,000 people, somehow 19 churches were kept open and running. A big part of your social fabric were the kids who went to the same church that you did, the people you would gravitate towards in a new social situation. The more rural and working class your family was, the less likely you actually went to a church on a regular basis. You'd see them on Christmas and Easter, but they weren't really part of the "church group".

My friend Mark was in this group. His family lived in an underground house next to their former motorcycle shop that had closed down when his dad died. It was an hour walk from my house to his through a beautiful forest filled with deer. I was often stuck within eyesight of the house as a freight train slowly rolled through in front of me, sitting on a tree stump until the train passed. The former motorcycle shop had a Dr. Pepper machine in front of the soaped windows I had a key to and we'd sneak in to explore the shop on sleepovers while his mom worked one of her many jobs.

Hocking Hills Underground Home - Earth houses for Rent in Rockbridge, Ohio, United States - Airbnb — What I mean when I say "underground house"

She was a chain-smoker who rarely spoke, often lighting one cigarette with the burning cherry of the first as she drove us to the local video store to rent a videogame. Mark's older brother also lived in the house, but rarely left his room. Mostly it was the two of us bouncing around unsupervised, watching old movies and drinking way too much soda as his German Shepard wheezed and coughed in the cigarette smoke filled house.

Families like this were often targeted by Evangelical Christians, groups that often tried to lure families in with offers of youth groups that could entertain your kids while you worked. At some point, one of the youth pastors convinced Mark's mom that she should send both of us to his church. Instead of Blockbuster, we got dropped off in front of an anonymous steel warehouse structure with a cross and a buzzing overhead light on a dark country road surrounded by cornfields. Mark hadn't really been exposed to religion, with his father and mother having been deep into biker culture. I had already seen these types of places before and was dreading what I knew came next.

When we walked in, we were introduced to "Pastor Michael", who looked a bit like if Santa Claus went on an extreme diet and bought serial killer glasses. Mark bounced over and started asking him questions, but I kept my distance. I had volunteered the year before to fix up an old train station which involved a large crew of youth "supervised" by the fundamentalist Christian church that wanted to turn the train station into a homeless shelter. We slept on the floors of the middle school in this smaller town neighboring mine, spending our days stripping paint and sanding floors and our evenings being lectured about the evils of sex and how America was in a "culture war". In retrospect I feel like there should have been more protective gear in removing lead paint as child labor, but I guess that was up to God.

After one of these long sessions where we were made to stand up and promise we wouldn't have sex before we got married, I made a joke in the walk back to our assigned classroom and was immediately set upon by the senior boy in the group. He had a military style haircut and he threw me against a locker hard enough that I saw stars. I had grown up going to Catholic school and mass every Sunday, spending my Wednesday nights going to CCD (Confraternity of Christian Doctrine), which was like Sunday school for Catholics. All of this was to say I had pretty established "Christian" credentials. This boy let me know he thought I was a bad influence, a fake Christian and that I should be careful since I'd be alone with him and his friends every night in the locked classroom. The experience had left me extremely wary of these Evangelical cults as I laid silently in my sleeping bag on the floor of a classroom, listening to a hamster running in a wheel that had clearly been forgotten.

To those of you not familiar with this world, allow me to provide some context. My Catholic education presented a very different relationship with holy figures. God spoke directly to very few people, saints mostly. There were many warnings growing up about not falling into the trap of believing you were such a person, worthy of a vision or a direct conversation with a deity. It was suggested softly this was more likely mental illness than divine intervention. He enters your heart and changes your behavior and gives you peace, but you aren't in that echelon of rare individuals for whom a chat was justified. So to me these Evangelicals claiming they could speak directly with God was heresy, a gross blasphemy where random "Pastors" were claiming they were saints.

The congregation started to file in and what would follow was one of the most surreal two hours of my life. People I knew, the woman who worked at the library and a local postal worker started to scream and wave their arms, blaming their health issues on Satan. Then at one point Scary Santa Claus started to shake and jerk, looking a bit like he was having a seizure. He started to babble loudly, moving around the room and I stared as more and more people seemed to pretend this babbling meant something and then doing it themselves. The bright lights and blaring music seemed to have worked these normal people into madness.

In this era before cell phones, there wasn't much I could do to leave the situation. I waited and watched as Mark became convinced these people were channeling the voice of God. "It's amazing, I really felt something in there, there was an energy in the room!" he whispered to me as I kept my eyes on the door. One of the youth pastors asked me if I felt the spirit moving through me, that I shouldn't resist the urge to join in. I muttered that I was ok and said I had to go to the bathroom, then waited in the stall until the service wrapped up almost two hours later.

In talking to the other kids, I couldn't wrap my mind around the reality that they believed this. "It's the language of God, only a select few can understand what the Holy Spirit is saying through us". The language was all powerful, allowing the Pastor to reveal prophesy to select members of the Church and assist them with their financial investments. This was a deadly serious business that these normal people completely believed, convinced this nonsense jabbering that would sometimes kind of sound like language was literally God talking through them.

I left confident that normal, rational people would never believe such nonsense. These people were gullible and once I got out of this dead town I'd never have to be subjected to this level of delusion. So imagine my surprise when, years later, I'm sitting in an giant conference hall in San Francisco as the CEO of Google explains to the crowd how AI is the future. This system that stitched together random words was going to replace all of us in the crowd, solve global warming, change every job. This was met with thunderous applause by the group, apparently excited to lose their health insurance. All this had been kicked off with techno music and bright lights, a church service with a bigger budget.

Every meeting I went to was filled with people ecstatic about the possibility of replacing staff with this divine text generator. A French venture capitalist who shared a Uber with me to the original Google campus for meetings was nearly breathless with excitement. "Soon we might not even need programmers to launch a startup! Just a founder and their ideas getting out to market as fast as they can dream it." I was tempted to comment that it seemed more likely I could replace him with an LLM, but it felt mean. "It is going to change the world" he muttered as we sat in a Tesla still being driven by a human.

It has often been suggested by religious people in my life that my community, the nonreligious tech enthusiasts, use technology as a replacement for religion. We reject the fantastical concept of gods and saints only to replace them with delusional ideas of the future. Self-driving cars were inevitable until it became clear that the problem was actually too hard and we quietly stopped talking about it. Establishing a colony on Mars is often discussed as if it is "soon", even if the idea of doing so far outstrips what we're capable of doing by a factor of 10. We tried to replace paper money with a digital currency and managed to create a global Ponzi scheme that accelerated the destruction of the Earth.

Typically I reject this logic. Technology, for its many faults, also produces a lot of things with actual benefits which is not a claim religion can make most of the time. But after months of hearing this blind faith in the power of AI, the comparisons between what I was hearing now and what the faithful had said to me after that service was eerily similar. Is this just a mass delusion, a desperate attempt by tech companies to convince us they are still worth a trillion dollars even though they have no new ideas? Is there anything here?

Glossolalia

Glossolalia, the technical term for speaking in tongues, is an old tradition with a more modern revival. It is a trademark of the Pentecostal Church, usually surrounded by loud music, screaming prayers and a leader trying to whip the crowd into a frenzy. Until the late 1950s it was confined to a few extreme groups, but since then has grown into a more and more common fixture in the US. The cultural interpretation of this trend presents it as a “heavenly language of the spirit” accessible only to the gifted ones. Glossolalists often report an intentional or spontaneous suspension of will to convey divine messages and prophecies.

In the early 1900s W. J. Seymour, a minister in the US, started to popularize the practice of whipping his congregation into a frenzy such that they could speak in tongues. This was in Los Angeles and quickly became the center of the movement. For those who felt disconnected from religion in a transplant city, it must have been quite the experience to feel your deity speaking directly through you.

It's Biblical basis is flimsy at best however. Joel 2:28-9 says:

And afterwards I will pour out my Spirit on all people. Your sons and daughters will prophesy, your old men will dream dreams, young men will see visions. Even on my servants, both men and women, I will pour out my Spirit in those days.

A lot of research has been done into whether this speech is a "language", with fascinating results. In the Psychology of Speaking in Tongues, Kildahl and Qualben attempted to figure that out. Their conclusions were that while it could sound like a language, it was a gibberish, closer to the fake language children use to practice the sounds of speaking. To believers though, this presented no problems.

He argued that glossolalia is real and that it is a gift from the Holy Spirit. He argued that a person cannot fake tongues. Tongues are an initial evidence of the spirit baptism. It is a spiritual experience. He observed that tongues cannot be understood by ordinary people. They can only be understood spiritually. He noted that when he speaks in tongues he feels out of himself. The feeling is very strange. One can cry, get excited, and laugh. As our respondent prayed, he uttered: Hiro---shi---shi---sha---a---karasha. He jumped and clapped his hands in excitement and charisma. He observed that if God allows a believer to speak in tongues there is a purpose for that. One can speak in tongues and interpret them at the same time. However, in his church there is no one who can interpret tongues. According to our respondent, tongues are intended to edify a person. Tongues are beneficial to the person who speaks in tongues. A person does not choose to pray in tongues. Tongues come through the Spirit of God. When speaking in tongues, it feels as if one has lost one's memory. It is as if one is drunk and the person seems to be psychologically disturbed. This is because of the power of the influence of the Holy Spirit. Tongues are a special visitation symbolising a further special touch of the Holy Spirit. Source

In reality glossolalic speech is not a random and disorganized production of sounds. It has specific accents, intonations and word-like units that resembles the original language of the speaker. [source] That doesn't make it language though, even if the words leave the speaker feeling warm and happy.

What it actually is, at its core, is another tool the Evangelical machine has at its disposal to use the power of music and group suggestion to work people into a frenzy.

The tongue-speaker temporarily discards some of his or her ego functioning as it happens in such times as in sleep or in sexual intercourse.⁴¹ This phenomenon was also noticed in 2006 at the University of Pennsylvania, USA, by researchers under the direction of Andrew Newburg, MD who completed the world's first brain-scan study of a group of Pentecostal practitioners while they were speaking in tongues. The researchers noticed that when the participants were engaged in glossolalia, activity in the language centres of the brain actually decreased, while activity in the emotional centres of the brain increased. The fact that the researchers observed no changes in any language areas, led them to conclude that this phenomenon suggests that glossolalia is not associated with usual language function or usage.

Source

It's the power of suggestion. You are in a group of people and someone, maybe a plant, kicks it off. You are encouraged to join in and watch as your peers enthusiastically get involved. The experience has been explained as positive, so of course you remember it as a positive experience, even if in your core you understand that you weren't "channeling voices". You can intellectually know it is a fake and still feel moved by the experience.

LLMs

LLMs, large language models, which have been rebranded as AI, share a lot with the Evangelical tool. AI was classically understood to be a true artificial intelligence, a thinking machine that actually processed and understood your request. It was seen as the Holy Grail of computer science, the ability to take the best of human intellect and combine it into an eternal machine that could guide and help us. This definition has leaked from the sphere of technology and now solidly lives on in Science Fiction, the talking robot who can help and assist the humans tasked with something.

If that's the positive spin, then there has always been a counter argument. Known as the "Chinese room argument" as shorthand, it says that a digital computer running code cannot have a mind, understanding or consciousness. You can create a very convincing fake though. The thought experiment is as follows:

You've made a computer that behaves as if it understands Chinese. It takes Chinese characters as input and returns Chinese characters as output. It does so at such a high level that it passes the Turing test in that a native Chinese speaker believes the thing it is speaking to is a human being speaking Chinese. But the distinction is that the machine doesn't understand Chinese, it is simulating the idea of speaking Chinese.

Searle suggests if you put him into a room with an English version of the program he could receive the same characters through a slot in the door, process them according to the code and produce Chinese characters as output, without understanding anything that is being said. However he still wouldn't speak or understand Chinese.

This topic has been discussed at length by experts, so if you are interested in the counterarguments I suggest the great site by Stanford: https://plato.stanford.edu/entries/chinese-room/

What Is An AI?

The Neural Networks powering AI at a high level look as follows:

The magic part of AI is the Transformer neural network, which uses self-attention to process not just the elements of text on their own but the way the data relates to each other. It has been fed enough examples to be able to take the input and produce a likely output. Effectively by ingesting every work spoken by humans on video or written on the internet, it can predict with some level of accuracy what a common collection of words to return to a query is.

The end result is a box that consumes a frightening amount of electricity to mimic the responses of an all-knowing eternal AI. The believers can commune directly with this machine spirit, asking it questions and getting back blocks of text or pictures that sometimes look like something a human would make. You cannot question the model or work backwards to determine how specifically it came to the conclusion it did. There is an element of the divine in the process, a specter of magic.

Hallucination

As AI was introduced, we were cautioned to be on the lookout for "hallucinations", where the model would produce convincing-looking fakes. The problem with this warning is it both flies in the face of how humans work (we're not amazing at questioning authoritative-sounding answers) and also suggested this was a temporary problem. But the more you read about AI, you realize it can never solve the hallucination problem because it would need to understand what is "real" and "fake" in order to do that.

What do I mean on a practical level? So I thought I would test out the Google Gemini AI with some easy questions. I had recently written a bash script to query GCP cloud deployments to get their status and write them to markdown. This should be an easy test of the system, right? It's Google AI writing a simple script using the Google CLI to interact with the Google Cloud Platform. Closed ecosystem, high degree of confident, should have definite sources of truth.

Can you write a bash script using the gcloud CLI to query the status of all deployments using cloud deploy, a GCP service, in the GCP project test123 and output the name and current name to a markdown file for each one?

#!/bin/bash

# Set project ID
PROJECT_ID="test123"

# Output directory for markdown files
OUTPUT_DIR="deployment_statuses"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Get all deployments
deployments=$(gcloud deploy deployments list --project "$PROJECT_ID" --format="value(NAME)")

# Loop through each deployment
for deployment in $deployments; do
  # Get deployment details
  deployment_details=$(gcloud deploy deployments describe "$deployment" --project "$PROJECT_ID" --format="yaml")

  # Parse name and state from YAML output
  name=$(echo "$deployment_details" | yq .name)
  state=$(echo "$deployment_details" | yq .status.state)

  # Create markdown file for deployment
  output_file="$OUTPUT_DIR/$name.md"
  echo "# Deployment: $name" > "$output_file"
  echo "Current State: $state" >> "$output_file"

  # Add additional information to markdown file as needed (e.g., rollout strategy, create time)
  # ...

done

echo "Deployment statuses written to markdown files in $OUTPUT_DIR"

Looks great right? Exactly what I wanted. Except it doesn't work.

gcloud doesn't have a gcloud deploy deployments argument.

NAME
    gcloud deploy - create and manage Cloud Deploy resources

SYNOPSIS
    gcloud deploy GROUP | COMMAND [GCLOUD_WIDE_FLAG ...]

DESCRIPTION
    Create and manage Cloud Deploy resources.

GCLOUD WIDE FLAGS
    These flags are available to all commands: --help.

    Run $ gcloud help for details.

GROUPS
    GROUP is one of the following:

     automation-runs
        Manages AutomationRuns resources for Cloud Deploy.

     automations
        Manages Automations resources for Cloud Deploy.

     custom-target-types
        Create and manage Custom Target Type resources for Cloud Deploy.

     delivery-pipelines
        Create and manage Delivery Pipeline resources for Cloud Deploy.

     job-runs
        Manages job runs resources for Cloud Deploy.

     releases
        Create and manage Release resources for Cloud Deploy.

     rollouts
        Create and manage Rollout resources for Cloud Deploy.

     targets
        Create and manage Target resources for Cloud Deploy.

Now I know this because I wrote it, but there's no way someone looking at this with no knowledge of the gcloud CLI would understand why this wouldn't work.

I tried again, this time with a simpler question. Maybe because not that many people use gcloud it doesn't have a big enough sample size to construct a real answer. Let's ask a more basic question, but let's keep it on Google technology. Kubernetes has been documented to death, books written about it, millions of articles and blog posts. Certainly that will work.

How do I, in hosted Kubernetes on GCP, write a deployment configuration yaml which sets nodeAffinity using preferredDuringSchedulingIgnoredDuringExecution. The nodepool should be nodes02 with a weight of one.

What I'm trying to prompt it here is to give me back a configuration file not with nodeSelector but with Node Affinity, the softer preference level for assigning pods to nodes. But there's a small trick in the question. I want a deployment not a Pod spec. This is a distinction which requires a more nuanced comprehension of the subject matter, the ability to not pull the most common example but the more specific example. What we want is this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 3
  template: 
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
          preference:
            matchExpressions:
            - key: cloud.google.com/gke-nodepool
            operator: In
            values:
            - nodes02
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

What I got back was this:

  spec:
    nodeSelector: 
      cloud.google.com/gke-nodepool: nodes02
    affinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        matchExpressions:
        - key: cloud.google.com/gke-nodepool
          operator: In
          values:
          - nodes02

The problem with this response is it does a very different thing from the thing I was trying to do. nodeSelector is a stricter approach, ensuring pods are only scheduled on nodes that match the label. nodeAffinity is a much softer preference, telling k8s I'd like the pods to go there if possible but if that isn't possible, put them where you would normally do.

Both of these examples seem reasonable. The machine responded with something that could be construed as the answer, a clever parody of human intelligence, but ultimately it is more like a child playing. It doesn't understand the question, but understands how to construct convincing looking fakes.

To the faithful though, this isn't a problem.

However, if the training data is incomplete or biased, the AI model may learn incorrect patterns. This can lead to the AI model making incorrect predictions, or hallucinating.

For example, an AI model that is trained on a dataset of medical images may learn to identify cancer cells. However, if the dataset does not include any images of healthy tissue, the AI model may incorrectly predict that healthy tissue is cancerous. This is an example of an AI hallucination.

[Source]

This creates a false belief that the problem lies with the training data, which for both of my examples simply cannot be true. Google controls both ends of that equation and can very confidently "ground" the model with verifiable sources of information. In theory this should tether their output and reduce the chances of inventing content. It reeks of a religious leader claiming while that prophecy was false, the next one will be real if you believe hard enough. It also moves the responsibility for the problem from the AI model to "the training data", which for these LLMs represents a black box of information. I don't know what the training data is, so I can't question whether its good or bad.

Is There Anything Here?

Now that isn't to say there isn't amazing work happening here. LLMs can do some fascinating things and the transformer work has the promise to change how we allow people to interact with computers. Instead of an HTML form with strict validation and obtuse error messages, we can instead help explain to people in real-time what is happening, how to fix problems, just in general provide more flexibility when dealing with human inputs. We can have a machine look at less-sorted data and find patterns, there are lots of ways for this tech to make meaningful differences in human life.

It just doesn't have a trillion dollars worth of value. This isn't a magic machine that will replace all human workers, which for some modern executives is the same as being able to talk directly to God in terms of the Holy Grail of human progress. Finally all the money can flow directly to the CEO himself, cutting out all those annoying middle steps. The demand of investors for these companies to produce something new has outstripped their ability to do that, resulting in a dangerous technology being unleashed upon the world with no safeties. We've made a lying machine that doesn't show you its work, making it even harder for people to tell truth from fiction.

If LLMs are going to turn into actual AI, we're still years and years from that happening. This represents an interesting trick, a feel-good exercise that, unless you look too closely, seems like you are actually talking to an immortal all-knowing being that lives in the clouds. But just like everything else, if your faith is shaken for even a moment the illusion collapses.

Questions/comments/concerns: https://c.im/@matdevdug

Why Can't My Mom Email Me?

April 10, 2024 in email

An investigation into Proton encrypted email.

Suddenly Silence

I'm a big user of email, preferring long chains to messaging apps for a lot of my friends and contacts. It's nice that it isn't tied to a single device or platform and since I own my domain, I can move it from service to service whenever I want and the sender doesn't have to learn some new address. However in the last two months I suddenly stopped getting emails from a percentage of my friends and even my mom.

What I was getting instead was PGP encrypted emails with blank bodies that looked like the following:

If I inspected the message, it was clearly an encrypted email which Fastmail doesn't support. They have a whole blog post on why they don't here: https://www.fastmail.com/blog/why-we-dont-offer-pgp/ but up to this point I haven't really cared one way or the other since nobody sends me encrypted emails.

Now I knew that Proton would send encrypted emails to other Proton email addresses, but obviously this isn't a Proton hosted email address which it would be able to tell pretty easily with DNS. Then it got even stranger when I tried my work email and got the same error.

Checking the raw message and there it is, Proton has encrypted this email. Now this address is hosted on Google Workspaces, so at this point I'm just baffled. Can Proton email users not send emails to people on Google Workspaces email addresses? That can't possibly be right? My friends and mom using Proton would have noticed that their emails seem to always disappear into the ether for the majority of the people they email.

I open a ticket with Fastmail hoping they've seen this problem before, but no luck. Then I opened a ticket with Proton but didn't hear back as of the time of me writing this.

How Proton Seems To Work

So the reason why so many people I know are moving to Proton is they seem to be the only game in town that has cracked sending encrypted emails in the least annoying way possible. Their encryption uses asymmetric PGP key pairs with lookup for other users public keys happening on their key server. This in conjunction with their Key Transparency technology that compares lookup requests by the client with requests on the server-side allows for easy encrypted message exchanges with a high degree of safety, at least according to them.

There seems to be three classes of keys at Proton.

User keys: encrypt account-specific stuff like contacts. Not shared.
Address keys: for encrypting messages and data.
Other keys: part of a key tree that leads back to the address key as the primary external key for people to use.

So that makes sense that Proton can lookup address keys for users on their system. But where are my keys coming from? So in their Proton Key Transparency whitepaper they have this little snippet on page 10:

For External Addresses, the server may return email encryption keys that it
found in the Web Key Directory (WKD) [6] (since email is hosted elsewhere).
The server may also return data encryption keys, used e.g. for Proton Drive.
The former should have an absence proof in KT, and the latter should have an
inclusion proof.
For Non-Proton Addresses, the server may also return keys that it found in the
WKD. This way clients can automatically encrypt emails to it. These keys won’t
be in ProtonKT, thus KT should return an absence proof.

What The Hell Is WKD?

WKD, or OpenPGP Web Key Directory is an IETF draft by Werner Koch. It describes a service where you can lookup OpenPGP keys by mail addresses using a service. It also allows the key owner and the mail provider to publish and revoke keys. The whole thing is very clever, an interesting way to get around the annoying parts of PGP encryption of email. You can read it here: https://www.ietf.org/archive/id/draft-koch-openpgp-webkey-service-16.txt

It outlines an enrollment process by which I would signal to a WKD service that I have a key that I want to enroll into the process. The only problem is I never did that, or at least certainly can't remember doing that. I'm certainly not hosting a page with any key verification stuff.

There seems to be a way to set a CNAME record to point towards keys.openpgp.org where I do have a key set, but that isn't set up on my domain.

nslookup openpgpkey.matduggan.com
Server:		2a01:4f8:c2c:123f::1
Address:	2a01:4f8:c2c:123f::1#53

Non-authoritative answer:
*** Can't find openpgpkey.matduggan.com: No answer

Source here: https://keys.openpgp.org/about/usage

I can't seem to find why Proton thinks they can use this key BUT I can confirm this is the key they're encrypting the emails with.

What?

So it seems if your email address returns a key from keys.openpgp.org then Proton will encrypt the message with your public key from there, even though (as far as I can tell) I haven't opted into them using this service. I also can't seem to figure out a way to signal to them they shouldn't do it.

Alright so what happens if I just remove my key from keys.openpgp.org. The process is pretty simple, just go to: https://keys.openpgp.org/manage and follow the instructions in the email. It seems to work more or less instantly.

Alright looks like we figured it out!

Proton Seriously What The Hell?

I'm at a little bit of a loss here. I totally understand sending me encrypted emails if I've gone through the steps to set the CNAME that indicates that I want to do that, but it doesn't seem like that's how the service works. As far as I can tell, the act of uploading a OpenPGP-compatible key seems to trigger their service to send it as an end-to-end encrypted message.

I'll update this with whatever I hear back from Proton but in the meantime if you stumble across this post after getting blank emails from people for months, you'll at least be able to fix it.

Is there some flag I've accidentally set somewhere that tells Proton to send me encrypted emails? Let me know at: https://c.im/@matdevdug

Why Don't I Like Git More?

April 05, 2024 in git

I've been working with git now full-time for around a decade now. I use it every day, relying on the command-line version primarily. I've read a book, watched talks, practiced with it and in general use it effectively to get my job done. I even have a custom collection of hooks I install in new repos to help me stay on the happy path. I should like it, based on mere exposure effect alone. I don't.

I don't feel like I can always "control" what git is going to do, with commands sometimes resulting in unexpected behavior that is consistent with the way git works but doesn't track with how I think it should work. Instead, I need to keep a lot in my mind to get it to do what I want. "Alright, I want to move unstaged edits to a new branch. If the branch doesn't exist, I want to use checkout, but if it does exist I need to stash, checkout and then stash pop." "Now if the problem is that I made changes on the wrong branch, I want stash apply and not stash pop." "I need to bring in some cross-repo dependencies. Do I want submodules or subtree?"

I need to always deeply understand the difference between reset, revert, checkout, clone, pull, fetch, cherrypick when I'm working even though some of those words mean the same thing in English. You need to remember that push and pull aren't opposites despite the name. When it comes to merging, you need to think through the logic of when you want rebase vs merge vs merge --squash. What is the direction of the merge? Shit, I accidentally deleted a file awhile ago. I need to remember git rev-list -n 1 HEAD – filename. Maybe I deleted a file and immediately realized it but accidentally committed it. git reset --hard HEAD~1 will fix my mistake but I need to remember what specifically --hard does when you use it and make sure it's the right flag to pass.

Nobody is saying this is impossible and clearly git works for millions of people around the world, but can we be honest for a second and acknowledge that this is massive overkill for the workflow I use at almost every job which looks as follows:

Make a branch
Push branch to remote
Do work on branch and then make a Pull Request
Merge PR, typically with a squash and merge cause it is easier to read
Let CI/CD do its thing.

I've never emailed a patch or restored a repo from my local copy. I don't spend weeks working offline only to attempt to merge a giant branch. We don't let repos get larger than 1-2 GB because then they become difficult to work with when I just need to change like three files and make a PR. None of the typical workflow benefits from the complexity of git.

More specifically, it doesn't work offline. It relies on merge controls that aren't even a part of git with Pull Requests. Most of that distributed history gets thrown away when I do a squash. I don't gain anything with my local disk being cluttered up with out-of-date repos I need to update before I start working anyway.

Now someone saying "I don't like how git works" is sort of like complaining about PHP in terms of being a new and novel perspective. Let me lay out what I think would be the perfect VCS and explore if I can get anywhere close to there with anything on the market.

Gitlite

What do I think a VCS needs (and doesn't need) to replace git for 95% of usecases.

Dump the decentralized model. I work with tons of repos, everyone works with tons of repos, I need to hit the server all the time to do my work anyway. The complexity of decentralized doesn't pay off and I'd rather be able to do the next section and lose it. If GitHub is down today I can't deploy anyway so I might as well embrace the server requirement as a perk.
Move a lot of the work server-side and on-demand. I wanna search for something in a repo. Instead of copying everything from the repo, running the search locally and then just accepting that it might be out of date, run it on the server and tell me what files I want. Then let me ask for just those files on-demand instead of copying everything.
I want big repos and I don't want to copy the entire thing to my disk. Just give me the stuff I want when I request it and then leave the rest of it up there. Why am I constantly pulling down hundreds of files when I work with like 3 of them.
Pull Request as a first-class citizen. We have the concept of branches and we've all adopted the idea of checks that a branch must pass before it can be merged. Let's make that a part of the CLI flow. How great would be it to be able to, inside the same tool, ask the server to "dry-run" a PR check and see if my branch passes? Imagine taking the functionality of the gh CLI and not making it platform specific ala kubectl with different hosted Kubernetes providers.
Endorsing and simplifying the idea of cross-repo dependencies. submodules don't work the way anybody wants them to. subtree does but taking work and pushing it back to the upstream dependency is confusing and painful to explain to people. Instead I want something like: https://gitmodules.com/
- My server keeps it in sync with the remote server if I'm pulling from a remote server but I can pin the version in my repo.
- My changes in my repo go to the remote dependency if I have permission
- If there are conflicts they are resolved through a PR.
Build in better visualization tools. Let me kick out to a browser or whatever to more graphically explore what I'm looking at here. A lot of people use the CLI + a GUI tool to do this with git and it seems like something we could roll into one step.
Easier centralized commit message and other etiquette enforcement. Yes I can distribute a bunch of git hooks but it would be nice if when you cloned the repo you got all the checks to make sure that you are doing things the right way before you wasted a bunch of time only to get caught by the CI linter or commit message format checker. I'd also love some prompts like "hey this branch is getting pretty big" or "every commit must be of a type fix/feat/docs/style/test/ci whatever".
Read replica concept. I'd love to be able to point my CI/CD systems at a read replica box and preserve my primary VCS box for actual users. Primary server fires a webhook that triggers a build with a tag, hits the read replica which knows to pull from the primary if it doesn't have that tag. Be even more amazing if we could do some sort of primary/secondary model where I can set both in the config and if primary (cloud provider) is down I can keep pushing stuff up to somewhere that is backed up.

So I tried out a few competitors to see "is there any system moving more towards this direction".

SVN in 2024

My first introduction to version control was SVN (Subversion), which was pitched to me as "don't try to make a branch until you've worked here a year". However getting SVN to work as a newbie was extremely easy because it doesn't do much. Add, delete, copy, move, mkdir, status, diff, update, commit, log, revert, update -r, co -r were pretty much all the commands you needed to get rolling. Subversion has a very simple mental model of how it works which also assists with getting you started. It's effectively "we copied stuff to a file server and back to your laptop when you ask us to".

I have to say though, svn is a much nicer experience than I remember. A lot of the rough edges seem to have been sanded down and I didn't hit any of the old issues I used to. Huge props to the Subversion team for delivering great work.

Subversion Basics

Effectively your Subversion client commits all your files as a single atomic transaction to the central server as the basic function. Whenever that happens, it creates a new version of the whole project, called a revision. This isn't a hash, it's just a number starting at zero, so there's no confusion as a new user what is the "newer" or "older" thing. These are global numbers, not tied to a file, so it's the state of the world. For each individual file there are 4 states it can be in:

Unchanged locally + current remote: leave it alone
Locally changed + current remote: to publish the change you need to commit it, an update will do nothing
Unchanged locally + out of date remotely: svn update will merge the latest copy into your working copy
Locally changed + out of date remotely: svn commit won't work, svn update will try to resolve the problem but if it can't then the user will need to figure out what to do.

It's nearly impossible to "break" SVN because pushing up doesn't mean you are pulling down. This means different files and directories can be set to different revisions, but only when you run svn update does the whole world true itself up to the latest revision.

Working with SVN looks as follows:

Ensure you are on the network
Run svn update to get your working copy up to latest
Make the changes you need, remembering not to use OS tooling to move or delete files and instead use svn copy and svn move so it knows about the changes.
Run svn diff to make sure you want to do what you are talking about doing
Run svn update again, resolve conflicts with svn resolve
Feeling good? Hit svn commit and you are done.

Why did SVN get dumped then? One word: branches.

SVN Branches

In SVN a branch is really just a directory that you stick into where you are working. Typically you do it as a remote copy and then start working with it, so it looks more like you are copying the URL path to a new URL path. But to users they just look like normal directories in the repository that you've made. Before SVN 1.4 merging a branch required a masters degree and a steady hand, but they added an svn merge which made it a bit easier.

Practically you are using svn merge against the main to keep your branch in sync and then when you are ready to go, you run svn merge --reintegrate to push the branch to master. Then you can delete the branch, but if you need to read the log the URL of the branch will always work to read the log of. This was particularly nice with ticket systems where the URL was just the ticket number. But you don't need to clutter things up forever with random directories.

In short a lot of the things that used to be wrong with svn branches aren't anymore.

What's wrong with it?

So SVN breaks down IME when it comes to automation. You need to make it all yourself. While you do you have nuanced access control over different parts of a repo, in practice this wasn't often valuable. What you don't have is the ability to block someone from merging in a branch without some sort of additional controls or check. It also can place a lot of burden on the SVN server since nobody seems to ever update them even when you add a lot more employees.

Also the UIs are dated and the entire tooling ecosystem has started to rot from users leaving. I don't know if I could realistically recommend someone jump from git to svn right now, but I do think it has a lot of good ideas that moves us closer to what I want. It would just need a tremendous amount of UI/UX investment in terms of web to get it to where I would like using it over git. But I think if someone was interested in that work, the fundamental "bones" of Subversion are good.

Sapling

One thing I've heard from every former Meta engineer I've worked with is how much they miss their VCS. Sapling is that team letting us play around with a lot of those pieces, adopted for a more GitHub-centric world. I've been using it for my own personal stuff for a few months and have really come to fall in love with it. It feels like Sapling is specifically designed to be easy to understand, which is a delightful change.

A lot of the stuff is the same. You clone with sl clone, you check the status with sl status and you commit with sl commit. The differences that immediately stick out are the concept of stacks and the concept of the smartlog. So stacks are "collections of commits" and the idea is that from the command line I can issue PRs for those changes with sl pr submit with each GitHub PR being one of the commits. This view (obviously) is cluttered and annoying, so there's another tool that helps you see the changes correctly which is ReviewStack.

None of this makes a lot of sense unless I show you what I'm talking about. I made a new repo and I'm adding files to it. First I check the status:

❯ sl st
? Dockerfile
? all_functions.py
? get-roles.sh
? gunicorn.sh
? main.py
? requirements.in
? requirements.txt

Then I add the files:

sl add .
adding Dockerfile
adding all_functions.py
adding get-roles.sh
adding gunicorn.sh
adding main.py
adding requirements.in
adding requirements.txt

If I want a nicer web UI running locally, I run sl web and get this:

So I added all those files in as one Initial Commit. Great, let's add some more.

❯ sl
@  5a23c603a  4 seconds ago  mathew.duggan
│  feat: adding the exceptions handler
│
o  2652cf416  17 seconds ago  mathew.duggan
│  feat: adding auth
│
o  2f5b8ee0c  9 minutes ago  mathew.duggan
   Initial Commit

Now if I want to navigate this stack, I can just use sl prev which moves me up and down the stack:

sl prev 1
0 files updated, 0 files merged, 1 files removed, 0 files unresolved
[2f5b8e] Initial Commit

And that is also represented in my sl output

❯ sl
o  5a23c603a  108 seconds ago  mathew.duggan
│  feat: adding the exceptions handler
│
o  2652cf416  2 minutes ago  mathew.duggan
│  feat: adding auth
│
@  2f5b8ee0c  11 minutes ago  mathew.duggan
   Initial Commit

This also shows up in my local web UI

Finally the flow ends with sl pr to create Pull Requests. They are GitHub Pull Requests but they don't look like normal GitHub pull requests and you don't want to review them the same way. The tool you want to use for this is ReviewStack.

I stole their GIF because it does a good job

Why I like it

Sapling lines up with what I expect a VCS to do. It's easier to see what is going on, it's designed to work with a large team and it surfaces the information I want in a way that makes more sense. The commands make more sense to me and I've never found myself unable to do something I needed to do.

More specifically I like throwing away the idea of branches. What I have is a collection of commits that fork off from the main line of development, but I don't have a distinct thing I want named that I'm asking you to add. I want to take the main line of work and add a stack of commits to it and then I want someone to look at that collection of commits and make sure it makes sense and then run automated checks against it. The "branch" concept doesn't do anything for me and ends up being something I delete anyway.

I also like that it's much easier to undo work. This is something where I feel like git makes it really difficult to handle and uncommit, unamend, unhide, and undo in Sapling just work better for me and always seem to result in the behavior that I expected. Losing the staging area and focusing more on easy to use commands is a more logical design.

Why you shouldn't switch

If I love Sapling so much, what's the problem? So to get Sapling to the place I actually want it to be, I need more of the Meta special sauce running. Sapling works pretty well on top of GitHub, but what I'd love is to get:

Mononoke, which is the server-side component of Sapling (https://github.com/facebook/sapling/blob/main/eden/mononoke/README.md)
EdenFS (https://github.com/facebook/sapling/blob/main/eden/fs/docs/Overview.md)

These seem to be the pieces to get all the goodness of this system as outlined below

On-demand historical file fetching (remotefilelog, 2013)
File system monitor for faster working copy status (watchman, 2014)
In-repo sparse profile to shrink working copy (2015)
Limit references to exchange (selective pull, 2016)
On-demand historical tree fetching (2017)
Incremental updates to working copy state (treestate, 2017)
New server infrastructure for push throughput and faster indexes (Mononoke, 2017)
Virtualized working copy for on-demand currently checked out file or tree fetching (EdenFS, 2018)
Faster commit graph algorithms (segmented changelog, 2020)
On-demand commit fetching (2021)

I'd love to try all of this together (and since there is source code for a lot of it, I am working on trying to get it started) but so far I don't think I've been able to see the full Sapling experience. All these pieces together would provide a really interesting argument for transitioning to Sapling but without them I'm really tacking a lot of custom workflow on top of GitHub. I think I could pitch migrating wholesale from GitHub to something else, but Meta would need to release more of these pieces in an easier to consume fashion.

Scalar

Alright so until Facebook decided to release the entire package end to end, Sapling exists as a great stack on top of GitHub but not something I could (realistically) see migrating a team to. Can I make git work more the way I want to? Or at least can I make it less of a pain to manage all the individual files?

Microsoft has a tool that does this, VFS for Git, but it's Windows only so that does nothing for me. However they also offer a cross-platform tool called Scalar that is designed to "enable working with large repos at scale". It was originally a Microsoft technology and was eventually moved to git proper, so maybe it'll do what I want.

What scalar does is effectively set all the most modern git options for working with a large repo. So this is the built-in file-system monitor, multi-pack index, commit graphs, scheduled background maintenance, partial cloning, and clone mode sparse-checkout.

So what are these things?

The file system monitor is FSMonitor, a daemon that tracks changes to files and directories from the OS and adds them to a queue. That means git status doesn't need to query every file in the repo to find changes.
Take the git pack directory with a pack file and break it into multiples.
Commit graphs which from the docs:
- " The commit-graph file stores the commit graph structure along with some extra metadata to speed up graph walks. By listing commit OIDs in lexicographic order, we can identify an integer position for each commit and refer to the parents of a commit using those integer positions. We use binary search to find initial commits and then use the integer positions for fast lookups during the walk."
Finally clone mode sparse-checkout. This allows people to limit their working directory to specific files

The purpose of this tool is to create an easy-mode for dealing with large monorepos, with an eye towards monorepos that are actually a collection of microservices. Ok but does it do what I want?

Why I like it

Well it's already built into git which is great and it is incredibly easy to use and get started with. Also it does some of what I want. Taking a bunch of existing repos and creating one giant monorepo, the performance was surprisingly good. The sparse-checkout means I get to designate what I care about and what I don't and also solves the problem of "what if I have a giant directory of binaries that I don't want people to worry about" since it follows the same pattern matching as .gitignore

Now what it doesn't do is radically change what git is. You could grow a repo to much much larger with these defaults set, but it's still handling a lot of things locally and requiring me to do the work. However I will say it makes a lot of my complaints go away. Combined with the gh CLI tool for PRs and I can cobble together a reasonably good workflow that I really like.

So while this is definitely the pattern I'm going to be adopting from now on (monorepo full of microservices where I manage scale with scalar), I think it represents how far you can modify git as an existing platform. This is the best possible option today but it still doesn't get me to where I want to be. It is closer though.

You can try it out yourself: https://git-scm.com/docs/scalar

Conclusion

So where does this leave us? Honestly, I could write another 5000 words on this stuff. It feels like as a field we get maddeningly close to cracking this code and then give up because we hit upon a solution that is mostly good enough. As workflows have continued to evolve, we haven't come back to touch this third rail of application design.

Why? I think the people not satisfied with git are told that is a result of them not understanding it. It creates a feeling that if you aren't clicking with the tool, then the deficiency is with you and not with the tool. I also think programmers love decentralized designs because it encourages the (somewhat) false hope of portability. Yes I am entirely reliant on GitHub actions, Pull Requests, GitHub access control, SSO, secrets and releases but in a pinch I could move the actual repo itself to a different provider.

Hopefully someone decides to take another run at this problem. I don't feel like we're close to done and it seems like, from playing around with all these, that there is a lot of low-hanging optimization fruit that anyone could grab. I think the primary blocker would be you'd need to leave git behind and migrate to a totally different structure, which might be too much for us. I'll keep hoping it's not though.

Corrections/suggestions: https://c.im/@matdevdug

IAM Is The Worst

March 15, 2024 in DevOps

Imagine your job was to clean a giant office building. You go from floor to floor, opening doors, collecting trash, getting a vacuum out of the cleaning closet and putting it back. It's a normal job and part of that job is someone gives you a key. The key opens every door everywhere. Everyone understands the key is powerful, but they also understand you need to do your job.

Then your management hears about someone stealing janitor keys. So they take away your universal key and they say "you need to tell Suzie, our security engineer, which keys you need at which time". But the keys don't just unlock one door, some unlock a lot of doors and some desk drawers, some open the vault (imagine this is the Die Hard building), some don't open any doors but instead turn on the coffee machine. Obviously the keys have titles, but the titles mean nothing. Do you need the "executive_floor/admin" key or the "executive_floor/viewer" key?

But you are a good employee and understand that security is a part of the job. So you dutifully request the keys you think you need, try to do your job, open a new ticket when the key doesn't open a door you want, try it again, it still doesn't open the door you want so then there's another key. Soon your keyring is massive, just a clanging sound as you walk down the hallway. It mostly works, but a lot of the keys open stuff you don't need, which makes you think maybe this entire thing was pointless.

The company is growing and we need new janitors, but they don't want to give all the new janitors your key ring. So they roll out a new system which says "now the keys can only open doors that we have written down that this key can open, even if it says "executive_floor/admin". The problem is people move offices all the time, so even if the list of what doors that key opened was true when it was issued, it's not true tomorrow. The Security team and HR share a list, but the list sometimes drifts or maybe someone moves offices without telling the right people.

Soon nobody is really 100% sure what you can or cannot open, including you. Sure someone can audit it and figure it out, but the risk of removing access means you cannot do your job and the office doesn't get cleaned. So practically speaking the longer someone works as a janitor the more doors they can open until eventually they have the same level of access as your original master key even if that wasn't the intent.

That's IAM (Identity and access management) in cloud providers today.

Stare Into Madness

It's Not Natural, It's Just Simple: Food Branding Co-Opts Another Mean

Honestly I don't even know why I'm complaining. Of course it's entirely reasonable to expect anyone working in a cloud environment to understand the dozen+ ways that they may or may not have access to a particular resource. Maybe they have permissions at a folder level, or an org level, but that permission is gated by specific resources.

Maybe they don't even have access but the tool they're interacting with the resource with has permission to do it, so they can do it but only as long as they are SSH'd into host01, not if they try to do it through some cloud shell. Possibly they had access to it before, but now they don't since they moved teams. Perhaps the members of this team were previously part of some existing group but now new employees aren't added to that group so some parts of the team can access X but others cannot. Or they actually have the correct permissions to the resource but the resource is located in another account and they don't have the right permission to traverse the networking link between the two VPCs.

Meanwhile someone is staring at these flowcharts trying to figure out what in hell is even happening here. As someone who has had to do this multiple times in my life, let me tell you the real-world workflow that ends up happening.

Developer wants to launch a new service using new cloud products. They put in a ticket for me to give them access to the correct "roles" to do this.
I need to look at two elements of it, both what are the permissions the person needs in order to see if the thing is working and then the permissions the service needs in order to complete the task it is trying to complete.
So I go through my giant list of roles and try to cobble together something that I think based on the names will do what I want. Do you feel like a roles/datastore.viewer or more of a roles/datastore.keyVisualizerViewer? To run backups is roles/datastore.backupsAdmin sufficient or do I need to add roles/datastore.backupSchedulesAdmin in there as well?
They try it and it doesn't work. Reopen the ticket with "I still get authorizationerror:foo". I switch that role with a different role, try it again. Run it through the simulator, it seems to work, but they report a new different error because actually in order to use service A you need to also have a role in service B. Go into bathroom, scream into the void and return to your terminal.
We end up cobbling together a custom role that includes all the permissions that this application needs and the remaining 90% of permissions are something it will never ever use but will just sit there as a possible security hole.
Because /* permissions are the work of Satan, I need to scope it to specific instances of that resource and just hope nobody ever adds a SQS queue without....checking the permissions I guess. In theory we should catch it in the non-prod environments but there's always the chance that someone messes up something at a higher level of permissions that does something in non-prod and doesn't exist in prod so we'll just kinda cross our fingers there.

GCP Makes It Worse

So that's effectively the AWS story, which is terrible but at least it's possible to cobble together something that works and you can audit. Google looked at this and said "what if we could express how much we hate Infrastructure teams as a service?" Expensive coffee robots were engaged, colorful furniture was sat on and the brightest minds of our generation came up with a system so punishing you'd think you did something to offend them personally.

Google looked at AWS and said "this is a tire fire" as corporations put non-prod and prod environments in the same accounts and then tried to divide them by conditionals. So they came up with a folder structure:

The problem is that this design encourages unsafe practices by promoting "groups should be set at the folder level with one of the default basic roles". It makes sense logically at first that you are a viewer, editor or owner. But as GCP adds more services this model breaks down quickly because each one of these encompasses thousands upon thousands of permissions. So additional IAM predefined roles were layered on.

People were encouraged to move away from the basic roles and towards the predefined roles. There are ServiceAgent roles that were designated for service accounts, aka the permissions you actual application has and then everything else. Then there are 1687 other roles for you to pick from to assign to your groups of users.

The problem is none of this is actually best practice. Even when assigning users "small roles", we're still not following the principal of least privilege. Also the roles don't remain static. As new services come online permissions are added to roles.

The above is an automated process that pulls down all the roles from the gcloud CLI tool and updates them for latest. It is a constant state of flux with roles with daily changes. It gets even more complicated though.

You also need to check the launch stage of a role.

Custom roles include a launch stage as part of the role's metadata. The most common launch stages for custom roles are ALPHA, BETA, and GA. These launch stages are informational; they help you keep track of whether each role is ready for widespread use. Another common launch stage is DISABLED. This launch stage lets you disable a custom role.

We recommend that you use launch stages to convey the following information about the role:

EAP or ALPHA: The role is still being developed or tested, or it includes permissions for Google Cloud services or features that are not yet public. It is not ready for widespread use.
BETA: The role has been tested on a limited basis, or it includes permissions for Google Cloud services or features that are not generally available.
GA: The role has been widely tested, and all of its permissions are for Google Cloud services or features that are generally available.
DEPRECATED: The role is no longer in use.

Who Cares?

Why would anyone care if Google is constantly changing roles? Well it matters because with GCP to make a custom role, you cannot combine predefined roles. Instead you need to go down to the permission level to list out all of the things those roles can do, then feed that list of permissions into the definition of your custom role and push that up to GCP.

In order to follow best practices this is what you have to do. Otherwise you will always be left with users that have a ton of unused permissions along with the fear of a security breach allowing someone to execute commands in your GCP account through an applications service account that cause way more damage than the actual application justifies.

So you get to build automated tooling which either queries the predefined roles for change over time and roll those into your custom roles so that you can assign a user or group one specific role that lets them do everything they need. Or you can assign these same folks multiple of the 1600+ predefined roles, accept that they have permissions they don't need and also just internalize that day to day you don't know how much the scope of those permissions have changed.

The Obvious Solution

Why am I ranting about this? Because the solution is so blindly obvious I don't understand why we're not already doing it. It's a solution I've had to build, myself, multiple times and at this point am furious that this keeps being my responsibility as I funnel hundreds of thousands of dollars to cloud providers.

What is this obvious solution? You, an application developer, need to launch a new service. I give you a service account that lets you do almost everything inside of that account along with a viewer account for your user that lets you go into the web console and see everything. You churn away happily, writing code that uses all those new great services. Meanwhile, we're tracking all the permissions your application and you are using.

At some time interval, 30 or 90 or whatever days, my tool looks at the permissions your application has used over the last 90 days and says "remove the global permissions and scope it to these". I don't need to ask you what you need, because I can see it. In the same vein I do the same thing with your user or group permissions. You don't need viewer everywhere because I can see what you've looked at.

Both GCP and AWS support this and have all this functionality baked in. GCP has the role recommendations which tracks exactly what I'm talking about and recommends lowering the role. AWS tracks the exact same information and can be used to do the exact same thing.

What if the user needs different permissions in a hurry?

This is not actually that hard to account for and again is something I and countless others have been forced to make over and over. You can issue expiring permissions in both situations where a user can request a role be temporarily granted to them and then it disappears in 4 hours. I've seen every version of these, from Slack bots to websites, but they're all the same thing. If user is in X group they're allowed to request Y temporary permissions. OR if the user is on-call as determined with an API call to the on-call provider they get more powers. Either design works fine.

That seems like a giant security hole

Compared to what? Team A guessing what Team B needs even though they don't ever do the work that Team B does? Some security team receiving a request for permissions and trying to figure out if the request "makes sense" or not? At least this approach is based on actual data and not throwing darts at a list of IAM roles and seeing what "feels right".

Conclusion

IAM started out as an easy idea that as more and more services were launched, started to become nightmarish to organize. It's too hard to do the right thing now and it's even harder to do the right thing in GCP compared to AWS. The solution is not complicated. We have all the tools, all the data, we understand how they fit together. We just need one of the providers to be brave enough to say "obviously we messed up and this legacy system you all built your access control on is bad and broken". It'll be horrible, we'll all grumble and moan but in the end it'll be a better world for us all.

Feedback: https://c.im/@matdevdug

State Of The Blog

March 13, 2024

Just a quick opportunity to check in with you all and say thanks!

Don't worry, nothing is changing. I just wanted to write this as a quick thank you to all of you for checking in with my little site. I also wanted to address a few questions I've gotten now that we've hit a certain threshold of traffic so I can direct people here in the future with questions. If I miss something hit me up on Mastodon here: https://c.im/@matdevdug

20k Average Weekly Visitors!

One of the milestones I never thought I'd hit was 10,000 average weekly visitors but we have blown past that without me noticing. Here are the last 30 days stats for those interested in such things. This obviously has a giant spike throwing the data off but if you look at a typical 30 day period we're just at 20,000 a week average.

I'm glad so many of you have found my little corner of the internet. It's been a pleasure to interact with (almost) all of you and for the vanishingly small percentage that have been unpleasant, we've had words.

Cost

Thanks to Cloudflare, running this site has not been expensive. We're still at about $6 a month to run. I'm running on the Hetzner ARM CAX11 instance class and have been really impressed with performance. Typically folks go with ARM-class instances for cost, but this thing has been a beast in terms of workload with zero ARM-specific issues I can point to. This mirrors my experience with AWS ARM instances, but in case you were considering doing the same thing, you can easily scale with even the cheapest instance.

Monetization

I've gotten a few offers to monetize the site, mostly either running ads (which for this audience would be a giant waste of time and ruin the visual of the site) or by running "promoted posts". After thinking it over I decided I don't want to earn any money on the site. It's fun to write, hopefully people enjoy reading it and I'm lucky enough to be at a point in my life where $10 a month is not a sum of money I miss.

If that ever changes in the future, I'll be sure to mark the posts as endorsed or paid for in some way so that nobody feels duped. But I haven't been interested so far.

Software

This is a Debian server that's initialized with cloud-init to set up a non-root user, install Docker and Docker compose, pull the Ghost images along with Nginx and then attach the volume. I also pull the Cloudflare certificate and insert that inside the Nginx container so I can have a long-running SSL certificate and let them handle the edge certificate rotation.

Previously I used Caddy in front of Ghost but did run into a few times when under load it seemed to struggle and required a restart. In general I had more problems than I expected with Caddy, which doesn't make it a bad webserver, but it is difficult to compete with the completely idiot-proof nature of Nginx. Plus since I'm not handling user-facing SSL certificates, the built-in SSL certificate functionality ended up not being much of a perk.

Ghost

As a platform Ghost has been pretty good with one exception, which is email newsletters. I'll touch on those later. It's held up well under load, with frequent updates. I don't use most of the admin panel which is really geared towards building an email newsletter business. I'm not interested in doing that, so a lot of the functionality is wasted on me.

However it is quite good at the basic stuff, which is you write text in their editor, hit publish and it goes out to the world. Most of the Dashboard stuff is pointless to me, with links to YouTubers and optimizing for SEO which I haven't done at all. Most of the new features they add has nothing to do with me and in retrospect I might have been better off with a static site.

In general though if you are interested in starting a blog with a focus on building a newsletter-based business, Ghost is great. For this it works well enough with some optimizations.

Email Newsletter

You may have noticed that the Subscribe button has disappeared. While I appreciate that people liked getting an email with the posts in it, the cost of sending emails exceeded the cost of hosting the rest of the platform by a lot. Ghost relies on Mailgun for sending newsletter and while it can use SMTP for transaction emails, the cost of Mailgun exceeds the value of what I get out of sending posts as newsletters. (If I post multiple times a month, we'd be looking at $90 a month for emails alone which is too rich for me). I also don't love having a database full of peoples names and email addresses in it, since the best way to prevent a data leak is to not have the data to begin with.

If anyone in the future complains I'll likely set this up: https://github.com/ItzNotABug/ghosler so I can use the much cheaper SMTP options. But so far the response to removing it has been small. For those reading this I would probably disable it from the get-go if started a new site OR set it up with SMTP from launch. Mailgun is too expensive for what it provides which was a pretty underwhelming user experience full of nested menus. (Insert my rant that transactional email API services are a scam business based on the false assertion that the reputation scores of senders are impossible to set up from scratch despite me having seen it done multiple times with IPv6 addresses).

However folks seem to be using RSS successfully, which is great. Some homegrown clients aren't intelligently checking whether there are new feeds or not, simply grabbing the entire RSS feed with every check. It's not a crisis by any means, but if that is you, maybe add a check for "is pubDate not todays date for the latest entry and if not, then maybe don't pull down the entire feed". But in general I strongly prefer RSS because the cost per user is extremely small and there are no personal data concerns around it. You are in control of your own feeds.

It does suck that less technical people still seem to struggle to find a functional RSS reader. It's still an unsolved problem as far as I can tell. I have many I like and recommend, but I constantly hear how hard it is to get "set up". If that's you, maybe check out Feedrabbit: https://feedrabbit.com/ to get RSS to emails.

Downsides

I don't love the Ghost team dumping sqlite for MySQL 8. Especially because there is no real scaling options here. So I'm not sure what the perk is for moving away from sqlite towards MySQL 8 if we're never going to be able to support multiple instances hosting the same site.
A lot of the technical work lately seems more in the direction of the headless CMS route, which is fine but does nothing for me.
Editor bugs. I get a lot of them. Markdown will stop working then resume working with a new update. Sometimes commands like /image will trigger the behavior I expect and sometimes it won't. The whole thing is a bit mysterious.

Nginx

Nothing to report here, just the official Nginx docker image with the following config. I have Authenticated Origin Pull set up so that I know all my traffic is coming from Cloudflare.

There's a couple of things happening here.

We have the proxy_cache setup so that Nginx can assist with any massive spikes in traffic.
This config attempts to both force SSL connections with the Strict-Transport-Security "max-age=63072000; includeSubdomains"; and also cache the SSL session parameters.
This is effectively my all-purpose Nginx configuration that I use for a lot of different things. There are a few sub-optimal things here (I don't think you need to do ssl_ciphers if you remove TLSv1) but in general this has been a pretty battle-tested config.

map $sent_http_content_type $expires {
    default                    off;
    text/css                   max;
    application/javascript     max;
    ~image/                    max;
}

server {
    listen 80;
    listen [::]:80;
    server_name matduggan.com www.matduggan.com;
    return 302 https://$server_name$request_uri;
}
proxy_cache_path /tmp/cache  levels=1:2    keys_zone=STATIC:10m inactive=24h  max_size=1g;
client_max_body_size 100M;

server {

    # SSL configuration

    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    charset UTF-8;
    gzip on;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 5m;
    ssl_prefer_server_ciphers on;
    ssl_ciphers ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DH+3DES:!ADH:!AECDH:!MD5;
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
    ssl_buffer_size 4k;
    add_header Strict-Transport-Security "max-age=63072000; includeSubdomains";
    ssl_certificate         /etc/ssl/cert.pem;
    ssl_certificate_key     /etc/ssl/key.pem;
    ssl_client_certificate /etc/ssl/cloudflare.crt;
    ssl_verify_client on;
    expires $expires;


    server_name matduggan.com www.matduggan.com;

    location / {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        add_header X-Cache-Status $upstream_cache_status;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $http_host;
        proxy_buffering        on;
        proxy_cache            STATIC;
        proxy_cache_valid      200  1d;
        proxy_cache_use_stale  error timeout invalid_header updating
                               http_500 http_502 http_503 http_504;
        proxy_pass http://127.0.0.1:8080;
        proxy_redirect off;
    }

    location ~ ^/(ghost/|p/)/ {
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
        proxy_pass http://127.0.0.1:8080;
    }

}

Considering Starting a Site?

You should! If you run across this and need some help, let me know at https://c.im/@matdevdug. Glad to provide pointers.

I think that addresses all the questions I typically get. If I missed something let me know.

K8s Service Meshes: The Bill Comes Due

March 01, 2024 in kubernetes

When you start using Kubernetes one of the first suggestions you'll get is to install a service mesh. This, of course, on top of the 900 other things you need to install. For those unaware, everything in k8s is open to everything else by default when you start and traffic isn't encrypted between services. Since encrypting traffic between services and controlling what services can talk to which requires something like a JWT and client certificates, teams aren't typically eager to take on this work even though its increasingly a requirement of any stack.

Infrastructure teams can usually implement a feature faster than every app team in a company, so this tends to get solved by them. Service meshes exploded in popularity as it became clear they were easy ways to implement enforced encryption and granular service to service access control. You also get better monitoring and some cool features like circuit breaking and request retries for "free". As the scale of deployments grew with k8s and started to bridge multiple cloud providers or a cloud provider and a datacenter, this went from "nice to have" to an operational requirement.

What is a service mesh?

Service-to-service communication before and after service mesh implementation

Service meshes let you do a few things easily

Easy metrics on all service to service requests since it has a proxy that knows success/failure/RTT/number of requests
Knowledge that all requests are encrypted with automated rotation
Option to ensure only encrypted requests are accepted so you can have k8s in the same VPC as other things without needing to do firewall rules
Easy to set up network isolation at a route/service/namespace level (great for k8s hosting platform or customer isolation)
Automatic retries, global timeout limits, circuit breaking and all the features of a more robustly designed application without the work
Reduces change failure rate. With a proxy sitting there holding and retrying requests, small blips don't register anymore to the client. Now they shouldn't anyway if you set up k8s correctly but its another level of redundancy.

This adds up to a lot of value for places that adopt them with a minimum amount of work since they're sidecars injected into existing apps. For the most part they "just work" and don't require a lot of knowledge to keep working.

However, it's 2024 and stuff that used to be free isn't anymore. The free money train from VCs has ended and the bill has come due. Increasingly, this requirement for deploying production applications to k8s is going to come with a tax that you need to account for when budgeting for your k8s migration and determining whether it is worth it. Since December 2023 the service mesh landscape has changed substantially and it's a good time for a quick overview of what is going on.

NOTE: Before people jump down my throat, I'm not saying these teams shouldn't get paid. If your tool provides real benefits to businesses it isn't unreasonable to ask them to materially contribute to it. I just want people to be up to speed on what the state of the service mesh industry is and be able to plan accordingly.

Linkerd

My personal favorite of the service meshes, Linkerd is the most idiot proof of the designs. It consists of a control plane and a data plane with a monitoring option included. It looks like this:

Recently Linkerd has announced a change to their release process, which I think is a novel approach to the problem of "getting paid for your work". For those unaware, Linkerd has always maintained a "stable" and an "edge" version of their software, along with an enterprise product. As of Linkerd 2.15.0, they will no longer publish stable releases. Instead the concept of a stable release will be bundled into their Buoyant Enterprise for Linkerd option. You can read the blog post here.

Important to note that unlike some products, Linkerd doesn't just take a specific release of Edge and make it Enterprise. There are features that make it to Edge that never get to Enterprise, Stable is also not a static target (there are patch releases to the Stable branch as well), so these are effectively three different products. So you can't do the workaround of locking your org to specific Edge releases that match up with Stable/Enterprise.

Pricing

Update: Linkerd changed their pricing to per-pod. You can see it here: https://buoyant.io/pricing. I'll leave the below for legacy purposes but the new pricing addresses my concerns.

Buoyant has selected the surprisingly high price of $2000 a cluster per month. The reason this is surprising to me is the model for k8s is increasingly moving towards more clusters with less in a cluster, vs the older monolithic cluster where the entire company lives in one. This pricing works against that goal and removes some of the value of the service mesh concept.

If the idea of the Linkerd team is that orgs are going to stick with fewer, larger clusters, then it makes less sense to me to go with Linkerd. With a ton of clusters, I don't want to think about IP address ranges or any of the east to west networking designs, but if I just have like 2-3 clusters that are entirely independent of each other, then I can get a similar experience to Linkerd with relatively basic firewall rules, k8s network policies and some minor changes to an app to encrypt connections. There's still value to Linkerd, but the per-cluster pricing when I was clearly fine hosting the entire thing myself before is strange.

$2000 a month for a site license makes sense to me to get access to enterprise. $2000 a month per cluster when Buoyant isn't providing me with dashboards or metrics on their side seems like they picked an arbitrary number out of thin air. There's zero additional cost for them per cluster added, it's just profit. It feels weird and bad. If I'm hosting and deploying everything and the only support you are providing me is letting me post to the forum, where do you come up with the calculation that I owe you per cluster regardless of size?

Now you can continue to use Linkerd, but you need to switch to Edge. In my experience testing it, Edge is fine. It's mostly production ready, but there are sometimes features which you'll start using and then they'll disappear. I don't think it'll matter for most orgs most of the time, since you aren't likely constantly rolling out service mesh upgrades. You'll pick a version of Edge, test it, deploy it and then wait until you are forced to upgrade or you see a feature you like.

You also can't just buy a license, you need to schedule a call with them to buy a license with discounts available before March 21st, 2024. I don't know about you but the idea of needing to both buy a license and have a call to buy a license is equally disheartening. Maybe just let me buy it with the corporate card or work with the cloud providers to let me pay you through them.

Cilium

Cilium is the new cool kid on the block when it comes to service meshes. It eliminates the sidecar container, removing a major source of failure in the service mesh design. You still get encryption, load balancing, etc but since it uses eBPF and is injected right into the kernel you remove that entire element of the stack.

You also get a LOT with Cilium. It is its own CNI, which in my testing has amazing performance. It works with all the major cloud providers, it gives you incredibly precise network security and observability. You can also replace Kube-proxy with cilium. Here is how it works in a normal k8s cluster with Kube-proxy:

Effectively Kube-proxy works with the OS filtering layer (typically iptables) to allow network communication to your pods. This is a bit simplified but you get the idea.

With the BPF Kube-proxy replacement we remove a lot of pieces in that design.

This is only a tiny fraction of what Cilium does. It has developed a reputation for excellence, where if you full adopt the stack you can replace almost all the cloud-provider specific pieces for k8s to a generic stack that works across providers at a lower cost and high performance.

the UI for seeing service relationships in Cilium is world-class

A Wild Cisco Appears

Cisco recently acquired Isovalent in December of 2023, apparently to get involved in the eBPF space and also likely to augment their acquisition of Splunk. Cilium provides the metrics and traces as well as generating great flow logs and Splunk ingests them for you. If you are on Linkerd and considering moving over to Cilium to avoid paying, you should be aware that with Cisco having purchased them the bill is inevitable.

You will eventually be expected to pay and my guess based on years of ordering Cisco licenses and hardware is you'll be expected to pay a lot. So factor that in when considering Cilium or migrating to Cilium. I'll go out on a limb here and predict that Cilium is priced as a premium multi-cloud product with a requirement of the enterprise license for many of the features before the end of 2024. I will also predict that Linkerd ends up as the cheapest option on the table by the end of 2024 for most orgs.

Take how expensive Splunk is and extrapolate that into a service mesh license and I suspect you'll be in the ballpark.

Istio

Istio, my least favorite service mesh. Conceptually Istio and Linkerd share many of the same ideas. Both platforms use a two-part architecture now: a control plane and a data plane. The control plane manages the data plane by issuing configuration updates to the proxies in the data plane. The control plane also provides security features such as mTLS encryption and authentication.

Istio uses Envoy proxies vs rolling their own like Linkerd and tends to cover more possible scenarios than Linkerd. Here's a feature comparison:

Istio's primary differences are that it supports VMs, runs its own Ingress Controller and is 10x the complexity of setting up any other option. Istio has become infamous among k8s infrastructure staff as being the cause of more problems than any other part of the stack. Now many of these can be solved with minor modifications to the configuration (there is absolutely nothing structurally wrong with Istio), but since a service mesh failure can be "the entire cluster dies", it's tricky.

The reality is Istio is free and open source, but you pay in other ways. Istio has so many components and custom resources that can interact with each other in surprising and terrifying ways that you need someone in your team who is an Istio expert. Otherwise any attempt to create a self-service ecosystem will result in lots of downtime and tears. You are going to spend a lot of time in Istio tracking down performance problems, weird network connectivity issues or just strange reverse proxy behavior.

Some of the earlier performance complaints of Envoy as the sidecar have been addressed, but I still hear of problems when organizations scale up to a certain number of requests per second (less than I used to). The cost for Istio, to me, exceeds the value of a service mesh most of the time. Especially since Linkerd has caught up with most of the traffic management stuff like circuit breaking.

Consul Connect

The next service mesh we'll talk about is Consul Connect. If Istio is highly complicated to set up and Linkerd is easiest but fewest knobs to turn, Consul sits right in the middle. It has a great story when it comes to observability and has performance right there with Linkerd and superior to Istio.

Consul is also very clearly designed to be deployed by large companies, with features around stability and cross-datacenter design that only apply to the biggest orgs. However people who have used it seem to really like it, based on the chats I've had. The ability to use Terraform with Consul with its Consul-Terraform-Sync functionality to get information about services and interact with those services at a networking level is massive, especially for teams managing thousands of nodes or where pods need strict enforced isolation (such as SaaS products where customer app servers can't interact).

Pricing

Consul starts at $0.027 an hour, but in practice your price is gonna be higher than that. It goes up based on how many instances and clusters you are running. It's also not available on GCP, just AWS and Azure. You also don't get support with that, seemingly needing to upgrade your package to ask questions.

I'm pretty down on Hashicorp after the Terraform change, but people have reported a lot of success with Consul so if you are considering a move, this one makes a lot of sense.

Cloud Provider Service Meshes

GCP has Anthos (based on Istio) as part of their GKE Enterprise offering, which is $.10/cluster/hour. It comes with a bunch of other features but in my testing was a much easier way to run Istio. Basically Istio without the annoying parts. AWS App Mesh still uses Envoy but has a pretty different architecture. However it comes with no extra cost which is nice.

AWS App Mesh is also great for orgs that aren't all-in for k8s. You can bridge systems like ECS and traditional EC2 with it, meaning its a super flexible tool for hybrid groups or groups where the k8s-only approach isn't a great fit.

Azure uses Open Service Mesh which is now a deprecated product. Despite that, it's still their recommend solution according to a Google search. Link

Once again the crack team at Azure blows me away with their attention to detail. Azure has a hosted Istio add-on in preview now and presumably they'll end up doing something similar to GKE with Anthos. You can see that here.

What do you need to do

So the era of the free Service Mesh is coming to a close. AWS has decided to use it as an incentive to stay on their platform, Linkerd is charging you, Cilium will charge you At Some Point and Consul is as far from free now as it gets. GKE and Azure seem to be betting on Istio where they move the complexity into their stack, which makes sense. This is a reflection of how valuable these meshes are for observability and resilience as organizations transition to microservices and more specifically split stacks, where you retain your ability to negotiate with your cloud provider by running things in multiple places.

Infrastructure teams will need to carefully pick what horse they want to back moving forward. It's a careful balance between cloud lock-in vs flexibility at the cost of budget or complexity. There aren't any clear-cut winners in the pack, which wasn't true six months ago when the recommendation was just Linkerd or Cilium. If you are locked into either Linkerd or Cilium, the time to start discussing a strategy moving forward is probably today. Either get ready for the bill, commit to running Edge with more internal testing, or brace yourself for a potentially much higher bill in the future.

Python Dependencies Are Fixable

February 23, 2024

I like Python. I've had a lot of success with it on projects large and small. It is fast enough for most of the things I need and when it isn't, migrating from it to a more performant language isn't challenging. The depth of the standard library has been incredible for stable long-living code. However the one thing I hear often when discussing Python with younger programmers is well the dependency management is so bad I wouldn't bother with the language. Lately it seems the narrative is now evolving into "it is so broken that we need a new package system to fix it", which to me is the programming version of Spock dying in the warp core. Let's make absolutely sure we have no other options.

Kirk (William Shatner) bids farewell to Spock (Leonard Nimoy) in the emotional finale of Star Trek II: The Wrath of Khan. (Photo: Paramount/Courtesy Everett Collection)

The problem here isn't one of engineering. We have all the solutions to solve this problem for the majority of users. The issue is an incorrect approach to managing defaults. Pip, like many engineering-led projects, doesn't account for the power of defaults. Engineers tend towards maintaining existing behavior and providing the tooling for people to do it correctly. That's the wrong mentality. Experts who drive the project should be adjusting the default behavior to follow best practices.

Defaults are so important and I think so intimidating to change that this decision has been pushed back for years and years. If we have a better user experience for people and we know this is what they should be using, we should not expect users to discover that best way on their own. You have to make them opt out of the correct flow, not discover and opt in to the right way to do things. Change is scary though and maintainers don't have a massive corporate structure to hide behind. Whatever ire the change generates isn't directed at Faceless Corporation PR, it's directly at the people who make the decision.

Golang taught us this lesson. I work a lot with Golang at work and on some side projects. It is an amazing language to show the power of defaults and the necessity of experts pushing users forward. Golang code at every job looks like code at every other job, which is the direct result of intentional design. Shipping gofmt bundled in with the language increased the quality and readability of golang everywhere. Decentralizing dependency management became a "of course" moment for people when they tried it. Keep the language simple in the face of demands for increased complexity has preserved the appeal. The list goes on and on.

Pypa needs to push the ecosystem forward or give up on the project and officially endorse a new approach. Offering people 400 options is destroying confidence in the core language. The design mentality has to change from "it isn't a problem if there is a workaround" to the correct approach which is for most users the default is the only option they'll ever try.

Why it isn't that broken

Why do I think that we don't need to start fresh? Here's the workflow I use, which is not unique to me. I start a new Python repo and immediately make a venv with python -m venv venv. Then I activate it with source /venv/bin/activate and start doing whatever I want. I write all my code, feel pretty good about it and decide to lock down my dependencies.

I run pip freeze > requirement.in which gives me all the packages I have installed with their versions. It's 2024 so I need more security and confidence than a list of packages with a version number. The easiest way to get that is with package hashes, which is easy to do with pip-tools. pip-compile --generate-hashes requirements.in outputs a requirements.txt with the hashes I want along with the dependencies of the packages.

build==1.0.3 \
    --hash=sha256:538aab1b64f9828977f84bc63ae570b060a8ed1be419e7870b8b4fc5e6ea553b \
    --hash=sha256:589bf99a67df7c9cf07ec0ac0e5e2ea5d4b37ac63301c4986d1acb126aa83f8f
    # via
    #   -r requirements.in
    #   pip-tools
cachetools==5.3.2 \
    --hash=sha256:086ee420196f7b2ab9ca2db2520aca326318b68fe5ba8bc4d49cca91add450f2 \
    --hash=sha256:861f35a13a451f94e301ce2bec7cac63e881232ccce7ed67fab9b5df4d3beaa1
    # via
    #   -r requirements.in
    #   google-auth
certifi==2023.11.17 \
    --hash=sha256:9b469f3a900bf28dc19b8cfbf8019bf47f7fdd1a65a1d4ffb98fc14166beb4d1 \
    --hash=sha256:e036ab49d5b79556f99cfc2d9320b34cfbe5be05c5871b51de9329f0603b0474
    # via
    #   -r requirements.in
    #   aioquic
    #   aioquic-mitmproxy
    #   mitmproxy
    #   requests

Now I know all the packages I have, why I have the packages I have and also the specific hashes of those packages so I don't need to worry about supply chain issues. My Dockerfile is also pretty idiot-proof.

FROM python:3.12-slim

# Create a non-root user
RUN groupadd -r nonroot && useradd -r -g nonroot nonroot
WORKDIR /app

COPY requirements.txt .

RUN pip3 install -r requirements.txt

COPY . .

RUN chown -R nonroot:nonroot /app

USER nonroot

ENTRYPOINT ["./gunicorn.sh"]

Yay its running. I can feel pretty confident handing this project over to someone new and having them run into minimal problems getting all of this running. Need to check for updates? Not a big deal.

pip-review
Flask==3.0.2 is available (you have 3.0.0)
Jinja2==3.1.3 is available (you have 3.1.2)
MarkupSafe==2.1.5 is available (you have 2.1.3)
pip==24.0 is available (you have 23.3.1)

Basically if you know the happy path, there are no serious problems here. But you need to know all these steps, which are documented in random places all over the internet. How did we get here and what can be done to fix it?

Why do people think it is so bad

What is the combination of decisions that got us to this place? Why is the average users opinion so low? I think its everything below.

Bare pip sucks for daily tasks. We can't declare a minimum version of Python, we don't get any information as to dependency relationships in the file, we don't have a concept of developer dependency vs production dependency, we don't have hashes so we're very open to upstream attacks, it's slow, it's not clear how to check for updates, there's no alert for a critical update, the list goes on and on.
- What pip-compile does should always be the minimum. It should have been the minimum years and years ago.
- Where pip shines is the range of scenarios it covers and backwards compatibility. We don't want to throw out all that work if we can avoid it to switch to a new package manager unless the situation is unfixable. To me the situation is extremely fixable, but we need to change defaults.
People used Python as a bash replacement. This was a weird period where, similar to Perl, there was an assumption that Python would be installed on any system you were working with and so you could write Python scripts to do things and then package them up as Linux packages. If your script had dependencies, you would also pull those in as Linux packages.
- To be blunt, this dependency management system never should have been allowed. It caused no end of confusion for everyone and ended up with people using super old packages. If your Python application had dependencies, you should have included them.
- Starting to write Python in Linux and then running apt-get install requests but then later being told to use pip and remove the package even though packages are how you get software in Linux has thrown off beginners as long as I have been doing this job.
The nature of dependencies has changed and how we think of including third-party software has evolved. I was shocked when I started working with NodeJS teams at how aggressively and (frankly) recklessly they would add dependencies to a project. However NPM and Node are designed around that model of lots of external dependencies and they've adopted a lot of things that people have come to expect
- The package.json, package-lock.json and node_modules directory as a consistent design across all projects is huge. It completely eliminated confusion and ensures you can have super-easy project switching along with reproducible environments.
- Node defaulting to per-project and not global is what Python should have switched over to years ago. Again, this is just what people expect when they're talking about having lots of dependencies.
- People have a lot more dependencies. When I started in this field, the idea of every project adding a 66 MB dependency with boto would have been unthinkable. Not because of disk space, but because its just so much code to bring into a project. However now people don't even blink at adding more libraries. pip was designed in a world where requirements.txt were 10 lines. Now we could easily be talking 200.
- If we're not going to switch over to per-project dependencies, then at the very least you need to switch to venv as a default. I don't care how you do it. Make a file that sits at the top level of a directory that tells Python we're using a venv. Have it check for the existence of a folder and if it exists use it by default, you gotta have something a bit easier here.

Why this isn't a crisis is this is effectively a basic .profile fix

   cd() {
       builtin cd "$@"
       if [ -f "venv/bin/activate" ]; then
           source venv/bin/activate
       fi
   }

Finally people think its bad because Golang and Rust exist. Expectations for how dependency management evolved in the space. Work has been done to expand pip to meet more of these expectations but we're still pretty far.

Where to go from here

Anyone familiar with the Apple ecosystem will know the term "Sherlocking". It's where Apple monitors the ecosystem of third-party apps and will periodically copy one and make it part of the default OS. While unfair at times to those third parties, it's a clever design from Apple's perspective. They can let someone else do all the work of figuring out what users like and don't like, what designs succeed or fail on their platform and then swoop in when there is general consensus.

pip needs to do some Sherlocking. Pypa has already done a ton of hard engineering work. We have the ability to create a more secure, easier to debug dependency management system with existing almost-stock tooling. It doesn't require any fundamental changes to the ecosystem, or the investment of a lot of engineering effort.

What it requires is being confident enough in your work to make a better experience for everyone by enduring a period of some complaints. Or its time to give up and endorse something like uv. Sitting and waiting for the problem to resolve itself through some abstract concept of community consensus has been a trainwreck. Either make the defaults conform to modern expectations or warn users when they run pip this is a deprecated project and they should go install whatever else.

Questions/comments/concerns: https://c.im/@matdevdug