Skip to content

Replace Docker Compose with Quadlet for Servers

Replace Docker Compose with Quadlet for Servers

So for years I've used Docker Compose as my stepping stone to k8s. If the project is small, or mostly for my own consumption OR if the business requirements don't really support the complexity of k8s, I use Compose. It's simple to manage with bash scripts for deployments, not hard to setup on fresh servers with cloud-init and the process of removing a server from a load balancer, pulling the new container, then adding it back in has been bulletproof for teams with limited headcount or services where uptime is less critical than cost control and ease of long-term maintenance. You avoid almost all of the complexity of really "running" a server while being able to scale up to about 20 VMs while still having a reasonable deployment time.

What are you talking about

Sure, so one common issue I hear is "we're a small team, k8s feels like overkill, what else is on the market"? The issue is there are tons and tons of ways to run containers on virtually every cloud platform, but a lot of them are locked to that cloud platform. They're also typically billed at premium pricing because they remove all the elements of "running a server".

That's fine but for small teams buying in too heavily to a vendor solution can be hard to get out of. Maybe they pick wrong and it gets deprecated, etc. So I try to push them towards a more simple stack that is more idiot-proof to manage. It varies by VPS provider but the basic stack looks like the following:

  • Debian servers setup with cloud-init to run all the updates, reboot, install the container manager of choice.
  • This also sets up Cloudflare tunnels so we can access the boxes securely and easily. Tailscale also works great/better for this. Avoids needing public IPs for each box.
  • Add a tag to each one of those servers so we know what it does (redis, app server, database)
  • Put them into a VPC together so they can communicate
  • Take the deploy script, have it SSH into the box and run the container update process

Linux updates involve a straightforward process of de-registering, destroying the VM and then starting fresh. Database is a bit more complicated but still doable. It's all easily done in simple scripts that you can tie to github actions if you are so inclined. Docker compose has been the glue that handles the actual launching and restarting of the containers for this sample stack.

When you outgrow this approach, you are big enough that you should have a pretty good idea of where to go now. Since everything is already in containers you haven't been boxed in and can migrate in whatever direction you want.

Why Not Docker

However I'm not thrilled with the current state of Docker as a full product. Even when I've paid for Docker Desktop I found it to be a profoundly underwhelming tool. It's slow, the UI is clunky, there's always an update pending, it's sort of expensive for what people use it for, Windows users seem to hate it. When I've compared Podman vs Docker on servers or my local machines, Podman is faster, seems better designed and just in general as a product is trending in a stronger direction. If I don't like Docker Desktop and prefer Podman Desktop, to me its worth migrating the entire stack over and just dumping Docker as a tool I use. Fewer things to keep track of.

Now the problem is that while podman has sort of a compatibility layer with Docker Compose, it's not a one to one replacement and you want to be careful using it. My testing showed it worked ok for basic examples, but more complex stuff and you started to run into problems. It also seems like work on the project has mostly been abandoned by the core maintainers. You can see it here: https://github.com/containers/podman-compose

I think podman-compose is the right solution for local dev, where you aren't using terribly complex examples and the uptime of the stack matters less. It's hard to replace Compose in this role because its just so straight-forward. As a production deployment tool I would stay away from it. This is important to note because right now the local dev container story often involves running k3 on your laptop. My experience is people loath Kubernetes for local development and will go out of their way to avoid it.

The people I know who are all-in on Podman pushed me towards Quadlet as an alternative which uses systemd to manage the entire stack. That makes a lot of sense to me, because my Linux servers already have systemd and it's already a critical piece of software that (as far as I can remember) works pretty much as expected. So the idea of building on top of that existing framework makes more sense to me than attempting to recreate the somewhat haphazard design of Compose.

Wait I thought this already existed?

Yeah I was also confused. So there was a command, podman-generate-systemd, that I had used previously to run containers with Podman using Systemd. That has been deprecated in favor of Quadlet, which are more powerful and offer more of the Compose functionality, but are also more complex and less magically generated.

So if all you want to do is run a container or pod using Systemd, then you can still use podman-generate-systemd which in my testing worked fine and did exactly what it says on the box. However if you want to emulate the functionality of Compose with networks and volumes, then you want Quadlet.

What is Quadlet

The name comes from this excellent pun:

What do you get if you squash a Kubernetes kubelet?
A quadlet

Actually laughed out loud at that. Anyway Quadlet is a tool for running Podman containers under Systemd in a declarative way. It has been merged into Podman 4.4 so it now comes in the box with Podman. When you install Podman it registers a systemd-generator that looks for files in the following directories:

/usr/share/containers/systemd/
/etc/containers/systemd/
# Rootless users
$HOME/.config/containers/systemd/
$XDG_RUNTIME_DIR/containers/systemd/
$XDG_CONFIG_HOME/containers/systemd/
/etc/containers/systemd/users/$(UID)
/etc/containers/systemd/users/

You put unit files in the directory you want (creating them if they aren't present which I assume they aren't) with the file extension telling you what you are looking at.

For example, if I wanted a simple volume I would make the following file:

/etc/containers/systemd/example-db.volume

[Unit]
Description=Example Database Container Volume

[Volume]
Label=app=myapp

You have all the same options you would on the command line.

You can see the entire list here: https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html

Here are the units you can create: name.container, name.volume, name.network, name.kube, name.image, name.build, name.pod

Workflow

So you have a pretty basic Docker Compose you want to replace with Quadlets. You probably need the following:

  • A network
  • Some volumes
  • A database container
  • An application container

The process is pretty straight forward.

Network

We'll make this one at: /etc/containers/systemd/myapp.network

[Unit]
Description=Myapp Network

[Network]
Label=app=myapp

Volume

/etc/containers/systemd/myapp.volume

[Unit]
Description=Myapp Container Volume

[Volume]
Label=app=myapp

/etc/containers/systemd/myapp-db.volume

[Unit]
Description=Myapp Database Container Volume

[Volume]
Label=app=myapp

Database

/etc/containers/systemd/postgres.container

[Unit]
Description=Myapp Database Container

[Service]
Restart=always

[Container]
Label=app=myapp
ContainerName=myapp-db
Image=docker.io/library/postgres:16-bookworm
Network=myapp.network
Volume=myapp-db.volume:/var/lib/postgresql/data
Environment=POSTGRES_PASSWORD=S3cret
Environment=POSTGRES_USER=user
Environment=POSTGRES_DB=myapp_db

[Install]
WantedBy=multi-user.target default.target

Application

/etc/containers/systemd/myapp.container

[Unit]
Description=Myapp Container
Requires=postgres.service
After=postgres.service

[Container]
Label=app=myapp
ContainerName=myapp
Image=wherever-you-get-this
Network=myapp.network
Volume=myapp.volume:/tmp/place_to_put_stuff
Environment=DB_HOST=postgres
Environment=WORDPRESS_DB_USER=user
Environment=WORDPRESS_DB_NAME=myapp_db
Environment=WORDPRESS_DB_PASSWORD=S3cret
PublishPort=9000:80

[Install]
WantedBy=multi-user.target default.target

Now you need to run

systemctl daemon-reload

and you should be able to use systemctl status to check all of these running processes. You don't need to run systemctl enable to get them to run on next boot IF you have the [Install] section defined. Also notice that when you are setting the dependencies (requires, after) that it is called name-of-thing.service, not name-of-thing.container or .volume. It threw me off at first but just wanted to call that out.

One thing I want to call out

Containers support AutoUpdate, which means if you just want Podman to pull down the freshest image from your registry that is supported out of the box. It's just AutoUpdate=registry. If you change that to local, Podman will restart when you trigger a new build of that image locally with a deployment. If you need more information about logging into registries with Podman you can find that here.

I find this very helpful for testing environments where I can tell servers to just run podman auto-update and just getting the newest containers. It's also great because it has options to help handle rolling back and failure scenarios, which are rare but can really blow up in your face with containers without k8s. You can see that here.

What if you don't store images somewhere?

So often with smaller apps it doesn't make sense to add a middle layer of build and storage the image in one place and then pull that image vs just building the image on the machine you are deploying to with docker compose up -d --no-deps --build myapp

You can do the same thing with Quadlet build files. The unit files are similar to the ones above but with a .build extension and the documentation is pretty simple to figure out how to convert whatever you are looking at to it.

I found this nice for quick testing so I could easily rsync changes to my test box and trigger a fast rebuild with the container layers mostly getting pulled from cache and only my code changes making a difference.

How do secrets work?

So secrets are supported with Quadlets. Effectively they just build on top of podman secret or secrets in Kubernetes. Assuming you don't want to go the Kubernetes route for this purpose, you have a couple of options.

  1. Make a secret from a local file (probably bad idea): podman secret create my_secret ./secret.txt
  2. Make a secret from an environmental variable on the box (better idea): podman secret create --env=true my_secret MYSECRET
  3. Use stdin: printf <secret> | podman secret create my_secret -

Then you can reference these secrets inside of the .container file with the Secret=name-of-podman-secret and then the options. By default these secrets are mounted to run/secrets/secretname as a file inside of the container. You can configure it to be an environmental variable (along with a bunch of other stuff) with the options outlined here.

Rootless

So my examples above were not rootless containers which are best practice. You can get them to work, but the behavior is more complicated and has problems I wanted to call out. You need to use default.target and not multi-user.target and then also it looks like you do need loginctl enable-linger to allow your user to start the containers without that user being logged in.

Also remember that all of the systemctl commands need the --user argument and that you might need to change your sysctl parameters to allow rootless containers to run on privileged ports.

sudo sysctl net.ipv4.ip_unprivileged_port_start=80

Unblocks 80, for example.

Networking

So for rootless networking Podman previously used slirp4netns and now uses pasta. Pasta doesn't do NAT and instead copies the IP address from your main network interface to the container namespace. Main in this case is defined as whatever interface as the default route. This can cause (obvious) problems with inter-container connections since its all the same IP. You need to configure the containers.conf to get around this problem.

[network]
pasta_options = ["-a", "10.0.2.0", "-n", "24", "-g", "10.0.2.2", "--dns-forward", "10.0.2.3"]

Also ping didn't work for me. You can fix that with the solution here.

That sounds like a giant pain in the ass.

Yeah I know. It's not actually the fault of the Podman team. The way rootless containers work is basically they use user_namespaces to emulate the privileges to create containers. Inside of the UserNS they can do things like mount namespaces and networking. Outgoing connections are tricky because vEth pairs cannot be created across UserNS boundaries without root. Inbound relies on port forwarding.

So tools like slirp and pasta are used since they can translate Ethernet packets to unprivileged socket system calls by making a tap interface available in the namespace. However the end result is you need to account for a lot of potential strangeness in the configuration file. I'm confident this will let less fiddly as time goes on.

Podman also has a tutorial on how to get it set up here: https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md which did work for me. If you do the work of rootless containers now you have a much easier security story for the rest of your app, so I do think it ultimately pays off even if it is annoying in the beginning.

Impressions

So as a replacement for Docker Compose on servers, I've really liked Quadlet. I find the logging to be easier to figure out since we're just using the standard systemctl commands, checking status is also easier and more straightforward. Getting the rootless containers running took....more time than I expected because I didn't think about how they wouldn't start by default until the user logged back in without the linger work.

It does stink that this is absolutely not a solution for local-dev for most places. I prefer that Podman remains daemonless and instead hooks into the existing functionality of Systemd but for people not running Linux as their local workstations (most people on Earth) you are either going to need to use the Podman Desktop Kubernetes functionality or use the podman-compose and just be aware that it's not something you should use in actual production deployments.

But if you are looking for something that scales well, runs containers and is super easy to manage and keep running, this has been a giant hit for me.

Questions/comments/concerns: https://c.im/@matdevdug


Teaching to the Test. Why It Security Audits Aren’t Making Stuff Safer

A lot has been written in the last few weeks about the state of IT security in the aftermath of the CrowdStrike outage. A range of opinions have emerged, ranging from blaming Microsoft for signing the CrowdStrike software (who in turn blame the EU for making them do it) to blaming the companies themselves for allowing all of these machines access to the Internet to receive the automatic template update. Bike-shedding among the technical community continues to be focused on the underlying technical deployment, which misses the forest for the trees.

The better question is what was the forcing mechanism that convinced every corporation in the world that it was a good idea to install software like this on every single machine? Why is there such a cottage industry of companies that are effectively undermining Operating System security with the argument that they are doing more "advanced" security features and allowing (often unqualified) security and IT departments to make fundamental changes to things like TLS encryption and basic OS functionality? How did all these smart people let a random company push updates to everyone on Earth with zero control? The justification often give is "to pass the audit".

These audits and certifications, of which there are many, are a fundamentally broken practice. The intent of the frameworks was good, allowing for the standardization of good cybersecurity practices while not relying on the expertise of an actual cybersecurity expert to validate the results. We can all acknowledge there aren't enough of those people on Earth to actually audit all the places that need to be audited. The issue is the audits don't actually fix real problems, but instead create busywork for people so it looks like they are fixing problems. It lets people cosplay as security experts without needing to actually understand what the stuff is.

I don't come to this analysis lightly. Between HIPAA, PCI, GDPR, ISO27001 and SOC2 I've seen every possible attempt to boil requirements down to a checklist that you can do. Add in the variations on these that large companies like to send out when you are attempting to sell them an enterprise SaaS and it wouldn't surprise me at all to learn that I've spent over 10,000 hours answering and implementing solutions to meet the arbitrary requirements of these documents. I have both produced the hundred page PDFs full of impressive-looking screenshots and diagrams AND received the PDFs full of diagrams and screenshots. I've been on many calls where it is clear neither of us understands what the other is talking about, but we agree that it sounds necessary and good.

I have also been there in the room when inept IT and Security teams use these regulations, or more specifically their interpretation of these regulations, to justify kicking off expensive and unnecessary projects. I've seen laptops crippled due to full filesystem scans looking for leaked AWS credentials and Social Security numbers, even if the employee has nothing to do with that sort of data. I've watched as TLS encryption is broken with proxies so that millions of files can be generated and stored inside of S3 for security teams to never ever look at again. Even I have had to reboot my laptop to apply a non-critical OS update in the middle of an important call. All this inflicted on poor people who had to work up the enthusiasm to even show up to their stupid jobs today.

Why?

Why does this keep happening? How is it that every large company keeps falling into the same trap of repeating the same expensive, bullshit processes?

  • The actual steps to improve cybersecurity are hard and involve making executives mad. You need to update your software, including planning ahead for end of life technology. Since this dark art is apparently impossible to do and would involve a lot of downtime to patch known-broken shit and reboot it, we won't do that. Better apparently to lose the entire Earths personal data.
  • Everyone is terrified that there might be a government regulation with actual consequences if they don't have an industry solution to this problem that sounds impressive but has no real punishments. If Comcast executives could go to jail for knowingly running out-of-date Citrix NetScaler software, it would have been fixed. So instead we need impressive-sounding things which can be held up as evidence of compliance that if, ultimately, don't end up preventing leaks the consequences are minor.
  • Nobody questions the justification of "we need to do x because of our certification". The actual requirements are too boring to read so it becomes this blank check that can be used to roll out nearly anything.
  • Easier to complete a million nonsense steps than it is to get in contact with someone who understands why the steps are nonsense. The number of times I've turned on silly "security settings" to pass an audit when the settings weren't applicable to how we used the product is almost too high to count.
  • Most Security teams aren't capable of stopping a dedicated attacker and, in their souls, know that to be true. Especially with large organizations, the number of conceivable attack vectors becomes too painful to even think about. Therefore too much faith is placed in companies like Zscaler and CloudStrike to use "machine learning and AI" (read: magic) to close up all the possible exploits before they happen.
  • If your IT department works exclusively with Windows and spends their time working with GPOs and Powershell, every problem you hand them will be solved with Windows. If you handed the same problem to a Linux person, you'd get a Linux solution. People just use what they know. So you end up with a one-size-fits-all approach to problems. Like mice in a maze where almost every step is electrified, if Windows loaded up with bullshit is what they are allowed to deploy without hassles that is what you are going to get.

Future

We all know this crap doesn't work and the sooner we can stop pretending it makes a difference, the better. AT&T had every certification on the planet and still didn't take the incredibly basic step of enforcing 2FA on a database of all the most sensitive data it has in the world. If following these stupid checklists and purchasing the required software ended up with more secure platforms, I'd say "well at least there is a payoff". But time after time we see the exact same thing which is an audit is not an adequate replacement for someone who knows what they are doing looking at your stack and asking hard questions about your process. These audits aren't resulting in organizations doing the hard but necessary step of taking downtime to patch critical flaws or even applying basic security settings across all of their platforms.

Because cryptocurrency now allows for hacking groups to demand millions of dollars in payments (thanks crypto!), the financial incentives to cripple critical infrastructure have never been better. At the same time most regulations designed to encourage the right behavior are completely toothless. Asking the tech industry to regulate itself has failed, without question. All that does is generate a lot of pain and suffering for their employees, who most businesses agree are disposable and idiots. All this while doing nothing to secure personal data. Even in organizations that had smart security people asking hard questions, that advice is entirely optional. There is no stick with cybersecurity and businesses, especially now that almost all of them have made giant mistakes.

I don't know what the solution is, but I know this song and dance isn't working. The world would be better off if organizations stopped wasting so much time and money on these vendor solutions and instead stuck to much more basic solutions. Perhaps if we could just start with "have we patched all the critical CVEs in our organization" and "did we remove the shared username and password from the cloud database with millions of call records", then perhaps AFTER all the actual work is done we can have some fun and inject dangerous software into the most critical parts of our employees devices.

Find me at: https://c.im/@matdevdug


Sears

Sears

It was 4 AM when I first heard the tapping on the glass. I had been working for 30 minutes trying desperately to get everything from the back store room onto the sales floor when I heard a light knocking. Peeking out from the back I saw an old woman wearing sweat pants and a Tweetie bird jacket, oxygen tank in tow, tapping a cane against one of the big front windows. "WE DON'T OPEN UNTIL 5" shouted my boss, who shook her head and resumed stacking boxes. "Black Friday is the worst" she said to nobody as we continued to pile the worthless garbage into neat piles on the store floor.

What people know now but didn't understand then was the items for sale on Black Friday weren't our normal inventory. These were TVs so poorly made they needed time to let their CRT tubes warm up before the image became recognizable. Radios with dials so brittle some came out of the box broken. Finally a mixer that when we tested it in the back let out such a stench of melted plastic we all screamed to turn it off before we burned down the building. I remember thinking as I unloaded it from the truck certainly nobody is gonna want this crap.

Well here they were and when we opened the doors they rushed in with a violence you wouldn't expect from a crowd of mostly senior citizens. One woman pushed me to get at the TVs, which was both unnecessary (I had already hidden one away for myself and put it behind the refrigerators in the back) and not helpful as she couldn't lift the thing on her own. I watched in silence as she tried to get her hands around the box with no holes cut out, presumably a cost savings on Sears part, grunting with effort as the box slowly slid while she held it. At the checkout desk a man told me he was buying the radio "as a Christmas gift for his son". "Alright but no returns ok?" I said keeping a smile on my face.

We had digital cameras the size of shoe-boxes, fire-hazard blenders and an automatic cat watering dish that I just knew was going to break a lot of hearts when Fluffy didn't survive the family trip to Florida. You knew it was quality when the dye from the box rubbed off on your hands when you picked it up. Despite my jokes about worthless junk, people couldn't purchase it fast enough. I saw arguments break out in the aisles and saw Robert, our marine veteran sales guy, whisper "forget this" and leave for a smoke by the loading dock. When I went over to ask if I could help, the man who had possession of the digital camera spun around and told me to "either find another one of these cameras or butt the fuck out". They resumed their argument and I resumed standing by the front telling newcomers that everything they wanted was already gone.

Hours later I was still doing that, informing everyone who walked in that the item they had circled in the newspaper was already sold out. "See, this is such a scam, why don't you stock more of it? It's just a trick to get us into the store". Customer after customer told me variations on the above, including one very kind looking grandfather type informing me I could "go fuck myself" when I wished him a nice holiday.

Beginnings

The store was in my small rural farming town in Ohio, nestled between the computer shop where I got my first job and a carpet store that was almost certainly a money laundering front since nobody ever went in or out. I was interviewed by the owner, a Vietnam veteran who spent probably half our interview talking about his two tours in Vietnam. "We used to throw oil drums in the water and shoot at them from our helicopter, god that was fun. Don't even get me started about all the beautiful local woman." I nodded, unsure what this had to do with me but sensing this was all part of his process. In the years to come I would learn to avoid sitting down in his office, since then you would be trapped listening to stories like these for an hour plus.

After these tales of what honestly sounded like a super fun war full of drugs and joyrides on helicopters, he asked me why I wanted to work at Sears. "It's an American institution and I've always had a lot of respect for it" I said, not sure if he would believe it. He nodded and went on to talk about how Sears build America. "Those kit houses around town, all ordered from Sears. Boy we were something back in the day. Anyway fill out your availability and we'll get you out there helping customers." I had assumed at some point I would get training on the actual products, which never happened in the years I worked there. In the back were dust covered training manuals which I was told I should look at "when I got some time". I obviously never did and still sometimes wonder about what mysteries they contained.

I was given my lanyard and put on the floor, which consisted of half appliances, one quarter electronics and then the rest being tools. Jane, one of the saleswomen told me to "direct all the leads for appliances to her" and not check one out myself, since I didn't get commission. Most of my job consisted of swapping broken Craftsmen tools since they had a lifetime warranty. You filled out a carbon paper form, dropped the broken tool into a giant metal barrel and then handed them a new one. I would also set up deliveries for rider lawnmowers and appliances, working on an ancient IBM POS terminal that required memorizing a series of strange keyboard shortcuts to navigate the calendar.

When there was downtime, I would go into the back and help Todd assemble the appliances and rider lawnmowers. Todd was a special needs student at my high school who was the entirety of our "expert assembly" service. He did a good job, carefully following the manual every time. Whatever sense of superiority as an honor role student I felt disappeared when he watched me try to assemble a rider mower myself. "You need to read the instructions and then do what they say" he would helpfully chime in as I struggled to figure out why the brakes did nothing. His mowers always started on the first try while mine were safety hazards that I felt certain was going to be on the news. "Tonight a Craftsman rider lawnmower killed a family of 4. It was assembled by this idiot." Then just my yearbook photo where I had decided to bleach my hair blonde like a chonky backstreet boy overlaid on top of live footage of blood splattered house siding.

Any feeling I had that people paying us $200 to assemble their rider mowers disappeared when I saw the first one where a customer tried to assemble it. If my mowers were death traps these were actual IEDs whose only conceivable purpose on Earth would be to trick innocent people into thinking they were rider lawnmowers until you turned the key and they blew you into the atmosphere. One guy brought his back with several ziplock bags full of screws bashfully explaining that he tried his best but "there's just no way that's right". That didn't stop me from holding my breath every time someone drove a mower I had worked on up the ramp into the back of the truck. "Please god just don't fall apart right now, wait until they get it home" was my prayer to whatever deity looked after idiots in jobs they shouldn't have.

Sometimes actual adults with real jobs would come in asking me questions about tools, conversations that both of us hated. "I'm looking for a oil filter wrench" they would say, as if this item was something I knew about and could find. "Uh sure, could you describe it?" "It's a wrench, used for changing oil filters, has a loop on it." I'd nod and then feebly offer them up random items until they finally grabbed it themselves. One mechanic when I offered a claw hammer up in response to his request for a cross-pein hammer said "you aren't exactly handy, are you?" I shook my head and went back behind the counter, attempting to establish what little authority I had left with the counter. I might not know anything about the products we sell, but only one of us is allowed back here sir.

Sears Expert

As the months dragged on I was moved from the heavier foot traffic shifts to the night shifts. This was because customers "didn't like talking to me", a piece of feedback I felt was true but still unfair. I had learned a lot, like every incorrect way to assemble a lawn mower and that refrigerators are all the same except for the external panels. Night shifts were mostly getting things ready for the delivery company, a father and son team who were always amusing.

The father was a chain-smoking tough guy who would regularly talk about his "fuck up" of a son. "That idiot dents another oven when we're bringing it in I swear to god I'm going to replace him with one of those Japanese robots I keep seeing on the news." The son was the nicest guy on Earth, really hard working, always on time for deliveries and we got like mountains of positive feedback about him. Old ladies would tear up as they told me about the son hauling their old appliances away in a blizzard on his back. He would just sit there, smile frozen on his face while his father went on and on about how much of a failure he was. "He's just like this sometimes" the son would tell me by the loading dock, even though I would never get involved. "He's actually a nice guy". This was often punctuated by the father running into a minor inconvenience and flying off the handle. "What kind of jackass would sort the paperwork alphabetically instead of by order of delivery?" he'd scream from the parking lot.

When the son went off to college he was replaced by a Hispanic man who took zero shit. His response to customer complaints was always that they were liars and I think the father was afraid of him. "Oh hey don't bother Leo with that, he's not in the mood, I'll call them and work it out" the father would tell me as Leo glared at us from the truck. Leo was incredibly handy though, able to fix almost any dent or scratch in minutes. He popped the dent out of my car door by punching the panel, which is still one of the cooler things I've seen someone do.

Other than the father and son duo, I was mostly alone with a woman named Ruth. She fascinated me because her life was unspeakably bleak. She had been born and raised in this town and had only left the county once in her life, to visit the Sears headquarters in Chicago. She'd talk about it like she had been permitted to visit heaven. "Oh it was something, just a beautiful shiny building full of the smartest people you ever met. Boy I'd love to see it again sometime." She had married her high school boyfriend, had children and now worked here in her 60s as her reward for a life of hard work. She had such bad pain in her knees she had to lean on the stocking cart as she pushed it down the aisles, often stopping to catch her breath. The store would be empty except for the sounds of a wheezing woman and squeaky wheels.

When I would mention Chicago was a 4 hour drive and she could see it again, she'd roll her eyes at me and continue stocking shelves. Ruth was a type of rural person I encountered a lot who seemed to get off on the idea that we were actually isolated from the outside world by a force field. Mention leaving the county to go perhaps to the next county and she would laugh or make a comment about how she wasn't "that kind of person". Every story she would tell had these depressing endings that left me pondering what kind of response she was looking for. "My brother, well he went off to war and when he came back was just a shell of a man. Never really came back if you ask me. Anyway let's clean the counters."

She'd talk endlessly about her grandson, a 12 year old who was "stupid but kind". His incredibly minor infractions were relayed to me like she was telling me about a dark family scandal. "Then I said, who ate all the chips? I knew he had, but he just sat there looking at me and I told him you better wipe those crumbs off your t-shirt smartass and get back to your homework". He finally visited and I was shocked to discover there was also a granddaughter who I had never heard about. He smirked when he met me and told me that Ruth had said I was "a lazy snob".

I'll admit, I was actually a little hurt. Was I a snob compared to Ruth? Absolutely. To be honest with you I'm not entirely sure she was literate. I'd sneak books under the counter to read during the long periods where nothing was happening and she'd often ask me what they were about even if the title sort of explained it. "What is Battle Cry of Freedom: The Civil War Era about? Um well the Civil War." I'd often get called over to "check" documents for her, which typically included anything more complicated than a few sentences. I still enjoyed working with her.

Our relationship never really recovered after I went to Japan when I was 16. I went by myself and wandered around Tokyo, having a great time. When I returned full of stories and pictures of the trip, I could tell she was immediately sick of me. "Who wants to see a place like Japan? Horrible people" she'd tell me as I tried to tell her that things had changed a tiny bit since WWII. "No it's really nice and clean, the food was amazing, let me tell you about these cool trains they have". She wasn't interested and it was clear my getting a passport and leaving the US had changed her opinion of me.

So when her grandson confided that she had called me lazy AND a snob my immediate reaction was to lean over and tell him that she had called him "a stupid idiot". Now she had never actually said "stupid idiot", but in the heat of the moment I went with my gut. Moments after I did that the reality of a 16 year old basically bullying a 12 year old sunk in and I decided it was time for me to go take out some garbage. Ruth of course found out what I said and mentioned it every shift after that. "Saying I called my grandson a stupid idiot, who does that, a rude person that's who, a rude snob" she'd say loud enough for me to hear as the cart very slowly inched down the aisles. I deserved it.

Trouble In Paradise

At a certain point I was allowed back in front of customers and realized with a shock that I had worked there for a few years. The job paid very little, which was fine as I had nothing in the town to actually buy, but enough to keep my lime green Ford Probe full of gas. It shook violently if you exceeded 70 MPH, which I should have asked someone about but never did. I was paired with Jane, the saleswoman who was a devout Republican and liked to make fun of me for being a Democrat. This was during the George W Bush vs Kerry election and she liked to point out how Kerry was a "flipflopper" on things. "He just flips and flops, changes his mind all the time". I'd point out we had vaporized the country of Iraq for no reason and she'd roll her eyes and tell me I'd get it when I was older.

My favorite was when we were working together during Reagan's funeral, an event which elicited no emotion from me but drove her to tears multiple times. "Now that was a man and a president" she'd exclaim to the store while the funeral procession was playing on the 30 TVs. "He won the Cold War you know?" she'd shout at a woman looking for replacement vacuum cleaner bags. Afterwards she asked me what my favorite Reagan memory was. All I could remember was that he had invaded the small nation of Grenada for some reason, so I said that. "Really showed those people not to mess with the US" she responded. I don't think either one of us knew that Grenada is a tiny island nation with a population less than 200,000.

Jane liked to dispense country wisdom, witty one-liners that only sometimes were relevant to the situation at hand. When confronted with an angry customer she would often say afterwards that you "You can't make a silk purse out of a sow's ear" which still means nothing to me.  Whatever rural knowledge I was supposed to obtain through osmosis my brain clearly rejected. Jane would send me over to sell televisions since I understood what an HDMI cord was and the difference between SD and HD television.

Selling TVs was perhaps the only thing I did well, that and the fun vacuum demonstration where we would dump a bunch of dirt on a carpet tile and suck it up. Some poor customer would tell me she didn't have the budget for the Dyson and I'd put my hand up to silence her. "You don't have to buy it, just watch it suck up a bunch of pebbles. I don't make commission anyway so who cares." Then we'd both watch as the Dyson would make a horrible screeching noise and suck in a cups worth of small rocks. "That's pretty cool huh?" and the customer would nod, probably terrified of what I would do if she said no.

Graduation

When I graduated high school and prepared to go off to college, I had the chance to say goodbye to everyone before I left. They had obviously already replaced me with another high school student, one that knew things about tools and was better looking. You like to imagine that people will miss you when you leave a job, but everyone knew that wasn't true here. I had been a normal employee who didn't steal and mostly showed up on time.

My last parting piece of wisdom from Ruth was not to let college "make me forget where I came from". Sadly for her I was desperate to do just that, entirely willing to adopt whatever new personality that was presented to me. I'd hated rural life and still do, the spooky dark roads surrounded by corn. Yelling at Amish teens to stop shoplifting during their Rumspringa where they would get dropped off in the middle of town and left to their own devices.

Still I'm grateful that I at least know how to assemble a rider lawnmower, even if it did take a lot of practice runs on customers mowers.


A Eulogy for DevOps

We hardly knew ye.

DevOps, like many trendy technology terms, has gone from the peak of optimism to the depths of exhaustion. While many of the fundamental ideas behind the concept have become second-nature for organizations, proving it did in fact have a measurable outcome, the difference between the initial intent and where we ended up is vast. For most organizations this didn't result in a wave of safer, easier to use software but instead encouraged new patterns of work that centralized risk and introduced delays and irritations that didn't exist before. We can move faster than before, but that didn't magically fix all our problems.

The cause of its death was a critical misunderstanding over what was causing software to be hard to write. The belief was by removing barriers to deployment, more software would get deployed and things would be easier and better. Effectively that the issue was that developers and operations teams were being held back by ridiculous process and coordination. In reality these "soft problems" of communication and coordination are much more difficult to solve than the technical problems around pushing more code out into the world more often.

What is DevOps?

DevOps, when it was introduced around 2007, was a pretty radical concept of removing the divisions between people who ran the hardware and people who wrote the software. Organizations still had giant silos between teams, with myself experiencing a lot of that workflow.

Since all computer nerds also love space, it was basically us cosplaying as NASA. Copying a lot of the procedures and ideas from NASA to try and increase the safety around pushing code out into the world. Different organizations would copy and paste different parts, but the basic premise was every release was as close to bug free as time allowed. You were typically shooting for zero exceptions.

When I worked for a legacy company around that time, the flow for releasing software looked as follows.

  • Development team would cut a release of the server software with a release number in conjunction with the frontend team typically packaged together as a full entity. They would test this locally on their machines, then it would go to dev for QA to test, then finally out to customers once the QA checks were cleared.
  • Operations teams would receive a playbook of effectively what the software was changing and what to do if it broke. This would include how it was supposed to be installed, if it did anything to the database, it was a whole living document. The idea was the people managing the servers, networking equipment and SANs had no idea what the software did or how to fix it so they needed what were effectively step by step instructions. Sometimes you would even get this as a paper document.
  • Since these happened often inside of your datacenter, you didn't have unlimited elasticity for growth. So, if possible, you would slowly roll out the update and stop to monitor at intervals. But you couldn't do what people see now as a blue/green deployment because rarely did you have enough excess server capacity to run two versions at the same time for all users. Some orgs did do different datacenters at different times and cut between them (which was considered to be sort of the highest tier of safety).
  • You'd pick a deployment day, typically middle of the week around 10 AM local time and then would monitor whatever metrics you had to see if the release was successful or not. These were often pretty basic metrics of success, including some real eyebrow raising stuff like "is support getting more tickets" and "are we getting more hits to our uptime website". Effectively "is the load balancer happy" and "are customers actively screaming at us".
  • You'd finish the deployment and then the on-call team would monitor the progress as you went.

Why Didn't This Work

Part of the issue was this design was very labor-intensive. You needed enough developers coordinating together to put together a release. Then you needed a staffed QA team to actually take that software and ensure, on top of automated testing which was jusssttttt starting to become a thing, that the software actually worked. Finally you needed a technical writer working with the development team to walk through what does a release playbook look like and then finally have the Operations team receive, review the book and then implement the plan.

It was also slow. Features would often be pushed for months even when they were done just because a more important feature had to go out first. Or this update was making major changes to the database and we didn't want to bundle in six things with the one possibly catastrophic change. It's effectively the Agile vs Waterfall design broken out to practical steps.

Waterfall vs Agile in software development infographic

A lot of the lip service around this time that was given as to why organizations were changing was, frankly, bullshit. The real reason companies were so desperate to change was the following:

  • Having lots of mandatory technical employees they couldn't easily replace was a bummer
  • Recruitment was hard and expensive.
  • Sales couldn't easily inject whatever last-minute deal requirement they had into the release cycle since that was often set it stone.
  • It provided an amazing opportunity for SaaS vendors to inject themselves into the process by offloading complexity into their stack so they pushed it hard.
  • The change also emphasized the strengths of cloud platforms at the time when they were starting to gobble market share. You didn't need lots of discipline, just allocate more servers.
  • Money was (effectively) free so it was better to increase speed regardless of monthly bills.
  • Developers were understandably frustrated that minor changes could take weeks to get out the door while they were being blamed for customer complaints.

So executives went to a few conferences and someone asked them if they were "doing DevOps" and so we all changed our entire lives so they didn't feel like they weren't part of the cool club.

What Was DevOps?

Often this image is used to sum it up:

DevOps Infinity Wheel

In a nutshell, the basic premise was that development teams and operations teams were now one team. QA was fired and replaced with this idea that because you could very quickly deploy new releases and get feedback on those releases, you didn't need a lengthy internal test period where every piece of functionality was retested and determined to still be relevant.

Often this is conflated with the concept of SRE from Google, which I will argue until I die is a giant mistake. SRE is in the same genre but a very different tune, with a much more disciplined and structured approach to this problem. DevOps instead is about the simplification of the stack such that any developer on your team can deploy to production as many times in a day as they wish with only the minimal amounts of control on that deployment to ensure it had a reasonably high chance of working.

In reality DevOps as a practice looks much more like how Facebook operated, with employees committing to production on their first day and relying extensively on real-world signals to determine success or failure vs QA and tightly controlled releases.

In practice it looks like this:

  • Development makes a branch in git and adds a feature, fix, change, etc.
  • They open up a PR and then someone else on that team looks at it, sees it passes their internal tests, approves it and then it gets merged into main. This is effectively the only safety step, relying on the reviewer to have perfect knowledge of all systems.
  • This triggers a webhook to the CI/CD system which starts the build (often of an entire container with this code inside) and then once the container is built, it's pushed to a container registry.
  • The CD system tells the servers that the new release exists, often through a Kubernetes deployment or pushing a new version of an internal package or using the internal CLI of the cloud providers specific "run a container as a service" platform. It then monitors and tells you about the success or failure of that deployment.
  • Finally there are release-aware metrics which allow that same team, who is on-call for their application, to see if something has changed since they released it. Is latency up, error count up, etc. This is often just a line in a graph saying this was old and this is new.
  • Depending on the system, this can either be something where every time the container is deployed it is on brand-new VMs or it is using some system like Kubernetes to deploy "the right number" of containers.

The sales pitch was simple. Everyone can do everything so teams no longer need as many specialized people. Frameworks like Rails made database operations less dangerous, so we don't need a team of DBAs. Hell, use something like Mongo and you never need a DBA!

DevOps combined with Agile ended up with a very different philosophy of programming which had the following conceits:

  • The User is the Tester
  • Every System Is Your Specialization
  • Speed Of Shipping Above All
  • Catch It In Metrics
  • Uptime Is Free, SSO Costs Money (free features were premium, expensive availability wasn't charged for)
  • Logs Are Business Intelligence

What Didn't Work

The first cracks in this model emerged pretty early on. Developers were testing on their local Mac and Windows machines and then deploying code to Linux servers configured from Ansible playbooks and left running for months, sometimes years. Inevitably small differences in the running fleet of production servers emerged, either from package upgrades for security reasons or just from random configuration events. This could be mitigated by frequently rotating the running servers by destroying and rebuilding them as fresh VMs, but in practice this wasn't done as often as it should have been.

Soon you would see things like "it's running fine on box 1,2, 4, 5, but 3 seems to be having problems". It wasn't clear in the DevOps model who exactly was supposed to go figure out what was happening or how. In the previous design someone who worked with Linux for years and with these specific servers would be monitoring the release, but now those team members often wouldn't even know a deployment was happening. Telling someone who is amazing at writing great Javascript to go "find the problem with a Linux box" turned out to be easier said than done.

Quickly feedback from developers started to pile up. They didn't want to have to spend all this time figuring out what Debian package they wanted for this or that dependency. It wasn't what they were interested in doing and also they weren't being rewarded for that work, since they were almost exclusively being measured for promotions by the software they shipped. This left the Operations folks in charge of "smoothing out" this process, which in practice often meant really wasteful practices.

You'd see really strange workflows around this time of doubling the number of production servers you were paying for by the hour during a deployment and then slowly scaling them down, all relying on the same AMI (server image) to ensure some baseline level of consistency. However since any update to the AMI required a full dev-stage-prod check, things like security upgrades took a very long time.

Soon you had just a pile of issues that became difficult to assign. Who "owned" platform errors that didn't result in problems for users? When a build worked locally but failed inside of Jenkins, what team needed to check that? The idea of we're all working on the same team broke down when it came to assigning ownership of annoying issues because someone had to own them or they'd just sit there forever untouched.

Enter Containers

DevOps got a real shot in the arm with the popularization of containers, which allowed the movement to progress past its awkward teenage years. Not only did this (mostly) solve the "it worked on my machine" thing but it also allowed for a massive simplification of the Linux server component part. Now servers were effectively dumb boxes running containers, either on their own with Docker compose or as part of a fleet with Kubernetes/ECS/App Engine/Nomad/whatever new thing that has been invented in the last two weeks.

Combined with you could move almost everything that might previous be a networking team problem or a SAN problem to configuration inside of the cloud provider through tools like Terraform and you saw a real flattening of the skill curve. This greatly reduced the expertise required to operate these platforms and allowed for more automation. Soon you started to see what we now recognize as the current standard for development which is "I push out a bajillion changes a day to production".

What Containers Didn't Fix

So there's a lot of other shit in that DevOps model we haven't talked about.

So far teams had improved the "build, test and deploy" parts. However operating the crap was still very hard. Observability was really really hard and expensive. Discoverability was actually harder than ever because stuff was constantly changing beneath your feet and finally the Planning part immediately collapsed into the ocean because now teams could do whatever they wanted all the time.

Operate

This meant someone going through and doing all the boring stuff. Upgrading Kubernetes, upgrading the host operating system, making firewall rules, setting up service meshes, enforcing network policies, running the bastion host, configuring the SSH keys, etc. What organizations quickly discovered was that this stuff was very time consuming to do and often required more specialization than the roles they had previously gotten rid of.

Before you needed a DBA, a sysadmin, a network engineer and some general Operations folks. Now you needed someone who not only understood databases but understood your specific cloud providers version of that database. You still needed someone with the sysadmin skills, but in addition they needed to be experts in your cloud platform in order to ensure you weren't exposing your database to the internet. Networking was still critical but now it all existed at a level outside of your control, meaning weird issues would sometimes have to get explained as "well that sometimes happens".

Often teams would delay maintenance tasks out of a fear of breaking something like k8s or their hosted database, but that resulted in delaying the pain and making their lives more difficult. This was the era where every startup I interviewed with basically just wanted someone to update all the stuff in their stack "safely". Every system was well past EOL and nobody knew how to Jenga it all together.

Observe

As applications shipped more often, knowing they worked became more important so you could roll back if it blew up in your face. However replacing simple uptime checks with detailed traces, metrics and logs was hard. These technologies are specialized and require detailed understanding of what they do and how they work. A syslog centralized box lasts to a point and then it doesn't. Prometheus scales to x amount of metrics and then no longer works on a single box. You needed someone who had a detailed understanding of how metrics, logs and traces worked and how to work with development teams in getting them sending the correct signal to the right places at the right amount of fidelity.

Or you could pay a SaaS a shocking amount to do it for you. The rise of companies like Datadog and the eye-watering bills that followed was proof that they understood how important what they were providing was. You quickly saw Observability bills exceed CPU and networking costs for organizations as one team would misconfigure their application logs and suddenly you have blown through your monthly quota in a week.

Developers were being expected to monitor with detailed precision what was happening with their applications without a full understanding of what they were seeing. How many metrics and logs were being dropped on the floor or sampled away, how did the platform work in displaying these logs to them, how do you write an query for terabytes of logs so that you can surface what you need quickly, all of this was being passed around in Confluence pages being written by desperate developers who were learning as they were getting paged at 2AM how all this shit works together.

Continuous Feedback

This to me is the same problem as Observe. It's about whether your deployment worked or not and whether you had signal from internal tests if it was likely to work. It's also about feedback from the team on what in this process worked and what didn't, but because nobody ever did anything with that internal feedback we can just throw that one directly in the trash.

I guess in theory this would be retros where we all complain about the same six things every sprint and then continue with our lives. I'm not an Agile Karate Master so you'll need to talk to the experts.

Discover

A big pitch of combining these teams was the idea of more knowledge sharing. Development teams and Operation teams would be able to cross-share more about what things did and how they worked. Again it's an interesting idea and there was some improvement to discoverability, but in practice that isn't how the incentives were aligned.

Developers weren't rewarded for discovering more about how the platform operated and Operations didn't have any incentive to sit down and figure out how the frontend was built. It's not a lack of intellectual curiosity by either party, just the finite amount of time we all have before we die and what we get rewarded for doing. Being surprised that this didn't work is like being surprised a mouse didn't go down the tunnel with no cheese just for the experience.

In practice I "discovered" that if NPM was down nothing worked and the frontend team "discovered" that troubleshooting Kubernetes was a bit like Warhammer 40k Adeptus Mechanicus waving incense in front of machines they didn't understand in the hopes that it would make the problem go away.

The Adeptus Mechanicus - Warhammer Universe (2024)
Try restarting the Holy Deployment

Plan

Maybe more than anything else, this lack of centralization impacted planning. Since teams weren't syncing on a regular basis anymore, things could continue in crazy directions unchecked. In theory PMs were syncing with each other to try and ensure there were railroad tracks in front of the train before it plowed into the ground at 100 MPH, but that was a lot to put on a small cadre of people.

We see this especially in large orgs with microservices where it is easier to write a new microservice to do something rather than figure out which existing microservice does the thing you are trying to do. This model was sustainable when money was free and cloud budgets were unlimited, but once that gravy train crashed into the mountain of "businesses need to be profitable and pay taxes" that stopped making sense.

The Part Where We All Gave Up

A lot of orgs solved the problems above by simply throwing bodies into the mix. More developers meant it was possible for teams to have someone (anyone) learn more about the systems and how to fix them. Adding more levels of PMs and overall planning staff meant even with the frantic pace of change it was...more possible to keep an eye on what was happening. While cloud bills continued to go unbounded, for the most part these services worked and allowed people to do the things they wanted to do.

Then layoffs started and budget cuts. Suddenly it wasn't acceptable to spend unlimited money with your logging platform and your cloud provider as well as having a full team. Almost instantly I saw the shift as organizations started talking about "going back to basics". Among this was a hard turn in the narrative around Kubernetes where it went from an amazing technology that lets you grow to Google-scale to a weight around an organizations neck nobody understood.

Platform Engineering

Since there are no new ideas, just new terms, a successor to the throne has emerged. No longer are development teams expected to understand and troubleshoot the platforms that run their software, instead the idea is that the entire process is completely abstracted away from them. They provide the container and that is the end of the relationship.

From a certain perspective this makes more sense since it places the ownership for the operation of the platform with the people who should have owned it from the beginning. It also removes some of the ambiguity over what is whose problem. The development teams are still on-call for their specific application errors, but platform teams are allowed to enforce more global rules.

Well at least in theory. In practice this is another expansion of roles. You went from needing to be a Linux sysadmin to being a cloud-certified Linux sysadmin to being a Kubernetes-certified multicloud Linux sysadmin to finally being an application developer who can create a useful webUI for deploying applications on top of a multicloud stack that runs on Kubernetes in multiple regions with perfect uptime and observability that doesn't blow the budget. I guess at some point between learning the difference between AWS and GCP we were all supposed to go out and learn how to make useful websites.

This division of labor makes no sense but at least it's something I guess. Feels like somehow Developers got stuck with a lot more work and Operation teams now need to learn 600 technologies a week. Surprisingly tech executives didn't get any additional work with this system. I'm sure the next reorg they'll chip in more.

Conclusion

We are now seeing a massive contraction of the Infrastructure space. Teams are increasingly looking for simple, less platform specific tooling. In my own personal circles it feels like a real return to basics, as small and medium organizations abandon technology like Kubernetes and adopt much more simple and easy-to-troubleshoot workflows like "a bash script that pulls a new container".

In some respects it's a positive change, as organizations stop pretending they needed a "global scale" and can focus on actually servicing the users and developers they have. In reality a lot of this technology was adopted by organizations who weren't ready for it and didn't have a great plan for how to use it.

However Platform Engineering is not a magical solution to the problem. It is instead another fabrication of an industry desperate to show monthly growth in cloud providers who know teams lack the expertise to create the kinds of tooling described by such practices. In reality organizations need to be more brutally honest about what they actually need vs what bullshit they've been led to believe they need.

My hope is that we keep the gains from the DevOps approach and focus on simplification and stability over rapid transformation in the Infrastructure space. I think we desperately need a return to basics ideology that encourages teams to stop designing with the expectation that endless growth is the only possible outcome of every product launch.


GitHub Copilot Workspace Review

I was recently invited to try out the beta for GitHub's new AI-driven web IDE and figured it could be an interesting time to dip my toes into AI. So far I've avoided all of the AI tooling, trying the GitHub paid Copilot option and being frankly underwhelmed. It made more work for me than it saved. However this is free for me to try and I figured "hey why not".

Disclaimer: I am not and have never been an employee of GitHub, Microsoft, any company owned by Microsoft, etc. They don't care about me and likely aren't aware of my existence. Nobody from GitHub PR asked me to do this and probably won't like what I have to say anyway.

TL;DR

GitHub Copilot Workspace didn't work on a super simple task regardless of how easy I made the task. I wouldn't use something like this for free, much less pay for it. It sort of failed in every way it could at every step.

What is GitHub Copilot Workspace?

So after the success of GitHub Copilot, which seems successful according to them:

In 2022, we launched GitHub Copilot as an autocomplete pair programmer in the editor, boosting developer productivity by up to 55%. Copilot is now the most widely adopted AI developer tool. In 2023, we released GitHub Copilot Chat—unlocking the power of natural language in coding, debugging, and testing—allowing developers to converse with their code in real time.

They have expanded on this feature set with GitHub Copilot Workspace, a combination of an AI tool with an online IDE....sorta. It's all powered by GPT-4 so my understanding is this is the best LLM money can buy. The workflow of the tool is strange and takes a little bit of explanation to convey what it is doing.

GitHub has the marketing page here: https://githubnext.com/projects/copilot-workspace and the docs here: https://github.com/githubnext/copilot-workspace-user-manual. It's a beta product and I thought the docs were nicely written.

Effectively you start with a GitHub Issue, the classic way maintainers are harassed by random strangers. I've moved my very simple demo site: https://gcp-iam-reference.matduggan.com/ to a GitHub repo to show what I did. So I open the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/1

Very simple, makes sense, Then I click "Open in Workspaces" which brings me to a kind of GitHub Actions inspired flow.

It reads the Issue and creates a Specification, which is editable.

Then you generate a Plan:

Finally it generates the files of that plan and you can choose whether to implement them or not and open a Pull Request against the main branch.

Implementation:

It makes a Pull Request:

Great right? Well except it didn't do any of it right.

  • It didn't add a route to the Flask app to expose this information
  • It didn't stick with the convention of storing the information in JSON files, writing it out to Markdown for some reason
  • It decided the way that it was going to reveal this information was to add it to the README
  • Finally it didn't get anywhere near all the machine types.
Before you ping me yes I tried to change the Proposed plan

Baby Web App

So the app I've written here is primarily for my own use and it is very brain dead simple. The entire thing is the work of roughly an afternoon of poking around while responding to Slack messages. However I figured this would be a good example of maybe a more simple internal tool where you might trust AI to go a bit nuts since nothing critical will explode if it messes up.

How the site works it is relies on the output of the gcloud CLI tool to generate JSON of all of IAM permissions for GCP and then output them so that I can put them into categories and quickly look for the one I want. I found the official documentation to be slow and hard to use, so I made my own. It's a Flask app, which means it is pretty stupid simple.

import os
from flask import *
from all_functions import *
import json


app = Flask(__name__)

@app.route('/')
def main():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("index.html", items=items, role_data=role_data)

@app.route('/all-roles')
def all_roles():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("all_roles.html", items=items, role_data=role_data)

@app.route('/search')
def search():
    items = get_iam_categories()
    return render_template('search_page.html', items=items)

@app.route('/iam-classes')
def iam_classes():
    source = request.args.get('parameter')
    items = get_iam_categories()
    specific_items = get_specific_roles(source)
    print(specific_items)
    return render_template("iam-classes.html", specific_items=specific_items, items=items)

@app.route('/tsid', methods=['GET'])
def tsid():
    data = get_tsid()
    return jsonify(data)

@app.route('/eu-eea', methods=['GET'])
def eueea():
    country_code = get_country_codes()
    return is_eea(country_code)


if __name__ == '__main__':
    app.run(debug=False)

I also have an endpoint I use during testing if I need to test some specific GDPR code so I can curl it and see if the IP address is coming from EU/EEA or not along with a TSID generator I used for a brief period of testing that I don't need anymore. So again, pretty simple. It could be rewritten to be much better but I'm the primary user and I don't care, so whatever.

So effectively what I want to add is another route where I would also have a list of all the GCP machine types because their official documentation is horrible and unreadable. https://cloud.google.com/compute/docs/machine-resource

What I'm looking to add is something like this: https://gcloud-compute.com/

Look how information packed it is! My god, I can tell at a glance if a machine type is eligible for Sustained Use Discounts, how many regions it is in, Hour/Spot/Month pricing and the breakout per OS along with Clock speed. If only Google had a team capable of making a spreadsheet.

Nothing I enjoy more than nested pages with nested submenus that lack all the information I would actually need. I'm also not clear what a Tier_1 bandwidth is but it does seem unlikely that it matters for machine types when so few have it.

I could complain about how GCP organizes information all day but regardless the information exists. So I don't need anything to this level, but could I make a simpler version of this that gives me some of the same information? Seems possible.

How I Would Do It

First let's try to stick with the gcloud CLI approach.

gcloud compute machine-types list --format="json"

Only problem with this is that it does output the information I want, but for some reason it outputs a JSON file per region.

  {
    "creationTimestamp": "1969-12-31T16:00:00.000-08:00",
    "description": "4 vCPUs 4 GB RAM",
    "guestCpus": 4,
    "id": "903004",
    "imageSpaceGb": 0,
    "isSharedCpu": false,
    "kind": "compute#machineType",
    "maximumPersistentDisks": 128,
    "maximumPersistentDisksSizeGb": "263168",
    "memoryMb": 4096,
    "name": "n2-highcpu-4",
    "selfLink": "https://www.googleapis.com/compute/v1/projects/sybogames-artifact/zones/africa-south1-c/machineTypes/n2-highcpu-4",
    "zone": "africa-south1-c"
  }

I don't know why but sure. However I don't actually need every region so I can cheat here. gcloud compute machine-types list --format="json" gets me some of the way there.

Where's the price?

Yeah so Google doesn't expose pricing through the API as far as I can tell. You can download what is effectively a global price list for your account at https://console.cloud.google.com/billing/[your billing account id]/pricing. That's a 13 MB CSV that includes what your specific pricing will be, which is what I would use. So then I would combine the information from my region with the information from the CSV and then output the values. However since I don't know whether the pricing I have is relevant to you, I can't really use this to generate a public webpage.

Web Scraping

So realistically my only option would be to scrape the pricing page here: https://cloud.google.com/compute/all-pricing. Except of course it was designed in such a way as to make it as hard to do that as possible.

Boy it is hard to escape the impression GCP does not want me doing large-scale cost analysis. Wonder why?

So there's actually a tool called gcosts which seems to power a lot of these sites running price analysis. However it relies on a pricing.yml file which is automatically generated weekly. The work involved in generating this file is not trivial:

 +--------------------------+  +------------------------------+
 | Google Cloud Billing API |  | Custom mapping (mapping.csv) |
 +--------------------------+  +------------------------------+
               ↓                              ↓
 +------------------------------------------------------------+
 | » Export SKUs and add custom mapping IDs to SKUs (skus.sh) |
 +------------------------------------------------------------+
               ↓
 +----------------------------------+  +-----------------------------+
 | SKUs pricing with custom mapping |  | Google Cloud Platform info. |
 |             (skus.db)            |  |           (gcp.yml)         |
 +----------------------------------+  +-----------------------------+
                \                             /
         +--------------------------------------------------+
         | » Generate pricing information file (pricing.pl) |
         +--------------------------------------------------+
                              ↓
                +-------------------------------+
                |  GCP pricing information file |
                |          (pricing.yml)        |
                +-------------------------------+

Alright so looking through the GitHub Action that generates this pricing.yml file, here, I can see how it works and how the file is generated. But also I can just skip that part and pull the latest for my usecase whenever I regenerate the site. That can be found here.

Effectively with no assistance from AI, I have now figured out how I would do this:

  • Pull down the pricing.yml file and parse it
  • Take that information and output it to a simple table structure
  • Make a new route on the Flask app and expose that information
  • Add a step to the Dockerfile to pull in the new pricing.yml with every Dockerfile build just so I'm not hammering the GitHub CDN all the time.

Why Am I Saying All This?

So this is a perfect example of an operation that should be simple but because the vendor doesn't want to make it simple, is actually pretty complicated. As we can now tell from the PR generated before, AI is never going to be able to understand all the steps we just walked through to understand how one actually get the prices for these machines. We've also learned that because of the hard work of someone else, we can skip a lot of the steps. So let's try it again.

Attempt 2

Maybe if I give it super specific information, it can do a better job.

You can see the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/4

I think I've explained maybe what I'm trying to do. Certainly a person would understand this. Obviously this isn't the right way to organize this information, I would want to do a different view and sort by region and blah blah blah. However this should be easier for the machine to understand.

Note: I am aware that Copilot has issues making calls to the internet to pull files, even from GitHub itself. That's why I've tried to include a sample of the data. If there's a canonical way to pass the tool information inside of the issue let me know at the link at the bottom.

Results

So at first things looked promising.

It seems to understand what I'm asking and why I'm asking it. This is roughly the correct thing. The plan also looks ok:

You can see the PR it generated here: https://github.com/matdevdug/gcp-iam-reference/pull/5

So this is much closer but it's still not really "right". First like most Flask apps I have a base template that I want to include on every page: https://github.com/matdevdug/gcp-iam-reference/blob/main/templates/base.html

Then for every HTML file after that we extend the base:

{% extends "base.html" %}

{% block main %}

<style>
        table {
            border-collapse: collapse;
            width: 100%;
        }

        th, td {
            border: 1px solid #dddddd;
            text-align: left;
            padding: 8px;
        }

        tr:nth-child(even) {
            background-color: #f2f2f2;
        }
</style>

The AI doesn't understand that we've done this and is just re-implementing Bootstrap: https://github.com/matdevdug/gcp-iam-reference/pull/5/files#diff-a8e8dd2ad94897b3e1d15ec0de6c7cfeb760c15c2bd62d828acba2317189a5a5

It's not adding it to the menu bar, there are actually a lot of pretty basic misses here. I wouldn't accept this PR from a person, but let's see if it works!

 => ERROR [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml                                             0.1s
------
 > [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml:
0.104 /bin/sh: 1: wget: not found

No worries, easy to fix.

Alright fixed wget, let's try again!

2024-06-18 11:18:57   File "/usr/local/lib/python3.12/site-packages/gunicorn/util.py", line 371, in import_app
2024-06-18 11:18:57     mod = importlib.import_module(module)
2024-06-18 11:18:57           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
2024-06-18 11:18:57     return _bootstrap._gcd_import(name[level:], package, level)
2024-06-18 11:18:57            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap_external>", line 995, in exec_module
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
2024-06-18 11:18:57   File "/app/main.py", line 2, in <module>
2024-06-18 11:18:57     import yaml
2024-06-18 11:18:57 ModuleNotFoundError: No module named 'yaml'

Yeah I did anticipate this one. Alright let's add PyYAML so there's something to import. I'll give AI a break on this one, this is a dumb Python thing.

Ok so it didn't add it to the menu, it didn't follow the style conventions, but did it at least work? Also no.

I'm not sure how it could have done a worse job to be honest. I understand what it did wrong and why this ended up like it did, but the work involved in fixing it exceeds the amount of work it would take for me to do it myself by scratch. The point of this was to give it a pretty simple concept (parse a YAML file) and see what it did.

Conclusion

I'm sure this tool is useful to someone on Earth. That person probably hates programming and gets no joy out of it, looking for something that could help them spend less time doing it. I am not that person. Having a tool that makes stuff that looks right but ends up broken is worse than not having the tool at all.

If you are a person maintaining an extremely simple thing with amazing test coverage, I guess go for it. Otherwise this is just a great way to get PRs that look right and completely waste your time. I'm sure there are ways to "prompt engineer" this better and if someone wants to tell me what I could do, I'm glad to re-run the test. However as it exists now, this is not worth using.

If you want to use it, here are my tips:

  • Your source of data must be inside of the repo, it doesn't like making network calls
  • It doesn't seem to go check any sort of requirements file for Python, so assume the dependencies are wrong
  • It understands Dockerfile but not checking if a binary is present so add a check for that
  • It seems to do better with JSON than YAML

Questions/comments/concerns: https://c.im/@matdevdug


Simple Kubernetes Secret Encryption with Python

I was recently working on a new side project in Python with Kubernetes and I needed to inject a bunch of secrets. The problem with secret management in Kubernetes is you end up needing to set up a lot of it yourself and its time consuming. When I'm working on a new idea, I typically don't want to waste a bunch of hours setting up "the right way" to do something that isn't related to the core of the idea I'm trying out.

For the record, the right way to do secrets in Kubernetes is the following:

  • Turn on encryption at rest for ETCD
  • Carefully set up RBAC inside of Kubernetes to ensure the right users and service accounts can access the secrets
  • Give up on trying to do that and end up setting up Vault or paying your cloud provider for their Secret Management tool
  • There is a comprehensive list of secret managers and approaches outlined here: https://www.argonaut.dev/blog/secret-management-in-kubernetes

However especially when you are trying ideas out, I wanted something more idiot proof that didn't require any setup. So I wrote something simple with Python Fernet encryption that I thought might be useful to someone else out there.

You can see everything here: https://gitlab.com/matdevdug/example_kubernetes_python_encryption

Walkthrough

So the script works in a pretty straight forward way. It reads the .env file you generate as outlined in the README with secrets in the following format:

Make a .env file with the following parameters:

KEY=Make a fernet key: https://fernetkeygen.com/
CLUSTER_NAME=name_of_cluster_you_want_to_use
SECRET-TEST-1=9e68b558-9f6a-4f06-8233-f0af0a1e5b42
SECRET-TEST-2=a004ce4c-f22d-46a1-ad39-f9c2a0a31619

The KEY is the secret key and the CLUSTER_NAME tells the Kubernetes library what kubeconfig target you want to use. Then the tool finds anything with the word SECRET in the .env file and encrypts it, then writes it to the .csv file.

The .csv file looks like the following:

I really like to keep some sort of record of what secrets are injected into the cluster outside of the cluster just so you can keep track of the encrypted values. Then the script checks the namespace you selected to see if there are secrets with that name already and, if not, injects it for you.

Some quick notes about the script:

  • Secret names in Kubernetes need a specific format for the name. Lower case with words separated by - or . The script will take the uppercase in the .env and convert it into a lowercase. Just be aware it is doing that.
  • It does base64 encode the secret before it uploads it, so be aware that your application will need to decode it when it loads the secret.
  • Now the only secret you need to worry about is the Fernet secret that you can load into the application in a secure way. I find this is much easier to mentally keep track of than trying to build an infinitely scalable secret solution. Plus its cheaper since many secret managers charge per secret.
  • The secrets are immutable which means they are lightweight on the k8s API and fast. Just be aware you'll need to delete the secrets if you need to replace them. I prefer this approach because I'd rather store more things as encrypted secrets and not worry about load.
  • It is easy to specify which namespace you intend to load the secrets into and I recommend using a different Fernet secret per application.
  • Mounting the secret works like it always does in k8s
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: your-image:latest
      volumeMounts:
        - name: secret-volume
          mountPath: /path/to/secret/data
  volumes:
    - name: secret-volume
      secret:
        secretName: my-secret

Inside of your application, you need to load the Fernet secret and decrypt the secrets. With Python that is pretty simple.

decrypt = fernet.decrypt(token)

Q+A

  • Why not SOPS? This is easier and also handles the process of making the API call to your k8s cluster to make the secret.
  • Is Fernet secure? As far as I can tell it's secure enough. Let me know if I'm wrong.
  • Would you make a CLI for this? If people actually use this thing and get value out of it, I would be more than happy to make it a CLI. I'd probably rewrite it in Golang if I did that, so if people ask it'll take me a bit of time to do it.

Questions/comments/concerns: https://c.im/@matdevdug


The Worst Website In The Entire World

The Worst Website In The Entire World

What if you set out to make the worst website you possibly could? So poorly designed and full of frustrating patterns that users would not only hate the experience of using this website, but would also come to hate your company. Could we make a web experience so terrible that it would express how much our company hated our users?

As a long-time Internet addict, I've encountered my fair share of terrible websites. Instagram where now half my feed is advertisements for stupid t-shirts and the other half is empty black space.

Who in the fuck would ever wear this

Or ARNGREN.net which is like if a newspaper ad threw up on my screen.

But Instagram still occasionally shows me pictures of people I follow and ultimately the stuff on ARNGREN is so cool I still want to buy it regardless of the layout.

No, I believe it is the crack team at Broadcom that have nailed it for the worst website in existence.

Lured in with free VMware

So through social media I discovered this blog post from VMware announcing that their popular virtualization software is free for personal use now. You can read that here. Great, I used VMware Fusion before and it was ok and maybe it will let me run Windows on an ARM Mac. Probably not but let's try it out and see.

This means that everyday users who want a virtual lab on their Mac, Windows or Linux computer can do so for free simply by registering and downloading the latest build from the new download portal located at support.broadcom.com. With the new commercial model, we have reduced our product group offerings down to a single SKU (VCF-DH-PRO) for users who require commercial use licensing. This simplification eliminates 40+ other SKUs and makes quoting and purchasing VMware Desktop Hypervisor apps, Fusion Pro and Workstation Pro, easier than ever. The new Desktop Hypervisor app subscription can be purchased from any Broadcom Advantage partner.

I don't want to register at support.broadcom.com but it looks like I don't have a choice as this is the screen on the VMware site.

Now this is where alarm bells start going crazy in my head. Nothing about this notice makes sense. "The store will be moving to a new domain". So it's...not...down for maintenance but actually is just gone? Or is it actually coming back? Because then you say "store will be shutdown" (just a quick note, you want "the store" and "will be shutting down on April 30th 2024"). Also why don't you just redirect to the new domain? What is happening here?

Broadcom

So then I go to support.broadcom.com which is where I was told to register and make an account.

Never a great sign when there's a link to an 11 page PDF of how to navigate your website. That's the "Learn how to navigate Broadcom Support" link. You can download that killer doc here: https://support.broadcom.com/documents/d/ecx/broadcom-support-portal-getting-started-guide

Alright let's register.

First the sentence "Enhance your skills through multiple self-service avenues by creating your Broadcom Account" leaps off the page as just pure corporate nonsense. I've also never seen a less useful CAPTCHA, it looks like it is from 1998 and any modern text recognition software would defeat it. In fact the Mac text recognition in Preview defeats 3 of the 4 characters with no additional work:

So completely pointless and user hostile. Scoring lots of points for the worst website ever. I'm also going to give some additional points for "Ask our chatbot for assistance", an idea so revolting normally I'd just give up on the entire idea. But of course I'm curious, so I click on the link for the "Ask our chatbot" and.....

It takes me back to the main page.

Slow clap Broadcom. Imagine being a customer that is so frustrated with your support portal that you actually click "Ask a chatbot" and the web developers at Broadcom come by and karate chop you right in the throat. Bravo. Now in Broadcom's defense in the corner IS a chatbot icon so I kinda see what happened here. Let's ask it a question.

I didn't say hello. I don't know why it decided I said hello to it. But in response to VMware it gives me this:

Did the chatbot just tell me to go fuck myself? Why did you make a chatbot if all you do is select a word from a list and it returns the link to the support doc? Would I like to "Type a Query"?? WHAT IS A CHATBOT IF NOT TYPING QUERIES?

Radio | Thanks! I Hate It

Next Steps

I fill in the AI-proof CAPTCHA and hit next, only to be greeted with the following screen for 30 seconds.

Finally I'm allowed to make my user account.

Um....alright....seems like overkill Broadcom but you know what this is your show. I have 1Password so this won't be a problem. It's not letting me copy/paste from 1Password into this field but if I do Command + \ it seems to let me insert. Then I get this.

What are you doing to me Broadcom. Did I....wrong you in some way? I don't understand what is happening. Ok well I refresh the page, try again and it works this time. Except I can't copy/paste into the Confirm Password field.

I mean they can't expect me to type out the impossibly complicated password they just had me generate right? Except they have and they've added a check to ensure that I don't disable Javascript and treat it like a normal HTML form.

Hey front-end folks, just a quick note. Never ever ever ever ever mess with my browser. It's not yours, it's mine. I'm letting you use it for free to render your bloated sites. Don't do this to me. I get to copy paste whatever I want whenever I want. When you get your own browser you can do whatever you want but while you are living in my house under my rules I get to copy/paste whenever I goddamn feel like it.

Quickly losing enthusiasm for the idea of VMware

So after pulling up the password and typing it in, I'm treated to this absolutely baffling screen.

Do I need those? I feel like I might need those. eStore at least sounds like something I might want. I don't really want Public Semiconductors Case Management but I guess that one comes in the box. 44 seconds of this icon later

I'm treated to the following.

Broadcom, you clever bastards. Just when I thought I was out, they pulled me back in. Tricking users into thinking a link is going to help them and then telling them to get fucked by advising them to contact your sales rep? Genius.

So then I hit cancel and get bounced back to......you guessed it!

Except I'm not even logged into my newly created account. So then I go to login with my new credentials and I finally make it to my customer portal. Well no first they need to redirect me back to the Broadcom Support main page again with new icons.

Apparently my name was too long to show and instead of fixing that or only showing first name Broadcom wanted to ensure the disrespect continued and sorta trail off. Whatever, I'm finally in the Matrix.

Now where might I go to...actually download some VMware software. There's a search bar that says "Search the entire site", let's start there!

Nothing found except for a CVE. Broadcom you are GOOD! For a second I thought you were gonna help me and like Lucy with the football you made me eat shit again.

Lucy and the football - John Quiggin's Blogstack

My Downloads was also unhelpful.

But maybe I can add the entitlement to the account? Let's try All Products.

Of course the link doesn't work. What was I even thinking trying that? That one is really on me. However "All Products" on the left-hand side works and finally I find it. My white whale.

Except when I click on product details I'm brought back to....

The blank page with no information! Out of frustration I click on "My Downloads" again which is now magically full of links! Then I see it!

YES. Clicking on it I get my old buddy the Broadcom logo for a solid 2 minutes 14 seconds.

Now I have fiber internet with 1000 down, so this has nothing to do with me. Finally I click the download button and I get.....the Broadcom logo again.

30 seconds pass. 1 minute passes. 2 minutes pass. I'm not sure what to do.

No. No you piece of shit website. I've come too far and sacrificed too much of my human dignity. I am getting a fucking copy of VMware Fusion. Try 2 is the same thing. 3, 4, 5 all fail. Then finally.

I install it and like a good horror movie, I think it's all over. I've killed Jason. Except when I'm installing Windows I see this little link:

And think "wow I would like to know what the limitations are for Windows 11 for Arm!". Click on it and I'm redirected to...

Just one final fuck you from the team at Broadcom.

Conclusion

I've used lots of bad websites in my life. Hell, I've made a lot of bad websites in my life. But never before have I seen a website that so completely expresses just the pure hatred of users like this one. Everything was as poorly designed as possible, with user hostile design at every corner.

Honestly Broadcom, I don't even know why you bothered buying VMware. It's impossible for anyone to ever get this product from you. Instead of migrating from the VMware store to this disaster, maybe just shut this down entirely. Destroy the backups of this dumpster fire and start fresh. Maybe just consider a Shopify site because at least then an average user might have a snowballs chance in hell of ever finding something to download from you.

Do you know of a worse website? I want to see it. https://c.im/@matdevdug


The Time Linkerd Erased My Load Balancer

The Time Linkerd Erased My Load Balancer

A cautionary tale of K8s CRDs and Linkerd.

A few months ago I had the genius idea of transitioning our production load balancer stack from Ingress to Gateway API in k8s. For those unaware, Ingress is the classic way of writing a configuration to tell a load balancer what routes should hit what services, effectively how do you expose services to the Internet. Gateway API is the re-imagined process for doing this where the problem domain is scoped, allowing teams more granular control over their specific services routes.

Ingress

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: external-lb
spec:
  controller: example.com/ingress-controller
  parameters:
    apiGroup: k8s.example.com
    kind: IngressParameters
    name: external-lb

This is what setting up a load balancer in Ingress looks like

upload in progress, 0
Source

After conversations with various folks at GCP it became clear to me that while Ingress wasn't deprecated or slated to be removed, Gateway API is where all the new development and features are moving to. I decided that we were a good candidate for the migration since we are a microservice based backend with lower and higher priority hostnames, meaning we could safely test the feature without cutting over all of our traffic at the same time.

I had this idea that we would turn on both Ingress and Gateway API and then cut between the two different IP addresses at the Cloudflare level. From my low-traffic testing this approach seemed to work ok, with me being able to switch between the two and then letting Gateway API bake for a week or two to shake out any problems. Then I decided to move to prod. Due to my lack of issues in the lower environments I decided that I wouldn't set up Cloudflare load balancing between the two and manage the cut-over in Terraform. This turned out to be a giant mistake.

The long and short of it is that the combination of Gateway API and Linkerd in GKE fell down under high volume of requests. Low request volume there were no problems, but once we got to around 2k requests a second the Linkerd-proxy sidecar container memory usage started to grow unbounded. When I attempted to cut back from Gateway API to Ingress, I encountered a GKE bug I hadn't seen before in the lower environments.

"Translation failed: invalid ingress spec: service "my_namespace/my_service" is type "ClusterIP", expected "NodePort" or "LoadBalancer";

What we were seeing was a mismatch between the annotations automatically added by GKE.

Ingress adds these annotations:  
cloud.google.com/neg: '{"ingress":true}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1pokfef..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'


Gateway adds these annotations:
cloud.google.com/neg: '{"exposed_ports":{"80":{}}}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1-oijfoijsdoifj-..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway doesn't understand the Ingress annotations and vice-versa. This obviously caused a massive problem and blew up in my face. I had thought I had tested this exact failure case, but clearly prod surfaced a different behavior than I had seen in lower environments. Effectively no traffic was getting to pods while I tried to figure out what had broken.

I ended up making to manually modify the annotations to get things working and had a pretty embarrassing blow-up in my face after what I had thought was careful testing (but was clearly wrong).

Fast Forward Two Months

I have learned from my mistake regarding the Gateway API and Ingress and was functioning totally fine on Gateway API when I decided to attempt to solve the Linkerd issue. The issue I was seeing with Linkerd was high-volume services were seeing their proxies consume unlimited memory, steadily growing over time but only while on Gateway API. I was installing Linkerd with their Helm libraries, which have 2 components, the Linkerd CRD chart here: https://artifacthub.io/packages/helm/linkerd2/linkerd-crds and the Linkerd control plane: https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane

Since debug logs and upgrades hadn't gotten me any closer to a solution as to why the proxies were consuming unlimited memory until they eventually were OOMkilled, I decided to start fresh. I removed the Linkerd injection from all deployments and removed the helm charts. Since this was a non-prod environment, I figured at least this way I could start fresh with debug logs and maybe come up with some justification for what was happening.

Except the second I uninstalled the charts, my graphs started to freak out. I couldn't understand what was happening, how did removing Linkerd break something? Did I have some policy set to require Linkerd? Why was my traffic levels quickly approaching zero in the non-prod environment?

Then a coworker said "oh it looks like all the routes are gone from the load balancer". I honestly hadn't even thought to look there, assuming the problem was some misaligned Linkerd policy where our deployments required encryption to communicate or some mistake on my part in the removal of the helm charts. But they were right, the load balancers didn't have any routes. kubectl confirmed, no HTTProutes remained.

So of course I was left wondering "what just happened".

Gateway API

So a quick crash course in "what is gateway API". At a high level, as discussed before, it is a new way of defining Ingress which cleans up the annotation mess and allows for a clean separation of responsibility in an org.

upload in progress, 0

So GCP defines the GatewayClass, I make the Gateway and developer provide the HTTPRoutes. This means developers can safely change the routes to their own services without the risk that they will blow up the load balancer. It also provides a ton of great customization for how to route traffic to a specific service.

upload in progress, 0

So first you make a Gateway like so in Helm or whatever:

---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: {{ .Values.gateway_name }}
  namespace: {{ .Values.gateway_namespace }}
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        kinds:
        - kind: HTTPRoute
        namespaces:
          from: Same
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        kinds:
          - kind: HTTPRoute
        namespaces:
          from: All
      tls:
        mode: Terminate
        options:
          networking.gke.io/pre-shared-certs: "{{ .Values.pre_shared_cert_name }},{{ .Values.internal_cert_name }}"

Then you provide a different YAML of HTTPRoute for the redirect of http to https:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: redirect
  namespace: {{ .Values.gateway_namespace }}
spec:
  parentRefs:
  - namespace: {{ .Values.gateway_namespace }}
    name: {{ .Values.gateway_name }}
    sectionName: http
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https

Finally you can set policies.

---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: tls-ssl-policy
  namespace: {{ .Values.gateway_namespace }}
spec:
  default:
    sslPolicy: tls-ssl-policy
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: {{ .Values.gateway_name }}

Then your developers can configure traffic to their services like so:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: store
spec:
  parentRefs:
  - kind: Gateway
    name: internal-http
  hostnames:
  - "store.example.com"
  rules:
  - backendRefs:
    - name: store-v1
      port: 8080
  - matches:
    - headers:
      - name: env
        value: canary
    backendRefs:
    - name: store-v2
      port: 8080
  - matches:
    - path:
        value: /de
    backendRefs:
    - name: store-german
      port: 8080

Seems Straightforward

Right? There isn't that much to the thing. So after I attempted to re-add the HTTPRoutes using Helm and Terraform (which of course didn't detect a diff even though the routes were gone because Helm never seems to do what I want it to do in a crisis) and then ended up bumping the chart version to finally force it do the right thing, I started looking around. What the hell had I done to break this?

When I removed linkerd crds it somehow took out my httproutes. So then I went to the Helm chart trying to work backwards. Immediately I see this:

{{- if .Values.enableHttpRoutes }}
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes-sigs/gateway-api/pull/1923
    gateway.networking.k8s.io/bundle-version: v0.7.1-dev
    gateway.networking.k8s.io/channel: experimental
    {{ include "partials.annotations.created-by" . }}
  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    linkerd.io/control-plane-ns: {{.Release.Namespace}}
  creationTimestamp: null
  name: httproutes.gateway.networking.k8s.io
spec:
  group: gateway.networking.k8s.io
  names:
    categories:
    - gateway-api
    kind: HTTPRoute
    listKind: HTTPRouteList
    plural: httproutes
    singular: httproute
  scope: Namespaced
  versions:

Sure enough, Linkerd CRD Helm chart has that set to default True:

I also found this issue: https://github.com/linkerd/linkerd2/issues/12232

So yeah, Linkerd is, for some reason, pulling this CRD from a pull request from April 6th of last year that is marked as "do not merge". https://github.com/kubernetes-sigs/gateway-api/pull/1923

Linkerd is aware of the possible problem but presumes you'll catch the configuration option on the Helm chart: https://github.com/linkerd/linkerd2/issues/11586

To be clear I'm not "coming after Linkerd" here. I just thought the whole thing was extremely weird and wanted to make sure, given the amount of usage Linkerd gets out there, that other people were made aware of it before running the car into the wall at 100 MPH.

What are CRDs?

Kubernetes Custom Resource Definitions (CRDs) essentially extend the Kubernetes API to manage custom resources specific to your application or domain.

  • CRD Object: You create a YAML manifest file defining the Custom Resource Definition (CRD). This file specifies the schema, validation rules, and names of your custom resource.
  • API Endpoint: When you deploy the CRD, the Kubernetes API server creates a new RESTful API endpoint for your custom resource.

Effectively when I enabled Gateway API in GKE with the following I hadn't considered that I could end up in a CRD conflict state with Linkerd:

  gcloud container clusters create CLUSTER_NAME \
    --gateway-api=standard \
    --cluster-version=VERSION \
    --location=CLUSTER_LOCATION

What I suspect happened is, since I had Linkerd installed before I had enabled the gateway-api on GKE, when GCP attempted to install the CRD it failed silently. Since I didn't know there was a CRD conflict, I didn't understand that the CRD that the HTTPRoutes relied on was actually the Linkerd maintained one, not the GCP one. Presumably had I attempted to do this the other way it would have thrown an error when the Helm chart attempted to install a CRD that was already present.

To be clear before you call me an idiot, I am painfully aware that the deletion of CRDs is a dangerous operation. I understand I should have carefully checked and I am admitting I didn't in large part because it just never occurred to me that something like Linkerd would do this. Think of my failure to check as a warning to you, not an indictment against Kubernetes or whatever.

Conclusion

If you are using Linkerd and Helm and intend to use Gateway API, this is your warning right now to go in there and flip that value in the Helm chart to false. Learn from my mistake.

Questions/comments/concerns: https://c.im/@matdevdug


AI Is Speaking In Tongues

AI Is Speaking In Tongues

In my hometown in Ohio, church membership was a given for middle-class people. With a population of 8,000 people, somehow 19 churches were kept open and running. A big part of your social fabric were the kids who went to the same church that you did, the people you would gravitate towards in a new social situation. The more rural and working class your family was, the less likely you actually went to a church on a regular basis. You'd see them on Christmas and Easter, but they weren't really part of the "church group".

My friend Mark was in this group. His family lived in an underground house next to their former motorcycle shop that had closed down when his dad died. It was an hour walk from my house to his through a beautiful forest filled with deer. I was often stuck within eyesight of the house as a freight train slowly rolled through in front of me, sitting on a tree stump until the train passed. The former motorcycle shop had a Dr. Pepper machine in front of the soaped windows I had a key to and we'd sneak in to explore the shop on sleepovers while his mom worked one of her many jobs.

Hocking Hills Underground Home - Earth houses for Rent in Rockbridge, Ohio,  United States - Airbnb
What I mean when I say "underground house"

She was a chain-smoker who rarely spoke, often lighting one cigarette with the burning cherry of the first as she drove us to the local video store to rent a videogame. Mark's older brother also lived in the house, but rarely left his room. Mostly it was the two of us bouncing around unsupervised, watching old movies and drinking way too much soda as his German Shepard wheezed and coughed in the cigarette smoke filled house.

Families like this were often targeted by Evangelical Christians, groups that often tried to lure families in with offers of youth groups that could entertain your kids while you worked. At some point, one of the youth pastors convinced Mark's mom that she should send both of us to his church. Instead of Blockbuster, we got dropped off in front of an anonymous steel warehouse structure with a cross and a buzzing overhead light on a dark country road surrounded by cornfields. Mark hadn't really been exposed to religion, with his father and mother having been deep into biker culture. I had already seen these types of places before and was dreading what I knew came next.

When we walked in, we were introduced to "Pastor Michael", who looked a bit like if Santa Claus went on an extreme diet and bought serial killer glasses. Mark bounced over and started asking him questions, but I kept my distance. I had volunteered the year before to fix up an old train station which involved a large crew of youth "supervised" by the fundamentalist Christian church that wanted to turn the train station into a homeless shelter. We slept on the floors of the middle school in this smaller town neighboring mine, spending our days stripping paint and sanding floors and our evenings being lectured about the evils of sex and how America was in a "culture war". In retrospect I feel like there should have been more protective gear in removing lead paint as child labor, but I guess that was up to God.

After one of these long sessions where we were made to stand up and promise we wouldn't have sex before we got married, I made a joke in the walk back to our assigned classroom and was immediately set upon by the senior boy in the group. He had a military style haircut and he threw me against a locker hard enough that I saw stars. I had grown up going to Catholic school and mass every Sunday, spending my Wednesday nights going to CCD (Confraternity of Christian Doctrine), which was like Sunday school for Catholics. All of this was to say I had pretty established "Christian" credentials. This boy let me know he thought I was a bad influence, a fake Christian and that I should be careful since I'd be alone with him and his friends every night in the locked classroom. The experience had left me extremely wary of these Evangelical cults as I laid silently in my sleeping bag on the floor of a classroom, listening to a hamster running in a wheel that had clearly been forgotten.

To those of you not familiar with this world, allow me to provide some context. My Catholic education presented a very different relationship with holy figures. God spoke directly to very few people, saints mostly. There were many warnings growing up about not falling into the trap of believing you were such a person, worthy of a vision or a direct conversation with a deity. It was suggested softly this was more likely mental illness than divine intervention. He enters your heart and changes your behavior and gives you peace, but you aren't in that echelon of rare individuals for whom a chat was justified. So to me these Evangelicals claiming they could speak directly with God was heresy, a gross blasphemy where random "Pastors" were claiming they were saints.

The congregation started to file in and what would follow was one of the most surreal two hours of my life. People I knew, the woman who worked at the library and a local postal worker started to scream and wave their arms, blaming their health issues on Satan. Then at one point Scary Santa Claus started to shake and jerk, looking a bit like he was having a seizure. He started to babble loudly, moving around the room and I stared as more and more people seemed to pretend this babbling meant something and then doing it themselves. The bright lights and blaring music seemed to have worked these normal people into madness.

In this era before cell phones, there wasn't much I could do to leave the situation. I waited and watched as Mark became convinced these people were channeling the voice of God. "It's amazing, I really felt something in there, there was an energy in the room!" he whispered to me as I kept my eyes on the door. One of the youth pastors asked me if I felt the spirit moving through me, that I shouldn't resist the urge to join in. I muttered that I was ok and said I had to go to the bathroom, then waited in the stall until the service wrapped up almost two hours later.

In talking to the other kids, I couldn't wrap my mind around the reality that they believed this. "It's the language of God, only a select few can understand what the Holy Spirit is saying through us". The language was all powerful, allowing the Pastor to reveal prophesy to select members of the Church and assist them with their financial investments. This was a deadly serious business that these normal people completely believed, convinced this nonsense jabbering that would sometimes kind of sound like language was literally God talking through them.

I left confident that normal, rational people would never believe such nonsense. These people were gullible and once I got out of this dead town I'd never have to be subjected to this level of delusion. So imagine my surprise when, years later, I'm sitting in an giant conference hall in San Francisco as the CEO of Google explains to the crowd how AI is the future. This system that stitched together random words was going to replace all of us in the crowd, solve global warming, change every job. This was met with thunderous applause by the group, apparently excited to lose their health insurance. All this had been kicked off with techno music and bright lights, a church service with a bigger budget.

Every meeting I went to was filled with people ecstatic about the possibility of replacing staff with this divine text generator. A French venture capitalist who shared a Uber with me to the original Google campus for meetings was nearly breathless with excitement. "Soon we might not even need programmers to launch a startup! Just a founder and their ideas getting out to market as fast as they can dream it." I was tempted to comment that it seemed more likely I could replace him with an LLM, but it felt mean. "It is going to change the world" he muttered as we sat in a Tesla still being driven by a human.

It has often been suggested by religious people in my life that my community, the nonreligious tech enthusiasts, use technology as a replacement for religion. We reject the fantastical concept of gods and saints only to replace them with delusional ideas of the future. Self-driving cars were inevitable until it became clear that the problem was actually too hard and we quietly stopped talking about it. Establishing a colony on Mars is often discussed as if it is "soon", even if the idea of doing so far outstrips what we're capable of doing by a factor of 10. We tried to replace paper money with a digital currency and managed to create a global Ponzi scheme that accelerated the destruction of the Earth.

Typically I reject this logic. Technology, for its many faults, also produces a lot of things with actual benefits which is not a claim religion can make most of the time. But after months of hearing this blind faith in the power of AI, the comparisons between what I was hearing now and what the faithful had said to me after that service was eerily similar. Is this just a mass delusion, a desperate attempt by tech companies to convince us they are still worth a trillion dollars even though they have no new ideas? Is there anything here?

Glossolalia

Glossolalia, the technical term for speaking in tongues, is an old tradition with a more modern revival. It is a trademark of the Pentecostal Church, usually surrounded by loud music, screaming prayers and a leader trying to whip the crowd into a frenzy. Until the late 1950s it was confined to a few extreme groups, but since then has grown into a more and more common fixture in the US. The cultural interpretation of this trend presents it as a “heavenly language of the spirit” accessible only to the gifted ones. Glossolalists often report an intentional or spontaneous suspension of will to convey divine messages and prophecies.

In the early 1900s W. J. Seymour, a minister in the US, started to popularize the practice of whipping his congregation into a frenzy such that they could speak in tongues. This was in Los Angeles and quickly became the center of the movement. For those who felt disconnected from religion in a transplant city, it must have been quite the experience to feel your deity speaking directly through you.

It's Biblical basis is flimsy at best however. Joel 2:28-9 says:

And afterwards I will pour out my Spirit on all people. Your sons and daughters will prophesy, your old men will dream dreams, young men will see visions. Even on my servants, both men and women, I will pour out my Spirit in those days.

A lot of research has been done into whether this speech is a "language", with fascinating results. In the Psychology of Speaking in Tongues, Kildahl and Qualben attempted to figure that out. Their conclusions were that while it could sound like a language, it was a gibberish, closer to the fake language children use to practice the sounds of speaking. To believers though, this presented no problems.

He argued that glossolalia is real and that it is a gift from the Holy Spirit. He argued that a person cannot fake tongues. Tongues are an initial evidence of the spirit baptism. It is a spiritual experience. He observed that tongues cannot be understood by ordinary people. They can only be understood spiritually. He noted that when he speaks in tongues he feels out of himself. The feeling is very strange. One can cry, get excited, and laugh. As our respondent prayed, he uttered: Hiro---shi---shi---sha---a---karasha. He jumped and clapped his hands in excitement and charisma. He observed that if God allows a believer to speak in tongues there is a purpose for that. One can speak in tongues and interpret them at the same time. However, in his church there is no one who can interpret tongues. According to our respondent, tongues are intended to edify a person. Tongues are beneficial to the person who speaks in tongues. A person does not choose to pray in tongues. Tongues come through the Spirit of God. When speaking in tongues, it feels as if one has lost one's memory. It is as if one is drunk and the person seems to be psychologically disturbed. This is because of the power of the influence of the Holy Spirit. Tongues are a special visitation symbolising a further special touch of the Holy Spirit. Source

In reality glossolalic speech is not a random and disorganized production of sounds. It has specific accents, intonations and word-like units that resembles the original language of the speaker. [source] That doesn't make it language though, even if the words leave the speaker feeling warm and happy.

What it actually is, at its core, is another tool the Evangelical machine has at its disposal to use the power of music and group suggestion to work people into a frenzy.

The tongue-speaker temporarily discards some of his or her ego functioning as it happens in such times as in sleep or in sexual intercourse.41 This phenomenon was also noticed in 2006 at the University of Pennsylvania, USA, by researchers under the direction of Andrew Newburg, MD who completed the world's first brain-scan study of a group of Pentecostal practitioners while they were speaking in tongues. The researchers noticed that when the participants were engaged in glossolalia, activity in the language centres of the brain actually decreased, while activity in the emotional centres of the brain increased. The fact that the researchers observed no changes in any language areas, led them to conclude that this phenomenon suggests that glossolalia is not associated with usual language function or usage.

Source

It's the power of suggestion. You are in a group of people and someone, maybe a plant, kicks it off. You are encouraged to join in and watch as your peers enthusiastically get involved. The experience has been explained as positive, so of course you remember it as a positive experience, even if in your core you understand that you weren't "channeling voices". You can intellectually know it is a fake and still feel moved by the experience.

LLMs

LLMs, large language models, which have been rebranded as AI, share a lot with the Evangelical tool. AI was classically understood to be a true artificial intelligence, a thinking machine that actually processed and understood your request. It was seen as the Holy Grail of computer science, the ability to take the best of human intellect and combine it into an eternal machine that could guide and help us. This definition has leaked from the sphere of technology and now solidly lives on in Science Fiction, the talking robot who can help and assist the humans tasked with something.

If that's the positive spin, then there has always been a counter argument. Known as the "Chinese room argument" as shorthand, it says that a digital computer running code cannot have a mind, understanding or consciousness. You can create a very convincing fake though. The thought experiment is as follows:

You've made a computer that behaves as if it understands Chinese. It takes Chinese characters as input and returns Chinese characters as output. It does so at such a high level that it passes the Turing test in that a native Chinese speaker believes the thing it is speaking to is a human being speaking Chinese. But the distinction is that the machine doesn't understand Chinese, it is simulating the idea of speaking Chinese.

Searle suggests if you put him into a room with an English version of the program he could receive the same characters through a slot in the door, process them according to the code and produce Chinese characters as output, without understanding anything that is being said. However he still wouldn't speak or understand Chinese.

This topic has been discussed at length by experts, so if you are interested in the counterarguments I suggest the great site by Stanford: https://plato.stanford.edu/entries/chinese-room/

What Is An AI?

The Neural Networks powering AI at a high level look as follows:

The magic part of AI is the Transformer neural network, which uses self-attention to process not just the elements of text on their own but the way the data relates to each other. It has been fed enough examples to be able to take the input and produce a likely output. Effectively by ingesting every work spoken by humans on video or written on the internet, it can predict with some level of accuracy what a common collection of words to return to a query is.

The end result is a box that consumes a frightening amount of electricity to mimic the responses of an all-knowing eternal AI. The believers can commune directly with this machine spirit, asking it questions and getting back blocks of text or pictures that sometimes look like something a human would make. You cannot question the model or work backwards to determine how specifically it came to the conclusion it did. There is an element of the divine in the process, a specter of magic.

Hallucination

As AI was introduced, we were cautioned to be on the lookout for "hallucinations", where the model would produce convincing-looking fakes. The problem with this warning is it both flies in the face of how humans work (we're not amazing at questioning authoritative-sounding answers) and also suggested this was a temporary problem. But the more you read about AI, you realize it can never solve the hallucination problem because it would need to understand what is "real" and "fake" in order to do that.

What do I mean on a practical level? So I thought I would test out the Google Gemini AI with some easy questions. I had recently written a bash script to query GCP cloud deployments to get their status and write them to markdown. This should be an easy test of the system, right? It's Google AI writing a simple script using the Google CLI to interact with the Google Cloud Platform. Closed ecosystem, high degree of confident, should have definite sources of truth.

Can you write a bash script using the gcloud CLI to query the status of all deployments using cloud deploy, a GCP service, in the GCP project test123 and output the name and current name to a markdown file for each one?
#!/bin/bash

# Set project ID
PROJECT_ID="test123"

# Output directory for markdown files
OUTPUT_DIR="deployment_statuses"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Get all deployments
deployments=$(gcloud deploy deployments list --project "$PROJECT_ID" --format="value(NAME)")

# Loop through each deployment
for deployment in $deployments; do
  # Get deployment details
  deployment_details=$(gcloud deploy deployments describe "$deployment" --project "$PROJECT_ID" --format="yaml")

  # Parse name and state from YAML output
  name=$(echo "$deployment_details" | yq .name)
  state=$(echo "$deployment_details" | yq .status.state)

  # Create markdown file for deployment
  output_file="$OUTPUT_DIR/$name.md"
  echo "# Deployment: $name" > "$output_file"
  echo "Current State: $state" >> "$output_file"

  # Add additional information to markdown file as needed (e.g., rollout strategy, create time)
  # ...

done

echo "Deployment statuses written to markdown files in $OUTPUT_DIR"

Looks great right? Exactly what I wanted. Except it doesn't work.

gcloud doesn't have a gcloud deploy deployments argument.

NAME
    gcloud deploy - create and manage Cloud Deploy resources

SYNOPSIS
    gcloud deploy GROUP | COMMAND [GCLOUD_WIDE_FLAG ...]

DESCRIPTION
    Create and manage Cloud Deploy resources.

GCLOUD WIDE FLAGS
    These flags are available to all commands: --help.

    Run $ gcloud help for details.

GROUPS
    GROUP is one of the following:

     automation-runs
        Manages AutomationRuns resources for Cloud Deploy.

     automations
        Manages Automations resources for Cloud Deploy.

     custom-target-types
        Create and manage Custom Target Type resources for Cloud Deploy.

     delivery-pipelines
        Create and manage Delivery Pipeline resources for Cloud Deploy.

     job-runs
        Manages job runs resources for Cloud Deploy.

     releases
        Create and manage Release resources for Cloud Deploy.

     rollouts
        Create and manage Rollout resources for Cloud Deploy.

     targets
        Create and manage Target resources for Cloud Deploy.

Now I know this because I wrote it, but there's no way someone looking at this with no knowledge of the gcloud CLI would understand why this wouldn't work.

I tried again, this time with a simpler question. Maybe because not that many people use gcloud it doesn't have a big enough sample size to construct a real answer. Let's ask a more basic question, but let's keep it on Google technology. Kubernetes has been documented to death, books written about it, millions of articles and blog posts. Certainly that will work.

How do I, in hosted Kubernetes on GCP, write a deployment configuration yaml which sets nodeAffinity using preferredDuringSchedulingIgnoredDuringExecution. The nodepool should be nodes02 with a weight of one.

What I'm trying to prompt it here is to give me back a configuration file not with nodeSelector but with Node Affinity, the softer preference level for assigning pods to nodes. But there's a small trick in the question. I want a deployment not a Pod spec. This is a distinction which requires a more nuanced comprehension of the subject matter, the ability to not pull the most common example but the more specific example. What we want is this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 3
  template: 
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
          preference:
            matchExpressions:
            - key: cloud.google.com/gke-nodepool
            operator: In
            values:
            - nodes02
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

What I got back was this:

  spec:
    nodeSelector: 
      cloud.google.com/gke-nodepool: nodes02
    affinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        matchExpressions:
        - key: cloud.google.com/gke-nodepool
          operator: In
          values:
          - nodes02

The problem with this response is it does a very different thing from the thing I was trying to do. nodeSelector is a stricter approach, ensuring pods are only scheduled on nodes that match the label. nodeAffinity is a much softer preference, telling k8s I'd like the pods to go there if possible but if that isn't possible, put them where you would normally do.

Both of these examples seem reasonable. The machine responded with something that could be construed as the answer, a clever parody of human intelligence, but ultimately it is more like a child playing. It doesn't understand the question, but understands how to construct convincing looking fakes.

To the faithful though, this isn't a problem.

However, if the training data is incomplete or biased, the AI model may learn incorrect patterns. This can lead to the AI model making incorrect predictions, or hallucinating.
For example, an AI model that is trained on a dataset of medical images may learn to identify cancer cells. However, if the dataset does not include any images of healthy tissue, the AI model may incorrectly predict that healthy tissue is cancerous. This is an example of an AI hallucination.

[Source]

This creates a false belief that the problem lies with the training data, which for both of my examples simply cannot be true. Google controls both ends of that equation and can very confidently "ground" the model with verifiable sources of information. In theory this should tether their output and reduce the chances of inventing content. It reeks of a religious leader claiming while that prophecy was false, the next one will be real if you believe hard enough. It also moves the responsibility for the problem from the AI model to "the training data", which for these LLMs represents a black box of information. I don't know what the training data is, so I can't question whether its good or bad.

Is There Anything Here?

Now that isn't to say there isn't amazing work happening here. LLMs can do some fascinating things and the transformer work has the promise to change how we allow people to interact with computers. Instead of an HTML form with strict validation and obtuse error messages, we can instead help explain to people in real-time what is happening, how to fix problems, just in general provide more flexibility when dealing with human inputs. We can have a machine look at less-sorted data and find patterns, there are lots of ways for this tech to make meaningful differences in human life.

It just doesn't have a trillion dollars worth of value. This isn't a magic machine that will replace all human workers, which for some modern executives is the same as being able to talk directly to God in terms of the Holy Grail of human progress. Finally all the money can flow directly to the CEO himself, cutting out all those annoying middle steps. The demand of investors for these companies to produce something new has outstripped their ability to do that, resulting in a dangerous technology being unleashed upon the world with no safeties. We've made a lying machine that doesn't show you its work, making it even harder for people to tell truth from fiction.

If LLMs are going to turn into actual AI, we're still years and years from that happening. This represents an interesting trick, a feel-good exercise that, unless you look too closely, seems like you are actually talking to an immortal all-knowing being that lives in the clouds. But just like everything else, if your faith is shaken for even a moment the illusion collapses.

Questions/comments/concerns: https://c.im/@matdevdug