Skip to content

How Mobile Networks Work

How Mobile Networks Work

I've spent a fair amount of time around networking. I've worked for a small ISP, helped to set up campus and office networks and even done a fair amount of work with BGP and assisting with ISP failover and route work. However in my current role I've been doing a lot of mobile network diagnostics and troubleshooting which made me realize I actually don't know anything about how mobile networks operate. So I figured it was a good idea for me to learn more and write up what I find.

It's interesting that without a doubt cellular internet is either going to become or has become the default Internet for most humans alive, but almost no developers I know have any idea how it works (including myself until recently). As I hope that I demonstrate below, it is untold amounts of amazing work that has been applied to this problem over decades that has really produced incredible results. As it turns out the network engineers working with cellular were doing nuclear physics while I was hot-gluing stuff together.

I am not an expert. I will update this as I get better information, but use this as a reference for stuff to look up, not a bible. It is my hope, over many revisions, to turn this into a easier to read PDF that folks can download. However I want to get it out in front of people to help find mistakes.

TL/DR: There is a shocking, eye-watering amount of complexity when it comes to cellular data as compared to a home or datacenter network connection. I could spend the next six months of my life reading about this and feel like I barely scratched the surface. However I'm hoping that I have provided some basic-level information about how this magic all works.

Corrections/Requests: https://c.im/@matdevdug. I know I didn't get it all right, I promise I won't be offended.

Basics

A modern cellular network at the core is comprised of three basic elements:

  • the RAN (radio access network)
  • CN (core network)
  • Services network

RAN

The RAN contains the base stations that allow for the communication with the phones using radio signals. When we think of a cell tower we are thinking of a RAN. When we are thinking of what a cellular network provides in terms of services, a lot of that is actually contained within the CN. That's where the stuff like user authorization, services turned on or off for the user and all the background stuff for the transfer and hand-off of user traffic. Think SMS and phone calls for most users today.

Key Components of the RAN:

  1. Base Transceiver Station (BTS): The BTS is a radio transmitter/receiver that communicates with your phone over the air interface.
  2. Node B (or Evolved Node B for 4G or gNodeB for 5G): In modern cellular networks, Node B refers to a base station that's managed by multiple cell sites. It aggregates data from these cell sites and forwards it to the RAN controller.
  3. Radio Network Controller (RNC): The RNC is responsible for managing the radio link between your phone and the BTS/Node B.
  4. Base Station Subsystem (BSS): The BSS is a term used in older cellular networks, referring to the combination of the BTS and RNC.
Link: https://www.cisco.com/c/en/us/products/collateral/wireless/nb-06-radio-access-networks-cte-en.html

Startup

  1. Cell Search and Network Acquisition. The device powers on and begins searching for available cells by scanning the frequencies of surrounding base stations (e.g., eNodeB for LTE, gNodeB for 5G).
┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│   Broadcast  │             │              │
│  ──────────> │ Search for  │ <──────────  │
│              │ Sync Signals│ Synchronizes │
│              │             │              │
└──────────────┘             └──────────────┘

- Device listens for synchronization signals.
- Identifies the best base station for connection.
  1. Random Access. After identifying the cell to connect to, the device sends a random access request to establish initial communication with the base station.This is often called RACH. If you want to read about it I found an incredible amount of detail here: https://www.sharetechnote.com/html/RACH_LTE.html
┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│  Random Access Response    │              │
│  <────────── │ ──────────> │ Random Access│
│              │             │ Request      │
└──────────────┘             └──────────────┘

- Device sends a Random Access Preamble.
- Base station responds with timing and resource allocation.
  1. Dedicated Radio Connection Setup (RRC Setup). The base station allocates resources for the device to establish a dedicated radio connection using the Radio Resource Control (RRC) protocol.
┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│  RRC Setup   │             │              │
│  ──────────> │ Send RRC    │              │
│              │ Request     │ <──────────  │
│              │             │ RRC Response │
└──────────────┘             └──────────────┘

- Device requests RRC connection.
- Base station assigns resources and confirms.
  1. Device-to-Core Network Communication (Authentication, Security, etc.). Once the RRC connection is established, the device communicates with the core network (e.g., EPC in LTE, 5GC in 5G) for authentication, security setup, and session establishment.
┌──────────────┐               ┌──────────────┐
│  Base Station│               │   Mobile     │
│  ──────────> │ Forward       │              │
│              │ Authentication Data          │
│              │ <──────────   │Authentication│
│              │               │ Request      │
│              │               │              │
└──────────────┘               └──────────────┘

- Device exchanges authentication and security data with the core network.
- Secure communication is established.
  1. Data Transfer (Downlink and Uplink). After setup, the device starts sending (uplink) and receiving (downlink) data using the established radio connection.
┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│  ──────────> │ Data        │              │
│  Downlink    │             │  <─────────  │
│  <────────── │ Data Uplink │ ──────────>  │
│              │             │              │
└──────────────┘             └──────────────┘

- Data is transmitted between the base station and the device.
- Downlink (BS to Device) and Uplink (Device to BS) transmissions.
  1. Handover. If the device moves out of range of the current base station, a handover is initiated to transfer the connection to a new base station without interrupting the service.

Signaling

As shown in the diagram above, there are a lot of references to something called "signaling". Signaling seems to be a shorthand for handling a lot of configuration and hand-off between tower and device and the core network. As far as I can tell they can be broken into 3 types.

  1. Access Stratum Signaling
    1. Set of protocols to manage the radio link between your phone and cellular network.
    2. Handles authentication and encryption
    3. Radio bearer establishment (setting up a dedicated channel for data transfer)
    4. Mobility management (handovers, etc)
    5. Quality of Service control.
  2. Non-Access Stratum (NAS) Signaling
    1. Set of protocols used to manage the interaction between your phone and the cellular network's core infrastructure.
    2. It handles tasks such as authentication, billing, and location services.
    3. Authentication with the Home Location Register (HLR)
    4. Roaming management
    5. Charging and billing
    6. IMSI Attach/ Detach procedure
  3. Lower Layer Signaling on the Air Interface
    1. This refers to the control signaling that occurs between your phone and the cellular network's base station at the physical or data link layer.
    2. It ensures reliable communication over the air interface, error detection and correction, and efficient use of resources (e.g., allocating radio bandwidth).
    3. Modulation and demodulation control
    4. Error detection and correction using CRCs (Cyclic Redundancy Checks)

High Level Overview of Signaling

  1. You turn on your phone (AS signaling starts).
  2. Your phone sends an Initial Direct Transfer (IDT) message to establish a radio connection with the base station (lower layer signaling takes over).
  3. The base station authenticates your phone using NAS signaling, contacting the HLR for authentication.
  4. Once authenticated, lower layer signaling continues to manage data transfer between your phone and the base station.

What is HLR?

Home Location Register contains the subscriber data for a network. Their IMSI, phone number, service information and is what negotiates where in the world the user physically is.

Duplexing

You have a lot of devices and you have a few towers. You need to do many uplinks and downlinks to many devices.

It is important that any cellular communications system you can send and receive in both directions at the same time. This enables conversations to be made, with either end being able to talk and listen as required. In order to be able to transmit in both directions, a device (UE) and base station must have a duplex scheme. There are a lot of them including Frequency Division Duplex (FDD), Time Division Duplex (TDD), Semi-static TDD and Dynamic TDD.

Duplexing Types:

  1. Frequency Division Duplex (FDD): Uses separate frequency bands for downlink and uplink signals.
    1. Downlink: The mobile device receives data from the base station on a specific frequency (F1).
    2. Uplink: The mobile device sends data to the base station on a different frequency (F2).
    3. Key Principle: Separate frequencies for uplink and downlink enable simultaneous transmission and reception.
┌──────────────┐              ┌──────────────┐
│  Base Station│              │   Mobile     │
│              │              │   Device     │
│  ──────────> │ F1 (Downlink)│ <──────────  │
│              │              │              │
│  <────────── │ F2 (Uplink)  │ ──────────>  │
└──────────────┘              └──────────────┘

Separate frequency bands (F1 and F2)
  1. Time Division Duplex (TDD): Alternates between downlink and uplink signals over the same frequency band.
    1. Downlink: The base station sends data to the mobile device in a time slot.
    2. Uplink: The mobile device sends data to the base station in a different time slot using the same frequency.
    3. Key Principle: The same frequency is used for both uplink and downlink, but at different times.
 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
                 (Base station sends data)

     ◄───────────  Time Slot 2 (Uplink)
                 (Mobile sends data)
     
     ───────────►  Time Slot 3 (Downlink)
                 (Base station sends data)
                 
     ◄───────────  Time Slot 4 (Uplink)
                 (Mobile sends data)
 
    - The same frequency is used for both directions.
    - Communication alternates between downlink and uplink in predefined time slots.

Frame design

    1. Downlink/Uplink: There are predetermined time slots for uplink and downlink, but they can be changed periodically (e.g., minutes, hours).
    2. Key Principle: Time slots are allocated statically for longer durations but can be switched based on network traffic patterns (e.g., heavier downlink traffic during peak hours).
    3. A frame typically lasts 10 ms and is divided into time slots for downlink (DL) and uplink (UL).
    4. "Guard" time slots are used to allow switching between transmission and reception.


4. Dynamic Time Division Duplex (Dynamic TDD):

    1. Downlink/Uplink: Time slots for uplink and downlink are dynamically adjusted in real time based on instantaneous traffic demands.
    2. Key Principle: Uplink and downlink time slots are flexible and can vary dynamically to optimize the usage of the available spectrum in real-time, depending on the traffic load.
    3. See second diagram for what "guard periods" are. Basically windows to ensure there are gaps and the signal doesn't overlap.
 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
     ───────────►  Time Slot 2 (Downlink)
     ───────────►  Time Slot 3 (Downlink)
     ◄───────────  Time Slot 4 (Uplink)
     ───────────►  Time Slot 5 (Downlink)
     
     ◄───────────  Time Slot 6 (Uplink)
     
    - More slots for downlink in scenarios with high download traffic (e.g., streaming video).
    - Dynamic slot assignment can change depending on the real-time demand.
 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
     ───────────►  Time Slot 2 (Downlink)
     [Guard Period]                          (Switch from downlink to uplink)
     ◄───────────  Time Slot 3 (Uplink)
     [Guard Period]                          (Switch from uplink to downlink)
     ───────────►  Time Slot 4 (Downlink)
     
    - Guard periods allow safe switching from one direction to another.
    - Guard periods prevent signals from overlapping and causing interference.

Core

So I've written a lot about what the RAN does. But we haven't really touched on what the core network concept does. Basically once the device registers with the base station using the random access procedure discussed above, the device is enabled and allows the core network to do a bunch of stuff that we typically associate with "having a cellular plan".

For modern devices when we say authentication we mean "mutual authentication", which means the device authenticates the network and the network authenticates the device. This is typically something like a subscriber-specific secret key and a random number to generate a response to the request sent by the device. Then the network sends an authentication token and the device compares this token with the expected token to authenticate the network. It looks like the following:

┌───────────────────────┐
│    Encryption &       │
│  Integrity Algorithms │
├───────────────────────┤
│  - AES (Encryption)   │
│  - SNOW 3G (Encryption│
│  - ZUC (Encryption)   │
│  - SHA-256 (Integrity)│
└───────────────────────┘

- AES: Strong encryption algorithm commonly used in LTE/5G.
- SNOW 3G: Stream cipher used for encryption in mobile communications.
- ZUC: Encryption algorithm used in 5G.
- SHA-256: Integrity algorithm ensuring data integrity.

The steps of the core network are as follows:

  • Registration (also called attach procedure): The device connects to the core network (e.g., EPC in LTE or 5GC in 5G) to register and declare its presence. This involves the device identifying itself and the network confirming its identity.
  • Mutual Authentication: The network and device authenticate each other to ensure a secure connection. The device verifies the network’s authenticity, and the network confirms the device’s identity.
  • Security Activation: After successful authentication, the network and the device establish a secure channel using encryption and integrity protection to ensure data confidentiality and integrity.
  • Session Setup and IP Address Allocation: The device establishes a data session with the core network, which includes setting up bearers (logical paths for data) and assigning an IP address to enable internet connectivity.

How Data Gets To Phone

Alright we've talked about how the phone finds a tower to talk to, how the tower knows who the phone is and all the millions of steps involved in getting the mobile phone an actual honest-to-god IP address. How is data actually getting to the phone itself?

  1. Configuration for Downlink Measurement: Before downlink data transmission can occur, the mobile device (UE) must be configured to perform downlink measurements. This helps the network optimize transmission based on the channel conditions. Configuration messages are sent from the base station (eNodeB in LTE or gNB in 5G) to instruct the UE to measure certain DL reference signals.
  2. Reference Signal (Downlink Measurements): The mobile device receives reference signals from the network. These reference signals are used by the UE to estimate DL channel conditions. In LTE, Cell-specific Reference Signals (CRS) are used, and in 5G, Channel State Information-Reference Signals (CSI-RS) are used.
  3. DL Channel Conditions (CQI, PMI, RI): The mobile device processes the reference signals to assess the downlink channel conditions and generates reports such as CQI (Channel Quality Indicator), PMI (Precoding Matrix Indicator), and RI (Rank Indicator). These reports are sent back to the base station.
  4. DL Resource Allocation and Packet Transmission: Based on the UE’s channel reports (CQI, PMI, RI), the base station allocates appropriate downlink resources. It determines the modulation scheme, coding rate, MIMO layers, and frequency resources (PRBs) and sends a DL scheduling grant to the UE. The data packets are then transmitted over the downlink.
  5. Positive/Negative Acknowledgement (HARQ Feedback): After the UE receives the downlink data, it checks the integrity of the packets using CRC (Cyclic Redundancy Check). If the CRC passes, the UE sends a positive acknowledgement (ACK) back to the network. If the CRC fails, a negative acknowledgement (NACK) is sent, indicating that retransmission is needed.
  6. New Transmission or Retransmission (HARQ Process): If the network receives a NACK, it retransmits the packet using the HARQ process. The retransmission is often incremental (IR-HARQ), meaning the device combines the new transmission with previously received data to improve decoding.

Uplink is a little different but is basically the device asking for a timeslot to upload, getting a grant, sending the data up and then getting an ack that it is sent.

Gs

So as everyone knows cellular networks have gone through a series of revisions over the years around the world. I'm going to talk about them and just try to walk through how they are different and what they mean.

1G

  • Starts in Japan, moves to Europe and then the US and UK.
  • Speeds up to 2.4kbps and operated in the frequency band of 150 KHz.
  • Didn't work between countries, had low capacity, unreliable handoff and no security. Basically any receiver can listen to a conversation.

2G

  • Launched in 1991 in Finland
  • Allows for text messages, picture messages and MMS.
  • Speeds up to 14.4kbps between 900MHz and 1800MHz bands
  • Actual security between sender and receiver with messages digitally encrypted.

Wait, are text messages encrypted?

So this was completely new to me but I guess my old Nokia brick had some encryption on it. Here's how that process worked:

  1. Mobile device stores a secret key in the SIM card and the network generates a random challenge and sends it to the mobile device.
  2. The A3 algorithm is used to compute a Signed Response (SRES) using the secret key and the random value.
  3. Then the A8 algorithm is used with secret and the random value to generate a session encryption key Kc (64-bit key). This key will be used for encrypting data, including SMS.
  4. After the authentication process and key generation, encryption of SMS messages begins. GSM uses a stream cipher to encrypt both voice and data traffic, including text messages. The encryption algorithm used for SMS is either A5/1 or A5/2, depending on the region and network configuration.
    1. A5/1: A stronger encryption algorithm used in Europe and other regions.
    2. A5/2: A weaker variant used in some regions, but deprecated due to its vulnerabilities.
    3. The A5 algorithm generates a keystream that is XORed with the plaintext message (SMS) to produce the ciphertext, ensuring the confidentiality of the message.

So basically text messages from the phone to the base station were encrypted and then exposed there. However I honestly didn't even know that was happening.

TSMA and CDMA

I remember a lot of conversations about GSM vs CDMA when you were talking about cellular networks but at the time all I really knew was "GSM is European and CDMA is US".

  • TSMA is GSM and uses time slots
  • CDMA allocates each user a special code to communicate over multiple physical channels
  • GSM is where we see services like voice mail, SMS, call waiting

EDGE

So everyone who is old like me remembers EDGE on cellphones, including the original iPhone I waited in line for. EDGE was effectively a retrofit you could put on top of an existing GSM network, keeping the cost for adding it low. You got speeds on 9.6-200kbps.

3G

  • Welcome to the year 2000
  • Frequency spectrum of 3G transmissions is 1900-2025MHz and 2110-2200MHz.
  • UTMS takes over for GSM and CDMA2000 takes over from CDMA.
  • Maxes out around 8-10Mbps
  • IMT-2000 = 3G

So let's just recap quickly how we got here.

  • 2G (GSM): Initially focused on voice communication and slow data services (up to 9.6 kbps using Circuit Switched Data).
  • 2.5G (GPRS): Introduced packet-switched data with rates of 40-50 kbps. It allowed more efficient use of radio resources for data services.
  • 2.75G (EDGE): Enhanced the data rate by improving modulation techniques (8PSK). This increased data rates to around 384 kbps, making it more suitable for early mobile internet usage.

EDGE introduced 8-PSK (8-Phase Shift Keying) modulation, which allowed the encoding of 3 bits per symbol (as opposed to 1 bit per symbol with the original GSM’s GMSK (Gaussian Minimum Shift Keying) modulation). This increased spectral efficiency and data throughput.

EDGE had really high latency so it wasn't really usable for things like video streaming or online gaming.

  • 3G (WCDMA): Max data rate: 2 Mbps (with improvements over EDGE in practice). Introduced spread-spectrum (CDMA) technology with QPSK modulation.
  • 3.5G (HSDPA): Enhanced WCDMA by introducing adaptive modulation (AMC), HARQ, and NodeB-based scheduling. Max data rate: 14.4 Mbps (downlink).

So when we say 3G we actually mean a pretty wide range of technologies all underneath the same umbrella.

4G

  • 4G or as it is sometimes called LTE evolved from WCDMA. Instead of developing new radio interfaces and new technology existing and newly developed wireless system like GPRS, EDGE, Bluetooth, WLAN and Hiper-LAN were integrated together
  • 4G has a download speed of 67.65Mbps and upload speed of 29.37Mbps
  • 4G operates at frequency bands of 2500-2570MHz for uplink and 2620-2690MHz for downlink with channel bandwidth of 1.25-20MHz
  • 4G has a few key technologies, mainly OFDM, SDR and Multiple-Input Multiple-Output (MIMO).
    • OFDM (Orthogonal Frequency Division Multiplexing)
      • Allows for more efficient use of the available bandwidth by breaking down data into smaller pieces and sending them simultaneously
      • Since each channel uses a different frequency, if one channel experiences interference or errors, the others remain unaffected.
      • OFDM can adapt to changing network conditions by dynamically adjusting the power levels and frequencies used for each channel.
    • SDR (Software Defined Radio)
      • Like it sounds, it is a technology that enables flexible and efficient implementation of wireless communication systems by using software algorithms to control and process radio signals in real-time. In cellular 4G, SDR is used to improve performance, reduce costs, and enable advanced features like multi-band support and spectrum flexibility.
    • MIMO (multiple-input multiple-output)
      • A technology used in cellular 4G to improve the performance and capacity of wireless networks. It allows for the simultaneous transmission and reception of multiple data streams over the same frequency band, using multiple antennas at both the base station and mobile device.
      • Works by having both the base station and the mobile device equipped with multiple antennas
      • Each antenna transmits and receives a separate data stream, allowing for multiple streams to be transmitted over the same frequency band
      • There is Spatial Multiplexing where multiple data streams are transmitted over the same frequency band using different antennas. Then Beamforming where advanced signal processing techniques to direct the transmitted beams towards specific users, improving signal quality and reducing interference. Finally Massive MIMO where you use a lot of antennas (64 or more) to improve capacity and performance.

5G

  • The International Telecommunication Union (ITU) defines 5G as a wireless communication system that supports speeds of at least 20 Gbps (gigabits per second), with ultra-low latency of less than 1 ms (millisecond).
  • 5G operates on a much broader range of frequency bands than 4G
    • Low-band frequencies: These frequencies are typically below 3 GHz and are used for coverage in rural areas or indoor environments. Examples include the 600 MHz, 700 MHz, and 850 MHz bands.
    • Mid-band frequencies: These frequencies range from approximately 3-10 GHz and are used for both coverage and capacity in urban areas. Examples include the 4.5 GHz, 6 GHz, and 24 GHz bands.
    • High-band frequencies: These frequencies range from approximately 10-90 GHz and are used primarily for high-speed data transfer in dense urban environments. Examples include the 28 GHz, 39 GHz, and 73 GHz bands.
  • 5g network designs are a step up in complexity from their 4g predecessors, with a control plane and a userplane with each plane using a separate network function. 4G networks have a single plane.
  • 5G uses advanced modulation schemes such as 256-Quadrature Amplitude Modulation (QAM) to achieve higher data transfer rates than 4G, which typically uses 64-QAM or 16-QAM
  • All the MIMO stuff discussed above.

What the hell is Quadrature Amplitude Modulation?

I know, it sounds like a Star Trek thing. It is a way to send digital information over a communication channel, like a wireless network or cable. It's a method of "modulating" the signal, which means changing its characteristics in a way that allows us to transmit data.

When we say 256-QAM, it refers to the specific type of modulation being used. Here's what it means:

  • Quadrature: This refers to the fact that the signal is being modulated using two different dimensions (or "quadratures"). Think of it like a coordinate system with x and y axes.
  • Amplitude Modulation (AM): This is the way we change the signal's characteristics. In this case, we're changing the amplitude (magnitude) of the signal to represent digital information.
  • 256: This refers to the number of possible states or levels that the signal can take on. Think of it like a binary alphabet with 2^8 = 256 possible combinations.

Why does 5G want this?

  • More information per symbol: With 256-QAM, each "symbol" (or signal change) can represent one of 256 different values. This means we can pack more data into the same amount of time.
  • Faster transmission speeds: As a result, we can transmit data at higher speeds without compromising quality.

Kubernetes and 5G

Kubernetes is a popular technology in 5G and is used for a number of functions, including the following:

  • Virtual Network Functions (VNFs): VNFs are software-based implementations of traditional network functions, such as firewalls or packet filters. Kubernetes is used to deploy and manage these VNFs.
  • Cloud-Native Network Functions (CNFs): CNFs are cloud-native applications that provide network function capabilities, such as traffic management or security filtering. Kubernetes is used to deploy and manage these CNFs.
  • Network Function Virtualization (NFV) Infrastructure: NFV infrastructure provides the underlying hardware and software resources for running VNFs and CNFs. Kubernetes is used to orchestrate and manage this infrastructure.

Conclusion

So one of the common sources of frustration for developers I've worked with when debugging cellular network problems is that often while there is plenty of bandwidth for what they are trying to do, the latency involved can be quite variable. If you look at all the complexity behind the scenes and then factor in that the network radio on the actual cellular device is constantly flipping between an Active and Idle state in an attempt to save battery life, this suddenly makes sense.

Because all of the complexity I'm talking about ultimately gets you back to the same TCP stack we've been using for years with all the overhead involved in that back and forth. We're still ending up with a SYN -> SYN-ACK. There are tools you can use to shorten this process somewhat (TCP Fast Open) and changing the initial congestion window but still you are mostly dealing with the same level of overhead you always dealt with.

Ultimately there isn't much you can do with this information, as developers have almost no control over the elements present here. However I think it's useful as cellular networks continue to become the dominant default Internet for the Earth's population that more folks understand the pieces happening in the background of this stack.


Replace Docker Compose with Quadlet for Servers

Replace Docker Compose with Quadlet for Servers

So for years I've used Docker Compose as my stepping stone to k8s. If the project is small, or mostly for my own consumption OR if the business requirements don't really support the complexity of k8s, I use Compose. It's simple to manage with bash scripts for deployments, not hard to setup on fresh servers with cloud-init and the process of removing a server from a load balancer, pulling the new container, then adding it back in has been bulletproof for teams with limited headcount or services where uptime is less critical than cost control and ease of long-term maintenance. You avoid almost all of the complexity of really "running" a server while being able to scale up to about 20 VMs while still having a reasonable deployment time.

What are you talking about

Sure, so one common issue I hear is "we're a small team, k8s feels like overkill, what else is on the market"? The issue is there are tons and tons of ways to run containers on virtually every cloud platform, but a lot of them are locked to that cloud platform. They're also typically billed at premium pricing because they remove all the elements of "running a server".

That's fine but for small teams buying in too heavily to a vendor solution can be hard to get out of. Maybe they pick wrong and it gets deprecated, etc. So I try to push them towards a more simple stack that is more idiot-proof to manage. It varies by VPS provider but the basic stack looks like the following:

  • Debian servers setup with cloud-init to run all the updates, reboot, install the container manager of choice.
  • This also sets up Cloudflare tunnels so we can access the boxes securely and easily. Tailscale also works great/better for this. Avoids needing public IPs for each box.
  • Add a tag to each one of those servers so we know what it does (redis, app server, database)
  • Put them into a VPC together so they can communicate
  • Take the deploy script, have it SSH into the box and run the container update process

Linux updates involve a straightforward process of de-registering, destroying the VM and then starting fresh. Database is a bit more complicated but still doable. It's all easily done in simple scripts that you can tie to github actions if you are so inclined. Docker compose has been the glue that handles the actual launching and restarting of the containers for this sample stack.

When you outgrow this approach, you are big enough that you should have a pretty good idea of where to go now. Since everything is already in containers you haven't been boxed in and can migrate in whatever direction you want.

Why Not Docker

However I'm not thrilled with the current state of Docker as a full product. Even when I've paid for Docker Desktop I found it to be a profoundly underwhelming tool. It's slow, the UI is clunky, there's always an update pending, it's sort of expensive for what people use it for, Windows users seem to hate it. When I've compared Podman vs Docker on servers or my local machines, Podman is faster, seems better designed and just in general as a product is trending in a stronger direction. If I don't like Docker Desktop and prefer Podman Desktop, to me its worth migrating the entire stack over and just dumping Docker as a tool I use. Fewer things to keep track of.

Now the problem is that while podman has sort of a compatibility layer with Docker Compose, it's not a one to one replacement and you want to be careful using it. My testing showed it worked ok for basic examples, but more complex stuff and you started to run into problems. It also seems like work on the project has mostly been abandoned by the core maintainers. You can see it here: https://github.com/containers/podman-compose

I think podman-compose is the right solution for local dev, where you aren't using terribly complex examples and the uptime of the stack matters less. It's hard to replace Compose in this role because its just so straight-forward. As a production deployment tool I would stay away from it. This is important to note because right now the local dev container story often involves running k3 on your laptop. My experience is people loath Kubernetes for local development and will go out of their way to avoid it.

The people I know who are all-in on Podman pushed me towards Quadlet as an alternative which uses systemd to manage the entire stack. That makes a lot of sense to me, because my Linux servers already have systemd and it's already a critical piece of software that (as far as I can remember) works pretty much as expected. So the idea of building on top of that existing framework makes more sense to me than attempting to recreate the somewhat haphazard design of Compose.

Wait I thought this already existed?

Yeah I was also confused. So there was a command, podman-generate-systemd, that I had used previously to run containers with Podman using Systemd. That has been deprecated in favor of Quadlet, which are more powerful and offer more of the Compose functionality, but are also more complex and less magically generated.

So if all you want to do is run a container or pod using Systemd, then you can still use podman-generate-systemd which in my testing worked fine and did exactly what it says on the box. However if you want to emulate the functionality of Compose with networks and volumes, then you want Quadlet.

What is Quadlet

The name comes from this excellent pun:

What do you get if you squash a Kubernetes kubelet?
A quadlet

Actually laughed out loud at that. Anyway Quadlet is a tool for running Podman containers under Systemd in a declarative way. It has been merged into Podman 4.4 so it now comes in the box with Podman. When you install Podman it registers a systemd-generator that looks for files in the following directories:

/usr/share/containers/systemd/
/etc/containers/systemd/
# Rootless users
$HOME/.config/containers/systemd/
$XDG_RUNTIME_DIR/containers/systemd/
$XDG_CONFIG_HOME/containers/systemd/
/etc/containers/systemd/users/$(UID)
/etc/containers/systemd/users/

You put unit files in the directory you want (creating them if they aren't present which I assume they aren't) with the file extension telling you what you are looking at.

For example, if I wanted a simple volume I would make the following file:

/etc/containers/systemd/example-db.volume

[Unit]
Description=Example Database Container Volume

[Volume]
Label=app=myapp

You have all the same options you would on the command line.

You can see the entire list here: https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html

Here are the units you can create: name.container, name.volume, name.network, name.kube, name.image, name.build, name.pod

Workflow

So you have a pretty basic Docker Compose you want to replace with Quadlets. You probably need the following:

  • A network
  • Some volumes
  • A database container
  • An application container

The process is pretty straight forward.

Network

We'll make this one at: /etc/containers/systemd/myapp.network

[Unit]
Description=Myapp Network

[Network]
Label=app=myapp

Volume

/etc/containers/systemd/myapp.volume

[Unit]
Description=Myapp Container Volume

[Volume]
Label=app=myapp

/etc/containers/systemd/myapp-db.volume

[Unit]
Description=Myapp Database Container Volume

[Volume]
Label=app=myapp

Database

/etc/containers/systemd/postgres.container

[Unit]
Description=Myapp Database Container

[Service]
Restart=always

[Container]
Label=app=myapp
ContainerName=myapp-db
Image=docker.io/library/postgres:16-bookworm
Network=myapp.network
Volume=myapp-db.volume:/var/lib/postgresql/data
Environment=POSTGRES_PASSWORD=S3cret
Environment=POSTGRES_USER=user
Environment=POSTGRES_DB=myapp_db

[Install]
WantedBy=multi-user.target default.target

Application

/etc/containers/systemd/myapp.container

[Unit]
Description=Myapp Container
Requires=postgres.service
After=postgres.service

[Container]
Label=app=myapp
ContainerName=myapp
Image=wherever-you-get-this
Network=myapp.network
Volume=myapp.volume:/tmp/place_to_put_stuff
Environment=DB_HOST=postgres
Environment=WORDPRESS_DB_USER=user
Environment=WORDPRESS_DB_NAME=myapp_db
Environment=WORDPRESS_DB_PASSWORD=S3cret
PublishPort=9000:80

[Install]
WantedBy=multi-user.target default.target

Now you need to run

systemctl daemon-reload

and you should be able to use systemctl status to check all of these running processes. You don't need to run systemctl enable to get them to run on next boot IF you have the [Install] section defined. Also notice that when you are setting the dependencies (requires, after) that it is called name-of-thing.service, not name-of-thing.container or .volume. It threw me off at first but just wanted to call that out.

One thing I want to call out

Containers support AutoUpdate, which means if you just want Podman to pull down the freshest image from your registry that is supported out of the box. It's just AutoUpdate=registry. If you change that to local, Podman will restart when you trigger a new build of that image locally with a deployment. If you need more information about logging into registries with Podman you can find that here.

I find this very helpful for testing environments where I can tell servers to just run podman auto-update and just getting the newest containers. It's also great because it has options to help handle rolling back and failure scenarios, which are rare but can really blow up in your face with containers without k8s. You can see that here.

What if you don't store images somewhere?

So often with smaller apps it doesn't make sense to add a middle layer of build and storage the image in one place and then pull that image vs just building the image on the machine you are deploying to with docker compose up -d --no-deps --build myapp

You can do the same thing with Quadlet build files. The unit files are similar to the ones above but with a .build extension and the documentation is pretty simple to figure out how to convert whatever you are looking at to it.

I found this nice for quick testing so I could easily rsync changes to my test box and trigger a fast rebuild with the container layers mostly getting pulled from cache and only my code changes making a difference.

How do secrets work?

So secrets are supported with Quadlets. Effectively they just build on top of podman secret or secrets in Kubernetes. Assuming you don't want to go the Kubernetes route for this purpose, you have a couple of options.

  1. Make a secret from a local file (probably bad idea): podman secret create my_secret ./secret.txt
  2. Make a secret from an environmental variable on the box (better idea): podman secret create --env=true my_secret MYSECRET
  3. Use stdin: printf <secret> | podman secret create my_secret -

Then you can reference these secrets inside of the .container file with the Secret=name-of-podman-secret and then the options. By default these secrets are mounted to run/secrets/secretname as a file inside of the container. You can configure it to be an environmental variable (along with a bunch of other stuff) with the options outlined here.

Rootless

So my examples above were not rootless containers which are best practice. You can get them to work, but the behavior is more complicated and has problems I wanted to call out. You need to use default.target and not multi-user.target and then also it looks like you do need loginctl enable-linger to allow your user to start the containers without that user being logged in.

Also remember that all of the systemctl commands need the --user argument and that you might need to change your sysctl parameters to allow rootless containers to run on privileged ports.

sudo sysctl net.ipv4.ip_unprivileged_port_start=80

Unblocks 80, for example.

Networking

So for rootless networking Podman previously used slirp4netns and now uses pasta. Pasta doesn't do NAT and instead copies the IP address from your main network interface to the container namespace. Main in this case is defined as whatever interface as the default route. This can cause (obvious) problems with inter-container connections since its all the same IP. You need to configure the containers.conf to get around this problem.

[network]
pasta_options = ["-a", "10.0.2.0", "-n", "24", "-g", "10.0.2.2", "--dns-forward", "10.0.2.3"]

Also ping didn't work for me. You can fix that with the solution here.

That sounds like a giant pain in the ass.

Yeah I know. It's not actually the fault of the Podman team. The way rootless containers work is basically they use user_namespaces to emulate the privileges to create containers. Inside of the UserNS they can do things like mount namespaces and networking. Outgoing connections are tricky because vEth pairs cannot be created across UserNS boundaries without root. Inbound relies on port forwarding.

So tools like slirp and pasta are used since they can translate Ethernet packets to unprivileged socket system calls by making a tap interface available in the namespace. However the end result is you need to account for a lot of potential strangeness in the configuration file. I'm confident this will let less fiddly as time goes on.

Podman also has a tutorial on how to get it set up here: https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md which did work for me. If you do the work of rootless containers now you have a much easier security story for the rest of your app, so I do think it ultimately pays off even if it is annoying in the beginning.

Impressions

So as a replacement for Docker Compose on servers, I've really liked Quadlet. I find the logging to be easier to figure out since we're just using the standard systemctl commands, checking status is also easier and more straightforward. Getting the rootless containers running took....more time than I expected because I didn't think about how they wouldn't start by default until the user logged back in without the linger work.

It does stink that this is absolutely not a solution for local-dev for most places. I prefer that Podman remains daemonless and instead hooks into the existing functionality of Systemd but for people not running Linux as their local workstations (most people on Earth) you are either going to need to use the Podman Desktop Kubernetes functionality or use the podman-compose and just be aware that it's not something you should use in actual production deployments.

But if you are looking for something that scales well, runs containers and is super easy to manage and keep running, this has been a giant hit for me.

Questions/comments/concerns: https://c.im/@matdevdug


Teaching to the Test. Why It Security Audits Aren’t Making Stuff Safer

A lot has been written in the last few weeks about the state of IT security in the aftermath of the CrowdStrike outage. A range of opinions have emerged, ranging from blaming Microsoft for signing the CrowdStrike software (who in turn blame the EU for making them do it) to blaming the companies themselves for allowing all of these machines access to the Internet to receive the automatic template update. Bike-shedding among the technical community continues to be focused on the underlying technical deployment, which misses the forest for the trees.

The better question is what was the forcing mechanism that convinced every corporation in the world that it was a good idea to install software like this on every single machine? Why is there such a cottage industry of companies that are effectively undermining Operating System security with the argument that they are doing more "advanced" security features and allowing (often unqualified) security and IT departments to make fundamental changes to things like TLS encryption and basic OS functionality? How did all these smart people let a random company push updates to everyone on Earth with zero control? The justification often give is "to pass the audit".

These audits and certifications, of which there are many, are a fundamentally broken practice. The intent of the frameworks was good, allowing for the standardization of good cybersecurity practices while not relying on the expertise of an actual cybersecurity expert to validate the results. We can all acknowledge there aren't enough of those people on Earth to actually audit all the places that need to be audited. The issue is the audits don't actually fix real problems, but instead create busywork for people so it looks like they are fixing problems. It lets people cosplay as security experts without needing to actually understand what the stuff is.

I don't come to this analysis lightly. Between HIPAA, PCI, GDPR, ISO27001 and SOC2 I've seen every possible attempt to boil requirements down to a checklist that you can do. Add in the variations on these that large companies like to send out when you are attempting to sell them an enterprise SaaS and it wouldn't surprise me at all to learn that I've spent over 10,000 hours answering and implementing solutions to meet the arbitrary requirements of these documents. I have both produced the hundred page PDFs full of impressive-looking screenshots and diagrams AND received the PDFs full of diagrams and screenshots. I've been on many calls where it is clear neither of us understands what the other is talking about, but we agree that it sounds necessary and good.

I have also been there in the room when inept IT and Security teams use these regulations, or more specifically their interpretation of these regulations, to justify kicking off expensive and unnecessary projects. I've seen laptops crippled due to full filesystem scans looking for leaked AWS credentials and Social Security numbers, even if the employee has nothing to do with that sort of data. I've watched as TLS encryption is broken with proxies so that millions of files can be generated and stored inside of S3 for security teams to never ever look at again. Even I have had to reboot my laptop to apply a non-critical OS update in the middle of an important call. All this inflicted on poor people who had to work up the enthusiasm to even show up to their stupid jobs today.

Why?

Why does this keep happening? How is it that every large company keeps falling into the same trap of repeating the same expensive, bullshit processes?

  • The actual steps to improve cybersecurity are hard and involve making executives mad. You need to update your software, including planning ahead for end of life technology. Since this dark art is apparently impossible to do and would involve a lot of downtime to patch known-broken shit and reboot it, we won't do that. Better apparently to lose the entire Earths personal data.
  • Everyone is terrified that there might be a government regulation with actual consequences if they don't have an industry solution to this problem that sounds impressive but has no real punishments. If Comcast executives could go to jail for knowingly running out-of-date Citrix NetScaler software, it would have been fixed. So instead we need impressive-sounding things which can be held up as evidence of compliance that if, ultimately, don't end up preventing leaks the consequences are minor.
  • Nobody questions the justification of "we need to do x because of our certification". The actual requirements are too boring to read so it becomes this blank check that can be used to roll out nearly anything.
  • Easier to complete a million nonsense steps than it is to get in contact with someone who understands why the steps are nonsense. The number of times I've turned on silly "security settings" to pass an audit when the settings weren't applicable to how we used the product is almost too high to count.
  • Most Security teams aren't capable of stopping a dedicated attacker and, in their souls, know that to be true. Especially with large organizations, the number of conceivable attack vectors becomes too painful to even think about. Therefore too much faith is placed in companies like Zscaler and CloudStrike to use "machine learning and AI" (read: magic) to close up all the possible exploits before they happen.
  • If your IT department works exclusively with Windows and spends their time working with GPOs and Powershell, every problem you hand them will be solved with Windows. If you handed the same problem to a Linux person, you'd get a Linux solution. People just use what they know. So you end up with a one-size-fits-all approach to problems. Like mice in a maze where almost every step is electrified, if Windows loaded up with bullshit is what they are allowed to deploy without hassles that is what you are going to get.

Future

We all know this crap doesn't work and the sooner we can stop pretending it makes a difference, the better. AT&T had every certification on the planet and still didn't take the incredibly basic step of enforcing 2FA on a database of all the most sensitive data it has in the world. If following these stupid checklists and purchasing the required software ended up with more secure platforms, I'd say "well at least there is a payoff". But time after time we see the exact same thing which is an audit is not an adequate replacement for someone who knows what they are doing looking at your stack and asking hard questions about your process. These audits aren't resulting in organizations doing the hard but necessary step of taking downtime to patch critical flaws or even applying basic security settings across all of their platforms.

Because cryptocurrency now allows for hacking groups to demand millions of dollars in payments (thanks crypto!), the financial incentives to cripple critical infrastructure have never been better. At the same time most regulations designed to encourage the right behavior are completely toothless. Asking the tech industry to regulate itself has failed, without question. All that does is generate a lot of pain and suffering for their employees, who most businesses agree are disposable and idiots. All this while doing nothing to secure personal data. Even in organizations that had smart security people asking hard questions, that advice is entirely optional. There is no stick with cybersecurity and businesses, especially now that almost all of them have made giant mistakes.

I don't know what the solution is, but I know this song and dance isn't working. The world would be better off if organizations stopped wasting so much time and money on these vendor solutions and instead stuck to much more basic solutions. Perhaps if we could just start with "have we patched all the critical CVEs in our organization" and "did we remove the shared username and password from the cloud database with millions of call records", then perhaps AFTER all the actual work is done we can have some fun and inject dangerous software into the most critical parts of our employees devices.

Find me at: https://c.im/@matdevdug


Sears

Sears

It was 4 AM when I first heard the tapping on the glass. I had been working for 30 minutes trying desperately to get everything from the back store room onto the sales floor when I heard a light knocking. Peeking out from the back I saw an old woman wearing sweat pants and a Tweetie bird jacket, oxygen tank in tow, tapping a cane against one of the big front windows. "WE DON'T OPEN UNTIL 5" shouted my boss, who shook her head and resumed stacking boxes. "Black Friday is the worst" she said to nobody as we continued to pile the worthless garbage into neat piles on the store floor.

What people know now but didn't understand then was the items for sale on Black Friday weren't our normal inventory. These were TVs so poorly made they needed time to let their CRT tubes warm up before the image became recognizable. Radios with dials so brittle some came out of the box broken. Finally a mixer that when we tested it in the back let out such a stench of melted plastic we all screamed to turn it off before we burned down the building. I remember thinking as I unloaded it from the truck certainly nobody is gonna want this crap.

Well here they were and when we opened the doors they rushed in with a violence you wouldn't expect from a crowd of mostly senior citizens. One woman pushed me to get at the TVs, which was both unnecessary (I had already hidden one away for myself and put it behind the refrigerators in the back) and not helpful as she couldn't lift the thing on her own. I watched in silence as she tried to get her hands around the box with no holes cut out, presumably a cost savings on Sears part, grunting with effort as the box slowly slid while she held it. At the checkout desk a man told me he was buying the radio "as a Christmas gift for his son". "Alright but no returns ok?" I said keeping a smile on my face.

We had digital cameras the size of shoe-boxes, fire-hazard blenders and an automatic cat watering dish that I just knew was going to break a lot of hearts when Fluffy didn't survive the family trip to Florida. You knew it was quality when the dye from the box rubbed off on your hands when you picked it up. Despite my jokes about worthless junk, people couldn't purchase it fast enough. I saw arguments break out in the aisles and saw Robert, our marine veteran sales guy, whisper "forget this" and leave for a smoke by the loading dock. When I went over to ask if I could help, the man who had possession of the digital camera spun around and told me to "either find another one of these cameras or butt the fuck out". They resumed their argument and I resumed standing by the front telling newcomers that everything they wanted was already gone.

Hours later I was still doing that, informing everyone who walked in that the item they had circled in the newspaper was already sold out. "See, this is such a scam, why don't you stock more of it? It's just a trick to get us into the store". Customer after customer told me variations on the above, including one very kind looking grandfather type informing me I could "go fuck myself" when I wished him a nice holiday.

Beginnings

The store was in my small rural farming town in Ohio, nestled between the computer shop where I got my first job and a carpet store that was almost certainly a money laundering front since nobody ever went in or out. I was interviewed by the owner, a Vietnam veteran who spent probably half our interview talking about his two tours in Vietnam. "We used to throw oil drums in the water and shoot at them from our helicopter, god that was fun. Don't even get me started about all the beautiful local woman." I nodded, unsure what this had to do with me but sensing this was all part of his process. In the years to come I would learn to avoid sitting down in his office, since then you would be trapped listening to stories like these for an hour plus.

After these tales of what honestly sounded like a super fun war full of drugs and joyrides on helicopters, he asked me why I wanted to work at Sears. "It's an American institution and I've always had a lot of respect for it" I said, not sure if he would believe it. He nodded and went on to talk about how Sears build America. "Those kit houses around town, all ordered from Sears. Boy we were something back in the day. Anyway fill out your availability and we'll get you out there helping customers." I had assumed at some point I would get training on the actual products, which never happened in the years I worked there. In the back were dust covered training manuals which I was told I should look at "when I got some time". I obviously never did and still sometimes wonder about what mysteries they contained.

I was given my lanyard and put on the floor, which consisted of half appliances, one quarter electronics and then the rest being tools. Jane, one of the saleswomen told me to "direct all the leads for appliances to her" and not check one out myself, since I didn't get commission. Most of my job consisted of swapping broken Craftsmen tools since they had a lifetime warranty. You filled out a carbon paper form, dropped the broken tool into a giant metal barrel and then handed them a new one. I would also set up deliveries for rider lawnmowers and appliances, working on an ancient IBM POS terminal that required memorizing a series of strange keyboard shortcuts to navigate the calendar.

When there was downtime, I would go into the back and help Todd assemble the appliances and rider lawnmowers. Todd was a special needs student at my high school who was the entirety of our "expert assembly" service. He did a good job, carefully following the manual every time. Whatever sense of superiority as an honor role student I felt disappeared when he watched me try to assemble a rider mower myself. "You need to read the instructions and then do what they say" he would helpfully chime in as I struggled to figure out why the brakes did nothing. His mowers always started on the first try while mine were safety hazards that I felt certain was going to be on the news. "Tonight a Craftsman rider lawnmower killed a family of 4. It was assembled by this idiot." Then just my yearbook photo where I had decided to bleach my hair blonde like a chonky backstreet boy overlaid on top of live footage of blood splattered house siding.

Any feeling I had that people paying us $200 to assemble their rider mowers disappeared when I saw the first one where a customer tried to assemble it. If my mowers were death traps these were actual IEDs whose only conceivable purpose on Earth would be to trick innocent people into thinking they were rider lawnmowers until you turned the key and they blew you into the atmosphere. One guy brought his back with several ziplock bags full of screws bashfully explaining that he tried his best but "there's just no way that's right". That didn't stop me from holding my breath every time someone drove a mower I had worked on up the ramp into the back of the truck. "Please god just don't fall apart right now, wait until they get it home" was my prayer to whatever deity looked after idiots in jobs they shouldn't have.

Sometimes actual adults with real jobs would come in asking me questions about tools, conversations that both of us hated. "I'm looking for a oil filter wrench" they would say, as if this item was something I knew about and could find. "Uh sure, could you describe it?" "It's a wrench, used for changing oil filters, has a loop on it." I'd nod and then feebly offer them up random items until they finally grabbed it themselves. One mechanic when I offered a claw hammer up in response to his request for a cross-pein hammer said "you aren't exactly handy, are you?" I shook my head and went back behind the counter, attempting to establish what little authority I had left with the counter. I might not know anything about the products we sell, but only one of us is allowed back here sir.

Sears Expert

As the months dragged on I was moved from the heavier foot traffic shifts to the night shifts. This was because customers "didn't like talking to me", a piece of feedback I felt was true but still unfair. I had learned a lot, like every incorrect way to assemble a lawn mower and that refrigerators are all the same except for the external panels. Night shifts were mostly getting things ready for the delivery company, a father and son team who were always amusing.

The father was a chain-smoking tough guy who would regularly talk about his "fuck up" of a son. "That idiot dents another oven when we're bringing it in I swear to god I'm going to replace him with one of those Japanese robots I keep seeing on the news." The son was the nicest guy on Earth, really hard working, always on time for deliveries and we got like mountains of positive feedback about him. Old ladies would tear up as they told me about the son hauling their old appliances away in a blizzard on his back. He would just sit there, smile frozen on his face while his father went on and on about how much of a failure he was. "He's just like this sometimes" the son would tell me by the loading dock, even though I would never get involved. "He's actually a nice guy". This was often punctuated by the father running into a minor inconvenience and flying off the handle. "What kind of jackass would sort the paperwork alphabetically instead of by order of delivery?" he'd scream from the parking lot.

When the son went off to college he was replaced by a Hispanic man who took zero shit. His response to customer complaints was always that they were liars and I think the father was afraid of him. "Oh hey don't bother Leo with that, he's not in the mood, I'll call them and work it out" the father would tell me as Leo glared at us from the truck. Leo was incredibly handy though, able to fix almost any dent or scratch in minutes. He popped the dent out of my car door by punching the panel, which is still one of the cooler things I've seen someone do.

Other than the father and son duo, I was mostly alone with a woman named Ruth. She fascinated me because her life was unspeakably bleak. She had been born and raised in this town and had only left the county once in her life, to visit the Sears headquarters in Chicago. She'd talk about it like she had been permitted to visit heaven. "Oh it was something, just a beautiful shiny building full of the smartest people you ever met. Boy I'd love to see it again sometime." She had married her high school boyfriend, had children and now worked here in her 60s as her reward for a life of hard work. She had such bad pain in her knees she had to lean on the stocking cart as she pushed it down the aisles, often stopping to catch her breath. The store would be empty except for the sounds of a wheezing woman and squeaky wheels.

When I would mention Chicago was a 4 hour drive and she could see it again, she'd roll her eyes at me and continue stocking shelves. Ruth was a type of rural person I encountered a lot who seemed to get off on the idea that we were actually isolated from the outside world by a force field. Mention leaving the county to go perhaps to the next county and she would laugh or make a comment about how she wasn't "that kind of person". Every story she would tell had these depressing endings that left me pondering what kind of response she was looking for. "My brother, well he went off to war and when he came back was just a shell of a man. Never really came back if you ask me. Anyway let's clean the counters."

She'd talk endlessly about her grandson, a 12 year old who was "stupid but kind". His incredibly minor infractions were relayed to me like she was telling me about a dark family scandal. "Then I said, who ate all the chips? I knew he had, but he just sat there looking at me and I told him you better wipe those crumbs off your t-shirt smartass and get back to your homework". He finally visited and I was shocked to discover there was also a granddaughter who I had never heard about. He smirked when he met me and told me that Ruth had said I was "a lazy snob".

I'll admit, I was actually a little hurt. Was I a snob compared to Ruth? Absolutely. To be honest with you I'm not entirely sure she was literate. I'd sneak books under the counter to read during the long periods where nothing was happening and she'd often ask me what they were about even if the title sort of explained it. "What is Battle Cry of Freedom: The Civil War Era about? Um well the Civil War." I'd often get called over to "check" documents for her, which typically included anything more complicated than a few sentences. I still enjoyed working with her.

Our relationship never really recovered after I went to Japan when I was 16. I went by myself and wandered around Tokyo, having a great time. When I returned full of stories and pictures of the trip, I could tell she was immediately sick of me. "Who wants to see a place like Japan? Horrible people" she'd tell me as I tried to tell her that things had changed a tiny bit since WWII. "No it's really nice and clean, the food was amazing, let me tell you about these cool trains they have". She wasn't interested and it was clear my getting a passport and leaving the US had changed her opinion of me.

So when her grandson confided that she had called me lazy AND a snob my immediate reaction was to lean over and tell him that she had called him "a stupid idiot". Now she had never actually said "stupid idiot", but in the heat of the moment I went with my gut. Moments after I did that the reality of a 16 year old basically bullying a 12 year old sunk in and I decided it was time for me to go take out some garbage. Ruth of course found out what I said and mentioned it every shift after that. "Saying I called my grandson a stupid idiot, who does that, a rude person that's who, a rude snob" she'd say loud enough for me to hear as the cart very slowly inched down the aisles. I deserved it.

Trouble In Paradise

At a certain point I was allowed back in front of customers and realized with a shock that I had worked there for a few years. The job paid very little, which was fine as I had nothing in the town to actually buy, but enough to keep my lime green Ford Probe full of gas. It shook violently if you exceeded 70 MPH, which I should have asked someone about but never did. I was paired with Jane, the saleswoman who was a devout Republican and liked to make fun of me for being a Democrat. This was during the George W Bush vs Kerry election and she liked to point out how Kerry was a "flipflopper" on things. "He just flips and flops, changes his mind all the time". I'd point out we had vaporized the country of Iraq for no reason and she'd roll her eyes and tell me I'd get it when I was older.

My favorite was when we were working together during Reagan's funeral, an event which elicited no emotion from me but drove her to tears multiple times. "Now that was a man and a president" she'd exclaim to the store while the funeral procession was playing on the 30 TVs. "He won the Cold War you know?" she'd shout at a woman looking for replacement vacuum cleaner bags. Afterwards she asked me what my favorite Reagan memory was. All I could remember was that he had invaded the small nation of Grenada for some reason, so I said that. "Really showed those people not to mess with the US" she responded. I don't think either one of us knew that Grenada is a tiny island nation with a population less than 200,000.

Jane liked to dispense country wisdom, witty one-liners that only sometimes were relevant to the situation at hand. When confronted with an angry customer she would often say afterwards that you "You can't make a silk purse out of a sow's ear" which still means nothing to me.  Whatever rural knowledge I was supposed to obtain through osmosis my brain clearly rejected. Jane would send me over to sell televisions since I understood what an HDMI cord was and the difference between SD and HD television.

Selling TVs was perhaps the only thing I did well, that and the fun vacuum demonstration where we would dump a bunch of dirt on a carpet tile and suck it up. Some poor customer would tell me she didn't have the budget for the Dyson and I'd put my hand up to silence her. "You don't have to buy it, just watch it suck up a bunch of pebbles. I don't make commission anyway so who cares." Then we'd both watch as the Dyson would make a horrible screeching noise and suck in a cups worth of small rocks. "That's pretty cool huh?" and the customer would nod, probably terrified of what I would do if she said no.

Graduation

When I graduated high school and prepared to go off to college, I had the chance to say goodbye to everyone before I left. They had obviously already replaced me with another high school student, one that knew things about tools and was better looking. You like to imagine that people will miss you when you leave a job, but everyone knew that wasn't true here. I had been a normal employee who didn't steal and mostly showed up on time.

My last parting piece of wisdom from Ruth was not to let college "make me forget where I came from". Sadly for her I was desperate to do just that, entirely willing to adopt whatever new personality that was presented to me. I'd hated rural life and still do, the spooky dark roads surrounded by corn. Yelling at Amish teens to stop shoplifting during their Rumspringa where they would get dropped off in the middle of town and left to their own devices.

Still I'm grateful that I at least know how to assemble a rider lawnmower, even if it did take a lot of practice runs on customers mowers.


A Eulogy for DevOps

We hardly knew ye.

DevOps, like many trendy technology terms, has gone from the peak of optimism to the depths of exhaustion. While many of the fundamental ideas behind the concept have become second-nature for organizations, proving it did in fact have a measurable outcome, the difference between the initial intent and where we ended up is vast. For most organizations this didn't result in a wave of safer, easier to use software but instead encouraged new patterns of work that centralized risk and introduced delays and irritations that didn't exist before. We can move faster than before, but that didn't magically fix all our problems.

The cause of its death was a critical misunderstanding over what was causing software to be hard to write. The belief was by removing barriers to deployment, more software would get deployed and things would be easier and better. Effectively that the issue was that developers and operations teams were being held back by ridiculous process and coordination. In reality these "soft problems" of communication and coordination are much more difficult to solve than the technical problems around pushing more code out into the world more often.

What is DevOps?

DevOps, when it was introduced around 2007, was a pretty radical concept of removing the divisions between people who ran the hardware and people who wrote the software. Organizations still had giant silos between teams, with myself experiencing a lot of that workflow.

Since all computer nerds also love space, it was basically us cosplaying as NASA. Copying a lot of the procedures and ideas from NASA to try and increase the safety around pushing code out into the world. Different organizations would copy and paste different parts, but the basic premise was every release was as close to bug free as time allowed. You were typically shooting for zero exceptions.

When I worked for a legacy company around that time, the flow for releasing software looked as follows.

  • Development team would cut a release of the server software with a release number in conjunction with the frontend team typically packaged together as a full entity. They would test this locally on their machines, then it would go to dev for QA to test, then finally out to customers once the QA checks were cleared.
  • Operations teams would receive a playbook of effectively what the software was changing and what to do if it broke. This would include how it was supposed to be installed, if it did anything to the database, it was a whole living document. The idea was the people managing the servers, networking equipment and SANs had no idea what the software did or how to fix it so they needed what were effectively step by step instructions. Sometimes you would even get this as a paper document.
  • Since these happened often inside of your datacenter, you didn't have unlimited elasticity for growth. So, if possible, you would slowly roll out the update and stop to monitor at intervals. But you couldn't do what people see now as a blue/green deployment because rarely did you have enough excess server capacity to run two versions at the same time for all users. Some orgs did do different datacenters at different times and cut between them (which was considered to be sort of the highest tier of safety).
  • You'd pick a deployment day, typically middle of the week around 10 AM local time and then would monitor whatever metrics you had to see if the release was successful or not. These were often pretty basic metrics of success, including some real eyebrow raising stuff like "is support getting more tickets" and "are we getting more hits to our uptime website". Effectively "is the load balancer happy" and "are customers actively screaming at us".
  • You'd finish the deployment and then the on-call team would monitor the progress as you went.

Why Didn't This Work

Part of the issue was this design was very labor-intensive. You needed enough developers coordinating together to put together a release. Then you needed a staffed QA team to actually take that software and ensure, on top of automated testing which was jusssttttt starting to become a thing, that the software actually worked. Finally you needed a technical writer working with the development team to walk through what does a release playbook look like and then finally have the Operations team receive, review the book and then implement the plan.

It was also slow. Features would often be pushed for months even when they were done just because a more important feature had to go out first. Or this update was making major changes to the database and we didn't want to bundle in six things with the one possibly catastrophic change. It's effectively the Agile vs Waterfall design broken out to practical steps.

Waterfall vs Agile in software development infographic

A lot of the lip service around this time that was given as to why organizations were changing was, frankly, bullshit. The real reason companies were so desperate to change was the following:

  • Having lots of mandatory technical employees they couldn't easily replace was a bummer
  • Recruitment was hard and expensive.
  • Sales couldn't easily inject whatever last-minute deal requirement they had into the release cycle since that was often set it stone.
  • It provided an amazing opportunity for SaaS vendors to inject themselves into the process by offloading complexity into their stack so they pushed it hard.
  • The change also emphasized the strengths of cloud platforms at the time when they were starting to gobble market share. You didn't need lots of discipline, just allocate more servers.
  • Money was (effectively) free so it was better to increase speed regardless of monthly bills.
  • Developers were understandably frustrated that minor changes could take weeks to get out the door while they were being blamed for customer complaints.

So executives went to a few conferences and someone asked them if they were "doing DevOps" and so we all changed our entire lives so they didn't feel like they weren't part of the cool club.

What Was DevOps?

Often this image is used to sum it up:

DevOps Infinity Wheel

In a nutshell, the basic premise was that development teams and operations teams were now one team. QA was fired and replaced with this idea that because you could very quickly deploy new releases and get feedback on those releases, you didn't need a lengthy internal test period where every piece of functionality was retested and determined to still be relevant.

Often this is conflated with the concept of SRE from Google, which I will argue until I die is a giant mistake. SRE is in the same genre but a very different tune, with a much more disciplined and structured approach to this problem. DevOps instead is about the simplification of the stack such that any developer on your team can deploy to production as many times in a day as they wish with only the minimal amounts of control on that deployment to ensure it had a reasonably high chance of working.

In reality DevOps as a practice looks much more like how Facebook operated, with employees committing to production on their first day and relying extensively on real-world signals to determine success or failure vs QA and tightly controlled releases.

In practice it looks like this:

  • Development makes a branch in git and adds a feature, fix, change, etc.
  • They open up a PR and then someone else on that team looks at it, sees it passes their internal tests, approves it and then it gets merged into main. This is effectively the only safety step, relying on the reviewer to have perfect knowledge of all systems.
  • This triggers a webhook to the CI/CD system which starts the build (often of an entire container with this code inside) and then once the container is built, it's pushed to a container registry.
  • The CD system tells the servers that the new release exists, often through a Kubernetes deployment or pushing a new version of an internal package or using the internal CLI of the cloud providers specific "run a container as a service" platform. It then monitors and tells you about the success or failure of that deployment.
  • Finally there are release-aware metrics which allow that same team, who is on-call for their application, to see if something has changed since they released it. Is latency up, error count up, etc. This is often just a line in a graph saying this was old and this is new.
  • Depending on the system, this can either be something where every time the container is deployed it is on brand-new VMs or it is using some system like Kubernetes to deploy "the right number" of containers.

The sales pitch was simple. Everyone can do everything so teams no longer need as many specialized people. Frameworks like Rails made database operations less dangerous, so we don't need a team of DBAs. Hell, use something like Mongo and you never need a DBA!

DevOps combined with Agile ended up with a very different philosophy of programming which had the following conceits:

  • The User is the Tester
  • Every System Is Your Specialization
  • Speed Of Shipping Above All
  • Catch It In Metrics
  • Uptime Is Free, SSO Costs Money (free features were premium, expensive availability wasn't charged for)
  • Logs Are Business Intelligence

What Didn't Work

The first cracks in this model emerged pretty early on. Developers were testing on their local Mac and Windows machines and then deploying code to Linux servers configured from Ansible playbooks and left running for months, sometimes years. Inevitably small differences in the running fleet of production servers emerged, either from package upgrades for security reasons or just from random configuration events. This could be mitigated by frequently rotating the running servers by destroying and rebuilding them as fresh VMs, but in practice this wasn't done as often as it should have been.

Soon you would see things like "it's running fine on box 1,2, 4, 5, but 3 seems to be having problems". It wasn't clear in the DevOps model who exactly was supposed to go figure out what was happening or how. In the previous design someone who worked with Linux for years and with these specific servers would be monitoring the release, but now those team members often wouldn't even know a deployment was happening. Telling someone who is amazing at writing great Javascript to go "find the problem with a Linux box" turned out to be easier said than done.

Quickly feedback from developers started to pile up. They didn't want to have to spend all this time figuring out what Debian package they wanted for this or that dependency. It wasn't what they were interested in doing and also they weren't being rewarded for that work, since they were almost exclusively being measured for promotions by the software they shipped. This left the Operations folks in charge of "smoothing out" this process, which in practice often meant really wasteful practices.

You'd see really strange workflows around this time of doubling the number of production servers you were paying for by the hour during a deployment and then slowly scaling them down, all relying on the same AMI (server image) to ensure some baseline level of consistency. However since any update to the AMI required a full dev-stage-prod check, things like security upgrades took a very long time.

Soon you had just a pile of issues that became difficult to assign. Who "owned" platform errors that didn't result in problems for users? When a build worked locally but failed inside of Jenkins, what team needed to check that? The idea of we're all working on the same team broke down when it came to assigning ownership of annoying issues because someone had to own them or they'd just sit there forever untouched.

Enter Containers

DevOps got a real shot in the arm with the popularization of containers, which allowed the movement to progress past its awkward teenage years. Not only did this (mostly) solve the "it worked on my machine" thing but it also allowed for a massive simplification of the Linux server component part. Now servers were effectively dumb boxes running containers, either on their own with Docker compose or as part of a fleet with Kubernetes/ECS/App Engine/Nomad/whatever new thing that has been invented in the last two weeks.

Combined with you could move almost everything that might previous be a networking team problem or a SAN problem to configuration inside of the cloud provider through tools like Terraform and you saw a real flattening of the skill curve. This greatly reduced the expertise required to operate these platforms and allowed for more automation. Soon you started to see what we now recognize as the current standard for development which is "I push out a bajillion changes a day to production".

What Containers Didn't Fix

So there's a lot of other shit in that DevOps model we haven't talked about.

So far teams had improved the "build, test and deploy" parts. However operating the crap was still very hard. Observability was really really hard and expensive. Discoverability was actually harder than ever because stuff was constantly changing beneath your feet and finally the Planning part immediately collapsed into the ocean because now teams could do whatever they wanted all the time.

Operate

This meant someone going through and doing all the boring stuff. Upgrading Kubernetes, upgrading the host operating system, making firewall rules, setting up service meshes, enforcing network policies, running the bastion host, configuring the SSH keys, etc. What organizations quickly discovered was that this stuff was very time consuming to do and often required more specialization than the roles they had previously gotten rid of.

Before you needed a DBA, a sysadmin, a network engineer and some general Operations folks. Now you needed someone who not only understood databases but understood your specific cloud providers version of that database. You still needed someone with the sysadmin skills, but in addition they needed to be experts in your cloud platform in order to ensure you weren't exposing your database to the internet. Networking was still critical but now it all existed at a level outside of your control, meaning weird issues would sometimes have to get explained as "well that sometimes happens".

Often teams would delay maintenance tasks out of a fear of breaking something like k8s or their hosted database, but that resulted in delaying the pain and making their lives more difficult. This was the era where every startup I interviewed with basically just wanted someone to update all the stuff in their stack "safely". Every system was well past EOL and nobody knew how to Jenga it all together.

Observe

As applications shipped more often, knowing they worked became more important so you could roll back if it blew up in your face. However replacing simple uptime checks with detailed traces, metrics and logs was hard. These technologies are specialized and require detailed understanding of what they do and how they work. A syslog centralized box lasts to a point and then it doesn't. Prometheus scales to x amount of metrics and then no longer works on a single box. You needed someone who had a detailed understanding of how metrics, logs and traces worked and how to work with development teams in getting them sending the correct signal to the right places at the right amount of fidelity.

Or you could pay a SaaS a shocking amount to do it for you. The rise of companies like Datadog and the eye-watering bills that followed was proof that they understood how important what they were providing was. You quickly saw Observability bills exceed CPU and networking costs for organizations as one team would misconfigure their application logs and suddenly you have blown through your monthly quota in a week.

Developers were being expected to monitor with detailed precision what was happening with their applications without a full understanding of what they were seeing. How many metrics and logs were being dropped on the floor or sampled away, how did the platform work in displaying these logs to them, how do you write an query for terabytes of logs so that you can surface what you need quickly, all of this was being passed around in Confluence pages being written by desperate developers who were learning as they were getting paged at 2AM how all this shit works together.

Continuous Feedback

This to me is the same problem as Observe. It's about whether your deployment worked or not and whether you had signal from internal tests if it was likely to work. It's also about feedback from the team on what in this process worked and what didn't, but because nobody ever did anything with that internal feedback we can just throw that one directly in the trash.

I guess in theory this would be retros where we all complain about the same six things every sprint and then continue with our lives. I'm not an Agile Karate Master so you'll need to talk to the experts.

Discover

A big pitch of combining these teams was the idea of more knowledge sharing. Development teams and Operation teams would be able to cross-share more about what things did and how they worked. Again it's an interesting idea and there was some improvement to discoverability, but in practice that isn't how the incentives were aligned.

Developers weren't rewarded for discovering more about how the platform operated and Operations didn't have any incentive to sit down and figure out how the frontend was built. It's not a lack of intellectual curiosity by either party, just the finite amount of time we all have before we die and what we get rewarded for doing. Being surprised that this didn't work is like being surprised a mouse didn't go down the tunnel with no cheese just for the experience.

In practice I "discovered" that if NPM was down nothing worked and the frontend team "discovered" that troubleshooting Kubernetes was a bit like Warhammer 40k Adeptus Mechanicus waving incense in front of machines they didn't understand in the hopes that it would make the problem go away.

The Adeptus Mechanicus - Warhammer Universe (2024)
Try restarting the Holy Deployment

Plan

Maybe more than anything else, this lack of centralization impacted planning. Since teams weren't syncing on a regular basis anymore, things could continue in crazy directions unchecked. In theory PMs were syncing with each other to try and ensure there were railroad tracks in front of the train before it plowed into the ground at 100 MPH, but that was a lot to put on a small cadre of people.

We see this especially in large orgs with microservices where it is easier to write a new microservice to do something rather than figure out which existing microservice does the thing you are trying to do. This model was sustainable when money was free and cloud budgets were unlimited, but once that gravy train crashed into the mountain of "businesses need to be profitable and pay taxes" that stopped making sense.

The Part Where We All Gave Up

A lot of orgs solved the problems above by simply throwing bodies into the mix. More developers meant it was possible for teams to have someone (anyone) learn more about the systems and how to fix them. Adding more levels of PMs and overall planning staff meant even with the frantic pace of change it was...more possible to keep an eye on what was happening. While cloud bills continued to go unbounded, for the most part these services worked and allowed people to do the things they wanted to do.

Then layoffs started and budget cuts. Suddenly it wasn't acceptable to spend unlimited money with your logging platform and your cloud provider as well as having a full team. Almost instantly I saw the shift as organizations started talking about "going back to basics". Among this was a hard turn in the narrative around Kubernetes where it went from an amazing technology that lets you grow to Google-scale to a weight around an organizations neck nobody understood.

Platform Engineering

Since there are no new ideas, just new terms, a successor to the throne has emerged. No longer are development teams expected to understand and troubleshoot the platforms that run their software, instead the idea is that the entire process is completely abstracted away from them. They provide the container and that is the end of the relationship.

From a certain perspective this makes more sense since it places the ownership for the operation of the platform with the people who should have owned it from the beginning. It also removes some of the ambiguity over what is whose problem. The development teams are still on-call for their specific application errors, but platform teams are allowed to enforce more global rules.

Well at least in theory. In practice this is another expansion of roles. You went from needing to be a Linux sysadmin to being a cloud-certified Linux sysadmin to being a Kubernetes-certified multicloud Linux sysadmin to finally being an application developer who can create a useful webUI for deploying applications on top of a multicloud stack that runs on Kubernetes in multiple regions with perfect uptime and observability that doesn't blow the budget. I guess at some point between learning the difference between AWS and GCP we were all supposed to go out and learn how to make useful websites.

This division of labor makes no sense but at least it's something I guess. Feels like somehow Developers got stuck with a lot more work and Operation teams now need to learn 600 technologies a week. Surprisingly tech executives didn't get any additional work with this system. I'm sure the next reorg they'll chip in more.

Conclusion

We are now seeing a massive contraction of the Infrastructure space. Teams are increasingly looking for simple, less platform specific tooling. In my own personal circles it feels like a real return to basics, as small and medium organizations abandon technology like Kubernetes and adopt much more simple and easy-to-troubleshoot workflows like "a bash script that pulls a new container".

In some respects it's a positive change, as organizations stop pretending they needed a "global scale" and can focus on actually servicing the users and developers they have. In reality a lot of this technology was adopted by organizations who weren't ready for it and didn't have a great plan for how to use it.

However Platform Engineering is not a magical solution to the problem. It is instead another fabrication of an industry desperate to show monthly growth in cloud providers who know teams lack the expertise to create the kinds of tooling described by such practices. In reality organizations need to be more brutally honest about what they actually need vs what bullshit they've been led to believe they need.

My hope is that we keep the gains from the DevOps approach and focus on simplification and stability over rapid transformation in the Infrastructure space. I think we desperately need a return to basics ideology that encourages teams to stop designing with the expectation that endless growth is the only possible outcome of every product launch.


GitHub Copilot Workspace Review

I was recently invited to try out the beta for GitHub's new AI-driven web IDE and figured it could be an interesting time to dip my toes into AI. So far I've avoided all of the AI tooling, trying the GitHub paid Copilot option and being frankly underwhelmed. It made more work for me than it saved. However this is free for me to try and I figured "hey why not".

Disclaimer: I am not and have never been an employee of GitHub, Microsoft, any company owned by Microsoft, etc. They don't care about me and likely aren't aware of my existence. Nobody from GitHub PR asked me to do this and probably won't like what I have to say anyway.

TL;DR

GitHub Copilot Workspace didn't work on a super simple task regardless of how easy I made the task. I wouldn't use something like this for free, much less pay for it. It sort of failed in every way it could at every step.

What is GitHub Copilot Workspace?

So after the success of GitHub Copilot, which seems successful according to them:

In 2022, we launched GitHub Copilot as an autocomplete pair programmer in the editor, boosting developer productivity by up to 55%. Copilot is now the most widely adopted AI developer tool. In 2023, we released GitHub Copilot Chat—unlocking the power of natural language in coding, debugging, and testing—allowing developers to converse with their code in real time.

They have expanded on this feature set with GitHub Copilot Workspace, a combination of an AI tool with an online IDE....sorta. It's all powered by GPT-4 so my understanding is this is the best LLM money can buy. The workflow of the tool is strange and takes a little bit of explanation to convey what it is doing.

GitHub has the marketing page here: https://githubnext.com/projects/copilot-workspace and the docs here: https://github.com/githubnext/copilot-workspace-user-manual. It's a beta product and I thought the docs were nicely written.

Effectively you start with a GitHub Issue, the classic way maintainers are harassed by random strangers. I've moved my very simple demo site: https://gcp-iam-reference.matduggan.com/ to a GitHub repo to show what I did. So I open the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/1

Very simple, makes sense, Then I click "Open in Workspaces" which brings me to a kind of GitHub Actions inspired flow.

It reads the Issue and creates a Specification, which is editable.

Then you generate a Plan:

Finally it generates the files of that plan and you can choose whether to implement them or not and open a Pull Request against the main branch.

Implementation:

It makes a Pull Request:

Great right? Well except it didn't do any of it right.

  • It didn't add a route to the Flask app to expose this information
  • It didn't stick with the convention of storing the information in JSON files, writing it out to Markdown for some reason
  • It decided the way that it was going to reveal this information was to add it to the README
  • Finally it didn't get anywhere near all the machine types.
Before you ping me yes I tried to change the Proposed plan

Baby Web App

So the app I've written here is primarily for my own use and it is very brain dead simple. The entire thing is the work of roughly an afternoon of poking around while responding to Slack messages. However I figured this would be a good example of maybe a more simple internal tool where you might trust AI to go a bit nuts since nothing critical will explode if it messes up.

How the site works it is relies on the output of the gcloud CLI tool to generate JSON of all of IAM permissions for GCP and then output them so that I can put them into categories and quickly look for the one I want. I found the official documentation to be slow and hard to use, so I made my own. It's a Flask app, which means it is pretty stupid simple.

import os
from flask import *
from all_functions import *
import json


app = Flask(__name__)

@app.route('/')
def main():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("index.html", items=items, role_data=role_data)

@app.route('/all-roles')
def all_roles():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("all_roles.html", items=items, role_data=role_data)

@app.route('/search')
def search():
    items = get_iam_categories()
    return render_template('search_page.html', items=items)

@app.route('/iam-classes')
def iam_classes():
    source = request.args.get('parameter')
    items = get_iam_categories()
    specific_items = get_specific_roles(source)
    print(specific_items)
    return render_template("iam-classes.html", specific_items=specific_items, items=items)

@app.route('/tsid', methods=['GET'])
def tsid():
    data = get_tsid()
    return jsonify(data)

@app.route('/eu-eea', methods=['GET'])
def eueea():
    country_code = get_country_codes()
    return is_eea(country_code)


if __name__ == '__main__':
    app.run(debug=False)

I also have an endpoint I use during testing if I need to test some specific GDPR code so I can curl it and see if the IP address is coming from EU/EEA or not along with a TSID generator I used for a brief period of testing that I don't need anymore. So again, pretty simple. It could be rewritten to be much better but I'm the primary user and I don't care, so whatever.

So effectively what I want to add is another route where I would also have a list of all the GCP machine types because their official documentation is horrible and unreadable. https://cloud.google.com/compute/docs/machine-resource

What I'm looking to add is something like this: https://gcloud-compute.com/

Look how information packed it is! My god, I can tell at a glance if a machine type is eligible for Sustained Use Discounts, how many regions it is in, Hour/Spot/Month pricing and the breakout per OS along with Clock speed. If only Google had a team capable of making a spreadsheet.

Nothing I enjoy more than nested pages with nested submenus that lack all the information I would actually need. I'm also not clear what a Tier_1 bandwidth is but it does seem unlikely that it matters for machine types when so few have it.

I could complain about how GCP organizes information all day but regardless the information exists. So I don't need anything to this level, but could I make a simpler version of this that gives me some of the same information? Seems possible.

How I Would Do It

First let's try to stick with the gcloud CLI approach.

gcloud compute machine-types list --format="json"

Only problem with this is that it does output the information I want, but for some reason it outputs a JSON file per region.

  {
    "creationTimestamp": "1969-12-31T16:00:00.000-08:00",
    "description": "4 vCPUs 4 GB RAM",
    "guestCpus": 4,
    "id": "903004",
    "imageSpaceGb": 0,
    "isSharedCpu": false,
    "kind": "compute#machineType",
    "maximumPersistentDisks": 128,
    "maximumPersistentDisksSizeGb": "263168",
    "memoryMb": 4096,
    "name": "n2-highcpu-4",
    "selfLink": "https://www.googleapis.com/compute/v1/projects/sybogames-artifact/zones/africa-south1-c/machineTypes/n2-highcpu-4",
    "zone": "africa-south1-c"
  }

I don't know why but sure. However I don't actually need every region so I can cheat here. gcloud compute machine-types list --format="json" gets me some of the way there.

Where's the price?

Yeah so Google doesn't expose pricing through the API as far as I can tell. You can download what is effectively a global price list for your account at https://console.cloud.google.com/billing/[your billing account id]/pricing. That's a 13 MB CSV that includes what your specific pricing will be, which is what I would use. So then I would combine the information from my region with the information from the CSV and then output the values. However since I don't know whether the pricing I have is relevant to you, I can't really use this to generate a public webpage.

Web Scraping

So realistically my only option would be to scrape the pricing page here: https://cloud.google.com/compute/all-pricing. Except of course it was designed in such a way as to make it as hard to do that as possible.

Boy it is hard to escape the impression GCP does not want me doing large-scale cost analysis. Wonder why?

So there's actually a tool called gcosts which seems to power a lot of these sites running price analysis. However it relies on a pricing.yml file which is automatically generated weekly. The work involved in generating this file is not trivial:

 +--------------------------+  +------------------------------+
 | Google Cloud Billing API |  | Custom mapping (mapping.csv) |
 +--------------------------+  +------------------------------+
               ↓                              ↓
 +------------------------------------------------------------+
 | » Export SKUs and add custom mapping IDs to SKUs (skus.sh) |
 +------------------------------------------------------------+
               ↓
 +----------------------------------+  +-----------------------------+
 | SKUs pricing with custom mapping |  | Google Cloud Platform info. |
 |             (skus.db)            |  |           (gcp.yml)         |
 +----------------------------------+  +-----------------------------+
                \                             /
         +--------------------------------------------------+
         | » Generate pricing information file (pricing.pl) |
         +--------------------------------------------------+
                              ↓
                +-------------------------------+
                |  GCP pricing information file |
                |          (pricing.yml)        |
                +-------------------------------+

Alright so looking through the GitHub Action that generates this pricing.yml file, here, I can see how it works and how the file is generated. But also I can just skip that part and pull the latest for my usecase whenever I regenerate the site. That can be found here.

Effectively with no assistance from AI, I have now figured out how I would do this:

  • Pull down the pricing.yml file and parse it
  • Take that information and output it to a simple table structure
  • Make a new route on the Flask app and expose that information
  • Add a step to the Dockerfile to pull in the new pricing.yml with every Dockerfile build just so I'm not hammering the GitHub CDN all the time.

Why Am I Saying All This?

So this is a perfect example of an operation that should be simple but because the vendor doesn't want to make it simple, is actually pretty complicated. As we can now tell from the PR generated before, AI is never going to be able to understand all the steps we just walked through to understand how one actually get the prices for these machines. We've also learned that because of the hard work of someone else, we can skip a lot of the steps. So let's try it again.

Attempt 2

Maybe if I give it super specific information, it can do a better job.

You can see the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/4

I think I've explained maybe what I'm trying to do. Certainly a person would understand this. Obviously this isn't the right way to organize this information, I would want to do a different view and sort by region and blah blah blah. However this should be easier for the machine to understand.

Note: I am aware that Copilot has issues making calls to the internet to pull files, even from GitHub itself. That's why I've tried to include a sample of the data. If there's a canonical way to pass the tool information inside of the issue let me know at the link at the bottom.

Results

So at first things looked promising.

It seems to understand what I'm asking and why I'm asking it. This is roughly the correct thing. The plan also looks ok:

You can see the PR it generated here: https://github.com/matdevdug/gcp-iam-reference/pull/5

So this is much closer but it's still not really "right". First like most Flask apps I have a base template that I want to include on every page: https://github.com/matdevdug/gcp-iam-reference/blob/main/templates/base.html

Then for every HTML file after that we extend the base:

{% extends "base.html" %}

{% block main %}

<style>
        table {
            border-collapse: collapse;
            width: 100%;
        }

        th, td {
            border: 1px solid #dddddd;
            text-align: left;
            padding: 8px;
        }

        tr:nth-child(even) {
            background-color: #f2f2f2;
        }
</style>

The AI doesn't understand that we've done this and is just re-implementing Bootstrap: https://github.com/matdevdug/gcp-iam-reference/pull/5/files#diff-a8e8dd2ad94897b3e1d15ec0de6c7cfeb760c15c2bd62d828acba2317189a5a5

It's not adding it to the menu bar, there are actually a lot of pretty basic misses here. I wouldn't accept this PR from a person, but let's see if it works!

 => ERROR [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml                                             0.1s
------
 > [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml:
0.104 /bin/sh: 1: wget: not found

No worries, easy to fix.

Alright fixed wget, let's try again!

2024-06-18 11:18:57   File "/usr/local/lib/python3.12/site-packages/gunicorn/util.py", line 371, in import_app
2024-06-18 11:18:57     mod = importlib.import_module(module)
2024-06-18 11:18:57           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
2024-06-18 11:18:57     return _bootstrap._gcd_import(name[level:], package, level)
2024-06-18 11:18:57            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap_external>", line 995, in exec_module
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
2024-06-18 11:18:57   File "/app/main.py", line 2, in <module>
2024-06-18 11:18:57     import yaml
2024-06-18 11:18:57 ModuleNotFoundError: No module named 'yaml'

Yeah I did anticipate this one. Alright let's add PyYAML so there's something to import. I'll give AI a break on this one, this is a dumb Python thing.

Ok so it didn't add it to the menu, it didn't follow the style conventions, but did it at least work? Also no.

I'm not sure how it could have done a worse job to be honest. I understand what it did wrong and why this ended up like it did, but the work involved in fixing it exceeds the amount of work it would take for me to do it myself by scratch. The point of this was to give it a pretty simple concept (parse a YAML file) and see what it did.

Conclusion

I'm sure this tool is useful to someone on Earth. That person probably hates programming and gets no joy out of it, looking for something that could help them spend less time doing it. I am not that person. Having a tool that makes stuff that looks right but ends up broken is worse than not having the tool at all.

If you are a person maintaining an extremely simple thing with amazing test coverage, I guess go for it. Otherwise this is just a great way to get PRs that look right and completely waste your time. I'm sure there are ways to "prompt engineer" this better and if someone wants to tell me what I could do, I'm glad to re-run the test. However as it exists now, this is not worth using.

If you want to use it, here are my tips:

  • Your source of data must be inside of the repo, it doesn't like making network calls
  • It doesn't seem to go check any sort of requirements file for Python, so assume the dependencies are wrong
  • It understands Dockerfile but not checking if a binary is present so add a check for that
  • It seems to do better with JSON than YAML

Questions/comments/concerns: https://c.im/@matdevdug


Simple Kubernetes Secret Encryption with Python

I was recently working on a new side project in Python with Kubernetes and I needed to inject a bunch of secrets. The problem with secret management in Kubernetes is you end up needing to set up a lot of it yourself and its time consuming. When I'm working on a new idea, I typically don't want to waste a bunch of hours setting up "the right way" to do something that isn't related to the core of the idea I'm trying out.

For the record, the right way to do secrets in Kubernetes is the following:

  • Turn on encryption at rest for ETCD
  • Carefully set up RBAC inside of Kubernetes to ensure the right users and service accounts can access the secrets
  • Give up on trying to do that and end up setting up Vault or paying your cloud provider for their Secret Management tool
  • There is a comprehensive list of secret managers and approaches outlined here: https://www.argonaut.dev/blog/secret-management-in-kubernetes

However especially when you are trying ideas out, I wanted something more idiot proof that didn't require any setup. So I wrote something simple with Python Fernet encryption that I thought might be useful to someone else out there.

You can see everything here: https://gitlab.com/matdevdug/example_kubernetes_python_encryption

Walkthrough

So the script works in a pretty straight forward way. It reads the .env file you generate as outlined in the README with secrets in the following format:

Make a .env file with the following parameters:

KEY=Make a fernet key: https://fernetkeygen.com/
CLUSTER_NAME=name_of_cluster_you_want_to_use
SECRET-TEST-1=9e68b558-9f6a-4f06-8233-f0af0a1e5b42
SECRET-TEST-2=a004ce4c-f22d-46a1-ad39-f9c2a0a31619

The KEY is the secret key and the CLUSTER_NAME tells the Kubernetes library what kubeconfig target you want to use. Then the tool finds anything with the word SECRET in the .env file and encrypts it, then writes it to the .csv file.

The .csv file looks like the following:

I really like to keep some sort of record of what secrets are injected into the cluster outside of the cluster just so you can keep track of the encrypted values. Then the script checks the namespace you selected to see if there are secrets with that name already and, if not, injects it for you.

Some quick notes about the script:

  • Secret names in Kubernetes need a specific format for the name. Lower case with words separated by - or . The script will take the uppercase in the .env and convert it into a lowercase. Just be aware it is doing that.
  • It does base64 encode the secret before it uploads it, so be aware that your application will need to decode it when it loads the secret.
  • Now the only secret you need to worry about is the Fernet secret that you can load into the application in a secure way. I find this is much easier to mentally keep track of than trying to build an infinitely scalable secret solution. Plus its cheaper since many secret managers charge per secret.
  • The secrets are immutable which means they are lightweight on the k8s API and fast. Just be aware you'll need to delete the secrets if you need to replace them. I prefer this approach because I'd rather store more things as encrypted secrets and not worry about load.
  • It is easy to specify which namespace you intend to load the secrets into and I recommend using a different Fernet secret per application.
  • Mounting the secret works like it always does in k8s
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: my-container
      image: your-image:latest
      volumeMounts:
        - name: secret-volume
          mountPath: /path/to/secret/data
  volumes:
    - name: secret-volume
      secret:
        secretName: my-secret

Inside of your application, you need to load the Fernet secret and decrypt the secrets. With Python that is pretty simple.

decrypt = fernet.decrypt(token)

Q+A

  • Why not SOPS? This is easier and also handles the process of making the API call to your k8s cluster to make the secret.
  • Is Fernet secure? As far as I can tell it's secure enough. Let me know if I'm wrong.
  • Would you make a CLI for this? If people actually use this thing and get value out of it, I would be more than happy to make it a CLI. I'd probably rewrite it in Golang if I did that, so if people ask it'll take me a bit of time to do it.

Questions/comments/concerns: https://c.im/@matdevdug


The Worst Website In The Entire World

The Worst Website In The Entire World

What if you set out to make the worst website you possibly could? So poorly designed and full of frustrating patterns that users would not only hate the experience of using this website, but would also come to hate your company. Could we make a web experience so terrible that it would express how much our company hated our users?

As a long-time Internet addict, I've encountered my fair share of terrible websites. Instagram where now half my feed is advertisements for stupid t-shirts and the other half is empty black space.

Who in the fuck would ever wear this

Or ARNGREN.net which is like if a newspaper ad threw up on my screen.

But Instagram still occasionally shows me pictures of people I follow and ultimately the stuff on ARNGREN is so cool I still want to buy it regardless of the layout.

No, I believe it is the crack team at Broadcom that have nailed it for the worst website in existence.

Lured in with free VMware

So through social media I discovered this blog post from VMware announcing that their popular virtualization software is free for personal use now. You can read that here. Great, I used VMware Fusion before and it was ok and maybe it will let me run Windows on an ARM Mac. Probably not but let's try it out and see.

This means that everyday users who want a virtual lab on their Mac, Windows or Linux computer can do so for free simply by registering and downloading the latest build from the new download portal located at support.broadcom.com. With the new commercial model, we have reduced our product group offerings down to a single SKU (VCF-DH-PRO) for users who require commercial use licensing. This simplification eliminates 40+ other SKUs and makes quoting and purchasing VMware Desktop Hypervisor apps, Fusion Pro and Workstation Pro, easier than ever. The new Desktop Hypervisor app subscription can be purchased from any Broadcom Advantage partner.

I don't want to register at support.broadcom.com but it looks like I don't have a choice as this is the screen on the VMware site.

Now this is where alarm bells start going crazy in my head. Nothing about this notice makes sense. "The store will be moving to a new domain". So it's...not...down for maintenance but actually is just gone? Or is it actually coming back? Because then you say "store will be shutdown" (just a quick note, you want "the store" and "will be shutting down on April 30th 2024"). Also why don't you just redirect to the new domain? What is happening here?

Broadcom

So then I go to support.broadcom.com which is where I was told to register and make an account.

Never a great sign when there's a link to an 11 page PDF of how to navigate your website. That's the "Learn how to navigate Broadcom Support" link. You can download that killer doc here: https://support.broadcom.com/documents/d/ecx/broadcom-support-portal-getting-started-guide

Alright let's register.

First the sentence "Enhance your skills through multiple self-service avenues by creating your Broadcom Account" leaps off the page as just pure corporate nonsense. I've also never seen a less useful CAPTCHA, it looks like it is from 1998 and any modern text recognition software would defeat it. In fact the Mac text recognition in Preview defeats 3 of the 4 characters with no additional work:

So completely pointless and user hostile. Scoring lots of points for the worst website ever. I'm also going to give some additional points for "Ask our chatbot for assistance", an idea so revolting normally I'd just give up on the entire idea. But of course I'm curious, so I click on the link for the "Ask our chatbot" and.....

It takes me back to the main page.

Slow clap Broadcom. Imagine being a customer that is so frustrated with your support portal that you actually click "Ask a chatbot" and the web developers at Broadcom come by and karate chop you right in the throat. Bravo. Now in Broadcom's defense in the corner IS a chatbot icon so I kinda see what happened here. Let's ask it a question.

I didn't say hello. I don't know why it decided I said hello to it. But in response to VMware it gives me this:

Did the chatbot just tell me to go fuck myself? Why did you make a chatbot if all you do is select a word from a list and it returns the link to the support doc? Would I like to "Type a Query"?? WHAT IS A CHATBOT IF NOT TYPING QUERIES?

Radio | Thanks! I Hate It

Next Steps

I fill in the AI-proof CAPTCHA and hit next, only to be greeted with the following screen for 30 seconds.

Finally I'm allowed to make my user account.

Um....alright....seems like overkill Broadcom but you know what this is your show. I have 1Password so this won't be a problem. It's not letting me copy/paste from 1Password into this field but if I do Command + \ it seems to let me insert. Then I get this.

What are you doing to me Broadcom. Did I....wrong you in some way? I don't understand what is happening. Ok well I refresh the page, try again and it works this time. Except I can't copy/paste into the Confirm Password field.

I mean they can't expect me to type out the impossibly complicated password they just had me generate right? Except they have and they've added a check to ensure that I don't disable Javascript and treat it like a normal HTML form.

Hey front-end folks, just a quick note. Never ever ever ever ever mess with my browser. It's not yours, it's mine. I'm letting you use it for free to render your bloated sites. Don't do this to me. I get to copy paste whatever I want whenever I want. When you get your own browser you can do whatever you want but while you are living in my house under my rules I get to copy/paste whenever I goddamn feel like it.

Quickly losing enthusiasm for the idea of VMware

So after pulling up the password and typing it in, I'm treated to this absolutely baffling screen.

Do I need those? I feel like I might need those. eStore at least sounds like something I might want. I don't really want Public Semiconductors Case Management but I guess that one comes in the box. 44 seconds of this icon later

I'm treated to the following.

Broadcom, you clever bastards. Just when I thought I was out, they pulled me back in. Tricking users into thinking a link is going to help them and then telling them to get fucked by advising them to contact your sales rep? Genius.

So then I hit cancel and get bounced back to......you guessed it!

Except I'm not even logged into my newly created account. So then I go to login with my new credentials and I finally make it to my customer portal. Well no first they need to redirect me back to the Broadcom Support main page again with new icons.

Apparently my name was too long to show and instead of fixing that or only showing first name Broadcom wanted to ensure the disrespect continued and sorta trail off. Whatever, I'm finally in the Matrix.

Now where might I go to...actually download some VMware software. There's a search bar that says "Search the entire site", let's start there!

Nothing found except for a CVE. Broadcom you are GOOD! For a second I thought you were gonna help me and like Lucy with the football you made me eat shit again.

Lucy and the football - John Quiggin's Blogstack

My Downloads was also unhelpful.

But maybe I can add the entitlement to the account? Let's try All Products.

Of course the link doesn't work. What was I even thinking trying that? That one is really on me. However "All Products" on the left-hand side works and finally I find it. My white whale.

Except when I click on product details I'm brought back to....

The blank page with no information! Out of frustration I click on "My Downloads" again which is now magically full of links! Then I see it!

YES. Clicking on it I get my old buddy the Broadcom logo for a solid 2 minutes 14 seconds.

Now I have fiber internet with 1000 down, so this has nothing to do with me. Finally I click the download button and I get.....the Broadcom logo again.

30 seconds pass. 1 minute passes. 2 minutes pass. I'm not sure what to do.

No. No you piece of shit website. I've come too far and sacrificed too much of my human dignity. I am getting a fucking copy of VMware Fusion. Try 2 is the same thing. 3, 4, 5 all fail. Then finally.

I install it and like a good horror movie, I think it's all over. I've killed Jason. Except when I'm installing Windows I see this little link:

And think "wow I would like to know what the limitations are for Windows 11 for Arm!". Click on it and I'm redirected to...

Just one final fuck you from the team at Broadcom.

Conclusion

I've used lots of bad websites in my life. Hell, I've made a lot of bad websites in my life. But never before have I seen a website that so completely expresses just the pure hatred of users like this one. Everything was as poorly designed as possible, with user hostile design at every corner.

Honestly Broadcom, I don't even know why you bothered buying VMware. It's impossible for anyone to ever get this product from you. Instead of migrating from the VMware store to this disaster, maybe just shut this down entirely. Destroy the backups of this dumpster fire and start fresh. Maybe just consider a Shopify site because at least then an average user might have a snowballs chance in hell of ever finding something to download from you.

Do you know of a worse website? I want to see it. https://c.im/@matdevdug


The Time Linkerd Erased My Load Balancer

The Time Linkerd Erased My Load Balancer

A cautionary tale of K8s CRDs and Linkerd.

A few months ago I had the genius idea of transitioning our production load balancer stack from Ingress to Gateway API in k8s. For those unaware, Ingress is the classic way of writing a configuration to tell a load balancer what routes should hit what services, effectively how do you expose services to the Internet. Gateway API is the re-imagined process for doing this where the problem domain is scoped, allowing teams more granular control over their specific services routes.

Ingress

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: external-lb
spec:
  controller: example.com/ingress-controller
  parameters:
    apiGroup: k8s.example.com
    kind: IngressParameters
    name: external-lb

This is what setting up a load balancer in Ingress looks like

upload in progress, 0
Source

After conversations with various folks at GCP it became clear to me that while Ingress wasn't deprecated or slated to be removed, Gateway API is where all the new development and features are moving to. I decided that we were a good candidate for the migration since we are a microservice based backend with lower and higher priority hostnames, meaning we could safely test the feature without cutting over all of our traffic at the same time.

I had this idea that we would turn on both Ingress and Gateway API and then cut between the two different IP addresses at the Cloudflare level. From my low-traffic testing this approach seemed to work ok, with me being able to switch between the two and then letting Gateway API bake for a week or two to shake out any problems. Then I decided to move to prod. Due to my lack of issues in the lower environments I decided that I wouldn't set up Cloudflare load balancing between the two and manage the cut-over in Terraform. This turned out to be a giant mistake.

The long and short of it is that the combination of Gateway API and Linkerd in GKE fell down under high volume of requests. Low request volume there were no problems, but once we got to around 2k requests a second the Linkerd-proxy sidecar container memory usage started to grow unbounded. When I attempted to cut back from Gateway API to Ingress, I encountered a GKE bug I hadn't seen before in the lower environments.

"Translation failed: invalid ingress spec: service "my_namespace/my_service" is type "ClusterIP", expected "NodePort" or "LoadBalancer";

What we were seeing was a mismatch between the annotations automatically added by GKE.

Ingress adds these annotations:  
cloud.google.com/neg: '{"ingress":true}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1pokfef..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'


Gateway adds these annotations:
cloud.google.com/neg: '{"exposed_ports":{"80":{}}}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1-oijfoijsdoifj-..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway doesn't understand the Ingress annotations and vice-versa. This obviously caused a massive problem and blew up in my face. I had thought I had tested this exact failure case, but clearly prod surfaced a different behavior than I had seen in lower environments. Effectively no traffic was getting to pods while I tried to figure out what had broken.

I ended up making to manually modify the annotations to get things working and had a pretty embarrassing blow-up in my face after what I had thought was careful testing (but was clearly wrong).

Fast Forward Two Months

I have learned from my mistake regarding the Gateway API and Ingress and was functioning totally fine on Gateway API when I decided to attempt to solve the Linkerd issue. The issue I was seeing with Linkerd was high-volume services were seeing their proxies consume unlimited memory, steadily growing over time but only while on Gateway API. I was installing Linkerd with their Helm libraries, which have 2 components, the Linkerd CRD chart here: https://artifacthub.io/packages/helm/linkerd2/linkerd-crds and the Linkerd control plane: https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane

Since debug logs and upgrades hadn't gotten me any closer to a solution as to why the proxies were consuming unlimited memory until they eventually were OOMkilled, I decided to start fresh. I removed the Linkerd injection from all deployments and removed the helm charts. Since this was a non-prod environment, I figured at least this way I could start fresh with debug logs and maybe come up with some justification for what was happening.

Except the second I uninstalled the charts, my graphs started to freak out. I couldn't understand what was happening, how did removing Linkerd break something? Did I have some policy set to require Linkerd? Why was my traffic levels quickly approaching zero in the non-prod environment?

Then a coworker said "oh it looks like all the routes are gone from the load balancer". I honestly hadn't even thought to look there, assuming the problem was some misaligned Linkerd policy where our deployments required encryption to communicate or some mistake on my part in the removal of the helm charts. But they were right, the load balancers didn't have any routes. kubectl confirmed, no HTTProutes remained.

So of course I was left wondering "what just happened".

Gateway API

So a quick crash course in "what is gateway API". At a high level, as discussed before, it is a new way of defining Ingress which cleans up the annotation mess and allows for a clean separation of responsibility in an org.

upload in progress, 0

So GCP defines the GatewayClass, I make the Gateway and developer provide the HTTPRoutes. This means developers can safely change the routes to their own services without the risk that they will blow up the load balancer. It also provides a ton of great customization for how to route traffic to a specific service.

upload in progress, 0

So first you make a Gateway like so in Helm or whatever:

---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: {{ .Values.gateway_name }}
  namespace: {{ .Values.gateway_namespace }}
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        kinds:
        - kind: HTTPRoute
        namespaces:
          from: Same
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        kinds:
          - kind: HTTPRoute
        namespaces:
          from: All
      tls:
        mode: Terminate
        options:
          networking.gke.io/pre-shared-certs: "{{ .Values.pre_shared_cert_name }},{{ .Values.internal_cert_name }}"

Then you provide a different YAML of HTTPRoute for the redirect of http to https:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: redirect
  namespace: {{ .Values.gateway_namespace }}
spec:
  parentRefs:
  - namespace: {{ .Values.gateway_namespace }}
    name: {{ .Values.gateway_name }}
    sectionName: http
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https

Finally you can set policies.

---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: tls-ssl-policy
  namespace: {{ .Values.gateway_namespace }}
spec:
  default:
    sslPolicy: tls-ssl-policy
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: {{ .Values.gateway_name }}

Then your developers can configure traffic to their services like so:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: store
spec:
  parentRefs:
  - kind: Gateway
    name: internal-http
  hostnames:
  - "store.example.com"
  rules:
  - backendRefs:
    - name: store-v1
      port: 8080
  - matches:
    - headers:
      - name: env
        value: canary
    backendRefs:
    - name: store-v2
      port: 8080
  - matches:
    - path:
        value: /de
    backendRefs:
    - name: store-german
      port: 8080

Seems Straightforward

Right? There isn't that much to the thing. So after I attempted to re-add the HTTPRoutes using Helm and Terraform (which of course didn't detect a diff even though the routes were gone because Helm never seems to do what I want it to do in a crisis) and then ended up bumping the chart version to finally force it do the right thing, I started looking around. What the hell had I done to break this?

When I removed linkerd crds it somehow took out my httproutes. So then I went to the Helm chart trying to work backwards. Immediately I see this:

{{- if .Values.enableHttpRoutes }}
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes-sigs/gateway-api/pull/1923
    gateway.networking.k8s.io/bundle-version: v0.7.1-dev
    gateway.networking.k8s.io/channel: experimental
    {{ include "partials.annotations.created-by" . }}
  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    linkerd.io/control-plane-ns: {{.Release.Namespace}}
  creationTimestamp: null
  name: httproutes.gateway.networking.k8s.io
spec:
  group: gateway.networking.k8s.io
  names:
    categories:
    - gateway-api
    kind: HTTPRoute
    listKind: HTTPRouteList
    plural: httproutes
    singular: httproute
  scope: Namespaced
  versions:

Sure enough, Linkerd CRD Helm chart has that set to default True:

I also found this issue: https://github.com/linkerd/linkerd2/issues/12232

So yeah, Linkerd is, for some reason, pulling this CRD from a pull request from April 6th of last year that is marked as "do not merge". https://github.com/kubernetes-sigs/gateway-api/pull/1923

Linkerd is aware of the possible problem but presumes you'll catch the configuration option on the Helm chart: https://github.com/linkerd/linkerd2/issues/11586

To be clear I'm not "coming after Linkerd" here. I just thought the whole thing was extremely weird and wanted to make sure, given the amount of usage Linkerd gets out there, that other people were made aware of it before running the car into the wall at 100 MPH.

What are CRDs?

Kubernetes Custom Resource Definitions (CRDs) essentially extend the Kubernetes API to manage custom resources specific to your application or domain.

  • CRD Object: You create a YAML manifest file defining the Custom Resource Definition (CRD). This file specifies the schema, validation rules, and names of your custom resource.
  • API Endpoint: When you deploy the CRD, the Kubernetes API server creates a new RESTful API endpoint for your custom resource.

Effectively when I enabled Gateway API in GKE with the following I hadn't considered that I could end up in a CRD conflict state with Linkerd:

  gcloud container clusters create CLUSTER_NAME \
    --gateway-api=standard \
    --cluster-version=VERSION \
    --location=CLUSTER_LOCATION

What I suspect happened is, since I had Linkerd installed before I had enabled the gateway-api on GKE, when GCP attempted to install the CRD it failed silently. Since I didn't know there was a CRD conflict, I didn't understand that the CRD that the HTTPRoutes relied on was actually the Linkerd maintained one, not the GCP one. Presumably had I attempted to do this the other way it would have thrown an error when the Helm chart attempted to install a CRD that was already present.

To be clear before you call me an idiot, I am painfully aware that the deletion of CRDs is a dangerous operation. I understand I should have carefully checked and I am admitting I didn't in large part because it just never occurred to me that something like Linkerd would do this. Think of my failure to check as a warning to you, not an indictment against Kubernetes or whatever.

Conclusion

If you are using Linkerd and Helm and intend to use Gateway API, this is your warning right now to go in there and flip that value in the Helm chart to false. Learn from my mistake.

Questions/comments/concerns: https://c.im/@matdevdug