For as long as I’ve been around tech enthusiasts, there has been a recurring “decentralization dream.” While the specifics evolve, the essence remains the same: everyone would own a domain name and host their digital identity. This vision promises that people, liberated from the chore of digital maintenance, would find freedom in owning their slice of the internet. The basic gist is at some point people would wake up to how important online services are to them and demand some ownership over how they work.
This idea, however, always fails. From hosting email and simple HTML websites in my youth to the current attempts at decentralized Twitter- or YouTube-like platforms, the tech community keeps waiting for everyday people to take the baton of self-hosting. They never will—because the effort and cost of maintaining self-hosted services far exceeds the skill and interest of the audience. The primary “feature” of self-hosting is, for most, a fatal flaw: it’s a chore. It’s akin to being “free” to change the oil in your car—it’s an option, but not a welcome one for most.
Inadvertently, self-hosting advocates may undermine their own goal of better privacy and data ownership. By promoting open-source, self-hosted tools as the solution for those concerned about their privacy, they provide an escape valve for companies and regulators alike. Companies can claim, “Well, if users care about privacy, they can use X tool.” This reduces the pressure for legal regulation. Even Meta’s Threads, with its integration of ActivityPub, can claim to be open and competitive, deflecting criticism and regulation—despite this openness being largely flowing from Threads to ActivityPub and not the other way around.
What people actually need are laws. Regulations like the GDPR must become the international standard for platforms handling personal data. These laws ensure a basic level of privacy and data rights, independent of whether a judge forces a bored billionaire to buy your favorite social network. Suggesting self-hosting as a solution in the absence of such legal protections is as naive as believing encrypted messaging platforms alone can protect you from government or employer overreach.
What do users actually deserve?
We don't need to treat this as a hypothetical. What citizens in the EU get is the logical "floor" of what citizens around the world should demand.
Right to access
What data do you have on me?
How long do you keep it?
Why do you have it? What purpose does it serve?
Right to Rectification
Fix errors in your personal data.
Right to be Forgotten
There's no reason when you leave a platform that they should keep your contribution forever.
Right to Data Portability
Transfer your data to another platform in a standardized machine-readable format.
Right to Withdraw Consent
Opt out of data collection whenever you want, even if you originally agreed.
These are not all GDPR rights, but they form the backbone of what allows users to engage with platforms confidently, knowing they have levers to control their data. Regulations like these are binding and create accountability—something neither self-hosting nor relying on tech billionaires can achieve.
Riding this roller coaster of "I need digital platforms to provide me essential information and access" and trying to balance it with "whatever rich bored people are doing this week" has been a disaster. It's time to stop pretending these companies are our friends and force them to do the things they say they'll do when they're attempting to attract new users.
The fallacies of decentralization as a solution
The decentralization argument often assumes that self-hosted platforms or volunteer-driven networks are inherently superior. But this isn’t practical:
Self-hosting platforms are fragile.
Shutting down a small self-hosted platform running on a VPS provider is pretty trivial. These are basically paid for by one or two people and they would be insane to fight any challenge, even a bad one. How many self-hosted platforms would stand up to a threatening letter from a lawyer, much less an actual government putting pressure on their hosting provider?
Even without external pressure there isn't any practical way to fund these efforts. You can ask for donations, but that's not a reliable source of revenue for a cost that will only grow over time. At a certain size the maintainer will need to form a nonprofit in order to continue collecting the donations, a logistical and legal challenge well outside of the skillset of the people we're talking about.
It's effectively free labor. You are taking a job, running a platform, removing the pay for that job, adding in all the complexity of running a nonprofit and adding in the joys of being the CPA, the CEO, the sysadmin, etc. At some point people get sick, they lose interest, etc.
Decentralization doesn’t replace regulation.
While decentralization aligns with the internet’s original ethos, it doesn’t negate the need for legal protections. Regulations like GDPR raise the minimum level of privacy and security, while decentralization remains an optional enhancement. You lose nothing by moving the floor up.
Regulation is not inherently bad.
A common refrain among technical enthusiasts is a libertarian belief that market pressures and having a superior technical product will "win out" and legislation is bad because it constrains future development. You saw this a lot in the US tech press over the EU move from proprietary chargers to USB-C, a sense of "well when the next big thing comes we won't be able to use it because of silly government regulation".
Global legislation forces all companies—not just a niche few catering to privacy enthusiasts—to respect users’ rights. Unlike market-driven solutions or self-hosting, laws are binding and provide universal protections.
It is impossible for an average user to keep track of who owns which platforms and what their terms of service are now. Since they can be changed with almost no notice, whatever "protections" they can provide are laughably weak. In resisting legislation you make the job of large corporations easier, not harder.
The reality of privacy as a privilege
Right now, privacy often depends on technical skills, financial resources, or sheer luck:
• I value privacy and have money: You can pay for premium platforms like Apple or Fastmail. These platforms could change the rules whenever they want to but likely won't because their entire brand is based on the promise of privacy.
• I value privacy and have technical skills: You can self-host and manage your own services.
• I value privacy but lack money and technical skills: You’re left hoping that volunteers or nonprofits continue offering free tools—and that they don’t disappear overnight. Or you try and keep abreast of a constant churning ecosystem where companies change hands all the time and the rules change whenever they want.
This is a gatekeeping problem. Privacy should not be a luxury or dependent on arbitrary skill sets. Everyone deserves it.
It actually makes a difference
As someone who has experienced the difference between the U.S. and the EU’s approach to privacy, I can attest to how much better life is with stronger regulations. GDPR isn’t perfect, but it provides a foundation that improves quality of life for everyone. Instead of treating regulation as burdensome or unrealistic, we should view it as essential.
The dream of a decentralized internet isn’t inherently wrong, but waiting for it to materialize as a universal solution is a mistake. Laws—not utopian ideals—are the only way to ensure that users everywhere have the protections they deserve. It’s time to stop pretending companies will prioritize ethics on their own and instead force them to.
Every few years I will be on a team and the topic of quantum computing will come up. Inevitably the question will get asked "well is there something we are supposed to be doing about that or is it just a looming threat?" We will all collectively stare at each other and shrug, then resume writing stuff exactly like we were writing it before.
In 2024 it would be hard to make a strong justification for worrying a lot about post-quantum cryptography in a world where your most likely attack vector is someone breaking into your company Slack and just asking for access to something. However it is a question developers like to worry about because it involves a lot of math and cool steampunk looking computers. It's definitely a more interesting problem than how to get everyone to stop blindly approving access to the company Confluence.
Looks like something in Star Trek someone would trip and pull a bunch of wires out of.
Since I get asked the question every few year and I basically have no idea what I'm talking about, I figured I'll do the research now and then refer back to this in the future when someone asks and I need to look clever in a hurry.
TL/DR: The tooling to create post-quantum safe secrets exists and mostly works, but for normal developers dealing with data that is of little interest 12 months after it is created, I think this is more a "nice to have". That said, these approaches are different enough from encryption now that developers operating with more important data would be well-served in investing the time in doing the research now on how to integrate some of these. Now that the standard is out I suspect there will be more professional interest in supporting these approaches and the tooling will get more open source developer contributions.
Think of a conventional computer like a regular Pokémon player. This player makes decisions based on clear rules, one at a time, and can only do one move at a time.
In the Pokémon card game, you have:
A limited number of cards (like a computer’s memory)
You play one card at a time (like a computer performing one calculation at a time)
You follow a clear set of rules (like how classical computers follow step-by-step instructions)
Every time you want to pick a card, attack, or use a move, you do it one by one in a specific order, just like a classical computer processes 0s and 1s in a step-by-step manner. If you want to calculate something or figure out the best strategy, you would test one option, then another, and so on, until you find the right solution. This makes conventional computers good at handling problems that can be broken down into simple steps.
Quantum Computers:
Now, imagine a quantum computer is like a player who can somehow look at all the cards in their deck at the same time and choose the best one without flipping through each card individually.
In the quantum world:
Instead of playing one card at a time, it’s like you could play multiple cards at once, but in a way that combines all possibilities (like a super-powered move).
You don’t just pick one strategy, you could explore all possible strategies at once. It’s as if you’re thinking of all possible moves simultaneously, which could lead to discovering new ways to win the game much faster than in a regular match.
Quantum computers rely on something called superposition, which is like having your Pokémon be both active and benched at the same time, until you need to make a decision. Then, they “collapse” into one state—either active or benched.
This gives quantum computers the ability to solve certain types of problems much faster because they are working with many possibilities at once, unlike classical computers that work on problems step-by-step.
Why Aren't Quantum Computers More Relevant To Me?
We'll explain this with Pokemon cards again.
The deck of cards (which represents the quantum system) in a quantum player’s game is extremely fragile. The cards are like quantum bits (qubits), and they can be in many states at once (active, benched, etc.). However, if someone bumps the table or even just looks at the cards wrong, the whole system can collapse and go back to a simple state.
In the Pokémon analogy, this would be like having super rare and powerful cards, but they’re so sensitive that if you shuffle too hard or drop the deck, the cards get damaged or lost. Because of this, it’s hard to keep the quantum player’s strategy intact without ruining their game.
In real life, quantum computers need extremely controlled environments to work—like keeping them at near absolute zero temperatures. Otherwise, they make too many errors to be useful for most tasks.
The quantum player might be amazing at playing certain types of Pokémon battles, like tournaments that require deep strategy or involve many complex moves. However, if they try to play a quick, casual game with a simple strategy, their special abilities don’t help much. They may even be worse at simple games than regular players.
Got it, so Post-Quantum Cryptography
So conventional encryption algorithms often work with the following design. They select 2 very large prime numbers and then multiply them to obtain an even larger number. The act of multiplying the prime numbers is easy, but it's hard to figure out what you used to make the output. These two numbers are known as the prime factors and are what you are talking about obtaining when you are talking about breaking encryption.
Sometimes you hear this referred to as "the RSA problem". How do you get the private key with only the public key. Since this not-yet-existing quantum computer would be good at finding these prime numbers, a lot of the assumptions we have about how encryption works would be broken. For years and years the idea that it is safe to share a public key has been an underpinning of much of the software that has been written. Cue much panic.
But since it takes 20 years for us to do anything as an industry we have to start planning now even though it seems more likely in 20-30 years we'll be struggling to keep any component of the internet functional through massive heat waves and water wars. Anywho.
So the NIST starting in 2016 asked for help selecting some post-quantum standards and ended up settling on 3 of them. Let's talk about them and why they are (probably) better to solve this problem.
FIPS 203 (Module-Lattice-Based Key-Encapsulation Mechanism Standard)
Basically we have two different things happening here. We have a Key-Encapsulation Mechanism, which is a known thing you have probably used. Layered on top of that is a Module-Lattice-Based KEM.
Key-Encapsulation Mechanism
You and another entity need to establish a private key between the two of you, but only using non-confidential communication. Basically the receiver generates a key pair and transmits the public key to the sender. The sender needs to ensure they got the right public key. The sender, using that public key, generates another key and encrypted text and then sends that back to the receiver over a channel that could either be secure or insecure. You've probably done this a lot in your career in some way or another.
More Lattices
There are two common algorithms that allow us to secure key-encapsulation mechanisms.
Ring Learning with Errors
Learning with Errors
Ring Learning with Errors
So we have three parts to this:
Ring: A set of polynomials where the variables are limited to a specific range. If you, like me, forgot what a polynomial is I've got you.
Modulus: The maximum value of the variable in the ring (e.g., q = 1024).
Error Term: Random values added during key generation, simulating noise.
How Does It Work?
Key generation:
Choose a large prime number (p)
Generate two random polynomials within the ring: a and s. These will be used to create the public and private keys.
Public Key Creation
Compute the product of a and a fixed polynomial x, which is part of the key generation algorithm (ax = s + e). The error term e represents the "noise" added to simulate real-world conditions.
Private Key: Keep s secret. It's used for decryption.
Public Key: Publish a. This is how others can send you encrypted messages.
Assuming you have a public key, in order to encrypt stuff you need to do the following:
Generate a random polynomial r
Encrypt the message using a, r, and some additional computation (c = ar + e'). The error term e' represents the "noise" added during encryption.
To decrypt:
Computing the difference between the ciphertext and a multiple of the public key (d = c - as). This eliminates the noise introduced during encryption.
Solving forr: Since we know that c = ar + e', subtracting as from both sides gives us an equation to solve for r.
Extracting the shared secret key: Once you have r, use it as a shared secret key.
What do this look like in Python?
Note: This is not a good example to use for real data. I'm trying to show how it works at a basic level. Never use a randos Python script to encrypt actual real data.
import numpy as np
def rlsr_keygen(prime):
# Generate large random numbers within the ring
poly_degree = 4
# Create a polynomial for key generation
s = np.random.randint(0, 2**12, size=poly_degree)
# Compute product of a and x, adding an error term (noise) during key generation
A = np.random.randint(0, 2**12, size=(poly_degree, poly_degree))
e = np.random.randint(-2**11, 2**11, size=poly_degree)
return s, A
def rlsr_encapsulate(A, message):
# Generate random polynomial to be used for encryption
modulus = 2**12
r = np.random.randint(0, 2**12, size=4)
# Compute ciphertext with noise
e_prime = np.random.randint(-modulus//2, modulus//2, size=4)
c = np.dot(A, r) + e_prime
return c
def rlsr_decapsulate(s, A, c):
# Compute difference between ciphertext and a multiple of the public key
d = np.subtract(c, np.dot(A, s))
# Solve for r (short vector in the lattice)
# In practice, this is done using various algorithms like LLL reduction
return d
def generate_shared_secret_key():
prime = 2**16 + 1 # Example value
modulus = 2**12
s, A = rlsr_keygen(prime)
# Generate a random message (example value)
message = np.random.randint(0, 256, size=4)
c = rlsr_encapsulate(A, message)
# Compute shared secret key
d = rlsr_decapsulate(s, A, c)
return d
shared_secret_key = generate_shared_secret_key()
print(shared_secret_key)
FIPS 204 (Module-Lattice-Based Digital Signature Standard)
A digital signature is a way to verify the authenticity and integrity of electronic documents, messages, or data. This is pretty important for software supply chains and packaging along with a million other things.
How It Works
Key Generation: Two large prime numbers p and q are chosen, along with their product n. A random matrix A is generated within this lattice. This process creates two keys:
Public Key (A): Published for others to use when verifying a digital signature.
Private Key (s): Kept secret by the sender and used to create a digital signature.
Message Hashing: The sender takes their message or document, which is often large in size, and converts it into a fixed-size string of characters called a message digest or hash value using a hash function (e.g., SHA-256). This process ensures that any small change to the original message will result in a completely different hash value.
Digital Signature Creation: The sender encrypts their private key (s) with the public key of the recipient (A) and then combines it with the message digest using a mathematical operation like exponentiation modulo n. This produces a unique digital signature for the original message.
Message Transmission: The sender transmits the digitally signed message (message + digital signature) to the recipient.
Digital Signature Verification:
When receiving the digitally signed message, the recipient can verify its authenticity using their public key (A). Here's how:
Recover Private Key (s): The recipient uses their public key (A) and the received digital signature to recover the private key used by the sender.
Message Hashing (Again): The recipient recreates the message digest from the original message, which should match the one obtained during the digital signature creation process.
Verification: If the two hash values match, it confirms that the original message hasn't been tampered with and was indeed signed by the sender.
Module-Lattice-Based Digital Signature
So a lot of this is the same as the stuff in FIPS 203. I'll provide a Python example for you to see how similar it is.
import numpy as np
def rlsr_keygen(prime):
# Generate large random numbers within the ring
modulus = 2**12
poly_degree = 4
s = np.random.randint(0, 2**12, size=poly_degree)
A = np.random.randint(0, 2**12, size=(poly_degree, poly_degree))
e = np.random.randint(-2**11, 2**11, size=poly_degree)
return s, A
def rlsr_sign(A, message):
# Generate random polynomial to be used for signing
modulus = 2**12
r = np.random.randint(0, 2**12, size=4)
# Compute signature with noise
e_prime = np.random.randint(-modulus//2, modulus//2, size=4)
c = np.dot(A, r) + e_prime
return c
def rlsr_verify(s, A, c):
# Compute difference between ciphertext and a multiple of the public key
d = np.subtract(c, np.dot(A, s))
# Check if message can be recovered from signature (in practice, this involves solving for r using LLL reduction)
return True
def generate_signature():
prime = 2**16 + 1 # Example value
modulus = 2**12
s, A = rlsr_keygen(prime)
message = np.random.randint(0, 256, size=4)
c = rlsr_sign(A, message)
signature_validity = rlsr_verify(s, A, c)
if signature_validity:
print("Signature is valid.")
return True
else:
print("Signature is invalid.")
return False
generate_signature()
Basically the same concept as before but for signatures.
FIPS 205 (Stateless Hash-Based Digital Signature Standard)
The Stateful-Light Hash-based Digital Signature Scheme (SLH-DSA) is a family of digital signature schemes that use hash functions and do not require any intermediate computations or stored state. SLH-DSAs are designed to be highly efficient and secure, making them suitable for various applications.
Basically because they use hashes and are stateless they are more resistant to quantum computers.
Basic Parts
Forest of Random Subsets (FORS): A collection of random subsets generated from a large set.
Hash Functions: Used to compute the hash values for the subsets.
Subset Selection: A mechanism for selecting a subset of subsets based on the message to be signed.
How It Works
Key Generation: Generate multiple random subsets from a large set using a hash function (e.g., SHA-256).
Message Hashing: Compute the hash value of the message to be signed.
Subset Selection: Select a subset of subsets based on the hash value of the message.
Signature Generation: Generate a signature by combining the selected subsets.
The Extended Merkle Signature Scheme (XMSS) is a multi-time signature scheme that uses the Merkle Tree technique to generate digital signatures. It is basically the following 4 steps to use.
Key Generation: Generate a Merkle Tree using multiple levels of random hash values.
Message Hashing: Compute the hash value of the message to be signed.
Tree Traversal: Traverse the Merkle Tree to select nodes that correspond to the message's hash value.
Signature Generation: Generate a signature by combining the selected nodes.
Can I have a Python example?
Honestly I really tried on this. But there was not a lot on the internet on how to do this. I will give you what I wrote, but it doesn't work and I'm not sure exactly how to fix it.
Python QRL library. This seems like it'll work but I couldn't get the package to install successfully with Python 3.10, 3.11 or 3.12.
Quantcrypt: This worked but honestly the "example" doesn't really show you anything interesting except that it seems to output what you think it should output.
Standard library: I messed part of it up but I'm not sure exactly where I went wrong.
import hashlib
import os
# Utility function to generate a hash of data
def hash_data(data):
return hashlib.sha256(data).digest()
# Generate a pair of keys (simplified as random bytes for the demo)
def generate_keypair():
private_key = os.urandom(32) # Private key (random 32 bytes)
public_key = hash_data(private_key) # Public key derived by hashing the private key
return private_key, public_key
# Create a simplified Merkle tree with n leaves
def create_merkle_tree(leaf_keys):
# Create parent nodes by hashing pairs of leaf nodes
current_level = [hash_data(k) for k in leaf_keys]
while len(current_level) > 1:
next_level = []
# Pair nodes and hash them to create the next level
for i in range(0, len(current_level), 2):
left_node = current_level[i]
right_node = current_level[i+1] if i + 1 < len(current_level) else left_node # Handle odd number of nodes
parent_node = hash_data(left_node + right_node)
next_level.append(parent_node)
current_level = next_level
return current_level[0] # Root of the Merkle tree
# Sign a message using a given private key
def sign_message(message, private_key):
# Hash the message and then "sign" it by using the private key
message_hash = hash_data(message)
signature = hash_data(private_key + message_hash)
return signature
def verify_signature(message, signature, public_key):
message_hash = hash_data(message)
# Instead of using public_key, regenerate what the signature would be if valid
expected_signature = hash_data(public_key + message_hash) # Modify this logic
return expected_signature == signature
# Example of using the above functions
# 1. Generate key pairs for leaf nodes in the Merkle tree
tree_height = 4 # This allows for 2^tree_height leaves
num_leaves = 2 ** tree_height
key_pairs = [generate_keypair() for _ in range(num_leaves)]
private_keys, public_keys = zip(*key_pairs)
# 2. Create the Merkle tree from the public keys (leaf nodes)
merkle_root = create_merkle_tree(public_keys)
print(f"Merkle Tree Root: {merkle_root.hex()}")
# 3. Sign a message using one of the private keys (simplified signing)
message = b"Hello, this is a test message for XMSS-like scheme"
leaf_index = 0 # Choose which key to sign with (0 in this case)
private_key = private_keys[leaf_index]
public_key = public_keys[leaf_index]
signature = sign_message(message, private_key)
print(f"Signature: {signature.hex()}")
# 4. Verify the signature
is_valid = verify_signature(message, signature, public_key)
print("Signature is valid!" if is_valid else "Signature is invalid!")
It always says signature is invalid. If you spot what I did wrong let me know, but honestly I sort of lost enthusiasm for this as we went. Hopefully the code you shouldn't be using at least provides some context.
Is This Something I Should Worry About Now?
That really depends on what data you are dealing with. If I was dealing with tons of super-sensitive data, I would probably start preparing the way now. This isn't a change you are going to want to make quickly, in no small part to account for the performance difference in using some of these approaches vs more standard encryption. Were I working on something like a medical device or secure communications it would definitely be something I'd at least spike out and try to see what it looked like.
So basically if someone asks you about this, I hope now you can at least talk intelligently about it for 5 minutes until they wander away from boredom. If this is something you actually have to deal with, start with PQClean and work from there.
I've spent a fair amount of time around networking. I've worked for a small ISP, helped to set up campus and office networks and even done a fair amount of work with BGP and assisting with ISP failover and route work. However in my current role I've been doing a lot of mobile network diagnostics and troubleshooting which made me realize I actually don't know anything about how mobile networks operate. So I figured it was a good idea for me to learn more and write up what I find.
It's interesting that without a doubt cellular internet is either going to become or has become the default Internet for most humans alive, but almost no developers I know have any idea how it works (including myself until recently). As I hope that I demonstrate below, it is untold amounts of amazing work that has been applied to this problem over decades that has really produced incredible results. As it turns out the network engineers working with cellular were doing nuclear physics while I was hot-gluing stuff together.
I am not an expert. I will update this as I get better information, but use this as a reference for stuff to look up, not a bible. It is my hope, over many revisions, to turn this into a easier to read PDF that folks can download. However I want to get it out in front of people to help find mistakes.
TL/DR: There is a shocking, eye-watering amount of complexity when it comes to cellular data as compared to a home or datacenter network connection. I could spend the next six months of my life reading about this and feel like I barely scratched the surface. However I'm hoping that I have provided some basic-level information about how this magic all works.
Corrections/Requests: https://c.im/@matdevdug. I know I didn't get it all right, I promise I won't be offended.
Basics
A modern cellular network at the core is comprised of three basic elements:
the RAN (radio access network)
CN (core network)
Services network
RAN
The RAN contains the base stations that allow for the communication with the phones using radio signals. When we think of a cell tower we are thinking of a RAN. When we are thinking of what a cellular network provides in terms of services, a lot of that is actually contained within the CN. That's where the stuff like user authorization, services turned on or off for the user and all the background stuff for the transfer and hand-off of user traffic. Think SMS and phone calls for most users today.
Key Components of the RAN:
Base Transceiver Station (BTS): The BTS is a radio transmitter/receiver that communicates with your phone over the air interface.
Node B (or Evolved Node B for 4G or gNodeB for 5G): In modern cellular networks, Node B refers to a base station that's managed by multiple cell sites. It aggregates data from these cell sites and forwards it to the RAN controller.
Radio Network Controller (RNC): The RNC is responsible for managing the radio link between your phone and the BTS/Node B.
Base Station Subsystem (BSS): The BSS is a term used in older cellular networks, referring to the combination of the BTS and RNC.
Cell Search and Network Acquisition. The device powers on and begins searching for available cells by scanning the frequencies of surrounding base stations (e.g., eNodeB for LTE, gNodeB for 5G).
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile │
│ │ │ Device │
│ Broadcast │ │ │
│ ──────────> │ Search for │ <────────── │
│ │ Sync Signals│ Synchronizes │
│ │ │ │
└──────────────┘ └──────────────┘
- Device listens for synchronization signals.
- Identifies the best base station for connection.
Random Access. After identifying the cell to connect to, the device sends a random access request to establish initial communication with the base station.This is often called RACH. If you want to read about it I found an incredible amount of detail here: https://www.sharetechnote.com/html/RACH_LTE.html
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile │
│ │ │ Device │
│ Random Access Response │ │
│ <────────── │ ──────────> │ Random Access│
│ │ │ Request │
└──────────────┘ └──────────────┘
- Device sends a Random Access Preamble.
- Base station responds with timing and resource allocation.
Dedicated Radio Connection Setup (RRC Setup). The base station allocates resources for the device to establish a dedicated radio connection using the Radio Resource Control (RRC) protocol.
Device-to-Core Network Communication (Authentication, Security, etc.). Once the RRC connection is established, the device communicates with the core network (e.g., EPC in LTE, 5GC in 5G) for authentication, security setup, and session establishment.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile │
│ ──────────> │ Forward │ │
│ │ Authentication Data │
│ │ <────────── │Authentication│
│ │ │ Request │
│ │ │ │
└──────────────┘ └──────────────┘
- Device exchanges authentication and security data with the core network.
- Secure communication is established.
Data Transfer (Downlink and Uplink). After setup, the device starts sending (uplink) and receiving (downlink) data using the established radio connection.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile │
│ ──────────> │ Data │ │
│ Downlink │ │ <───────── │
│ <────────── │ Data Uplink │ ──────────> │
│ │ │ │
└──────────────┘ └──────────────┘
- Data is transmitted between the base station and the device.
- Downlink (BS to Device) and Uplink (Device to BS) transmissions.
Handover. If the device moves out of range of the current base station, a handover is initiated to transfer the connection to a new base station without interrupting the service.
Signaling
As shown in the diagram above, there are a lot of references to something called "signaling". Signaling seems to be a shorthand for handling a lot of configuration and hand-off between tower and device and the core network. As far as I can tell they can be broken into 3 types.
Access Stratum Signaling
Set of protocols to manage the radio link between your phone and cellular network.
Handles authentication and encryption
Radio bearer establishment (setting up a dedicated channel for data transfer)
Mobility management (handovers, etc)
Quality of Service control.
Non-Access Stratum (NAS) Signaling
Set of protocols used to manage the interaction between your phone and the cellular network's core infrastructure.
It handles tasks such as authentication, billing, and location services.
Authentication with the Home Location Register (HLR)
Roaming management
Charging and billing
IMSI Attach/ Detach procedure
Lower Layer Signaling on the Air Interface
This refers to the control signaling that occurs between your phone and the cellular network's base station at the physical or data link layer.
It ensures reliable communication over the air interface, error detection and correction, and efficient use of resources (e.g., allocating radio bandwidth).
Modulation and demodulation control
Error detection and correction using CRCs (Cyclic Redundancy Checks)
High Level Overview of Signaling
You turn on your phone (AS signaling starts).
Your phone sends an Initial Direct Transfer (IDT) message to establish a radio connection with the base station (lower layer signaling takes over).
The base station authenticates your phone using NAS signaling, contacting the HLR for authentication.
Once authenticated, lower layer signaling continues to manage data transfer between your phone and the base station.
What is HLR?
Home Location Register contains the subscriber data for a network. Their IMSI, phone number, service information and is what negotiates where in the world the user physically is.
Duplexing
You have a lot of devices and you have a few towers. You need to do many uplinks and downlinks to many devices.
It is important that any cellular communications system you can send and receive in both directions at the same time. This enables conversations to be made, with either end being able to talk and listen as required. In order to be able to transmit in both directions, a device (UE) and base station must have a duplex scheme. There are a lot of them including Frequency Division Duplex (FDD), Time Division Duplex (TDD), Semi-static TDD and Dynamic TDD.
Duplexing Types:
Frequency Division Duplex (FDD): Uses separate frequency bands for downlink and uplink signals.
Downlink: The mobile device receives data from the base station on a specific frequency (F1).
Uplink: The mobile device sends data to the base station on a different frequency (F2).
Key Principle: Separate frequencies for uplink and downlink enable simultaneous transmission and reception.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile │
│ │ │ Device │
│ ──────────> │ F1 (Downlink)│ <────────── │
│ │ │ │
│ <────────── │ F2 (Uplink) │ ──────────> │
└──────────────┘ └──────────────┘
Separate frequency bands (F1 and F2)
Time Division Duplex (TDD): Alternates between downlink and uplink signals over the same frequency band.
Downlink: The base station sends data to the mobile device in a time slot.
Uplink: The mobile device sends data to the base station in a different time slot using the same frequency.
Key Principle: The same frequency is used for both uplink and downlink, but at different times.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile Phone│
│ (eNodeB/gNB) │ │ │
└──────────────┘ └──────────────┘
───────────► Time Slot 1 (Downlink)
(Base station sends data)
◄─────────── Time Slot 2 (Uplink)
(Mobile sends data)
───────────► Time Slot 3 (Downlink)
(Base station sends data)
◄─────────── Time Slot 4 (Uplink)
(Mobile sends data)
- The same frequency is used for both directions.
- Communication alternates between downlink and uplink in predefined time slots.
Frame design
Downlink/Uplink: There are predetermined time slots for uplink and downlink, but they can be changed periodically (e.g., minutes, hours).
Key Principle: Time slots are allocated statically for longer durations but can be switched based on network traffic patterns (e.g., heavier downlink traffic during peak hours).
A frame typically lasts 10 ms and is divided into time slots for downlink (DL) and uplink (UL).
"Guard" time slots are used to allow switching between transmission and reception.
4. Dynamic Time Division Duplex (Dynamic TDD):
Downlink/Uplink: Time slots for uplink and downlink are dynamically adjusted in real time based on instantaneous traffic demands.
Key Principle: Uplink and downlink time slots are flexible and can vary dynamically to optimize the usage of the available spectrum in real-time, depending on the traffic load.
See second diagram for what "guard periods" are. Basically windows to ensure there are gaps and the signal doesn't overlap.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile Phone│
│ (eNodeB/gNB) │ │ │
└──────────────┘ └──────────────┘
───────────► Time Slot 1 (Downlink)
───────────► Time Slot 2 (Downlink)
───────────► Time Slot 3 (Downlink)
◄─────────── Time Slot 4 (Uplink)
───────────► Time Slot 5 (Downlink)
◄─────────── Time Slot 6 (Uplink)
- More slots for downlink in scenarios with high download traffic (e.g., streaming video).
- Dynamic slot assignment can change depending on the real-time demand.
┌──────────────┐ ┌──────────────┐
│ Base Station│ │ Mobile Phone│
│ (eNodeB/gNB) │ │ │
└──────────────┘ └──────────────┘
───────────► Time Slot 1 (Downlink)
───────────► Time Slot 2 (Downlink)
[Guard Period] (Switch from downlink to uplink)
◄─────────── Time Slot 3 (Uplink)
[Guard Period] (Switch from uplink to downlink)
───────────► Time Slot 4 (Downlink)
- Guard periods allow safe switching from one direction to another.
- Guard periods prevent signals from overlapping and causing interference.
Core
So I've written a lot about what the RAN does. But we haven't really touched on what the core network concept does. Basically once the device registers with the base station using the random access procedure discussed above, the device is enabled and allows the core network to do a bunch of stuff that we typically associate with "having a cellular plan".
For modern devices when we say authentication we mean "mutual authentication", which means the device authenticates the network and the network authenticates the device. This is typically something like a subscriber-specific secret key and a random number to generate a response to the request sent by the device. Then the network sends an authentication token and the device compares this token with the expected token to authenticate the network. It looks like the following:
┌───────────────────────┐
│ Encryption & │
│ Integrity Algorithms │
├───────────────────────┤
│ - AES (Encryption) │
│ - SNOW 3G (Encryption│
│ - ZUC (Encryption) │
│ - SHA-256 (Integrity)│
└───────────────────────┘
- AES: Strong encryption algorithm commonly used in LTE/5G.
- SNOW 3G: Stream cipher used for encryption in mobile communications.
- ZUC: Encryption algorithm used in 5G.
- SHA-256: Integrity algorithm ensuring data integrity.
The steps of the core network are as follows:
Registration (also called attach procedure): The device connects to the core network (e.g., EPC in LTE or 5GC in 5G) to register and declare its presence. This involves the device identifying itself and the network confirming its identity.
Mutual Authentication: The network and device authenticate each other to ensure a secure connection. The device verifies the network’s authenticity, and the network confirms the device’s identity.
Security Activation: After successful authentication, the network and the device establish a secure channel using encryption and integrity protection to ensure data confidentiality and integrity.
Session Setup and IP Address Allocation: The device establishes a data session with the core network, which includes setting up bearers (logical paths for data) and assigning an IP address to enable internet connectivity.
How Data Gets To Phone
Alright we've talked about how the phone finds a tower to talk to, how the tower knows who the phone is and all the millions of steps involved in getting the mobile phone an actual honest-to-god IP address. How is data actually getting to the phone itself?
Configuration for Downlink Measurement: Before downlink data transmission can occur, the mobile device (UE) must be configured to perform downlink measurements. This helps the network optimize transmission based on the channel conditions. Configuration messages are sent from the base station (eNodeB in LTE or gNB in 5G) to instruct the UE to measure certain DL reference signals.
Reference Signal (Downlink Measurements): The mobile device receives reference signals from the network. These reference signals are used by the UE to estimate DL channel conditions. In LTE, Cell-specific Reference Signals (CRS) are used, and in 5G, Channel State Information-Reference Signals (CSI-RS) are used.
DL Channel Conditions (CQI, PMI, RI): The mobile device processes the reference signals to assess the downlink channel conditions and generates reports such as CQI (Channel Quality Indicator), PMI (Precoding Matrix Indicator), and RI (Rank Indicator). These reports are sent back to the base station.
DL Resource Allocation and Packet Transmission: Based on the UE’s channel reports (CQI, PMI, RI), the base station allocates appropriate downlink resources. It determines the modulation scheme, coding rate, MIMO layers, and frequency resources (PRBs) and sends a DL scheduling grant to the UE. The data packets are then transmitted over the downlink.
Positive/Negative Acknowledgement (HARQ Feedback): After the UE receives the downlink data, it checks the integrity of the packets using CRC (Cyclic Redundancy Check). If the CRC passes, the UE sends a positive acknowledgement (ACK) back to the network. If the CRC fails, a negative acknowledgement (NACK) is sent, indicating that retransmission is needed.
New Transmission or Retransmission (HARQ Process): If the network receives a NACK, it retransmits the packet using the HARQ process. The retransmission is often incremental (IR-HARQ), meaning the device combines the new transmission with previously received data to improve decoding.
Uplink is a little different but is basically the device asking for a timeslot to upload, getting a grant, sending the data up and then getting an ack that it is sent.
Gs
So as everyone knows cellular networks have gone through a series of revisions over the years around the world. I'm going to talk about them and just try to walk through how they are different and what they mean.
1G
Starts in Japan, moves to Europe and then the US and UK.
Speeds up to 2.4kbps and operated in the frequency band of 150 KHz.
Didn't work between countries, had low capacity, unreliable handoff and no security. Basically any receiver can listen to a conversation.
2G
Launched in 1991 in Finland
Allows for text messages, picture messages and MMS.
Speeds up to 14.4kbps between 900MHz and 1800MHz bands
Actual security between sender and receiver with messages digitally encrypted.
Wait, are text messages encrypted?
So this was completely new to me but I guess my old Nokia brick had some encryption on it. Here's how that process worked:
Mobile device stores a secret key in the SIM card and the network generates a random challenge and sends it to the mobile device.
The A3 algorithm is used to compute a Signed Response (SRES) using the secret key and the random value.
Then the A8 algorithm is used with secret and the random value to generate a session encryption key Kc (64-bit key). This key will be used for encrypting data, including SMS.
After the authentication process and key generation, encryption of SMS messages begins. GSM uses a stream cipher to encrypt both voice and data traffic, including text messages. The encryption algorithm used for SMS is either A5/1 or A5/2, depending on the region and network configuration.
A5/1: A stronger encryption algorithm used in Europe and other regions.
A5/2: A weaker variant used in some regions, but deprecated due to its vulnerabilities.
The A5 algorithm generates a keystream that is XORed with the plaintext message (SMS) to produce the ciphertext, ensuring the confidentiality of the message.
So basically text messages from the phone to the base station were encrypted and then exposed there. However I honestly didn't even know that was happening.
TSMA and CDMA
I remember a lot of conversations about GSM vs CDMA when you were talking about cellular networks but at the time all I really knew was "GSM is European and CDMA is US".
TSMA is GSM and uses time slots
CDMA allocates each user a special code to communicate over multiple physical channels
GSM is where we see services like voice mail, SMS, call waiting
EDGE
So everyone who is old like me remembers EDGE on cellphones, including the original iPhone I waited in line for. EDGE was effectively a retrofit you could put on top of an existing GSM network, keeping the cost for adding it low. You got speeds on 9.6-200kbps.
3G
Welcome to the year 2000
Frequency spectrum of 3G transmissions is 1900-2025MHz and 2110-2200MHz.
UTMS takes over for GSM and CDMA2000 takes over from CDMA.
Maxes out around 8-10Mbps
IMT-2000 = 3G
So let's just recap quickly how we got here.
2G (GSM): Initially focused on voice communication and slow data services (up to 9.6 kbps using Circuit Switched Data).
2.5G (GPRS): Introduced packet-switched data with rates of 40-50 kbps. It allowed more efficient use of radio resources for data services.
2.75G (EDGE): Enhanced the data rate by improving modulation techniques (8PSK). This increased data rates to around 384 kbps, making it more suitable for early mobile internet usage.
EDGE introduced 8-PSK (8-Phase Shift Keying) modulation, which allowed the encoding of 3 bits per symbol (as opposed to 1 bit per symbol with the original GSM’s GMSK (Gaussian Minimum Shift Keying) modulation). This increased spectral efficiency and data throughput.
EDGE had really high latency so it wasn't really usable for things like video streaming or online gaming.
3G (WCDMA): Max data rate: 2 Mbps (with improvements over EDGE in practice). Introduced spread-spectrum (CDMA) technology with QPSK modulation.
3.5G (HSDPA): Enhanced WCDMA by introducing adaptive modulation (AMC), HARQ, and NodeB-based scheduling. Max data rate: 14.4 Mbps (downlink).
So when we say 3G we actually mean a pretty wide range of technologies all underneath the same umbrella.
4G
4G or as it is sometimes called LTE evolved from WCDMA. Instead of developing new radio interfaces and new technology existing and newly developed wireless system like GPRS, EDGE, Bluetooth, WLAN and Hiper-LAN were integrated together
4G has a download speed of 67.65Mbps and upload speed of 29.37Mbps
4G operates at frequency bands of 2500-2570MHz for uplink and 2620-2690MHz for downlink with channel bandwidth of 1.25-20MHz
4G has a few key technologies, mainly OFDM, SDR and Multiple-Input Multiple-Output (MIMO).
OFDM (Orthogonal Frequency Division Multiplexing)
Allows for more efficient use of the available bandwidth by breaking down data into smaller pieces and sending them simultaneously
Since each channel uses a different frequency, if one channel experiences interference or errors, the others remain unaffected.
OFDM can adapt to changing network conditions by dynamically adjusting the power levels and frequencies used for each channel.
SDR (Software Defined Radio)
Like it sounds, it is a technology that enables flexible and efficient implementation of wireless communication systems by using software algorithms to control and process radio signals in real-time. In cellular 4G, SDR is used to improve performance, reduce costs, and enable advanced features like multi-band support and spectrum flexibility.
MIMO (multiple-input multiple-output)
A technology used in cellular 4G to improve the performance and capacity of wireless networks. It allows for the simultaneous transmission and reception of multiple data streams over the same frequency band, using multiple antennas at both the base station and mobile device.
Works by having both the base station and the mobile device equipped with multiple antennas
Each antenna transmits and receives a separate data stream, allowing for multiple streams to be transmitted over the same frequency band
There is Spatial Multiplexing where multiple data streams are transmitted over the same frequency band using different antennas. Then Beamforming where advanced signal processing techniques to direct the transmitted beams towards specific users, improving signal quality and reducing interference. Finally Massive MIMO where you use a lot of antennas (64 or more) to improve capacity and performance.
5G
The International Telecommunication Union (ITU) defines 5G as a wireless communication system that supports speeds of at least 20 Gbps (gigabits per second), with ultra-low latency of less than 1 ms (millisecond).
5G operates on a much broader range of frequency bands than 4G
Low-band frequencies: These frequencies are typically below 3 GHz and are used for coverage in rural areas or indoor environments. Examples include the 600 MHz, 700 MHz, and 850 MHz bands.
Mid-band frequencies: These frequencies range from approximately 3-10 GHz and are used for both coverage and capacity in urban areas. Examples include the 4.5 GHz, 6 GHz, and 24 GHz bands.
High-band frequencies: These frequencies range from approximately 10-90 GHz and are used primarily for high-speed data transfer in dense urban environments. Examples include the 28 GHz, 39 GHz, and 73 GHz bands.
5g network designs are a step up in complexity from their 4g predecessors, with a control plane and a userplane with each plane using a separate network function. 4G networks have a single plane.
5G uses advanced modulation schemes such as 256-Quadrature Amplitude Modulation (QAM) to achieve higher data transfer rates than 4G, which typically uses 64-QAM or 16-QAM
All the MIMO stuff discussed above.
What the hell is Quadrature Amplitude Modulation?
I know, it sounds like a Star Trek thing. It is a way to send digital information over a communication channel, like a wireless network or cable. It's a method of "modulating" the signal, which means changing its characteristics in a way that allows us to transmit data.
When we say 256-QAM, it refers to the specific type of modulation being used. Here's what it means:
Quadrature: This refers to the fact that the signal is being modulated using two different dimensions (or "quadratures"). Think of it like a coordinate system with x and y axes.
Amplitude Modulation (AM): This is the way we change the signal's characteristics. In this case, we're changing the amplitude (magnitude) of the signal to represent digital information.
256: This refers to the number of possible states or levels that the signal can take on. Think of it like a binary alphabet with 2^8 = 256 possible combinations.
Why does 5G want this?
More information per symbol: With 256-QAM, each "symbol" (or signal change) can represent one of 256 different values. This means we can pack more data into the same amount of time.
Faster transmission speeds: As a result, we can transmit data at higher speeds without compromising quality.
Kubernetes and 5G
Kubernetes is a popular technology in 5G and is used for a number of functions, including the following:
Virtual Network Functions (VNFs): VNFs are software-based implementations of traditional network functions, such as firewalls or packet filters. Kubernetes is used to deploy and manage these VNFs.
Cloud-Native Network Functions (CNFs): CNFs are cloud-native applications that provide network function capabilities, such as traffic management or security filtering. Kubernetes is used to deploy and manage these CNFs.
Network Function Virtualization (NFV) Infrastructure: NFV infrastructure provides the underlying hardware and software resources for running VNFs and CNFs. Kubernetes is used to orchestrate and manage this infrastructure.
Conclusion
So one of the common sources of frustration for developers I've worked with when debugging cellular network problems is that often while there is plenty of bandwidth for what they are trying to do, the latency involved can be quite variable. If you look at all the complexity behind the scenes and then factor in that the network radio on the actual cellular device is constantly flipping between an Active and Idle state in an attempt to save battery life, this suddenly makes sense.
Because all of the complexity I'm talking about ultimately gets you back to the same TCP stack we've been using for years with all the overhead involved in that back and forth. We're still ending up with a SYN -> SYN-ACK. There are tools you can use to shorten this process somewhat (TCP Fast Open) and changing the initial congestion window but still you are mostly dealing with the same level of overhead you always dealt with.
Ultimately there isn't much you can do with this information, as developers have almost no control over the elements present here. However I think it's useful as cellular networks continue to become the dominant default Internet for the Earth's population that more folks understand the pieces happening in the background of this stack.
I've complained a lot about the gaps in offerings for login security in the past. The basic problem is this domain of security serves a lot of masters. To get the widest level of buy-in from experts, the solution has to scale from normal logins to national security. This creates a frustrating experience for users because it is often overkill for the level of security they need. Basically is it reasonable that you need Google Authenticator to access your gym website? In terms of communication, the solutions we hear about the most, i.e. with the most marketing, allow for the insertion of SaaS services into the chain so that an operation that was previously free now pays a monthly fee based on usage.
This creates a lopsided set of incentives where only the most technologically complex and extremely secure solutions are endorsed and when teams are (understandably) overwhelmed by their requirements a SaaS attempts to get inserted into a critical junction of their product.
The tech community have mostly agreed that username and passwords assigned by the user are not sufficient for even basic security. What we haven't done is precisely explained what it is that we want normal average non-genius developers to do about that. We've settled on this really weird place with the following rules:
Email accounts are always secure but SMS is never secure. You can always email a magic link and that's fine for some reason.
You should have TOTP but we've settled on very short time windows because I guess we decided NTP was a solved problem. There's no actual requirement the code changes every 30 seconds, we're just pretending that we're all spies and someone is watching your phone. Also consumers should be given recovery codes, which are basically just passwords you generate and give to them and only allow to be used once. It is unclear why generating a one-time password for the user is bad but if we call the password a "recovery code" it is suddenly sufficient.
TOTP serves two purposes. One is it ensures there is one randomly generated secret associated with the account that we don't hash (even though I think you could....but nobody seems to), so it's actually kind of a dangerous password that we need to encrypt and can't rotate. The other is we tacked on this stupid idea that it is multi-device, even though there's zero requirement that the code lives on another device. Just someone decided that because there is a QR code it is now multi-device because phones scan QR codes.
At some point we decided to add a second device requirement, but those devices live in entirely different ecosystems. Even if you have an iPhone and a work MacBook, they shouldn't be using the same Apple ID, so I'm not really clear how things would ever line up. It seems like most people sync things like TOTP with their personal Google accounts across different work devices over time. I can't imagine that was the intended functionality.
Passkeys are great but also their range of behavior is bizarre and unpredictable so if you implement them you will be expected to effectively build every other possible recovery flow into this system. Even highly technical users cannot be relied upon to know whether they will lose their passkey when they do something.
Offloading the task to a large corporation is good, but you cannot pick one big corporation. You must have a relationship with Apple and Facebook and Microsoft and Google and Discord and anyone else who happens to be wandering around when you build this. Their logins are secured with magic and unbreakable, but if they are bypassed you can go fuck yourself because that is your problem, not theirs.
All of this is sort of a way to talk around the basic problem. I need a username and a password for every user on my platform. That password needs to be randomly generated and never stored as plain text in my database. If I had a way to know that the browser generated and stored the password, this basic level of security is met. As far as I can tell, there's no way for me to know that for sure. I can guess based on the length of the password and how quickly it was entered into a form field.
Keep in mind all I am trying to do is build a simple login route on an application that is portable, somewhat future proof and doesn't require a ton of personal data from the user to resolve common human error problems. Ideally I'd like to be able to hand this to someone else, they generate a new secret and they too can enroll as many users as they want. This is a simple thing to build so it should be simple to solve the login story as well.
Making a simple CMS
The site you are reading this on is hosted on Ghost, a CMS that is written in Node. It supports a lot of very exciting features I don't use and comes with a lot of baggage I don't need. Effectively all I actually use for is:
RSS
Writing posts in its editor
Fixing typos in the posts I publish (sometimes, my writing is not good)
Let me write a million drafts for every thing I publish
Minimize the amount of JS I'm inflicting on people and try whenever possible to stick to just HTML and CSS
Ghost supports a million things on top of the things I have listed and it also comes with some strange requirements like running MySQL. I don't really need a lot of that stuff and running a full MySQL for a CMS that doesn't have any sort of multi-instance scaling functionality seems odd. I also don't want to stick something this complicated on the internet for people to use for long periods of time without regular maintenance.
Before you say it I don't care for static site generators. I find it's easier for me to have a tab open, write for ten minutes, then go back to what I was doing before.
My goal with this is just to make a normal friendly baby CMS that I could share with a group of people, less technical people, so they could write stuff when they felt like it. We're not trading nuclear secrets here. The requirements are:
Needs to be open to the public internet with no special device enrollment or network segmentation
Not administered by me. Whatever normal problem arises it has to be solvable by a non-technical person.
Making the CMS
So in a day when I was doing other stuff I put this together: https://gitlab.com/matdevdug/ezblog. It's nothing amazing, just sort of a basic template I can build on top of later. Uses sqlite and it does the things you would expect it to do. I can:
Write posts in Quill
Save the posts as drafts or as published posts
Edit the posts after I publish them
Have a valid RSS feed of the posts
The whole frontend is just HTML/CSS so it'll load fast and be easy to cache
Then there is the whole workflow of draft to published.
For one days work this seems to be roughly where I hoped to be. Now we get to the crux of the matter. How do I log in?
What you built is bad and I hate it
The point is I should be able to solve this problem quickly and easily for a hobby website, not that you personally like what I made. The examples are not fully-fleshed out examples, just templates to demonstrate the problem. Also I'm allowed to make stuff that serves no other function than it amuses me.
Password Login
The default for most sites (including Ghost) is just a username and password. The reason for this: it's easy, works on everything and it's pretty simple to work out a fallback flow for users. Everyone understands it, there's no concerns around data ownership or platform lock-in.
I've got a csrf_token in there and the rest is pretty straight forward. Server-side is also pretty easy.
@bp.route('/login', methods=('GET', 'POST'))
@limiter.limit("5 per minute")
def login():
if request.method == 'POST':
username = request.form['username']
password = request.form['password']
db = get_db()
error = None
user = db.execute(
'SELECT * FROM user WHERE username = ?', (username,)
).fetchone()
if user is None:
error = 'Incorrect username.'
elif not check_password_hash(user['password'], password):
error = 'Incorrect password.'
if error is None:
session.clear()
session['user_id'] = user['id']
return redirect(url_for('index'))
flash(error)
return render_template('auth/login.html')
I'm not storing the raw password, just the hash. It's requires almost no work to do. It works exactly the way I think it should. Great fine.
Why are passwords insufficient?
This has been talked to death but let's recap for the sake of me being able to say I did it and you can just kinda scroll quickly through this part.
Users reuse usernames and passwords, so even though I might not know the raw text of the password another website might be (somehow) even lazier than me and their database gets leaked and then oh no I'm hacked.
The password might be a bad password and it's just one people try and oh no they are in the system.
I have to build in a password reset flow because humans are bad at remembering things and that's just how it is.
Password Reset Flow
Everyone has seen this, but let's talk about what I would need to modify about this small application to allow more than one person to use it.
I would need to add a route that handles allowing the user to reset their password by requesting it through their email
To know where to send that email, I would need to receive and store the email address for every user
I would also need to verify the users email to ensure it worked
All of this hinges on having a token I could send to that user that I could generate with something like the following:
Since I'm salting it with the hash of the current password which will change when they change the password, the token can only be used once. Makes sense.
Why is this bad?
For a ton of reasons.
I don't want to know an email address if I don't need it. There's no reason to store more personal information about a user that makes the database more valuable if someone were to steal it.
Email addresses change. You need to write another route which handles that process, which isn't hard but then you need to decide whether you need to confirm that the user has access to address 1 with another magic URL or if it is sufficient to say they are currently logged in.
Finally it sort of punts the problem to email and says "well I assume and hope your email is secure even if statistically you probably use the same password for both".
How do you fix this?
The problem can be boiled down to 2 basic parts:
I don't want the user to tell me a username, I want a randomly generated username so it further reduces the value of information stored in my database. It also makes it harder to do a random drive-by login attempt.
I don't want to own the password management story. Ideally I want the browser to do this on its side.
In a perfect world I want a response that says "yes we have stored these credentials somewhere under this users control" and I can wash my hands of that until we get into the situation where somehow they've lost access to the sync account (which should hopefully be rare enough that we can just do that in the database).
The annoying thing is this technology already exists.
The Credential Manager API does the things I am talking about. Effectively I would need to add some Javascript to my Registration page:
<script>
document.getElementById('register-form').addEventListener('submit', function(event) {
event.preventDefault(); // Prevent form submission
const username = document.getElementById('username').value;
const password = document.getElementById('password').value;
// Save credentials using Credential Management API
if ('credentials' in navigator) {
const cred = new PasswordCredential({
id: username,
password: password
});
// Store credentials in the browser's password manager
navigator.credentials.store(cred).then(() => {
console.log('Credentials stored successfully');
// Proceed with registration, for example, send credentials to your server
registerUser(username, password);
}).catch(error => {
console.error('Error storing credentials:', error);
});
}
});
function registerUser(username, password) {
// Simulate server registration request
fetch('/register', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ username: username, password: password })
}).then(response => {
if (response.ok) {
console.log('User registered successfully');
// Redirect or show success message
} else {
console.error('Registration failed');
}
});
}
</script>
Then on my login page something like this:
function attemptAutoLogin() {
if ('credentials' in navigator) {
navigator.credentials.get({password: true}).then(cred => {
if (cred) {
// Send the credentials to the server to log in the user
fetch('/login', {
method: 'POST',
body: new URLSearchParams({
'username': cred.id,
'password': cred.password
})
}).then(response => {
// Handle login success or failure
if (response.ok) {
console.log('User logged in');
} else {
console.error('Login failed');
}
});
}
}).catch(error => {
console.error('Error retrieving credentials:', error);
});
}
}
// Call the function when the page loads
document.addEventListener('DOMContentLoaded', attemptAutoLogin);
So great, I assign a random cred.id and cred.password, stick it in the browser and then I sorta wash my hands of it.
We know the password is stored somewhere and can be synced for free
We know the user can pull the password out and put it somewhere else if they want to switch platforms
Browsers handle password migrations for users
The problem with this approach is I don't know if I'm supposed to use it.
I have no idea what this means. Could this go away? In testing it does seem like the performance is all over the place. Firefox seems to have some issues with this, whereas Chrome seems to always nail it. iOS Safari also seems to have some problems. So this isn't seemingly reliable enough to use.
Just please just make this a thing that works everywhere.
Before you yell at me about Math.random I think the following would make a good password:
function generatePassword(length) {
const charset = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
let password = "";
for (let i = 0; i < length; i++) {
const randomIndex = Math.floor(Math.random() * charset.length);
password += charset.charAt(randomIndex);
}
return password;
}
const password = generatePassword(32);
console.log(password);
Alright so I can't get away with just a password, so I have to assume the password is bunk and use it as one element of login. Then I have to use either TOTP or HOTP.
From a user perspective TOTP works as follows:
Set up 2FA for your online account.
Get a QR code.
You scan this QR code with an authenticator app of your choice
Your app will immediately start generating these six-digit tokens.
The website asks you to provide one of these six-digit tokens.
Practically this is pretty straight forward. I add a few extra libraries:
import io
import pyotp
import qrcode
from flask import send_file
I have to generate a secret totp_secret = pyotp.random_base32() which then I have to store in the database. Then I have to generate a QR code to show the user so they can generate the time-based codes.
However the more you look into this, the more complicated it gets.
You actually don't need the token to be 6 digits. It can be up to 10. I don't know why I'd want more or less. Presumably more is better.
The token can be valid for longer than 30 seconds. From reading it seems like that makes the code less reliant on perfect time sync between client and server (great) but also increases the probability of someone stealing the TOTP and using it. That doesn't seem like a super likely attack vector here so I'll make it way longer. But then why don't more services use longer tokens if the only concern then is if someone sees my code? Is this just people being unspeakably annoying?
I need to add some recovery step in case you lose access to the TOTP code.
How do you recover from a TOTP failure?
Effectively I'm back to my original problem. I can either:
Go back to the email workflow I don't want because again I don't want to rely on email as some sort of super-secure bastion and I really don't want to store email addresses.
Or I generate a recovery code and give you those codes which let you bypass the TOTP requirement. That at least lets me be like "this is no longer my fault". I like that.
How do I make a recovery code?
Honest to god I have no idea. As far as I can tell a "recovery code" is just a randomly generated value I hash and stick in the database and then when the user enters it on a form, check the hash. It's just another password. I don't know why all the recovery codes I see are numbers, since it seems to have no relationship to that and would likely work with any string.
Effectively all I need to do with the recovery code is ensure it gets burned once used. Which is fine, but now I'm confused. So I'm generating passwords for the user and then I give the passwords back to the user and tell them to store them somewhere? Why don't I just give them the one good password for the initial login and call it a day? Why is one forbidden and the other is mandatory?
Does HOTP help?
I'm really still not clear how HOTP works. Like I understand the basics:
@app.route('/verify_2fa', methods=['GET', 'POST'])
def verify_2fa():
if request.method == 'POST':
hotp = pyotp.HOTP(user_data['hotp_secret'])
otp = request.form['otp']
if hotp.verify(otp, user_data['counter']):
user_data['counter'] += 1 # Increment the counter after successful verification
return redirect(url_for('index'))
flash('Invalid OTP')
return render_template('verify_2fa.html')
There is a secret per-user and a counter and then I increment the counter every single time the user logs in. As far as I can tell there isn't a forcing mechanism which keeps the client and the server in-sync, so basically you tap a button and generate a password and then if you accidentally tap the button again the two counters are off. It seems like then the server has to decide "are you a reasonable number of times off or an unreasonable amount of counts off". With the PyOTP library I don't see a way for me to control that:
verify(otp: str, counter: int) → bool[source]
Verifies the OTP passed in against the current counter OTP.
Parameters:
otp – the OTP to check against
counter – the OTP HMAC counter
So I mean I could test it against a certain range of counters from the counter I know and then accept it if it falls within that window, but you still are either running a different application or an app on your phone to enter this code. I'm not sure exactly why I would ever use this over TOTP, but it definitely doesn't seem easier to recover from.
So TOTP would work with the recovery code but this seems aggressive to ask a normal people to install a different program on their computer or phone in order to login based on a time-based code which will stop working if the client and server (who have zero way to sync time with each other) drift too far apart. Then I need to give you recovery codes and just sorta hope you have somewhere good to put those.
That said, it is the closest to solving the problem because those are at least normal understandable human problems and it does meet my initial requirement of "the user has one good password". It's also portable and allows administrators to be like "well you fell through the one safety net, account is locked, make a new one".
What is the expected treatment of the TOTP secret?
When I was writing this out I became unsure if I'm allowed to hash this secret. Like in theory I should be able to, because I don't need to recover it. If the user was to go through a TOTP reset flow, then I would probably (presumably) want to generate a new secret in which case there's nothing stopping me from using a strong key derivation function.
None of the tutorials I was able to find seemed to have any opinion on this topic. It seems like using encryption is the SOP, which is fine (it's not sitting on disk as a plain string) but introduces another failure point. It seems odd there isn't a way to negotiate a rotation with a client or really provide any sort of feedback. It meets my initial requirement, but the more I read about TOTP the more surprised I was it hasn't been better thought out.
Things I would love from TOTP/HOTP
Some sort of secret rotation process would be great. It doesn't have to be common, but it would be nice if there was some standard way of informing the client.
Be great if we more clearly explained to people how long the codes should be valid for. Certainly 1 hour is sufficient for consumer-level applications right?
Explain like what would I do if the counters get off with HOTP. Certainly some human error must be accounted for by the designers. People are going to hit the button too many times at some point.
Use Google/Facebook/Apple
I'm not against using these sorts of login buttons except I can't offer just one, I need to offer all of them. I have no idea what login that user is going to have or what make sense for them to use. It also means I need to manage some sort of app registration with each one of these companies for each domain that they can suspend approximately whenever they feel like it because they're giant megacorps.
So now I can't just spin up as many copies of this thing as I want with different URLs and I need to go through and test each one to ensure they work. I also need to come up with some sort of migration path for if one of them disappears and I need to authenticate the users into their existing accounts but using a different source of truth.
Since I cannot think of a way to do that which doesn't involve me basically emailing a magic link to the email address I get sent in the response from your corpo login and then allowing that form to update your user account with a different "real_user_id" I gotta abandon this. It just seems like a tremendous amount of work to not really "solve" the problem but just make the problem someone else's fault if it doesn't work.
Like if a user could previously log into a Facebook account and now no longer can, there's no customer service escalation they can go on. They can effectively go fuck themselves because nobody cares about one user encountering a problem. But that means you would still need some way of being like "you were a Facebook user and now you are a Google user". Or what if the user typically logs in with Google, clicks Facebook instead and now has two accounts? Am I expected to reconcile the two?
It's also important to note that I don't want any permissions and I don't want all the information I get back. I don't want to store email address or real name or anything like that, so again like the OAuth flow is overkill for my usage. I have no intention of requesting permissions on behalf of these users with any of these providers.
Use Passkeys
Me and passkeys don't get along super well, mostly because I think they're insane. I've written a lot about them in the past: https://matduggan.com/passkeys-as-a-tool-for-user-retention/ and I won't dwell on it except to say I don't think passkeys are designed with the first goal being an easy user experience.
But regardless passkeys do solve some of my problems.
Since I'm getting a public key I don't care if my database gets leaked
In theory I don't need an email address for fallback because on some platforms some of the time they sync
If users care a lot about ownership of personal data they can use a password manager sometimes if the password manager knows the right people and idk is friends with the mayor of passkeys or something. I don't really understand how that works, like what qualifies you to store the passkeys.
Mayor of passkeys
My issue with passkeys is I cannot conceive of a even "somewhat ok" fallback plan. So you set it up on an iPhone with a Windows computer at home. You break your iPhone and get an Android. It doesn't seem that crazy of a scenario to me to not have any solution for. Do I need your phone number on top of all of this? I don't want that crap sitting in a database.
Tell the users to buy a cross-platform password manager
Oh ok yeah absolutely normal people care enough about passwords to pay a monthly fee. Thanks for the awesome tip. I think everyone on Earth would agree they'd give up most of the price of a streaming platform full of fun content to pay for a password manager. Maybe I should tell them to spin up a docker container and run bitwarden while we're at it.
Anyway I have a hyper-secure biometric login as step 1 and then what is step 2, as the fallback? An email magic link? Based on what? Do I give you "recovery codes" like I did with TOTP? It seems crazy to layer TOTP on top of passkeys but maybe that...makes some sense as a fallback route? That seems way too secure but also possibly the right answer?
I'm not even trying to be snarky, I just don't understand what would be the acceptable position to take here.
What to do from here
Basically I'm left where I started. Here are my options:
Let the user assign a username and password and hope they let the browser or password manager do it and assume it is a good one.
Use the API in the browser to generate a good username and password and store it, hoping they always use a supported browser and that this API doesn't go away in the future.
Generate a TOTP but then also give them passwords called "recovery codes" and then hope they store those passwords somewhere good.
Use email magic links a lot and hope they remember to change their email address here when they lose access to an old email.
Use passkeys and then add on one of the other recovery systems and sort of hope for the best.
What basic stuff would I need to solve this problem forever:
The browser could tell me if it generated the password or if the user typed the password. If they type the password, force the 2FA flow. If not, don't. Let me tell the user "seriously let the system make the password". 1 good password criteria met.
Have the PasswordCredential API work everywhere all the time and I'll make a random username and password on the client and then we can just be done with this forever.
Passkeys but they live in the browser and sync like a normal password. Passkey lite. Passkey but not for nuclear secrets.
TOTP but if recovery codes are gonna be a requirement can we make it part of the spec? It seems like a made-up concept we sorta tacked on top.
I don't think these are crazy requirements. I just think if we want people to build more stuff and for that stuff to be secure, someone needs to sit down and realistically map out "how does a normal person do this". We need consistent reliable conventions I can build on top of, not weird design patterns we came up with because the initial concept was never tested on normal people before being formalized into a spec.
So for years I've used Docker Compose as my stepping stone to k8s. If the project is small, or mostly for my own consumption OR if the business requirements don't really support the complexity of k8s, I use Compose. It's simple to manage with bash scripts for deployments, not hard to setup on fresh servers with cloud-init and the process of removing a server from a load balancer, pulling the new container, then adding it back in has been bulletproof for teams with limited headcount or services where uptime is less critical than cost control and ease of long-term maintenance. You avoid almost all of the complexity of really "running" a server while being able to scale up to about 20 VMs while still having a reasonable deployment time.
What are you talking about
Sure, so one common issue I hear is "we're a small team, k8s feels like overkill, what else is on the market"? The issue is there are tons and tons of ways to run containers on virtually every cloud platform, but a lot of them are locked to that cloud platform. They're also typically billed at premium pricing because they remove all the elements of "running a server".
That's fine but for small teams buying in too heavily to a vendor solution can be hard to get out of. Maybe they pick wrong and it gets deprecated, etc. So I try to push them towards a more simple stack that is more idiot-proof to manage. It varies by VPS provider but the basic stack looks like the following:
Debian servers setup with cloud-init to run all the updates, reboot, install the container manager of choice.
This also sets up Cloudflare tunnels so we can access the boxes securely and easily. Tailscale also works great/better for this. Avoids needing public IPs for each box.
Add a tag to each one of those servers so we know what it does (redis, app server, database)
Put them into a VPC together so they can communicate
Take the deploy script, have it SSH into the box and run the container update process
Linux updates involve a straightforward process of de-registering, destroying the VM and then starting fresh. Database is a bit more complicated but still doable. It's all easily done in simple scripts that you can tie to github actions if you are so inclined. Docker compose has been the glue that handles the actual launching and restarting of the containers for this sample stack.
When you outgrow this approach, you are big enough that you should have a pretty good idea of where to go now. Since everything is already in containers you haven't been boxed in and can migrate in whatever direction you want.
Why Not Docker
However I'm not thrilled with the current state of Docker as a full product. Even when I've paid for Docker Desktop I found it to be a profoundly underwhelming tool. It's slow, the UI is clunky, there's always an update pending, it's sort of expensive for what people use it for, Windows users seem to hate it. When I've compared Podman vs Docker on servers or my local machines, Podman is faster, seems better designed and just in general as a product is trending in a stronger direction. If I don't like Docker Desktop and prefer Podman Desktop, to me its worth migrating the entire stack over and just dumping Docker as a tool I use. Fewer things to keep track of.
Now the problem is that while podman has sort of a compatibility layer with Docker Compose, it's not a one to one replacement and you want to be careful using it. My testing showed it worked ok for basic examples, but more complex stuff and you started to run into problems. It also seems like work on the project has mostly been abandoned by the core maintainers. You can see it here: https://github.com/containers/podman-compose
I think podman-compose is the right solution for local dev, where you aren't using terribly complex examples and the uptime of the stack matters less. It's hard to replace Compose in this role because its just so straight-forward. As a production deployment tool I would stay away from it. This is important to note because right now the local dev container story often involves running k3 on your laptop. My experience is people loath Kubernetes for local development and will go out of their way to avoid it.
The people I know who are all-in on Podman pushed me towards Quadlet as an alternative which uses systemd to manage the entire stack. That makes a lot of sense to me, because my Linux servers already have systemd and it's already a critical piece of software that (as far as I can remember) works pretty much as expected. So the idea of building on top of that existing framework makes more sense to me than attempting to recreate the somewhat haphazard design of Compose.
Wait I thought this already existed?
Yeah I was also confused. So there was a command, podman-generate-systemd, that I had used previously to run containers with Podman using Systemd. That has been deprecated in favor of Quadlet, which are more powerful and offer more of the Compose functionality, but are also more complex and less magically generated.
So if all you want to do is run a container or pod using Systemd, then you can still usepodman-generate-systemd which in my testing worked fine and did exactly what it says on the box. However if you want to emulate the functionality of Compose with networks and volumes, then you want Quadlet.
What is Quadlet
The name comes from this excellent pun:
What do you get if you squash a Kubernetes kubelet?
A quadlet
Actually laughed out loud at that. Anyway Quadlet is a tool for running Podman containers under Systemd in a declarative way. It has been merged into Podman 4.4 so it now comes in the box with Podman. When you install Podman it registers a systemd-generator that looks for files in the following directories:
You put unit files in the directory you want (creating them if they aren't present which I assume they aren't) with the file extension telling you what you are looking at.
For example, if I wanted a simple volume I would make the following file:
and you should be able to use systemctl status to check all of these running processes. You don't need to run systemctl enable to get them to run on next boot IF you have the [Install] section defined. Also notice that when you are setting the dependencies (requires, after) that it is called name-of-thing.service, not name-of-thing.container or .volume. It threw me off at first but just wanted to call that out.
One thing I want to call out
Containers support AutoUpdate, which means if you just want Podman to pull down the freshest image from your registry that is supported out of the box. It's just AutoUpdate=registry. If you change that to local, Podman will restart when you trigger a new build of that image locally with a deployment. If you need more information about logging into registries with Podman you can find that here.
I find this very helpful for testing environments where I can tell servers to just run podman auto-update and just getting the newest containers. It's also great because it has options to help handle rolling back and failure scenarios, which are rare but can really blow up in your face with containers without k8s. You can see that here.
What if you don't store images somewhere?
So often with smaller apps it doesn't make sense to add a middle layer of build and storage the image in one place and then pull that image vs just building the image on the machine you are deploying to with docker compose up -d --no-deps --build myapp
You can do the same thing with Quadlet build files. The unit files are similar to the ones above but with a .build extension and the documentation is pretty simple to figure out how to convert whatever you are looking at to it.
I found this nice for quick testing so I could easily rsync changes to my test box and trigger a fast rebuild with the container layers mostly getting pulled from cache and only my code changes making a difference.
How do secrets work?
So secrets are supported with Quadlets. Effectively they just build on top of podman secret or secrets in Kubernetes. Assuming you don't want to go the Kubernetes route for this purpose, you have a couple of options.
Make a secret from a local file (probably bad idea): podman secret create my_secret ./secret.txt
Make a secret from an environmental variable on the box (better idea): podman secret create --env=true my_secret MYSECRET
Use stdin: printf <secret> | podman secret create my_secret -
Then you can reference these secrets inside of the .container file with the Secret=name-of-podman-secret and then the options. By default these secrets are mounted to run/secrets/secretname as a file inside of the container. You can configure it to be an environmental variable (along with a bunch of other stuff) with the options outlined here.
Rootless
So my examples above were not rootless containers which are best practice. You can get them to work, but the behavior is more complicated and has problems I wanted to call out. You need to use default.target and not multi-user.target and then also it looks like you do need loginctl enable-linger to allow your user to start the containers without that user being logged in.
Also remember that all of the systemctl commands need the --user argument and that you might need to change your sysctl parameters to allow rootless containers to run on privileged ports.
So for rootless networking Podman previously used slirp4netns and now uses pasta. Pasta doesn't do NAT and instead copies the IP address from your main network interface to the container namespace. Main in this case is defined as whatever interface as the default route. This can cause (obvious) problems with inter-container connections since its all the same IP. You need to configure the containers.conf to get around this problem.
Also ping didn't work for me. You can fix that with the solution here.
That sounds like a giant pain in the ass.
Yeah I know. It's not actually the fault of the Podman team. The way rootless containers work is basically they use user_namespaces to emulate the privileges to create containers. Inside of the UserNS they can do things like mount namespaces and networking. Outgoing connections are tricky because vEth pairs cannot be created across UserNS boundaries without root. Inbound relies on port forwarding.
So tools like slirp and pasta are used since they can translate Ethernet packets to unprivileged socket system calls by making a tap interface available in the namespace. However the end result is you need to account for a lot of potential strangeness in the configuration file. I'm confident this will let less fiddly as time goes on.
Podman also has a tutorial on how to get it set up here: https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md which did work for me. If you do the work of rootless containers now you have a much easier security story for the rest of your app, so I do think it ultimately pays off even if it is annoying in the beginning.
Impressions
So as a replacement for Docker Compose on servers, I've really liked Quadlet. I find the logging to be easier to figure out since we're just using the standard systemctl commands, checking status is also easier and more straightforward. Getting the rootless containers running took....more time than I expected because I didn't think about how they wouldn't start by default until the user logged back in without the linger work.
It does stink that this is absolutely not a solution for local-dev for most places. I prefer that Podman remains daemonless and instead hooks into the existing functionality of Systemd but for people not running Linux as their local workstations (most people on Earth) you are either going to need to use the Podman Desktop Kubernetes functionality or use the podman-compose and just be aware that it's not something you should use in actual production deployments.
But if you are looking for something that scales well, runs containers and is super easy to manage and keep running, this has been a giant hit for me.
A lot has been written in the last few weeks about the state of IT security in the aftermath of the CrowdStrike outage. A range of opinions have emerged, ranging from blaming Microsoft for signing the CrowdStrike software (who in turn blame the EU for making them do it) to blaming the companies themselves for allowing all of these machines access to the Internet to receive the automatic template update. Bike-shedding among the technical community continues to be focused on the underlying technical deployment, which misses the forest for the trees.
The better question is what was the forcing mechanism that convinced every corporation in the world that it was a good idea to install software like this on every single machine? Why is there such a cottage industry of companies that are effectively undermining Operating System security with the argument that they are doing more "advanced" security features and allowing (often unqualified) security and IT departments to make fundamental changes to things like TLS encryption and basic OS functionality? How did all these smart people let a random company push updates to everyone on Earth with zero control? The justification often give is "to pass the audit".
These audits and certifications, of which there are many, are a fundamentally broken practice. The intent of the frameworks was good, allowing for the standardization of good cybersecurity practices while not relying on the expertise of an actual cybersecurity expert to validate the results. We can all acknowledge there aren't enough of those people on Earth to actually audit all the places that need to be audited. The issue is the audits don't actually fix real problems, but instead create busywork for people so it looks like they are fixing problems. It lets people cosplay as security experts without needing to actually understand what the stuff is.
I don't come to this analysis lightly. Between HIPAA, PCI, GDPR, ISO27001 and SOC2 I've seen every possible attempt to boil requirements down to a checklist that you can do. Add in the variations on these that large companies like to send out when you are attempting to sell them an enterprise SaaS and it wouldn't surprise me at all to learn that I've spent over 10,000 hours answering and implementing solutions to meet the arbitrary requirements of these documents. I have both produced the hundred page PDFs full of impressive-looking screenshots and diagrams AND received the PDFs full of diagrams and screenshots. I've been on many calls where it is clear neither of us understands what the other is talking about, but we agree that it sounds necessary and good.
I have also been there in the room when inept IT and Security teams use these regulations, or more specifically their interpretation of these regulations, to justify kicking off expensive and unnecessary projects. I've seen laptops crippled due to full filesystem scans looking for leaked AWS credentials and Social Security numbers, even if the employee has nothing to do with that sort of data. I've watched as TLS encryption is broken with proxies so that millions of files can be generated and stored inside of S3 for security teams to never ever look at again. Even I have had to reboot my laptop to apply a non-critical OS update in the middle of an important call. All this inflicted on poor people who had to work up the enthusiasm to even show up to their stupid jobs today.
Why?
Why does this keep happening? How is it that every large company keeps falling into the same trap of repeating the same expensive, bullshit processes?
The actual steps to improve cybersecurity are hard and involve making executives mad. You need to update your software, including planning ahead for end of life technology. Since this dark art is apparently impossible to do and would involve a lot of downtime to patch known-broken shit and reboot it, we won't do that. Better apparently to lose the entire Earths personal data.
Everyone is terrified that there might be a government regulation with actual consequences if they don't have an industry solution to this problem that sounds impressive but has no real punishments. If Comcast executives could go to jail for knowingly running out-of-date Citrix NetScaler software, it would have been fixed. So instead we need impressive-sounding things which can be held up as evidence of compliance that if, ultimately, don't end up preventing leaks the consequences are minor.
Nobody questions the justification of "we need to do x because of our certification". The actual requirements are too boring to read so it becomes this blank check that can be used to roll out nearly anything.
Easier to complete a million nonsense steps than it is to get in contact with someone who understands why the steps are nonsense. The number of times I've turned on silly "security settings" to pass an audit when the settings weren't applicable to how we used the product is almost too high to count.
Most Security teams aren't capable of stopping a dedicated attacker and, in their souls, know that to be true. Especially with large organizations, the number of conceivable attack vectors becomes too painful to even think about. Therefore too much faith is placed in companies like Zscaler and CloudStrike to use "machine learning and AI" (read: magic) to close up all the possible exploits before they happen.
If your IT department works exclusively with Windows and spends their time working with GPOs and Powershell, every problem you hand them will be solved with Windows. If you handed the same problem to a Linux person, you'd get a Linux solution. People just use what they know. So you end up with a one-size-fits-all approach to problems. Like mice in a maze where almost every step is electrified, if Windows loaded up with bullshit is what they are allowed to deploy without hassles that is what you are going to get.
Future
We all know this crap doesn't work and the sooner we can stop pretending it makes a difference, the better. AT&T had every certification on the planet and still didn't take the incredibly basic step of enforcing 2FA on a database of all the most sensitive data it has in the world. If following these stupid checklists and purchasing the required software ended up with more secure platforms, I'd say "well at least there is a payoff". But time after time we see the exact same thing which is an audit is not an adequate replacement for someone who knows what they are doing looking at your stack and asking hard questions about your process. These audits aren't resulting in organizations doing the hard but necessary step of taking downtime to patch critical flaws or even applying basic security settings across all of their platforms.
Because cryptocurrency now allows for hacking groups to demand millions of dollars in payments (thanks crypto!), the financial incentives to cripple critical infrastructure have never been better. At the same time most regulations designed to encourage the right behavior are completely toothless. Asking the tech industry to regulate itself has failed, without question. All that does is generate a lot of pain and suffering for their employees, who most businesses agree are disposable and idiots. All this while doing nothing to secure personal data. Even in organizations that had smart security people asking hard questions, that advice is entirely optional. There is no stick with cybersecurity and businesses, especially now that almost all of them have made giant mistakes.
I don't know what the solution is, but I know this song and dance isn't working. The world would be better off if organizations stopped wasting so much time and money on these vendor solutions and instead stuck to much more basic solutions. Perhaps if we could just start with "have we patched all the critical CVEs in our organization" and "did we remove the shared username and password from the cloud database with millions of call records", then perhaps AFTER all the actual work is done we can have some fun and inject dangerous software into the most critical parts of our employees devices.
It was 4 AM when I first heard the tapping on the glass. I had been working for 30 minutes trying desperately to get everything from the back store room onto the sales floor when I heard a light knocking. Peeking out from the back I saw an old woman wearing sweat pants and a Tweetie bird jacket, oxygen tank in tow, tapping a cane against one of the big front windows. "WE DON'T OPEN UNTIL 5" shouted my boss, who shook her head and resumed stacking boxes. "Black Friday is the worst" she said to nobody as we continued to pile the worthless garbage into neat piles on the store floor.
What people know now but didn't understand then was the items for sale on Black Friday weren't our normal inventory. These were TVs so poorly made they needed time to let their CRT tubes warm up before the image became recognizable. Radios with dials so brittle some came out of the box broken. Finally a mixer that when we tested it in the back let out such a stench of melted plastic we all screamed to turn it off before we burned down the building. I remember thinking as I unloaded it from the truck certainly nobody is gonna want this crap.
Well here they were and when we opened the doors they rushed in with a violence you wouldn't expect from a crowd of mostly senior citizens. One woman pushed me to get at the TVs, which was both unnecessary (I had already hidden one away for myself and put it behind the refrigerators in the back) and not helpful as she couldn't lift the thing on her own. I watched in silence as she tried to get her hands around the box with no holes cut out, presumably a cost savings on Sears part, grunting with effort as the box slowly slid while she held it. At the checkout desk a man told me he was buying the radio "as a Christmas gift for his son". "Alright but no returns ok?" I said keeping a smile on my face.
We had digital cameras the size of shoe-boxes, fire-hazard blenders and an automatic cat watering dish that I just knew was going to break a lot of hearts when Fluffy didn't survive the family trip to Florida. You knew it was quality when the dye from the box rubbed off on your hands when you picked it up. Despite my jokes about worthless junk, people couldn't purchase it fast enough. I saw arguments break out in the aisles and saw Robert, our marine veteran sales guy, whisper "forget this" and leave for a smoke by the loading dock. When I went over to ask if I could help, the man who had possession of the digital camera spun around and told me to "either find another one of these cameras or butt the fuck out". They resumed their argument and I resumed standing by the front telling newcomers that everything they wanted was already gone.
Hours later I was still doing that, informing everyone who walked in that the item they had circled in the newspaper was already sold out. "See, this is such a scam, why don't you stock more of it? It's just a trick to get us into the store". Customer after customer told me variations on the above, including one very kind looking grandfather type informing me I could "go fuck myself" when I wished him a nice holiday.
Beginnings
The store was in my small rural farming town in Ohio, nestled between the computer shop where I got my first job and a carpet store that was almost certainly a money laundering front since nobody ever went in or out. I was interviewed by the owner, a Vietnam veteran who spent probably half our interview talking about his two tours in Vietnam. "We used to throw oil drums in the water and shoot at them from our helicopter, god that was fun. Don't even get me started about all the beautiful local woman." I nodded, unsure what this had to do with me but sensing this was all part of his process. In the years to come I would learn to avoid sitting down in his office, since then you would be trapped listening to stories like these for an hour plus.
After these tales of what honestly sounded like a super fun war full of drugs and joyrides on helicopters, he asked me why I wanted to work at Sears. "It's an American institution and I've always had a lot of respect for it" I said, not sure if he would believe it. He nodded and went on to talk about how Sears build America. "Those kit houses around town, all ordered from Sears. Boy we were something back in the day. Anyway fill out your availability and we'll get you out there helping customers." I had assumed at some point I would get training on the actual products, which never happened in the years I worked there. In the back were dust covered training manuals which I was told I should look at "when I got some time". I obviously never did and still sometimes wonder about what mysteries they contained.
I was given my lanyard and put on the floor, which consisted of half appliances, one quarter electronics and then the rest being tools. Jane, one of the saleswomen told me to "direct all the leads for appliances to her" and not check one out myself, since I didn't get commission. Most of my job consisted of swapping broken Craftsmen tools since they had a lifetime warranty. You filled out a carbon paper form, dropped the broken tool into a giant metal barrel and then handed them a new one. I would also set up deliveries for rider lawnmowers and appliances, working on an ancient IBM POS terminal that required memorizing a series of strange keyboard shortcuts to navigate the calendar.
When there was downtime, I would go into the back and help Todd assemble the appliances and rider lawnmowers. Todd was a special needs student at my high school who was the entirety of our "expert assembly" service. He did a good job, carefully following the manual every time. Whatever sense of superiority as an honor role student I felt disappeared when he watched me try to assemble a rider mower myself. "You need to read the instructions and then do what they say" he would helpfully chime in as I struggled to figure out why the brakes did nothing. His mowers always started on the first try while mine were safety hazards that I felt certain was going to be on the news. "Tonight a Craftsman rider lawnmower killed a family of 4. It was assembled by this idiot." Then just my yearbook photo where I had decided to bleach my hair blonde like a chonky backstreet boy overlaid on top of live footage of blood splattered house siding.
Any feeling I had that people paying us $200 to assemble their rider mowers disappeared when I saw the first one where a customer tried to assemble it. If my mowers were death traps these were actual IEDs whose only conceivable purpose on Earth would be to trick innocent people into thinking they were rider lawnmowers until you turned the key and they blew you into the atmosphere. One guy brought his back with several ziplock bags full of screws bashfully explaining that he tried his best but "there's just no way that's right". That didn't stop me from holding my breath every time someone drove a mower I had worked on up the ramp into the back of the truck. "Please god just don't fall apart right now, wait until they get it home" was my prayer to whatever deity looked after idiots in jobs they shouldn't have.
Sometimes actual adults with real jobs would come in asking me questions about tools, conversations that both of us hated. "I'm looking for a oil filter wrench" they would say, as if this item was something I knew about and could find. "Uh sure, could you describe it?" "It's a wrench, used for changing oil filters, has a loop on it." I'd nod and then feebly offer them up random items until they finally grabbed it themselves. One mechanic when I offered a claw hammer up in response to his request for a cross-pein hammer said "you aren't exactly handy, are you?" I shook my head and went back behind the counter, attempting to establish what little authority I had left with the counter. I might not know anything about the products we sell, but only one of us is allowed back here sir.
Sears Expert
As the months dragged on I was moved from the heavier foot traffic shifts to the night shifts. This was because customers "didn't like talking to me", a piece of feedback I felt was true but still unfair. I had learned a lot, like every incorrect way to assemble a lawn mower and that refrigerators are all the same except for the external panels. Night shifts were mostly getting things ready for the delivery company, a father and son team who were always amusing.
The father was a chain-smoking tough guy who would regularly talk about his "fuck up" of a son. "That idiot dents another oven when we're bringing it in I swear to god I'm going to replace him with one of those Japanese robots I keep seeing on the news." The son was the nicest guy on Earth, really hard working, always on time for deliveries and we got like mountains of positive feedback about him. Old ladies would tear up as they told me about the son hauling their old appliances away in a blizzard on his back. He would just sit there, smile frozen on his face while his father went on and on about how much of a failure he was. "He's just like this sometimes" the son would tell me by the loading dock, even though I would never get involved. "He's actually a nice guy". This was often punctuated by the father running into a minor inconvenience and flying off the handle. "What kind of jackass would sort the paperwork alphabetically instead of by order of delivery?" he'd scream from the parking lot.
When the son went off to college he was replaced by a Hispanic man who took zero shit. His response to customer complaints was always that they were liars and I think the father was afraid of him. "Oh hey don't bother Leo with that, he's not in the mood, I'll call them and work it out" the father would tell me as Leo glared at us from the truck. Leo was incredibly handy though, able to fix almost any dent or scratch in minutes. He popped the dent out of my car door by punching the panel, which is still one of the cooler things I've seen someone do.
Other than the father and son duo, I was mostly alone with a woman named Ruth. She fascinated me because her life was unspeakably bleak. She had been born and raised in this town and had only left the county once in her life, to visit the Sears headquarters in Chicago. She'd talk about it like she had been permitted to visit heaven. "Oh it was something, just a beautiful shiny building full of the smartest people you ever met. Boy I'd love to see it again sometime." She had married her high school boyfriend, had children and now worked here in her 60s as her reward for a life of hard work. She had such bad pain in her knees she had to lean on the stocking cart as she pushed it down the aisles, often stopping to catch her breath. The store would be empty except for the sounds of a wheezing woman and squeaky wheels.
When I would mention Chicago was a 4 hour drive and she could see it again, she'd roll her eyes at me and continue stocking shelves. Ruth was a type of rural person I encountered a lot who seemed to get off on the idea that we were actually isolated from the outside world by a force field. Mention leaving the county to go perhaps to the next county and she would laugh or make a comment about how she wasn't "that kind of person". Every story she would tell had these depressing endings that left me pondering what kind of response she was looking for. "My brother, well he went off to war and when he came back was just a shell of a man. Never really came back if you ask me. Anyway let's clean the counters."
She'd talk endlessly about her grandson, a 12 year old who was "stupid but kind". His incredibly minor infractions were relayed to me like she was telling me about a dark family scandal. "Then I said, who ate all the chips? I knew he had, but he just sat there looking at me and I told him you better wipe those crumbs off your t-shirt smartass and get back to your homework". He finally visited and I was shocked to discover there was also a granddaughter who I had never heard about. He smirked when he met me and told me that Ruth had said I was "a lazy snob".
I'll admit, I was actually a little hurt. Was I a snob compared to Ruth? Absolutely. To be honest with you I'm not entirely sure she was literate. I'd sneak books under the counter to read during the long periods where nothing was happening and she'd often ask me what they were about even if the title sort of explained it. "What is Battle Cry of Freedom: The Civil War Era about? Um well the Civil War." I'd often get called over to "check" documents for her, which typically included anything more complicated than a few sentences. I still enjoyed working with her.
Our relationship never really recovered after I went to Japan when I was 16. I went by myself and wandered around Tokyo, having a great time. When I returned full of stories and pictures of the trip, I could tell she was immediately sick of me. "Who wants to see a place like Japan? Horrible people" she'd tell me as I tried to tell her that things had changed a tiny bit since WWII. "No it's really nice and clean, the food was amazing, let me tell you about these cool trains they have". She wasn't interested and it was clear my getting a passport and leaving the US had changed her opinion of me.
So when her grandson confided that she had called me lazy AND a snob my immediate reaction was to lean over and tell him that she had called him "a stupid idiot". Now she had never actually said "stupid idiot", but in the heat of the moment I went with my gut. Moments after I did that the reality of a 16 year old basically bullying a 12 year old sunk in and I decided it was time for me to go take out some garbage. Ruth of course found out what I said and mentioned it every shift after that. "Saying I called my grandson a stupid idiot, who does that, a rude person that's who, a rude snob" she'd say loud enough for me to hear as the cart very slowly inched down the aisles. I deserved it.
Trouble In Paradise
At a certain point I was allowed back in front of customers and realized with a shock that I had worked there for a few years. The job paid very little, which was fine as I had nothing in the town to actually buy, but enough to keep my lime green Ford Probe full of gas. It shook violently if you exceeded 70 MPH, which I should have asked someone about but never did. I was paired with Jane, the saleswoman who was a devout Republican and liked to make fun of me for being a Democrat. This was during the George W Bush vs Kerry election and she liked to point out how Kerry was a "flipflopper" on things. "He just flips and flops, changes his mind all the time". I'd point out we had vaporized the country of Iraq for no reason and she'd roll her eyes and tell me I'd get it when I was older.
My favorite was when we were working together during Reagan's funeral, an event which elicited no emotion from me but drove her to tears multiple times. "Now that was a man and a president" she'd exclaim to the store while the funeral procession was playing on the 30 TVs. "He won the Cold War you know?" she'd shout at a woman looking for replacement vacuum cleaner bags. Afterwards she asked me what my favorite Reagan memory was. All I could remember was that he had invaded the small nation of Grenada for some reason, so I said that. "Really showed those people not to mess with the US" she responded. I don't think either one of us knew that Grenada is a tiny island nation with a population less than 200,000.
Jane liked to dispense country wisdom, witty one-liners that only sometimes were relevant to the situation at hand. When confronted with an angry customer she would often say afterwards that you "You can't make a silk purse out of a sow's ear" which still means nothing to me. Whatever rural knowledge I was supposed to obtain through osmosis my brain clearly rejected. Jane would send me over to sell televisions since I understood what an HDMI cord was and the difference between SD and HD television.
Selling TVs was perhaps the only thing I did well, that and the fun vacuum demonstration where we would dump a bunch of dirt on a carpet tile and suck it up. Some poor customer would tell me she didn't have the budget for the Dyson and I'd put my hand up to silence her. "You don't have to buy it, just watch it suck up a bunch of pebbles. I don't make commission anyway so who cares." Then we'd both watch as the Dyson would make a horrible screeching noise and suck in a cups worth of small rocks. "That's pretty cool huh?" and the customer would nod, probably terrified of what I would do if she said no.
Graduation
When I graduated high school and prepared to go off to college, I had the chance to say goodbye to everyone before I left. They had obviously already replaced me with another high school student, one that knew things about tools and was better looking. You like to imagine that people will miss you when you leave a job, but everyone knew that wasn't true here. I had been a normal employee who didn't steal and mostly showed up on time.
My last parting piece of wisdom from Ruth was not to let college "make me forget where I came from". Sadly for her I was desperate to do just that, entirely willing to adopt whatever new personality that was presented to me. I'd hated rural life and still do, the spooky dark roads surrounded by corn. Yelling at Amish teens to stop shoplifting during their Rumspringa where they would get dropped off in the middle of town and left to their own devices.
Still I'm grateful that I at least know how to assemble a rider lawnmower, even if it did take a lot of practice runs on customers mowers.
DevOps, like many trendy technology terms, has gone from the peak of optimism to the depths of exhaustion. While many of the fundamental ideas behind the concept have become second-nature for organizations, proving it did in fact have a measurable outcome, the difference between the initial intent and where we ended up is vast. For most organizations this didn't result in a wave of safer, easier to use software but instead encouraged new patterns of work that centralized risk and introduced delays and irritations that didn't exist before. We can move faster than before, but that didn't magically fix all our problems.
The cause of its death was a critical misunderstanding over what was causing software to be hard to write. The belief was by removing barriers to deployment, more software would get deployed and things would be easier and better. Effectively that the issue was that developers and operations teams were being held back by ridiculous process and coordination. In reality these "soft problems" of communication and coordination are much more difficult to solve than the technical problems around pushing more code out into the world more often.
What is DevOps?
DevOps, when it was introduced around 2007, was a pretty radical concept of removing the divisions between people who ran the hardware and people who wrote the software. Organizations still had giant silos between teams, with myself experiencing a lot of that workflow.
Since all computer nerds also love space, it was basically us cosplaying as NASA. Copying a lot of the procedures and ideas from NASA to try and increase the safety around pushing code out into the world. Different organizations would copy and paste different parts, but the basic premise was every release was as close to bug free as time allowed. You were typically shooting for zero exceptions.
When I worked for a legacy company around that time, the flow for releasing software looked as follows.
Development team would cut a release of the server software with a release number in conjunction with the frontend team typically packaged together as a full entity. They would test this locally on their machines, then it would go to dev for QA to test, then finally out to customers once the QA checks were cleared.
Operations teams would receive a playbook of effectively what the software was changing and what to do if it broke. This would include how it was supposed to be installed, if it did anything to the database, it was a whole living document. The idea was the people managing the servers, networking equipment and SANs had no idea what the software did or how to fix it so they needed what were effectively step by step instructions. Sometimes you would even get this as a paper document.
Since these happened often inside of your datacenter, you didn't have unlimited elasticity for growth. So, if possible, you would slowly roll out the update and stop to monitor at intervals. But you couldn't do what people see now as a blue/green deployment because rarely did you have enough excess server capacity to run two versions at the same time for all users. Some orgs did do different datacenters at different times and cut between them (which was considered to be sort of the highest tier of safety).
You'd pick a deployment day, typically middle of the week around 10 AM local time and then would monitor whatever metrics you had to see if the release was successful or not. These were often pretty basic metrics of success, including some real eyebrow raising stuff like "is support getting more tickets" and "are we getting more hits to our uptime website". Effectively "is the load balancer happy" and "are customers actively screaming at us".
You'd finish the deployment and then the on-call team would monitor the progress as you went.
Why Didn't This Work
Part of the issue was this design was very labor-intensive. You needed enough developers coordinating together to put together a release. Then you needed a staffed QA team to actually take that software and ensure, on top of automated testing which was jusssttttt starting to become a thing, that the software actually worked. Finally you needed a technical writer working with the development team to walk through what does a release playbook look like and then finally have the Operations team receive, review the book and then implement the plan.
It was also slow. Features would often be pushed for months even when they were done just because a more important feature had to go out first. Or this update was making major changes to the database and we didn't want to bundle in six things with the one possibly catastrophic change. It's effectively the Agile vs Waterfall design broken out to practical steps.
A lot of the lip service around this time that was given as to why organizations were changing was, frankly, bullshit. The real reason companies were so desperate to change was the following:
Having lots of mandatory technical employees they couldn't easily replace was a bummer
Recruitment was hard and expensive.
Sales couldn't easily inject whatever last-minute deal requirement they had into the release cycle since that was often set it stone.
It provided an amazing opportunity for SaaS vendors to inject themselves into the process by offloading complexity into their stack so they pushed it hard.
The change also emphasized the strengths of cloud platforms at the time when they were starting to gobble market share. You didn't need lots of discipline, just allocate more servers.
Money was (effectively) free so it was better to increase speed regardless of monthly bills.
Developers were understandably frustrated that minor changes could take weeks to get out the door while they were being blamed for customer complaints.
So executives went to a few conferences and someone asked them if they were "doing DevOps" and so we all changed our entire lives so they didn't feel like they weren't part of the cool club.
What Was DevOps?
Often this image is used to sum it up:
In a nutshell, the basic premise was that development teams and operations teams were now one team. QA was fired and replaced with this idea that because you could very quickly deploy new releases and get feedback on those releases, you didn't need a lengthy internal test period where every piece of functionality was retested and determined to still be relevant.
Often this is conflated with the concept of SRE from Google, which I will argue until I die is a giant mistake. SRE is in the same genre but a very different tune, with a much more disciplined and structured approach to this problem. DevOps instead is about the simplification of the stack such that any developer on your team can deploy to production as many times in a day as they wish with only the minimal amounts of control on that deployment to ensure it had a reasonably high chance of working.
In reality DevOps as a practice looks much more like how Facebook operated, with employees committing to production on their first day and relying extensively on real-world signals to determine success or failure vs QA and tightly controlled releases.
In practice it looks like this:
Development makes a branch in git and adds a feature, fix, change, etc.
They open up a PR and then someone else on that team looks at it, sees it passes their internal tests, approves it and then it gets merged into main. This is effectively the only safety step, relying on the reviewer to have perfect knowledge of all systems.
This triggers a webhook to the CI/CD system which starts the build (often of an entire container with this code inside) and then once the container is built, it's pushed to a container registry.
The CD system tells the servers that the new release exists, often through a Kubernetes deployment or pushing a new version of an internal package or using the internal CLI of the cloud providers specific "run a container as a service" platform. It then monitors and tells you about the success or failure of that deployment.
Finally there are release-aware metrics which allow that same team, who is on-call for their application, to see if something has changed since they released it. Is latency up, error count up, etc. This is often just a line in a graph saying this was old and this is new.
Depending on the system, this can either be something where every time the container is deployed it is on brand-new VMs or it is using some system like Kubernetes to deploy "the right number" of containers.
The sales pitch was simple. Everyone can do everything so teams no longer need as many specialized people. Frameworks like Rails made database operations less dangerous, so we don't need a team of DBAs. Hell, use something like Mongo and you never need a DBA!
DevOps combined with Agile ended up with a very different philosophy of programming which had the following conceits:
The User is the Tester
Every System Is Your Specialization
Speed Of Shipping Above All
Catch It In Metrics
Uptime Is Free, SSO Costs Money (free features were premium, expensive availability wasn't charged for)
Logs Are Business Intelligence
What Didn't Work
The first cracks in this model emerged pretty early on. Developers were testing on their local Mac and Windows machines and then deploying code to Linux servers configured from Ansible playbooks and left running for months, sometimes years. Inevitably small differences in the running fleet of production servers emerged, either from package upgrades for security reasons or just from random configuration events. This could be mitigated by frequently rotating the running servers by destroying and rebuilding them as fresh VMs, but in practice this wasn't done as often as it should have been.
Soon you would see things like "it's running fine on box 1,2, 4, 5, but 3 seems to be having problems". It wasn't clear in the DevOps model who exactly was supposed to go figure out what was happening or how. In the previous design someone who worked with Linux for years and with these specific servers would be monitoring the release, but now those team members often wouldn't even know a deployment was happening. Telling someone who is amazing at writing great Javascript to go "find the problem with a Linux box" turned out to be easier said than done.
Quickly feedback from developers started to pile up. They didn't want to have to spend all this time figuring out what Debian package they wanted for this or that dependency. It wasn't what they were interested in doing and also they weren't being rewarded for that work, since they were almost exclusively being measured for promotions by the software they shipped. This left the Operations folks in charge of "smoothing out" this process, which in practice often meant really wasteful practices.
You'd see really strange workflows around this time of doubling the number of production servers you were paying for by the hour during a deployment and then slowly scaling them down, all relying on the same AMI (server image) to ensure some baseline level of consistency. However since any update to the AMI required a full dev-stage-prod check, things like security upgrades took a very long time.
Soon you had just a pile of issues that became difficult to assign. Who "owned" platform errors that didn't result in problems for users? When a build worked locally but failed inside of Jenkins, what team needed to check that? The idea of we're all working on the same team broke down when it came to assigning ownership of annoying issues because someone had to own them or they'd just sit there forever untouched.
Enter Containers
DevOps got a real shot in the arm with the popularization of containers, which allowed the movement to progress past its awkward teenage years. Not only did this (mostly) solve the "it worked on my machine" thing but it also allowed for a massive simplification of the Linux server component part. Now servers were effectively dumb boxes running containers, either on their own with Docker compose or as part of a fleet with Kubernetes/ECS/App Engine/Nomad/whatever new thing that has been invented in the last two weeks.
Combined with you could move almost everything that might previous be a networking team problem or a SAN problem to configuration inside of the cloud provider through tools like Terraform and you saw a real flattening of the skill curve. This greatly reduced the expertise required to operate these platforms and allowed for more automation. Soon you started to see what we now recognize as the current standard for development which is "I push out a bajillion changes a day to production".
What Containers Didn't Fix
So there's a lot of other shit in that DevOps model we haven't talked about.
So far teams had improved the "build, test and deploy" parts. However operating the crap was still very hard. Observability was really really hard and expensive. Discoverability was actually harder than ever because stuff was constantly changing beneath your feet and finally the Planning part immediately collapsed into the ocean because now teams could do whatever they wanted all the time.
Operate
This meant someone going through and doing all the boring stuff. Upgrading Kubernetes, upgrading the host operating system, making firewall rules, setting up service meshes, enforcing network policies, running the bastion host, configuring the SSH keys, etc. What organizations quickly discovered was that this stuff was very time consuming to do and often required more specialization than the roles they had previously gotten rid of.
Before you needed a DBA, a sysadmin, a network engineer and some general Operations folks. Now you needed someone who not only understood databases but understood your specific cloud providers version of that database. You still needed someone with the sysadmin skills, but in addition they needed to be experts in your cloud platform in order to ensure you weren't exposing your database to the internet. Networking was still critical but now it all existed at a level outside of your control, meaning weird issues would sometimes have to get explained as "well that sometimes happens".
Often teams would delay maintenance tasks out of a fear of breaking something like k8s or their hosted database, but that resulted in delaying the pain and making their lives more difficult. This was the era where every startup I interviewed with basically just wanted someone to update all the stuff in their stack "safely". Every system was well past EOL and nobody knew how to Jenga it all together.
Observe
As applications shipped more often, knowing they worked became more important so you could roll back if it blew up in your face. However replacing simple uptime checks with detailed traces, metrics and logs was hard. These technologies are specialized and require detailed understanding of what they do and how they work. A syslog centralized box lasts to a point and then it doesn't. Prometheus scales to x amount of metrics and then no longer works on a single box. You needed someone who had a detailed understanding of how metrics, logs and traces worked and how to work with development teams in getting them sending the correct signal to the right places at the right amount of fidelity.
Or you could pay a SaaS a shocking amount to do it for you. The rise of companies like Datadog and the eye-watering bills that followed was proof that they understood how important what they were providing was. You quickly saw Observability bills exceed CPU and networking costs for organizations as one team would misconfigure their application logs and suddenly you have blown through your monthly quota in a week.
Developers were being expected to monitor with detailed precision what was happening with their applications without a full understanding of what they were seeing. How many metrics and logs were being dropped on the floor or sampled away, how did the platform work in displaying these logs to them, how do you write an query for terabytes of logs so that you can surface what you need quickly, all of this was being passed around in Confluence pages being written by desperate developers who were learning as they were getting paged at 2AM how all this shit works together.
Continuous Feedback
This to me is the same problem as Observe. It's about whether your deployment worked or not and whether you had signal from internal tests if it was likely to work. It's also about feedback from the team on what in this process worked and what didn't, but because nobody ever did anything with that internal feedback we can just throw that one directly in the trash.
I guess in theory this would be retros where we all complain about the same six things every sprint and then continue with our lives. I'm not an Agile Karate Master so you'll need to talk to the experts.
Discover
A big pitch of combining these teams was the idea of more knowledge sharing. Development teams and Operation teams would be able to cross-share more about what things did and how they worked. Again it's an interesting idea and there was some improvement to discoverability, but in practice that isn't how the incentives were aligned.
Developers weren't rewarded for discovering more about how the platform operated and Operations didn't have any incentive to sit down and figure out how the frontend was built. It's not a lack of intellectual curiosity by either party, just the finite amount of time we all have before we die and what we get rewarded for doing. Being surprised that this didn't work is like being surprised a mouse didn't go down the tunnel with no cheese just for the experience.
In practice I "discovered" that if NPM was down nothing worked and the frontend team "discovered" that troubleshooting Kubernetes was a bit like Warhammer 40k Adeptus Mechanicus waving incense in front of machines they didn't understand in the hopes that it would make the problem go away.
Try restarting the Holy Deployment
Plan
Maybe more than anything else, this lack of centralization impacted planning. Since teams weren't syncing on a regular basis anymore, things could continue in crazy directions unchecked. In theory PMs were syncing with each other to try and ensure there were railroad tracks in front of the train before it plowed into the ground at 100 MPH, but that was a lot to put on a small cadre of people.
We see this especially in large orgs with microservices where it is easier to write a new microservice to do something rather than figure out which existing microservice does the thing you are trying to do. This model was sustainable when money was free and cloud budgets were unlimited, but once that gravy train crashed into the mountain of "businesses need to be profitable and pay taxes" that stopped making sense.
The Part Where We All Gave Up
A lot of orgs solved the problems above by simply throwing bodies into the mix. More developers meant it was possible for teams to have someone (anyone) learn more about the systems and how to fix them. Adding more levels of PMs and overall planning staff meant even with the frantic pace of change it was...more possible to keep an eye on what was happening. While cloud bills continued to go unbounded, for the most part these services worked and allowed people to do the things they wanted to do.
Then layoffs started and budget cuts. Suddenly it wasn't acceptable to spend unlimited money with your logging platform and your cloud provider as well as having a full team. Almost instantly I saw the shift as organizations started talking about "going back to basics". Among this was a hard turn in the narrative around Kubernetes where it went from an amazing technology that lets you grow to Google-scale to a weight around an organizations neck nobody understood.
Platform Engineering
Since there are no new ideas, just new terms, a successor to the throne has emerged. No longer are development teams expected to understand and troubleshoot the platforms that run their software, instead the idea is that the entire process is completely abstracted away from them. They provide the container and that is the end of the relationship.
From a certain perspective this makes more sense since it places the ownership for the operation of the platform with the people who should have owned it from the beginning. It also removes some of the ambiguity over what is whose problem. The development teams are still on-call for their specific application errors, but platform teams are allowed to enforce more global rules.
Well at least in theory. In practice this is another expansion of roles. You went from needing to be a Linux sysadmin to being a cloud-certified Linux sysadmin to being a Kubernetes-certified multicloud Linux sysadmin to finally being an application developer who can create a useful webUI for deploying applications on top of a multicloud stack that runs on Kubernetes in multiple regions with perfect uptime and observability that doesn't blow the budget. I guess at some point between learning the difference between AWS and GCP we were all supposed to go out and learn how to make useful websites.
This division of labor makes no sense but at least it's something I guess. Feels like somehow Developers got stuck with a lot more work and Operation teams now need to learn 600 technologies a week. Surprisingly tech executives didn't get any additional work with this system. I'm sure the next reorg they'll chip in more.
Conclusion
We are now seeing a massive contraction of the Infrastructure space. Teams are increasingly looking for simple, less platform specific tooling. In my own personal circles it feels like a real return to basics, as small and medium organizations abandon technology like Kubernetes and adopt much more simple and easy-to-troubleshoot workflows like "a bash script that pulls a new container".
In some respects it's a positive change, as organizations stop pretending they needed a "global scale" and can focus on actually servicing the users and developers they have. In reality a lot of this technology was adopted by organizations who weren't ready for it and didn't have a great plan for how to use it.
However Platform Engineering is not a magical solution to the problem. It is instead another fabrication of an industry desperate to show monthly growth in cloud providers who know teams lack the expertise to create the kinds of tooling described by such practices. In reality organizations need to be more brutally honest about what they actually need vs what bullshit they've been led to believe they need.
My hope is that we keep the gains from the DevOps approach and focus on simplification and stability over rapid transformation in the Infrastructure space. I think we desperately need a return to basics ideology that encourages teams to stop designing with the expectation that endless growth is the only possible outcome of every product launch.
I was recently invited to try out the beta for GitHub's new AI-driven web IDE and figured it could be an interesting time to dip my toes into AI. So far I've avoided all of the AI tooling, trying the GitHub paid Copilot option and being frankly underwhelmed. It made more work for me than it saved. However this is free for me to try and I figured "hey why not".
Disclaimer: I am not and have never been an employee of GitHub, Microsoft, any company owned by Microsoft, etc. They don't care about me and likely aren't aware of my existence. Nobody from GitHub PR asked me to do this and probably won't like what I have to say anyway.
TL;DR
GitHub Copilot Workspace didn't work on a super simple task regardless of how easy I made the task. I wouldn't use something like this for free, much less pay for it. It sort of failed in every way it could at every step.
What is GitHub Copilot Workspace?
So after the success of GitHub Copilot, which seems successful according to them:
In 2022, we launched GitHub Copilot as an autocomplete pair programmer in the editor, boosting developer productivity by up to 55%. Copilot is now the most widely adopted AI developer tool. In 2023, we released GitHub Copilot Chat—unlocking the power of natural language in coding, debugging, and testing—allowing developers to converse with their code in real time.
They have expanded on this feature set with GitHub Copilot Workspace, a combination of an AI tool with an online IDE....sorta. It's all powered by GPT-4 so my understanding is this is the best LLM money can buy. The workflow of the tool is strange and takes a little bit of explanation to convey what it is doing.
Very simple, makes sense, Then I click "Open in Workspaces" which brings me to a kind of GitHub Actions inspired flow.
It reads the Issue and creates a Specification, which is editable.
Then you generate a Plan:
Finally it generates the files of that plan and you can choose whether to implement them or not and open a Pull Request against the main branch.
Implementation:
It makes a Pull Request:
Great right? Well except it didn't do any of it right.
It didn't add a route to the Flask app to expose this information
It didn't stick with the convention of storing the information in JSON files, writing it out to Markdown for some reason
It decided the way that it was going to reveal this information was to add it to the README
Finally it didn't get anywhere near all the machine types.
Before you ping me yes I tried to change the Proposed plan
Baby Web App
So the app I've written here is primarily for my own use and it is very brain dead simple. The entire thing is the work of roughly an afternoon of poking around while responding to Slack messages. However I figured this would be a good example of maybe a more simple internal tool where you might trust AI to go a bit nuts since nothing critical will explode if it messes up.
How the site works it is relies on the output of the gcloud CLI tool to generate JSON of all of IAM permissions for GCP and then output them so that I can put them into categories and quickly look for the one I want. I found the official documentation to be slow and hard to use, so I made my own. It's a Flask app, which means it is pretty stupid simple.
I also have an endpoint I use during testing if I need to test some specific GDPR code so I can curl it and see if the IP address is coming from EU/EEA or not along with a TSID generator I used for a brief period of testing that I don't need anymore. So again, pretty simple. It could be rewritten to be much better but I'm the primary user and I don't care, so whatever.
So effectively what I want to add is another route where I would also have a list of all the GCP machine types because their official documentation is horrible and unreadable. https://cloud.google.com/compute/docs/machine-resource
Look how information packed it is! My god, I can tell at a glance if a machine type is eligible for Sustained Use Discounts, how many regions it is in, Hour/Spot/Month pricing and the breakout per OS along with Clock speed. If only Google had a team capable of making a spreadsheet.
Nothing I enjoy more than nested pages with nested submenus that lack all the information I would actually need. I'm also not clear what a Tier_1 bandwidth is but it does seem unlikely that it matters for machine types when so few have it.
I could complain about how GCP organizes information all day but regardless the information exists. So I don't need anything to this level, but could I make a simpler version of this that gives me some of the same information? Seems possible.
How I Would Do It
First let's try to stick with the gcloud CLI approach.
gcloud compute machine-types list --format="json"
Only problem with this is that it does output the information I want, but for some reason it outputs a JSON file per region.
I don't know why but sure. However I don't actually need every region so I can cheat here. gcloud compute machine-types list --format="json" gets me some of the way there.
Where's the price?
Yeah so Google doesn't expose pricing through the API as far as I can tell. You can download what is effectively a global price list for your account at https://console.cloud.google.com/billing/[your billing account id]/pricing. That's a 13 MB CSV that includes what your specific pricing will be, which is what I would use. So then I would combine the information from my region with the information from the CSV and then output the values. However since I don't know whether the pricing I have is relevant to you, I can't really use this to generate a public webpage.
Web Scraping
So realistically my only option would be to scrape the pricing page here: https://cloud.google.com/compute/all-pricing. Except of course it was designed in such a way as to make it as hard to do that as possible.
Boy it is hard to escape the impression GCP does not want me doing large-scale cost analysis. Wonder why?
So there's actually a tool called gcosts which seems to power a lot of these sites running price analysis. However it relies on a pricing.yml file which is automatically generated weekly. The work involved in generating this file is not trivial:
+--------------------------+ +------------------------------+
| Google Cloud Billing API | | Custom mapping (mapping.csv) |
+--------------------------+ +------------------------------+
↓ ↓
+------------------------------------------------------------+
| » Export SKUs and add custom mapping IDs to SKUs (skus.sh) |
+------------------------------------------------------------+
↓
+----------------------------------+ +-----------------------------+
| SKUs pricing with custom mapping | | Google Cloud Platform info. |
| (skus.db) | | (gcp.yml) |
+----------------------------------+ +-----------------------------+
\ /
+--------------------------------------------------+
| » Generate pricing information file (pricing.pl) |
+--------------------------------------------------+
↓
+-------------------------------+
| GCP pricing information file |
| (pricing.yml) |
+-------------------------------+
Alright so looking through the GitHub Action that generates this pricing.yml file, here, I can see how it works and how the file is generated. But also I can just skip that part and pull the latest for my usecase whenever I regenerate the site. That can be found here.
Effectively with no assistance from AI, I have now figured out how I would do this:
Pull down the pricing.yml file and parse it
Take that information and output it to a simple table structure
Make a new route on the Flask app and expose that information
Add a step to the Dockerfile to pull in the new pricing.yml with every Dockerfile build just so I'm not hammering the GitHub CDN all the time.
Why Am I Saying All This?
So this is a perfect example of an operation that should be simple but because the vendor doesn't want to make it simple, is actually pretty complicated. As we can now tell from the PR generated before, AI is never going to be able to understand all the steps we just walked through to understand how one actually get the prices for these machines. We've also learned that because of the hard work of someone else, we can skip a lot of the steps. So let's try it again.
Attempt 2
Maybe if I give it super specific information, it can do a better job.
I think I've explained maybe what I'm trying to do. Certainly a person would understand this. Obviously this isn't the right way to organize this information, I would want to do a different view and sort by region and blah blah blah. However this should be easier for the machine to understand.
Note: I am aware that Copilot has issues making calls to the internet to pull files, even from GitHub itself. That's why I've tried to include a sample of the data. If there's a canonical way to pass the tool information inside of the issue let me know at the link at the bottom.
Results
So at first things looked promising.
It seems to understand what I'm asking and why I'm asking it. This is roughly the correct thing. The plan also looks ok:
It's not adding it to the menu bar, there are actually a lot of pretty basic misses here. I wouldn't accept this PR from a person, but let's see if it works!
=> ERROR [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml 0.1s
------
> [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml:
0.104 /bin/sh: 1: wget: not found
No worries, easy to fix.
Alright fixed wget, let's try again!
2024-06-18 11:18:57 File "/usr/local/lib/python3.12/site-packages/gunicorn/util.py", line 371, in import_app
2024-06-18 11:18:57 mod = importlib.import_module(module)
2024-06-18 11:18:57 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57 File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
2024-06-18 11:18:57 return _bootstrap._gcd_import(name[level:], package, level)
2024-06-18 11:18:57 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57 File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
2024-06-18 11:18:57 File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
2024-06-18 11:18:57 File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
2024-06-18 11:18:57 File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
2024-06-18 11:18:57 File "<frozen importlib._bootstrap_external>", line 995, in exec_module
2024-06-18 11:18:57 File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
2024-06-18 11:18:57 File "/app/main.py", line 2, in <module>
2024-06-18 11:18:57 import yaml
2024-06-18 11:18:57 ModuleNotFoundError: No module named 'yaml'
Yeah I did anticipate this one. Alright let's add PyYAML so there's something to import. I'll give AI a break on this one, this is a dumb Python thing.
Ok so it didn't add it to the menu, it didn't follow the style conventions, but did it at least work? Also no.
I'm not sure how it could have done a worse job to be honest. I understand what it did wrong and why this ended up like it did, but the work involved in fixing it exceeds the amount of work it would take for me to do it myself by scratch. The point of this was to give it a pretty simple concept (parse a YAML file) and see what it did.
Conclusion
I'm sure this tool is useful to someone on Earth. That person probably hates programming and gets no joy out of it, looking for something that could help them spend less time doing it. I am not that person. Having a tool that makes stuff that looks right but ends up broken is worse than not having the tool at all.
If you are a person maintaining an extremely simple thing with amazing test coverage, I guess go for it. Otherwise this is just a great way to get PRs that look right and completely waste your time. I'm sure there are ways to "prompt engineer" this better and if someone wants to tell me what I could do, I'm glad to re-run the test. However as it exists now, this is not worth using.
If you want to use it, here are my tips:
Your source of data must be inside of the repo, it doesn't like making network calls
It doesn't seem to go check any sort of requirements file for Python, so assume the dependencies are wrong
It understands Dockerfile but not checking if a binary is present so add a check for that
I was recently working on a new side project in Python with Kubernetes and I needed to inject a bunch of secrets. The problem with secret management in Kubernetes is you end up needing to set up a lot of it yourself and its time consuming. When I'm working on a new idea, I typically don't want to waste a bunch of hours setting up "the right way" to do something that isn't related to the core of the idea I'm trying out.
For the record, the right way to do secrets in Kubernetes is the following:
Turn on encryption at rest for ETCD
Carefully set up RBAC inside of Kubernetes to ensure the right users and service accounts can access the secrets
Give up on trying to do that and end up setting up Vault or paying your cloud provider for their Secret Management tool
However especially when you are trying ideas out, I wanted something more idiot proof that didn't require any setup. So I wrote something simple with Python Fernet encryption that I thought might be useful to someone else out there.
So the script works in a pretty straight forward way. It reads the .env file you generate as outlined in the README with secrets in the following format:
Make a .env file with the following parameters:
KEY=Make a fernet key: https://fernetkeygen.com/
CLUSTER_NAME=name_of_cluster_you_want_to_use
SECRET-TEST-1=9e68b558-9f6a-4f06-8233-f0af0a1e5b42
SECRET-TEST-2=a004ce4c-f22d-46a1-ad39-f9c2a0a31619
The KEY is the secret key and the CLUSTER_NAME tells the Kubernetes library what kubeconfig target you want to use. Then the tool finds anything with the word SECRET in the .env file and encrypts it, then writes it to the .csv file.
The .csv file looks like the following:
I really like to keep some sort of record of what secrets are injected into the cluster outside of the cluster just so you can keep track of the encrypted values. Then the script checks the namespace you selected to see if there are secrets with that name already and, if not, injects it for you.
Some quick notes about the script:
Secret names in Kubernetes need a specific format for the name. Lower case with words separated by - or . The script will take the uppercase in the .env and convert it into a lowercase. Just be aware it is doing that.
It does base64 encode the secret before it uploads it, so be aware that your application will need to decode it when it loads the secret.
Now the only secret you need to worry about is the Fernet secret that you can load into the application in a secure way. I find this is much easier to mentally keep track of than trying to build an infinitely scalable secret solution. Plus its cheaper since many secret managers charge per secret.
The secrets are immutable which means they are lightweight on the k8s API and fast. Just be aware you'll need to delete the secrets if you need to replace them. I prefer this approach because I'd rather store more things as encrypted secrets and not worry about load.
It is easy to specify which namespace you intend to load the secrets into and I recommend using a different Fernet secret per application.
Mounting the secret works like it always does in k8s
Inside of your application, you need to load the Fernet secret and decrypt the secrets. With Python that is pretty simple.
decrypt = fernet.decrypt(token)
Q+A
Why not SOPS? This is easier and also handles the process of making the API call to your k8s cluster to make the secret.
Is Fernet secure? As far as I can tell it's secure enough. Let me know if I'm wrong.
Would you make a CLI for this? If people actually use this thing and get value out of it, I would be more than happy to make it a CLI. I'd probably rewrite it in Golang if I did that, so if people ask it'll take me a bit of time to do it.