Work - matduggan.com

Slack: The Art of Being Busy Without Getting Anything Done

March 17, 2025 in Work

My first formal IT helpdesk role was basically "resetting stuff". I would get a ticket, an email or a phone call and would take the troubleshooting as far as I could go. Reset the password, check the network connection, confirm the clock time was right, ensure the issue persisted past a reboot, check the logs and see if I could find the failure event, then I would package the entire thing up as a ticket and escalate it up the chain.

It was effectively on the job training. We were all trying to get better at troubleshooting to get a shot at one of the coveted SysAdmin jobs. Moving up from broken laptops and desktops to broken servers was about as big as 22 year old me dreamed.

This is not what we looked like but how creepy is this photo?

Sometimes people would (rightfully) observe that they were spending a lot of time interacting with us, while the more senior IT people were working quietly behind us and they could probably fix the issue immediately. We would explain that, while that was true, our time was less valuable than theirs. Our role was to eliminate all of the most common causes of failure then to give them the best possible information to take the issue and continue looking at it.

There are people who understand waiting in a line and there are people who make a career around skipping lines. These VIPs encountered this flow in their various engineering organizations and decided that a shorter line between their genius and the cogs making the product was actually the "secret sauce" they needed.

Thus, Slack was born, a tool pitched to the rank and file as a nicer chat tool and to the leadership as a all-seeing eye that allowed them to plug directly into the nervous system of the business and get instant answers from the exact right person regardless of where they were or what they were doing.

My job as a professional Slacker

At first Slack-style chat seemed great. Email was slow and the signal to noise ratio was off, while other chat systems I had used before at work either didn't preserve state, so whatever conversation happened while you were offline didn't get pushed to you, or they didn't scale up to large conversations well. Both XMPP and IRC has the same issue, which is if you were there when the conversation was happening you had context, but otherwise no message history for you.

There were attempts to resolve this (https://xmpp.org/extensions/xep-0313.html) but support among clients was all over the place. The clients just weren't very good and were constantly going through cycles of intense development only to be abandoned. It felt like when an old hippie would tell you about Woodstock. "You had to be there, man".

Slack brought channels and channels bought a level of almost voyeurism into what other teams were doing. I knew exactly what everyone was doing all the time, down to I knew where the marketing team liked to go for lunch. Responsiveness became the new corporate religion and I was a true believer. I would stop walking in the hallway to respond to a DM or answer a question I knew the answer to, ignoring the sighs of frustration as people walked around my hoodie-clad roadblock of a body.

Sounds great, what's the issue?

So what's the catch? Well I first noticed it on the train. My daily commute home through the Chicago snowy twilight used to be a sacred ritual of mental decompression. A time to sift through the day's triumphs and (more often) the screw-ups. What needed fixing tomorrow? What problem had I pushed off maybe one day too long?

But as I got further and further into Slack, I realized I was coming home utterly drained yet strangely...hollow. I hadn't done any actual work that day.

My days had become a never-ending performance of "work". I was constantly talking about the work, planning the work, discussing the requirements of the work, and then in a truly Sisyphean twist, linking new people to old conversations where we had already discussed the work to get them up to speed on our conversation. All the while diligently monitoring my channels, a digital sentry ensuring no question went unanswered, no emoji not +1'd. That was it, that was the entire job.

Show up, spend eight hours orchestrating the idea of work, and then go home feeling like I'd tried to make a sandcastle on the beach and getting upset when the tide did what it always does. I wasn't making anything, I certainly wasn't helping our users or selling the product. I was project managing, but poorly, like a toddler with a spreadsheet.

And for the senior engineers? Forget about it. Why bother formulating a coherent question for a team channel when you could just DM the poor bastard who wrote the damn code in the first place? Sure, they could push back occasionally, feigning busyness or pointing to some obscure corporate policy about proper channel etiquette. But let's be real. If the person asking was important enough (read: had a title that could sign off on their next project), they were answering. Immediately.

So, you had your most productive people spending their days explaining why they weren't going to answer questions they already knew the answer to, unless they absolutely had to. It's the digital equivalent of stopping a concert pianist to teach you "Twinkle Twinkle Little Star" 6 times a day.

It's a training problem too

And don't even get me started on the junior folks. Slack was actively robbing them of the chance to learn. Those small, less urgent issues? That's where the real education happens. You get to poke around in the systems, see how the gears grind, understand the delicate dance of interconnectedness. But why bother troubleshooting when Jessica, the architect of the entire damn stack, could just drop the answer into a DM in 30 seconds? People quickly figured out the pecking order. Why wait four hours for a potentially wrong answer when the Oracle of Code was just a direct message away?

You think you are too good to answer questions???

Au contraire! I genuinely enjoy feeling connected to the organizational pulse. I like helping people. But that, my friends, is the digital guillotine. The nice guys (and gals) finish last in this notification-driven dystopia. The jerks? They thrive. They simply ignore the incoming tide of questions, their digital silence mistaken for deep focus. And guess what? People eventually figure out who will respond and only bother those poor souls. Humans are remarkably adept at finding the path of least resistance, even if it leads directly to someone else's burnout.

Then comes review time. The jerk, bless his oblivious heart, has been cranking out code, uninterrupted by the incessant digital demands. He has tangible projects to point to, gleaming monuments to his uninterrupted focus. The nice person, the one everyone loves, the one who spent half their day answering everyone else's questions? Their accomplishments are harder to quantify. "Well, they were really helpful in Slack..." doesn't quite have the same ring as "Shipped the entire new authentication system."

It's the same problem with being the amazing pull request reviewer. Your team appreciates you, your code quality goes up, you’re contributing meaningfully. But how do you put a number on "prevented three critical bugs from going into production"? You can't. So, you get a pat on the back and maybe a gift certificate to a mediocre pizza place.

Slackifying Increases

Time marches on, and suddenly, email is the digital equivalent of that dusty corner in your attic where you throw things you don't know what to do with. It's a wasteland of automated notifications from systems nobody cares about. But Slack? There’s no rhyme or reason to it. Can I message you after hours with the implicit understanding you'll ignore it until morning? Should I schedule the message for later, like some passive-aggressive digital time bomb?

And the threads! Oh, the glorious, nested chaos of threads. Should I respond in a thread to keep the main channel clean? Or should I keep it top-level so that if there's a misunderstanding, the whole damn team can pile on and offer their unsolicited opinions? What about DMs? Is there a secret protocol there? Or is it just a free-for-all of late-night "u up?" style queries about production outages?

It felt like every meeting had a pre-meeting in Slack to discuss the agenda, followed by an actual meeting on some other platform to rehash the same points, and then a post-meeting discussion in a private channel to dissect the meeting itself. And inevitably, someone who missed the memo would then ask about the meeting in the public channel, triggering a meta-post-meeting discussion about the pre-meeting, the meeting, and the initial post-meeting discussion.

The only way I could actually get any work done was to actively ignore messages. But then, of course, I was completely out of the loop. The expectation became this impossible ideal of perfect knowledge, of being constantly aware of every initiative across the entire company. It was like trying to play a gameshow and write a paper at the same time. To be seen as "on it", I needed to hit the buzzer and answer the question, but come review time none of those points mattered and the scoring was made up.

I was constantly forced to choose: stay informed or actually do something. If I chose the latter, I risked building the wrong thing or working with outdated information because some crucial decision had been made in a Slack channel I hadn't dared to open for fear of being sucked into the notification vortex. It started to feel like those brief moments when you come up for air after being underwater for too long. I'd go dark on Slack for a few weeks, actually accomplish something, and then spend the next week frantically trying to catch up on the digital deluge I'd missed.

Attention has a cost

One of the hardest lessons for anyone to learn is the profound value of human attention. Slack is a fantastic tool for those who organize and monitor work. It lets you bypass the pesky hierarchy, see who's online, and ensure your urgent request doesn't languish in some digital abyss. As an executive, you can even cut out middle management and go straight to the poor souls actually doing the work. It's digital micromanagement on steroids.

But if you're leading a team that's supposed to be building something, I'd argue that Slack and its ilk are a complete and utter disaster. Your team's precious cognitive resources are constantly being bled dry by a relentless stream of random distractions from every corner of the company. There are no real controls over who can interrupt you or how often. It's the digital equivalent of having your office door ripped off its hinges and replaced with glass like a zoo. Visitors can come and peer in on what your team is up to.

Turns out, the lack of history in tools like XMPP and IRC wasn't a bug, it was a feature. If something important needed to be preserved, you had to consciously move it to a more permanent medium. These tools facilitated casual conversation without fostering the expectation of constant, searchable digital omniscience.

Go look at the Slack for any large open-source project. It's pure, unadulterated noise. A cacophony of voices shouting into the void. Developers are forced to tune out, otherwise it's all they'd do all day. Users have a terrible experience because it's just a random stream of consciousness, people asking questions to other people who are also just asking questions. It's like replacing a structured technical support system with a giant conference call where everyone is on hold and told to figure it out amongst themselves.

My dream

So, what do I even want here? I know, I know, it's a fool's errand. We're all drowning in Slack clones now. You can't stop this productivity-killing juggernaut. It's like trying to un-ring a bell, or perhaps more accurately, trying to silence a thousand incessantly pinging notifications.

But I disagree. I still think it's not too late to have a serious conversation about how many hours a day it's actually useful for someone to spend on Slack. What do you, as a team, even want out of a chat client? For many teams, especially smaller ones, it makes far more sense to focus your efforts where there's a real payoff. Pick one tool, one central place for conversations, and then just…turn off the rest. Everyone will be happier, even if the tool you pick has limitations, because humans actually thrive within reasonable constraints. Unlimited choice, as it turns out, is just another form of digital torture.

Try to get away with the most basic, barebones thing you can for as long as you can. I knew a (surprisingly productive) team that did most of their conversation on an honest-to-god phpBB internal forum. Another just lived and died in GitHub with Issues. Just because it's a tool a lot of people talk about doesn't make it a good tool and just because it's old, doesn't make it useless.

As for me? I'll be here, with my Slack and Teams and Discord open trying to see if anything has happened in any of the places I'm responsible for seeing if something has happened. I will consume gigs of RAM on what, even ten years ago, would have been an impossibly powerful computer to watch basically random forum posts stream in live.

Why Login Security Sucks

September 06, 2024 in security

I've complained a lot about the gaps in offerings for login security in the past. The basic problem is this domain of security serves a lot of masters. To get the widest level of buy-in from experts, the solution has to scale from normal logins to national security. This creates a frustrating experience for users because it is often overkill for the level of security they need. Basically is it reasonable that you need Google Authenticator to access your gym website? In terms of communication, the solutions we hear about the most, i.e. with the most marketing, allow for the insertion of SaaS services into the chain so that an operation that was previously free now pays a monthly fee based on usage.

This creates a lopsided set of incentives where only the most technologically complex and extremely secure solutions are endorsed and when teams are (understandably) overwhelmed by their requirements a SaaS attempts to get inserted into a critical junction of their product.

The tech community have mostly agreed that username and passwords assigned by the user are not sufficient for even basic security. What we haven't done is precisely explained what it is that we want normal average non-genius developers to do about that. We've settled on this really weird place with the following rules:

Email accounts are always secure but SMS is never secure. You can always email a magic link and that's fine for some reason.
You should have TOTP but we've settled on very short time windows because I guess we decided NTP was a solved problem. There's no actual requirement the code changes every 30 seconds, we're just pretending that we're all spies and someone is watching your phone. Also consumers should be given recovery codes, which are basically just passwords you generate and give to them and only allow to be used once. It is unclear why generating a one-time password for the user is bad but if we call the password a "recovery code" it is suddenly sufficient.

TOTP serves two purposes. One is it ensures there is one randomly generated secret associated with the account that we don't hash (even though I think you could....but nobody seems to), so it's actually kind of a dangerous password that we need to encrypt and can't rotate. The other is we tacked on this stupid idea that it is multi-device, even though there's zero requirement that the code lives on another device. Just someone decided that because there is a QR code it is now multi-device because phones scan QR codes.
At some point we decided to add a second device requirement, but those devices live in entirely different ecosystems. Even if you have an iPhone and a work MacBook, they shouldn't be using the same Apple ID, so I'm not really clear how things would ever line up. It seems like most people sync things like TOTP with their personal Google accounts across different work devices over time. I can't imagine that was the intended functionality.
Passkeys are great but also their range of behavior is bizarre and unpredictable so if you implement them you will be expected to effectively build every other possible recovery flow into this system. Even highly technical users cannot be relied upon to know whether they will lose their passkey when they do something.
Offloading the task to a large corporation is good, but you cannot pick one big corporation. You must have a relationship with Apple and Facebook and Microsoft and Google and Discord and anyone else who happens to be wandering around when you build this. Their logins are secured with magic and unbreakable, but if they are bypassed you can go fuck yourself because that is your problem, not theirs.

All of this is sort of a way to talk around the basic problem. I need a username and a password for every user on my platform. That password needs to be randomly generated and never stored as plain text in my database. If I had a way to know that the browser generated and stored the password, this basic level of security is met. As far as I can tell, there's no way for me to know that for sure. I can guess based on the length of the password and how quickly it was entered into a form field.

Keep in mind all I am trying to do is build a simple login route on an application that is portable, somewhat future proof and doesn't require a ton of personal data from the user to resolve common human error problems. Ideally I'd like to be able to hand this to someone else, they generate a new secret and they too can enroll as many users as they want. This is a simple thing to build so it should be simple to solve the login story as well.

Making a simple CMS

The site you are reading this on is hosted on Ghost, a CMS that is written in Node. It supports a lot of very exciting features I don't use and comes with a lot of baggage I don't need. Effectively all I actually use for is:

RSS
Writing posts in its editor
Fixing typos in the posts I publish (sometimes, my writing is not good)
Let me write a million drafts for every thing I publish
Minimize the amount of JS I'm inflicting on people and try whenever possible to stick to just HTML and CSS

Ghost supports a million things on top of the things I have listed and it also comes with some strange requirements like running MySQL. I don't really need a lot of that stuff and running a full MySQL for a CMS that doesn't have any sort of multi-instance scaling functionality seems odd. I also don't want to stick something this complicated on the internet for people to use for long periods of time without regular maintenance.

Before you say it I don't care for static site generators. I find it's easier for me to have a tab open, write for ten minutes, then go back to what I was doing before.

My goal with this is just to make a normal friendly baby CMS that I could share with a group of people, less technical people, so they could write stuff when they felt like it. We're not trading nuclear secrets here. The requirements are:

Needs to be open to the public internet with no special device enrollment or network segmentation
Not administered by me. Whatever normal problem arises it has to be solvable by a non-technical person.

Making the CMS

So in a day when I was doing other stuff I put this together: https://gitlab.com/matdevdug/ezblog. It's nothing amazing, just sort of a basic template I can build on top of later. Uses sqlite and it does the things you would expect it to do. I can:

Write posts in Quill
Save the posts as drafts or as published posts
Edit the posts after I publish them
Have a valid RSS feed of the posts
The whole frontend is just HTML/CSS so it'll load fast and be easy to cache

Then there is the whole workflow of draft to published.

For one days work this seems to be roughly where I hoped to be. Now we get to the crux of the matter. How do I log in?

What you built is bad and I hate it

The point is I should be able to solve this problem quickly and easily for a hobby website, not that you personally like what I made. The examples are not fully-fleshed out examples, just templates to demonstrate the problem. Also I'm allowed to make stuff that serves no other function than it amuses me.

The default for most sites (including Ghost) is just a username and password. The reason for this: it's easy, works on everything and it's pretty simple to work out a fallback flow for users. Everyone understands it, there's no concerns around data ownership or platform lock-in.

My login page:

{% extends 'base.html' %}

{% block header %}
  <h1>{% block title %}Log In{% endblock %}</h1>
{% endblock %}

{% block content %}
  <form method="post">
    <input type="hidden" name="csrf_token" value="{{ csrf_token() }}">
    <label for="username">Username</label>
    <input name="username" id="username" required>
    <label for="password">Password</label>
    <input type="password" name="password" id="password" required>
    <input type="submit" value="Log In">
  </form>
{% endblock %}

I've got a csrf_token in there and the rest is pretty straight forward. Server-side is also pretty easy.

@bp.route('/login', methods=('GET', 'POST'))
@limiter.limit("5 per minute")
def login():
    if request.method == 'POST':
        username = request.form['username']
        password = request.form['password']
        db = get_db()
        error = None
        user = db.execute(
            'SELECT * FROM user WHERE username = ?', (username,)
        ).fetchone()

        if user is None:
            error = 'Incorrect username.'
        elif not check_password_hash(user['password'], password):
            error = 'Incorrect password.'

        if error is None:
            session.clear()
            session['user_id'] = user['id']
            return redirect(url_for('index'))

        flash(error)

    return render_template('auth/login.html')

I'm not storing the raw password, just the hash. It's requires almost no work to do. It works exactly the way I think it should. Great fine.

Why are passwords insufficient?

This has been talked to death but let's recap for the sake of me being able to say I did it and you can just kinda scroll quickly through this part.

Users reuse usernames and passwords, so even though I might not know the raw text of the password another website might be (somehow) even lazier than me and their database gets leaked and then oh no I'm hacked.
The password might be a bad password and it's just one people try and oh no they are in the system.
I have to build in a password reset flow because humans are bad at remembering things and that's just how it is.

Password Reset Flow

Everyone has seen this, but let's talk about what I would need to modify about this small application to allow more than one person to use it.

I would need to add a route that handles allowing the user to reset their password by requesting it through their email
To know where to send that email, I would need to receive and store the email address for every user
I would also need to verify the users email to ensure it worked
All of this hinges on having a token I could send to that user that I could generate with something like the following:

def generate_reset_password_token(self):
    serializer = URLSafeTimedSerializer(current_app.config["SECRET_KEY"])

    return serializer.dumps(self.email, salt=self.password_hash)

Since I'm salting it with the hash of the current password which will change when they change the password, the token can only be used once. Makes sense.

Why is this bad?

For a ton of reasons.

I don't want to know an email address if I don't need it. There's no reason to store more personal information about a user that makes the database more valuable if someone were to steal it.
Email addresses change. You need to write another route which handles that process, which isn't hard but then you need to decide whether you need to confirm that the user has access to address 1 with another magic URL or if it is sufficient to say they are currently logged in.
Finally it sort of punts the problem to email and says "well I assume and hope your email is secure even if statistically you probably use the same password for both".

How do you fix this?

The problem can be boiled down to 2 basic parts:

I don't want the user to tell me a username, I want a randomly generated username so it further reduces the value of information stored in my database. It also makes it harder to do a random drive-by login attempt.
I don't want to own the password management story. Ideally I want the browser to do this on its side.
In a perfect world I want a response that says "yes we have stored these credentials somewhere under this users control" and I can wash my hands of that until we get into the situation where somehow they've lost access to the sync account (which should hopefully be rare enough that we can just do that in the database).

The annoying thing is this technology already exists.

The Credential Manager API does the things I am talking about. Effectively I would need to add some Javascript to my Registration page:

    <script>
        document.getElementById('register-form').addEventListener('submit', function(event) {
            event.preventDefault(); // Prevent form submission

            const username = document.getElementById('username').value;
            const password = document.getElementById('password').value;

            // Save credentials using Credential Management API
            if ('credentials' in navigator) {
                const cred = new PasswordCredential({
                    id: username,
                    password: password
                });

                // Store credentials in the browser's password manager
                navigator.credentials.store(cred).then(() => {
                    console.log('Credentials stored successfully');
                    // Proceed with registration, for example, send credentials to your server
                    registerUser(username, password);
                }).catch(error => {
                    console.error('Error storing credentials:', error);
                });
            } 
        });

        function registerUser(username, password) {
            // Simulate server registration request
            fetch('/register', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ username: username, password: password })
            }).then(response => {
                if (response.ok) {
                    console.log('User registered successfully');
                    // Redirect or show success message
                } else {
                    console.error('Registration failed');
                }
            });
        }
    </script>

Then on my login page something like this:

function attemptAutoLogin() {
    if ('credentials' in navigator) {
        navigator.credentials.get({password: true}).then(cred => {
            if (cred) {
                // Send the credentials to the server to log in the user
                fetch('/login', {
                    method: 'POST',
                    body: new URLSearchParams({
                        'username': cred.id,
                        'password': cred.password
                    })
                }).then(response => {
                    // Handle login success or failure
                    if (response.ok) {
                        console.log('User logged in');
                    } else {
                        console.error('Login failed');
                    }
                });
            }
        }).catch(error => {
            console.error('Error retrieving credentials:', error);
        });
    }
}

// Call the function when the page loads
document.addEventListener('DOMContentLoaded', attemptAutoLogin);

So great, I assign a random cred.id and cred.password, stick it in the browser and then I sorta wash my hands of it.

We know the password is stored somewhere and can be synced for free
We know the user can pull the password out and put it somewhere else if they want to switch platforms
Browsers handle password migrations for users

The problem with this approach is I don't know if I'm supposed to use it.

I have no idea what this means. Could this go away? In testing it does seem like the performance is all over the place. Firefox seems to have some issues with this, whereas Chrome seems to always nail it. iOS Safari also seems to have some problems. So this isn't seemingly reliable enough to use.

Just please just make this a thing that works everywhere.

Before you yell at me about Math.random I think the following would make a good password:

function generatePassword(length) {
  const charset = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
  let password = "";
  for (let i = 0; i < length; i++) {
    const randomIndex = Math.floor(Math.random() * charset.length);
    password += charset.charAt(randomIndex);
  }
  return password;
}

const password = generatePassword(32);
console.log(password);

Source

But also maybe I'm wrong. I never have any idea if I'm generating random things in a safe way. Every time I do it someone tells me I did it wrong.

TOTP

Alright so I can't get away with just a password, so I have to assume the password is bunk and use it as one element of login. Then I have to use either TOTP or HOTP.

From a user perspective TOTP works as follows:

Set up 2FA for your online account.
Get a QR code.
You scan this QR code with an authenticator app of your choice
Your app will immediately start generating these six-digit tokens.
The website asks you to provide one of these six-digit tokens.

Practically this is pretty straight forward. I add a few extra libraries:

import io
import pyotp
import qrcode
from flask import send_file

I have to generate a secret totp_secret = pyotp.random_base32() which then I have to store in the database. Then I have to generate a QR code to show the user so they can generate the time-based codes.

otp_uri = pyotp.totp.TOTP(totp_secret).provisioning_uri(username, issuer_name="ezblog")
                qr = qrcode.make(otp_uri)
                buf = io.BytesIO()
                qr.save(buf)
                buf.seek(0)

                return send_file(buf, mimetype='image/png')

However the more you look into this, the more complicated it gets.

You actually don't need the token to be 6 digits. It can be up to 10. I don't know why I'd want more or less. Presumably more is better.
The token can be valid for longer than 30 seconds. From reading it seems like that makes the code less reliant on perfect time sync between client and server (great) but also increases the probability of someone stealing the TOTP and using it. That doesn't seem like a super likely attack vector here so I'll make it way longer. But then why don't more services use longer tokens if the only concern then is if someone sees my code? Is this just people being unspeakably annoying?
I need to add some recovery step in case you lose access to the TOTP code.

How do you recover from a TOTP failure?

Effectively I'm back to my original problem. I can either:

Go back to the email workflow I don't want because again I don't want to rely on email as some sort of super-secure bastion and I really don't want to store email addresses.
Or I generate a recovery code and give you those codes which let you bypass the TOTP requirement. That at least lets me be like "this is no longer my fault". I like that.

How do I make a recovery code?

Honest to god I have no idea. As far as I can tell a "recovery code" is just a randomly generated value I hash and stick in the database and then when the user enters it on a form, check the hash. It's just another password. I don't know why all the recovery codes I see are numbers, since it seems to have no relationship to that and would likely work with any string.

Effectively all I need to do with the recovery code is ensure it gets burned once used. Which is fine, but now I'm confused. So I'm generating passwords for the user and then I give the passwords back to the user and tell them to store them somewhere? Why don't I just give them the one good password for the initial login and call it a day? Why is one forbidden and the other is mandatory?

Does HOTP help?

I'm really still not clear how HOTP works. Like I understand the basics:

@app.route('/verify_2fa', methods=['GET', 'POST'])
def verify_2fa():
    if request.method == 'POST':
        hotp = pyotp.HOTP(user_data['hotp_secret'])
        otp = request.form['otp']
        if hotp.verify(otp, user_data['counter']):
            user_data['counter'] += 1  # Increment the counter after successful verification
            return redirect(url_for('index'))
        flash('Invalid OTP')
    return render_template('verify_2fa.html')

There is a secret per-user and a counter and then I increment the counter every single time the user logs in. As far as I can tell there isn't a forcing mechanism which keeps the client and the server in-sync, so basically you tap a button and generate a password and then if you accidentally tap the button again the two counters are off. It seems like then the server has to decide "are you a reasonable number of times off or an unreasonable amount of counts off". With the PyOTP library I don't see a way for me to control that:

verify(otp: str, counter: int) → bool[source]

    Verifies the OTP passed in against the current counter OTP.

    Parameters:

            otp – the OTP to check against

            counter – the OTP HMAC counter

So I mean I could test it against a certain range of counters from the counter I know and then accept it if it falls within that window, but you still are either running a different application or an app on your phone to enter this code. I'm not sure exactly why I would ever use this over TOTP, but it definitely doesn't seem easier to recover from.

So TOTP would work with the recovery code but this seems aggressive to ask a normal people to install a different program on their computer or phone in order to login based on a time-based code which will stop working if the client and server (who have zero way to sync time with each other) drift too far apart. Then I need to give you recovery codes and just sorta hope you have somewhere good to put those.

That said, it is the closest to solving the problem because those are at least normal understandable human problems and it does meet my initial requirement of "the user has one good password". It's also portable and allows administrators to be like "well you fell through the one safety net, account is locked, make a new one".

What is the expected treatment of the TOTP secret?

When I was writing this out I became unsure if I'm allowed to hash this secret. Like in theory I should be able to, because I don't need to recover it. If the user was to go through a TOTP reset flow, then I would probably (presumably) want to generate a new secret in which case there's nothing stopping me from using a strong key derivation function.

None of the tutorials I was able to find seemed to have any opinion on this topic. It seems like using encryption is the SOP, which is fine (it's not sitting on disk as a plain string) but introduces another failure point. It seems odd there isn't a way to negotiate a rotation with a client or really provide any sort of feedback. It meets my initial requirement, but the more I read about TOTP the more surprised I was it hasn't been better thought out.

Things I would love from TOTP/HOTP

Some sort of secret rotation process would be great. It doesn't have to be common, but it would be nice if there was some standard way of informing the client.
Be great if we more clearly explained to people how long the codes should be valid for. Certainly 1 hour is sufficient for consumer-level applications right?
Explain like what would I do if the counters get off with HOTP. Certainly some human error must be accounted for by the designers. People are going to hit the button too many times at some point.

Use Google/Facebook/Apple

I'm not against using these sorts of login buttons except I can't offer just one, I need to offer all of them. I have no idea what login that user is going to have or what make sense for them to use. It also means I need to manage some sort of app registration with each one of these companies for each domain that they can suspend approximately whenever they feel like it because they're giant megacorps.

So now I can't just spin up as many copies of this thing as I want with different URLs and I need to go through and test each one to ensure they work. I also need to come up with some sort of migration path for if one of them disappears and I need to authenticate the users into their existing accounts but using a different source of truth.

Since I cannot think of a way to do that which doesn't involve me basically emailing a magic link to the email address I get sent in the response from your corpo login and then allowing that form to update your user account with a different "real_user_id" I gotta abandon this. It just seems like a tremendous amount of work to not really "solve" the problem but just make the problem someone else's fault if it doesn't work.

Like if a user could previously log into a Facebook account and now no longer can, there's no customer service escalation they can go on. They can effectively go fuck themselves because nobody cares about one user encountering a problem. But that means you would still need some way of being like "you were a Facebook user and now you are a Google user". Or what if the user typically logs in with Google, clicks Facebook instead and now has two accounts? Am I expected to reconcile the two?

It's also important to note that I don't want any permissions and I don't want all the information I get back. I don't want to store email address or real name or anything like that, so again like the OAuth flow is overkill for my usage. I have no intention of requesting permissions on behalf of these users with any of these providers.

Use Passkeys

Me and passkeys don't get along super well, mostly because I think they're insane. I've written a lot about them in the past: https://matduggan.com/passkeys-as-a-tool-for-user-retention/ and I won't dwell on it except to say I don't think passkeys are designed with the first goal being an easy user experience.

But regardless passkeys do solve some of my problems.

Since I'm getting a public key I don't care if my database gets leaked
In theory I don't need an email address for fallback because on some platforms some of the time they sync
If users care a lot about ownership of personal data they can use a password manager sometimes if the password manager knows the right people and idk is friends with the mayor of passkeys or something. I don't really understand how that works, like what qualifies you to store the passkeys.

My issue with passkeys is I cannot conceive of a even "somewhat ok" fallback plan. So you set it up on an iPhone with a Windows computer at home. You break your iPhone and get an Android. It doesn't seem that crazy of a scenario to me to not have any solution for. Do I need your phone number on top of all of this? I don't want that crap sitting in a database.

Tell the users to buy a cross-platform password manager

Oh ok yeah absolutely normal people care enough about passwords to pay a monthly fee. Thanks for the awesome tip. I think everyone on Earth would agree they'd give up most of the price of a streaming platform full of fun content to pay for a password manager. Maybe I should tell them to spin up a docker container and run bitwarden while we're at it.

Anyway I have a hyper-secure biometric login as step 1 and then what is step 2, as the fallback? An email magic link? Based on what? Do I give you "recovery codes" like I did with TOTP? It seems crazy to layer TOTP on top of passkeys but maybe that...makes some sense as a fallback route? That seems way too secure but also possibly the right answer?

I'm not even trying to be snarky, I just don't understand what would be the acceptable position to take here.

What to do from here

Basically I'm left where I started. Here are my options:

Let the user assign a username and password and hope they let the browser or password manager do it and assume it is a good one.
Use the API in the browser to generate a good username and password and store it, hoping they always use a supported browser and that this API doesn't go away in the future.
Generate a TOTP but then also give them passwords called "recovery codes" and then hope they store those passwords somewhere good.
Use email magic links a lot and hope they remember to change their email address here when they lose access to an old email.
Use passkeys and then add on one of the other recovery systems and sort of hope for the best.

What basic stuff would I need to solve this problem forever:

The browser could tell me if it generated the password or if the user typed the password. If they type the password, force the 2FA flow. If not, don't. Let me tell the user "seriously let the system make the password". 1 good password criteria met.
Have the PasswordCredential API work everywhere all the time and I'll make a random username and password on the client and then we can just be done with this forever.
Passkeys but they live in the browser and sync like a normal password. Passkey lite. Passkey but not for nuclear secrets.
TOTP but if recovery codes are gonna be a requirement can we make it part of the spec? It seems like a made-up concept we sorta tacked on top.

I don't think these are crazy requirements. I just think if we want people to build more stuff and for that stuff to be secure, someone needs to sit down and realistically map out "how does a normal person do this". We need consistent reliable conventions I can build on top of, not weird design patterns we came up with because the initial concept was never tested on normal people before being formalized into a spec.

The Time Linkerd Erased My Load Balancer

May 03, 2024 in kubernetes

A cautionary tale of K8s CRDs and Linkerd.

A few months ago I had the genius idea of transitioning our production load balancer stack from Ingress to Gateway API in k8s. For those unaware, Ingress is the classic way of writing a configuration to tell a load balancer what routes should hit what services, effectively how do you expose services to the Internet. Gateway API is the re-imagined process for doing this where the problem domain is scoped, allowing teams more granular control over their specific services routes.

Ingress

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: external-lb
spec:
  controller: example.com/ingress-controller
  parameters:
    apiGroup: k8s.example.com
    kind: IngressParameters
    name: external-lb

This is what setting up a load balancer in Ingress looks like

After conversations with various folks at GCP it became clear to me that while Ingress wasn't deprecated or slated to be removed, Gateway API is where all the new development and features are moving to. I decided that we were a good candidate for the migration since we are a microservice based backend with lower and higher priority hostnames, meaning we could safely test the feature without cutting over all of our traffic at the same time.

I had this idea that we would turn on both Ingress and Gateway API and then cut between the two different IP addresses at the Cloudflare level. From my low-traffic testing this approach seemed to work ok, with me being able to switch between the two and then letting Gateway API bake for a week or two to shake out any problems. Then I decided to move to prod. Due to my lack of issues in the lower environments I decided that I wouldn't set up Cloudflare load balancing between the two and manage the cut-over in Terraform. This turned out to be a giant mistake.

The long and short of it is that the combination of Gateway API and Linkerd in GKE fell down under high volume of requests. Low request volume there were no problems, but once we got to around 2k requests a second the Linkerd-proxy sidecar container memory usage started to grow unbounded. When I attempted to cut back from Gateway API to Ingress, I encountered a GKE bug I hadn't seen before in the lower environments.

"Translation failed: invalid ingress spec: service "my_namespace/my_service" is type "ClusterIP", expected "NodePort" or "LoadBalancer";

What we were seeing was a mismatch between the annotations automatically added by GKE.

Ingress adds these annotations:
cloud.google.com/neg: '{"ingress":true}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1pokfef..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway adds these annotations:
cloud.google.com/neg: '{"exposed_ports":{"80":{}}}'
cloud.google.com/neg-status: '{"network_endpoint_groups":{"80":"k8s1-oijfoijsdoifj-..."},"zones":["us-central1-a","us-central1-b","us-central1-f"]}'

Gateway doesn't understand the Ingress annotations and vice-versa. This obviously caused a massive problem and blew up in my face. I had thought I had tested this exact failure case, but clearly prod surfaced a different behavior than I had seen in lower environments. Effectively no traffic was getting to pods while I tried to figure out what had broken.

I ended up making to manually modify the annotations to get things working and had a pretty embarrassing blow-up in my face after what I had thought was careful testing (but was clearly wrong).

Fast Forward Two Months

I have learned from my mistake regarding the Gateway API and Ingress and was functioning totally fine on Gateway API when I decided to attempt to solve the Linkerd issue. The issue I was seeing with Linkerd was high-volume services were seeing their proxies consume unlimited memory, steadily growing over time but only while on Gateway API. I was installing Linkerd with their Helm libraries, which have 2 components, the Linkerd CRD chart here: https://artifacthub.io/packages/helm/linkerd2/linkerd-crds and the Linkerd control plane: https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane

Since debug logs and upgrades hadn't gotten me any closer to a solution as to why the proxies were consuming unlimited memory until they eventually were OOMkilled, I decided to start fresh. I removed the Linkerd injection from all deployments and removed the helm charts. Since this was a non-prod environment, I figured at least this way I could start fresh with debug logs and maybe come up with some justification for what was happening.

Except the second I uninstalled the charts, my graphs started to freak out. I couldn't understand what was happening, how did removing Linkerd break something? Did I have some policy set to require Linkerd? Why was my traffic levels quickly approaching zero in the non-prod environment?

Then a coworker said "oh it looks like all the routes are gone from the load balancer". I honestly hadn't even thought to look there, assuming the problem was some misaligned Linkerd policy where our deployments required encryption to communicate or some mistake on my part in the removal of the helm charts. But they were right, the load balancers didn't have any routes. kubectl confirmed, no HTTProutes remained.

So of course I was left wondering "what just happened".

Gateway API

So a quick crash course in "what is gateway API". At a high level, as discussed before, it is a new way of defining Ingress which cleans up the annotation mess and allows for a clean separation of responsibility in an org.

So GCP defines the GatewayClass, I make the Gateway and developer provide the HTTPRoutes. This means developers can safely change the routes to their own services without the risk that they will blow up the load balancer. It also provides a ton of great customization for how to route traffic to a specific service.

So first you make a Gateway like so in Helm or whatever:

---
kind: Gateway
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: {{ .Values.gateway_name }}
  namespace: {{ .Values.gateway_namespace }}
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        kinds:
        - kind: HTTPRoute
        namespaces:
          from: Same
    - name: https
      protocol: HTTPS
      port: 443
      allowedRoutes:
        kinds:
          - kind: HTTPRoute
        namespaces:
          from: All
      tls:
        mode: Terminate
        options:
          networking.gke.io/pre-shared-certs: "{{ .Values.pre_shared_cert_name }},{{ .Values.internal_cert_name }}"

Then you provide a different YAML of HTTPRoute for the redirect of http to https:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: redirect
  namespace: {{ .Values.gateway_namespace }}
spec:
  parentRefs:
  - namespace: {{ .Values.gateway_namespace }}
    name: {{ .Values.gateway_name }}
    sectionName: http
  rules:
  - filters:
    - type: RequestRedirect
      requestRedirect:
        scheme: https

Finally you can set policies.

---
apiVersion: networking.gke.io/v1
kind: GCPGatewayPolicy
metadata:
  name: tls-ssl-policy
  namespace: {{ .Values.gateway_namespace }}
spec:
  default:
    sslPolicy: tls-ssl-policy
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: {{ .Values.gateway_name }}

Then your developers can configure traffic to their services like so:

kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
  name: store
spec:
  parentRefs:
  - kind: Gateway
    name: internal-http
  hostnames:
  - "store.example.com"
  rules:
  - backendRefs:
    - name: store-v1
      port: 8080
  - matches:
    - headers:
      - name: env
        value: canary
    backendRefs:
    - name: store-v2
      port: 8080
  - matches:
    - path:
        value: /de
    backendRefs:
    - name: store-german
      port: 8080

Seems Straightforward

Right? There isn't that much to the thing. So after I attempted to re-add the HTTPRoutes using Helm and Terraform (which of course didn't detect a diff even though the routes were gone because Helm never seems to do what I want it to do in a crisis) and then ended up bumping the chart version to finally force it do the right thing, I started looking around. What the hell had I done to break this?

When I removed linkerd crds it somehow took out my httproutes. So then I went to the Helm chart trying to work backwards. Immediately I see this:

{{- if .Values.enableHttpRoutes }}
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes-sigs/gateway-api/pull/1923
    gateway.networking.k8s.io/bundle-version: v0.7.1-dev
    gateway.networking.k8s.io/channel: experimental
    {{ include "partials.annotations.created-by" . }}
  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    linkerd.io/control-plane-ns: {{.Release.Namespace}}
  creationTimestamp: null
  name: httproutes.gateway.networking.k8s.io
spec:
  group: gateway.networking.k8s.io
  names:
    categories:
    - gateway-api
    kind: HTTPRoute
    listKind: HTTPRouteList
    plural: httproutes
    singular: httproute
  scope: Namespaced
  versions:

Sure enough, Linkerd CRD Helm chart has that set to default True:

I also found this issue: https://github.com/linkerd/linkerd2/issues/12232

So yeah, Linkerd is, for some reason, pulling this CRD from a pull request from April 6th of last year that is marked as "do not merge". https://github.com/kubernetes-sigs/gateway-api/pull/1923

Linkerd is aware of the possible problem but presumes you'll catch the configuration option on the Helm chart: https://github.com/linkerd/linkerd2/issues/11586

To be clear I'm not "coming after Linkerd" here. I just thought the whole thing was extremely weird and wanted to make sure, given the amount of usage Linkerd gets out there, that other people were made aware of it before running the car into the wall at 100 MPH.

What are CRDs?

Kubernetes Custom Resource Definitions (CRDs) essentially extend the Kubernetes API to manage custom resources specific to your application or domain.

CRD Object: You create a YAML manifest file defining the Custom Resource Definition (CRD). This file specifies the schema, validation rules, and names of your custom resource.
API Endpoint: When you deploy the CRD, the Kubernetes API server creates a new RESTful API endpoint for your custom resource.

Effectively when I enabled Gateway API in GKE with the following I hadn't considered that I could end up in a CRD conflict state with Linkerd:

  gcloud container clusters create CLUSTER_NAME \
    --gateway-api=standard \
    --cluster-version=VERSION \
    --location=CLUSTER_LOCATION

What I suspect happened is, since I had Linkerd installed before I had enabled the gateway-api on GKE, when GCP attempted to install the CRD it failed silently. Since I didn't know there was a CRD conflict, I didn't understand that the CRD that the HTTPRoutes relied on was actually the Linkerd maintained one, not the GCP one. Presumably had I attempted to do this the other way it would have thrown an error when the Helm chart attempted to install a CRD that was already present.

To be clear before you call me an idiot, I am painfully aware that the deletion of CRDs is a dangerous operation. I understand I should have carefully checked and I am admitting I didn't in large part because it just never occurred to me that something like Linkerd would do this. Think of my failure to check as a warning to you, not an indictment against Kubernetes or whatever.

Conclusion

If you are using Linkerd and Helm and intend to use Gateway API, this is your warning right now to go in there and flip that value in the Helm chart to false. Learn from my mistake.

Questions/comments/concerns: https://c.im/@matdevdug

IAM Is The Worst

March 15, 2024 in DevOps

Imagine your job was to clean a giant office building. You go from floor to floor, opening doors, collecting trash, getting a vacuum out of the cleaning closet and putting it back. It's a normal job and part of that job is someone gives you a key. The key opens every door everywhere. Everyone understands the key is powerful, but they also understand you need to do your job.

Then your management hears about someone stealing janitor keys. So they take away your universal key and they say "you need to tell Suzie, our security engineer, which keys you need at which time". But the keys don't just unlock one door, some unlock a lot of doors and some desk drawers, some open the vault (imagine this is the Die Hard building), some don't open any doors but instead turn on the coffee machine. Obviously the keys have titles, but the titles mean nothing. Do you need the "executive_floor/admin" key or the "executive_floor/viewer" key?

But you are a good employee and understand that security is a part of the job. So you dutifully request the keys you think you need, try to do your job, open a new ticket when the key doesn't open a door you want, try it again, it still doesn't open the door you want so then there's another key. Soon your keyring is massive, just a clanging sound as you walk down the hallway. It mostly works, but a lot of the keys open stuff you don't need, which makes you think maybe this entire thing was pointless.

The company is growing and we need new janitors, but they don't want to give all the new janitors your key ring. So they roll out a new system which says "now the keys can only open doors that we have written down that this key can open, even if it says "executive_floor/admin". The problem is people move offices all the time, so even if the list of what doors that key opened was true when it was issued, it's not true tomorrow. The Security team and HR share a list, but the list sometimes drifts or maybe someone moves offices without telling the right people.

Soon nobody is really 100% sure what you can or cannot open, including you. Sure someone can audit it and figure it out, but the risk of removing access means you cannot do your job and the office doesn't get cleaned. So practically speaking the longer someone works as a janitor the more doors they can open until eventually they have the same level of access as your original master key even if that wasn't the intent.

That's IAM (Identity and access management) in cloud providers today.

Stare Into Madness

It's Not Natural, It's Just Simple: Food Branding Co-Opts Another Mean

Honestly I don't even know why I'm complaining. Of course it's entirely reasonable to expect anyone working in a cloud environment to understand the dozen+ ways that they may or may not have access to a particular resource. Maybe they have permissions at a folder level, or an org level, but that permission is gated by specific resources.

Maybe they don't even have access but the tool they're interacting with the resource with has permission to do it, so they can do it but only as long as they are SSH'd into host01, not if they try to do it through some cloud shell. Possibly they had access to it before, but now they don't since they moved teams. Perhaps the members of this team were previously part of some existing group but now new employees aren't added to that group so some parts of the team can access X but others cannot. Or they actually have the correct permissions to the resource but the resource is located in another account and they don't have the right permission to traverse the networking link between the two VPCs.

Meanwhile someone is staring at these flowcharts trying to figure out what in hell is even happening here. As someone who has had to do this multiple times in my life, let me tell you the real-world workflow that ends up happening.

Developer wants to launch a new service using new cloud products. They put in a ticket for me to give them access to the correct "roles" to do this.
I need to look at two elements of it, both what are the permissions the person needs in order to see if the thing is working and then the permissions the service needs in order to complete the task it is trying to complete.
So I go through my giant list of roles and try to cobble together something that I think based on the names will do what I want. Do you feel like a roles/datastore.viewer or more of a roles/datastore.keyVisualizerViewer? To run backups is roles/datastore.backupsAdmin sufficient or do I need to add roles/datastore.backupSchedulesAdmin in there as well?
They try it and it doesn't work. Reopen the ticket with "I still get authorizationerror:foo". I switch that role with a different role, try it again. Run it through the simulator, it seems to work, but they report a new different error because actually in order to use service A you need to also have a role in service B. Go into bathroom, scream into the void and return to your terminal.
We end up cobbling together a custom role that includes all the permissions that this application needs and the remaining 90% of permissions are something it will never ever use but will just sit there as a possible security hole.
Because /* permissions are the work of Satan, I need to scope it to specific instances of that resource and just hope nobody ever adds a SQS queue without....checking the permissions I guess. In theory we should catch it in the non-prod environments but there's always the chance that someone messes up something at a higher level of permissions that does something in non-prod and doesn't exist in prod so we'll just kinda cross our fingers there.

GCP Makes It Worse

So that's effectively the AWS story, which is terrible but at least it's possible to cobble together something that works and you can audit. Google looked at this and said "what if we could express how much we hate Infrastructure teams as a service?" Expensive coffee robots were engaged, colorful furniture was sat on and the brightest minds of our generation came up with a system so punishing you'd think you did something to offend them personally.

Google looked at AWS and said "this is a tire fire" as corporations put non-prod and prod environments in the same accounts and then tried to divide them by conditionals. So they came up with a folder structure:

The problem is that this design encourages unsafe practices by promoting "groups should be set at the folder level with one of the default basic roles". It makes sense logically at first that you are a viewer, editor or owner. But as GCP adds more services this model breaks down quickly because each one of these encompasses thousands upon thousands of permissions. So additional IAM predefined roles were layered on.

People were encouraged to move away from the basic roles and towards the predefined roles. There are ServiceAgent roles that were designated for service accounts, aka the permissions you actual application has and then everything else. Then there are 1687 other roles for you to pick from to assign to your groups of users.

The problem is none of this is actually best practice. Even when assigning users "small roles", we're still not following the principal of least privilege. Also the roles don't remain static. As new services come online permissions are added to roles.

The above is an automated process that pulls down all the roles from the gcloud CLI tool and updates them for latest. It is a constant state of flux with roles with daily changes. It gets even more complicated though.

You also need to check the launch stage of a role.

Custom roles include a launch stage as part of the role's metadata. The most common launch stages for custom roles are ALPHA, BETA, and GA. These launch stages are informational; they help you keep track of whether each role is ready for widespread use. Another common launch stage is DISABLED. This launch stage lets you disable a custom role.

We recommend that you use launch stages to convey the following information about the role:

EAP or ALPHA: The role is still being developed or tested, or it includes permissions for Google Cloud services or features that are not yet public. It is not ready for widespread use.
BETA: The role has been tested on a limited basis, or it includes permissions for Google Cloud services or features that are not generally available.
GA: The role has been widely tested, and all of its permissions are for Google Cloud services or features that are generally available.
DEPRECATED: The role is no longer in use.

Who Cares?

Why would anyone care if Google is constantly changing roles? Well it matters because with GCP to make a custom role, you cannot combine predefined roles. Instead you need to go down to the permission level to list out all of the things those roles can do, then feed that list of permissions into the definition of your custom role and push that up to GCP.

In order to follow best practices this is what you have to do. Otherwise you will always be left with users that have a ton of unused permissions along with the fear of a security breach allowing someone to execute commands in your GCP account through an applications service account that cause way more damage than the actual application justifies.

So you get to build automated tooling which either queries the predefined roles for change over time and roll those into your custom roles so that you can assign a user or group one specific role that lets them do everything they need. Or you can assign these same folks multiple of the 1600+ predefined roles, accept that they have permissions they don't need and also just internalize that day to day you don't know how much the scope of those permissions have changed.

The Obvious Solution

Why am I ranting about this? Because the solution is so blindly obvious I don't understand why we're not already doing it. It's a solution I've had to build, myself, multiple times and at this point am furious that this keeps being my responsibility as I funnel hundreds of thousands of dollars to cloud providers.

What is this obvious solution? You, an application developer, need to launch a new service. I give you a service account that lets you do almost everything inside of that account along with a viewer account for your user that lets you go into the web console and see everything. You churn away happily, writing code that uses all those new great services. Meanwhile, we're tracking all the permissions your application and you are using.

At some time interval, 30 or 90 or whatever days, my tool looks at the permissions your application has used over the last 90 days and says "remove the global permissions and scope it to these". I don't need to ask you what you need, because I can see it. In the same vein I do the same thing with your user or group permissions. You don't need viewer everywhere because I can see what you've looked at.

Both GCP and AWS support this and have all this functionality baked in. GCP has the role recommendations which tracks exactly what I'm talking about and recommends lowering the role. AWS tracks the exact same information and can be used to do the exact same thing.

What if the user needs different permissions in a hurry?

This is not actually that hard to account for and again is something I and countless others have been forced to make over and over. You can issue expiring permissions in both situations where a user can request a role be temporarily granted to them and then it disappears in 4 hours. I've seen every version of these, from Slack bots to websites, but they're all the same thing. If user is in X group they're allowed to request Y temporary permissions. OR if the user is on-call as determined with an API call to the on-call provider they get more powers. Either design works fine.

That seems like a giant security hole

Compared to what? Team A guessing what Team B needs even though they don't ever do the work that Team B does? Some security team receiving a request for permissions and trying to figure out if the request "makes sense" or not? At least this approach is based on actual data and not throwing darts at a list of IAM roles and seeing what "feels right".

Conclusion

IAM started out as an easy idea that as more and more services were launched, started to become nightmarish to organize. It's too hard to do the right thing now and it's even harder to do the right thing in GCP compared to AWS. The solution is not complicated. We have all the tools, all the data, we understand how they fit together. We just need one of the providers to be brave enough to say "obviously we messed up and this legacy system you all built your access control on is bad and broken". It'll be horrible, we'll all grumble and moan but in the end it'll be a better world for us all.

Feedback: https://c.im/@matdevdug

K8s Service Meshes: The Bill Comes Due

March 01, 2024 in kubernetes

When you start using Kubernetes one of the first suggestions you'll get is to install a service mesh. This, of course, on top of the 900 other things you need to install. For those unaware, everything in k8s is open to everything else by default when you start and traffic isn't encrypted between services. Since encrypting traffic between services and controlling what services can talk to which requires something like a JWT and client certificates, teams aren't typically eager to take on this work even though its increasingly a requirement of any stack.

Infrastructure teams can usually implement a feature faster than every app team in a company, so this tends to get solved by them. Service meshes exploded in popularity as it became clear they were easy ways to implement enforced encryption and granular service to service access control. You also get better monitoring and some cool features like circuit breaking and request retries for "free". As the scale of deployments grew with k8s and started to bridge multiple cloud providers or a cloud provider and a datacenter, this went from "nice to have" to an operational requirement.

What is a service mesh?

Service-to-service communication before and after service mesh implementation

Service meshes let you do a few things easily

Easy metrics on all service to service requests since it has a proxy that knows success/failure/RTT/number of requests
Knowledge that all requests are encrypted with automated rotation
Option to ensure only encrypted requests are accepted so you can have k8s in the same VPC as other things without needing to do firewall rules
Easy to set up network isolation at a route/service/namespace level (great for k8s hosting platform or customer isolation)
Automatic retries, global timeout limits, circuit breaking and all the features of a more robustly designed application without the work
Reduces change failure rate. With a proxy sitting there holding and retrying requests, small blips don't register anymore to the client. Now they shouldn't anyway if you set up k8s correctly but its another level of redundancy.

This adds up to a lot of value for places that adopt them with a minimum amount of work since they're sidecars injected into existing apps. For the most part they "just work" and don't require a lot of knowledge to keep working.

However, it's 2024 and stuff that used to be free isn't anymore. The free money train from VCs has ended and the bill has come due. Increasingly, this requirement for deploying production applications to k8s is going to come with a tax that you need to account for when budgeting for your k8s migration and determining whether it is worth it. Since December 2023 the service mesh landscape has changed substantially and it's a good time for a quick overview of what is going on.

NOTE: Before people jump down my throat, I'm not saying these teams shouldn't get paid. If your tool provides real benefits to businesses it isn't unreasonable to ask them to materially contribute to it. I just want people to be up to speed on what the state of the service mesh industry is and be able to plan accordingly.

Linkerd

My personal favorite of the service meshes, Linkerd is the most idiot proof of the designs. It consists of a control plane and a data plane with a monitoring option included. It looks like this:

Recently Linkerd has announced a change to their release process, which I think is a novel approach to the problem of "getting paid for your work". For those unaware, Linkerd has always maintained a "stable" and an "edge" version of their software, along with an enterprise product. As of Linkerd 2.15.0, they will no longer publish stable releases. Instead the concept of a stable release will be bundled into their Buoyant Enterprise for Linkerd option. You can read the blog post here.

Important to note that unlike some products, Linkerd doesn't just take a specific release of Edge and make it Enterprise. There are features that make it to Edge that never get to Enterprise, Stable is also not a static target (there are patch releases to the Stable branch as well), so these are effectively three different products. So you can't do the workaround of locking your org to specific Edge releases that match up with Stable/Enterprise.

Pricing

Update: Linkerd changed their pricing to per-pod. You can see it here: https://buoyant.io/pricing. I'll leave the below for legacy purposes but the new pricing addresses my concerns.

Buoyant has selected the surprisingly high price of $2000 a cluster per month. The reason this is surprising to me is the model for k8s is increasingly moving towards more clusters with less in a cluster, vs the older monolithic cluster where the entire company lives in one. This pricing works against that goal and removes some of the value of the service mesh concept.

If the idea of the Linkerd team is that orgs are going to stick with fewer, larger clusters, then it makes less sense to me to go with Linkerd. With a ton of clusters, I don't want to think about IP address ranges or any of the east to west networking designs, but if I just have like 2-3 clusters that are entirely independent of each other, then I can get a similar experience to Linkerd with relatively basic firewall rules, k8s network policies and some minor changes to an app to encrypt connections. There's still value to Linkerd, but the per-cluster pricing when I was clearly fine hosting the entire thing myself before is strange.

$2000 a month for a site license makes sense to me to get access to enterprise. $2000 a month per cluster when Buoyant isn't providing me with dashboards or metrics on their side seems like they picked an arbitrary number out of thin air. There's zero additional cost for them per cluster added, it's just profit. It feels weird and bad. If I'm hosting and deploying everything and the only support you are providing me is letting me post to the forum, where do you come up with the calculation that I owe you per cluster regardless of size?

Now you can continue to use Linkerd, but you need to switch to Edge. In my experience testing it, Edge is fine. It's mostly production ready, but there are sometimes features which you'll start using and then they'll disappear. I don't think it'll matter for most orgs most of the time, since you aren't likely constantly rolling out service mesh upgrades. You'll pick a version of Edge, test it, deploy it and then wait until you are forced to upgrade or you see a feature you like.

You also can't just buy a license, you need to schedule a call with them to buy a license with discounts available before March 21st, 2024. I don't know about you but the idea of needing to both buy a license and have a call to buy a license is equally disheartening. Maybe just let me buy it with the corporate card or work with the cloud providers to let me pay you through them.

Cilium

Cilium is the new cool kid on the block when it comes to service meshes. It eliminates the sidecar container, removing a major source of failure in the service mesh design. You still get encryption, load balancing, etc but since it uses eBPF and is injected right into the kernel you remove that entire element of the stack.

You also get a LOT with Cilium. It is its own CNI, which in my testing has amazing performance. It works with all the major cloud providers, it gives you incredibly precise network security and observability. You can also replace Kube-proxy with cilium. Here is how it works in a normal k8s cluster with Kube-proxy:

Effectively Kube-proxy works with the OS filtering layer (typically iptables) to allow network communication to your pods. This is a bit simplified but you get the idea.

With the BPF Kube-proxy replacement we remove a lot of pieces in that design.

This is only a tiny fraction of what Cilium does. It has developed a reputation for excellence, where if you full adopt the stack you can replace almost all the cloud-provider specific pieces for k8s to a generic stack that works across providers at a lower cost and high performance.

the UI for seeing service relationships in Cilium is world-class

A Wild Cisco Appears

Cisco recently acquired Isovalent in December of 2023, apparently to get involved in the eBPF space and also likely to augment their acquisition of Splunk. Cilium provides the metrics and traces as well as generating great flow logs and Splunk ingests them for you. If you are on Linkerd and considering moving over to Cilium to avoid paying, you should be aware that with Cisco having purchased them the bill is inevitable.

You will eventually be expected to pay and my guess based on years of ordering Cisco licenses and hardware is you'll be expected to pay a lot. So factor that in when considering Cilium or migrating to Cilium. I'll go out on a limb here and predict that Cilium is priced as a premium multi-cloud product with a requirement of the enterprise license for many of the features before the end of 2024. I will also predict that Linkerd ends up as the cheapest option on the table by the end of 2024 for most orgs.

Take how expensive Splunk is and extrapolate that into a service mesh license and I suspect you'll be in the ballpark.

Istio

Istio, my least favorite service mesh. Conceptually Istio and Linkerd share many of the same ideas. Both platforms use a two-part architecture now: a control plane and a data plane. The control plane manages the data plane by issuing configuration updates to the proxies in the data plane. The control plane also provides security features such as mTLS encryption and authentication.

Istio uses Envoy proxies vs rolling their own like Linkerd and tends to cover more possible scenarios than Linkerd. Here's a feature comparison:

Istio's primary differences are that it supports VMs, runs its own Ingress Controller and is 10x the complexity of setting up any other option. Istio has become infamous among k8s infrastructure staff as being the cause of more problems than any other part of the stack. Now many of these can be solved with minor modifications to the configuration (there is absolutely nothing structurally wrong with Istio), but since a service mesh failure can be "the entire cluster dies", it's tricky.

The reality is Istio is free and open source, but you pay in other ways. Istio has so many components and custom resources that can interact with each other in surprising and terrifying ways that you need someone in your team who is an Istio expert. Otherwise any attempt to create a self-service ecosystem will result in lots of downtime and tears. You are going to spend a lot of time in Istio tracking down performance problems, weird network connectivity issues or just strange reverse proxy behavior.

Some of the earlier performance complaints of Envoy as the sidecar have been addressed, but I still hear of problems when organizations scale up to a certain number of requests per second (less than I used to). The cost for Istio, to me, exceeds the value of a service mesh most of the time. Especially since Linkerd has caught up with most of the traffic management stuff like circuit breaking.

Consul Connect

The next service mesh we'll talk about is Consul Connect. If Istio is highly complicated to set up and Linkerd is easiest but fewest knobs to turn, Consul sits right in the middle. It has a great story when it comes to observability and has performance right there with Linkerd and superior to Istio.

Consul is also very clearly designed to be deployed by large companies, with features around stability and cross-datacenter design that only apply to the biggest orgs. However people who have used it seem to really like it, based on the chats I've had. The ability to use Terraform with Consul with its Consul-Terraform-Sync functionality to get information about services and interact with those services at a networking level is massive, especially for teams managing thousands of nodes or where pods need strict enforced isolation (such as SaaS products where customer app servers can't interact).

Pricing

Consul starts at $0.027 an hour, but in practice your price is gonna be higher than that. It goes up based on how many instances and clusters you are running. It's also not available on GCP, just AWS and Azure. You also don't get support with that, seemingly needing to upgrade your package to ask questions.

I'm pretty down on Hashicorp after the Terraform change, but people have reported a lot of success with Consul so if you are considering a move, this one makes a lot of sense.

Cloud Provider Service Meshes

GCP has Anthos (based on Istio) as part of their GKE Enterprise offering, which is $.10/cluster/hour. It comes with a bunch of other features but in my testing was a much easier way to run Istio. Basically Istio without the annoying parts. AWS App Mesh still uses Envoy but has a pretty different architecture. However it comes with no extra cost which is nice.

AWS App Mesh is also great for orgs that aren't all-in for k8s. You can bridge systems like ECS and traditional EC2 with it, meaning its a super flexible tool for hybrid groups or groups where the k8s-only approach isn't a great fit.

Azure uses Open Service Mesh which is now a deprecated product. Despite that, it's still their recommend solution according to a Google search. Link

Once again the crack team at Azure blows me away with their attention to detail. Azure has a hosted Istio add-on in preview now and presumably they'll end up doing something similar to GKE with Anthos. You can see that here.

What do you need to do

So the era of the free Service Mesh is coming to a close. AWS has decided to use it as an incentive to stay on their platform, Linkerd is charging you, Cilium will charge you At Some Point and Consul is as far from free now as it gets. GKE and Azure seem to be betting on Istio where they move the complexity into their stack, which makes sense. This is a reflection of how valuable these meshes are for observability and resilience as organizations transition to microservices and more specifically split stacks, where you retain your ability to negotiate with your cloud provider by running things in multiple places.

Infrastructure teams will need to carefully pick what horse they want to back moving forward. It's a careful balance between cloud lock-in vs flexibility at the cost of budget or complexity. There aren't any clear-cut winners in the pack, which wasn't true six months ago when the recommendation was just Linkerd or Cilium. If you are locked into either Linkerd or Cilium, the time to start discussing a strategy moving forward is probably today. Either get ready for the bill, commit to running Edge with more internal testing, or brace yourself for a potentially much higher bill in the future.

Typewriters and WordPerfect

February 06, 2024 in Work

The first and greatest trick of all technology is to make words appear. I will remember forever the feeling of writing my first paper on a typewriter as a kid. The tactile clunk and slight depression of the letters on the page made me feel like I was making something. It transformed my trivial thoughts to something more serious and weighty. I beamed with pride when I would be the only person who would hand in typed documents instead of the cursive of my classmates.

I learned how to type on the schools Brother Charger 11 typewriter, which by the time I got there were one step away from being thrown away. It was one of the last of its kind, being a manual portable typewriter before electric typewriters took over the entire market. Our typing teacher was a nun who had learned how to type on them and insisted they be what we tried first. Typewriters were heavy things, with a thunk and a clang going along with almost anything you did.

Despite being used to teach kids to type for years, they were effectively the same as the day they had been purchased. The typewriters sat against the wall in their little attached cases with colors that seemed to exist from the 1950s until the end of the 70s and then we stopped remembering how to mix them. The other kids in my class hated the typewriters since it was easier to just write on loose leaf paper and hand that in, plus the typing tests involved your hands being covered with a cardboard shell to prevent you from looking.

I, like all tech people, decided that instead of fixing my terrible handwriting, I would put in 10x as much work to skip the effort. So I typed everything I could, trying to get out of as many cursive class requirements as possible. As I was doing that, my father was bringing me along to various courthouses and law offices in Ohio when I had snow days or days off school and he didn't want to leave me alone in the house.

These trips were great, mostly because people forgot I was there. I'd watch violent criminal trials, sit in the secretary areas of courthouses eating cookies that were snuck over to me, the whole thing was great. Multiple times I would be sitting on the bench outside of holding cell for prisoners before they would appear in court (often for some procedural thing) and they'd give me advice. I remember one guy who was just covered in tattoos advising me that "stealing cars may look fun and it is fun, but don't crash because the police WILL COME and ask for registration information". 10 year old me would nod sagely and store this information for the future.

It was at one of these courthouses that I was introduced to something mind-blowing. It was a computer running WordPerfect.

WordPerfect?

For a long time the word processor of choice by professionals was WordPerfect. I got to watch the transformation from machine-gun sounding electric typewriters to the glow of CRT monitors. While the business world had switched over pretty quickly, it took a bit longer for government organizations to drop the typewriters and switch. I started learning how to use a word processor with WordPerfect 5.1, which came with an instruction manual big enough to stop a bullet.

For those unaware, WordPerfect introduced some patterns that have persisted throughout time as the best way to do things. It was very reliable software that came with 2 killer features that put the bullet in the head of typewriters: Move and Cancel. Ctrl-F4 let you grab a sentence and then hit enter to move it anywhere else. In an era of dangerous menus, F1 would reliably back you out of any setting in WordPerfect and get you back to where you started without causing damage. Add in some basic file navigation with F5 and you had the beginnings of every text processing tool that came after.

I fell in love with it, eventually getting one of the old courthouse computers in my house to do papers on. We set it up on a giant table next to the front door and I would happily bang away at the thing, churning out papers with the correct date in there (without having to look it up with Shift-F5). In many ways this was the most formative concept of how software worked that I would encounter.

WordPerfect was the first software I saw that understood the idea of WYSIWYG. If you changed the margins in the program, the view reflected that change. You weren't limited to one page of text at a time but could quickly wheel through all the text. It didn't have "modes", similar to Vim today, where you needed to pick Create, Edit or Insert. WordPerfect if you started typing it would insert text. It would then push the other text out of the way instead of overwriting it. It clicks as a natural way for text to work on a screen.

Thanks to the magic of emulation, I'm still able to run this software (and in fact am typing this on it right now). It turns out it is just as good as I remember, if not better. If you are interested in how there is a great write-up here. However as good as the software is, it turns out there is an amazing history of WordPerfect available for free online.

Almost Perfect is the story of WordPerfect's rise and fall from the perspective of someone who was there. I loved reading this and am so grateful that the entire text exists online. It contains some absolute gems like:

One other serious problem was our growing reputation for buggy software. Any complex software program has a number of bugs which evade the testing process. We had ours, and as quickly as we found them, we fixed them. Every couple of months we issued improved software with new release numbers. By the spring of 1983, we had already sent out versions 2.20, 2.21, and 2.23 (2.22 was not good enough to make it out the door). Unfortunately, shipping these new versions with new numbers was taken as evidence by the press and by our dealers that we were shipping bad software. Ironically, our reputation was being destroyed because we were efficient at fixing our bugs.

Our profits were penalized as well. Every time we changed a version number on the outside of the box, dealers wanted to exchange their old software for new. We did not like exchanging their stock, because the costs of remanufacturing the software and shipping it back and forth were steep. This seemed like a waste of money, since the bug fixes were minor and did not affect most users.

Our solution was not to stop releasing the fixes, but to stop changing the version numbers. We changed the date of the software on the diskettes inside the box, but we left the outside of the box the same, a practice known in the industry as slipstreaming. This was a controversial solution, but our bad reputation disappeared. We learned that perception was more important than reality. Our software was no better or worse than it had been before, but in the absence of the new version numbers, it was perceived as being much better.

You can find the entire thing here: http://www.wordplace.com/ap/index.shtml

Fixing Macs Door to Door

January 05, 2024 in Apple

When I graduated college in 2008, even our commencement speaker talked about how moving back in with your parents is nothing to be ashamed of. I sat there thinking well that certainly can't be a good sign. Since I had no aspirations and my girlfriend was moving to Chicago, I figured why not follow her. I had been there a few times and there were no jobs in Michigan. We found a cheap apartment near her law school and I started job hunting.

After a few weeks applying to every job on Craigslist, I landed an odd job working for an Apple Authorized Repair Center. The store was in a strip mall in the suburbs of Chicago with a Dollar Store and a Chinese buffet next door. My primary qualifications were that I was willing to work for not a lot of money and I would buy my own tools. My interview was with a deeply Catholic boss who focused on how I had been an alter boy growing up. Like all of my bosses early on, his primary quality was he was a bad judge of character.

I was hired to do something that I haven't seen anyone else talk about on the Internet and wanted to record before it was lost to time. It was a weird program, a throwback to the pre-Apple Store days of Apple Mac support that was called AppleCare Dispatch. It still appears to exist (https://www.apple.com/support/products/mac/) but I don't know of any AASPs still dispatching employees. It's possible that Apple has subcontracted it out to someone else.

AppleCare Dispatch

Basically if you owned a desktop Mac and lived in certain geographic areas, when you contacted AppleCare to get warranty support they could send someone like me out with a part. Normally they'd do this only for customers who were extremely upset or had a store repair go poorly. I'd get a notice that AppleCare was dispatching a part, we'd get it from FedEx and then I'd fill a backpack full of tools and head out to you on foot.

While we had the same certifications as an Apple Genius, unlike the Genius Bar we weren't trained on any sort of "customer service" element. All we did was Mac hardware repairs all day, with pretty tight expectations of turnaround. So how it worked at the time was basically if the Apple Store was underwater with in-house repairs, or you asked for at-home or the customer was Very Important, we would get sent out. I would head out to you on foot with my CTA card.

That's correct, I didn't own a car. AppleCare didn't pay a lot for each dispatch and my salary of $25,000 a year plus some for each repair didn't go far in Chicago even in the Great Recession. So this job involved me basically taking every form of public transportation in Chicago to every corner of the city. I'd show up at your door within a 2 hour time window, take your desktop Mac apart in your house, swap the part, run the diagnostic and then take the old part with me and mail it back to Apple.

Apple provided a backend web panel which came with a chat client. Your personal Apple ID was linked with the web tool (I think it was called ASX) where you could order parts for repairs as well as open up a chat with the Apple rep there to escalate an issue or ask for additional assistance. The system worked pretty well, with Apple paying reduced rates for each additional part after the first part you ordered. This encouraged us all to get pretty good at specific diagnostics with a minimal number of swaps.

Our relationship to Apple was bizarre. Very few people at Apple even knew the program existed, seemingly only senior AppleCare support people. We could get audited for repair quality, but I don't remember that ever happening. Customer satisfaction was extremely important and basically determined the rate we got paid, so we were almost never late to appointments and typically tried to make the experience as nice as possible. Even Apple Store staff seemed baffled by us on the rare occasions we ran into each other.

There weren't a lot of us working in Chicago around 2008-2010, maybe 20 in total. The community was small and I quickly met most of my peers who worked at other independent retail shops. If our customer satisfaction numbers were high, Apple never really bothered us. They'd provide all the internal PDF repair guides, internal diagnostic tools and that was it.

It is still surprising that Apple turned us loose onto strangers without anyone from Apple speaking to us or making us watch a video. Our exam was mostly about not ordering too many parts and ensuring we could read the PDF guide of how to fix a Mac. A lot of the program was a clear holdover from the pre-iPod Apple, where resources were scarce and oversight minimal. As Apple Retail grew, the relationship to Apple Authorized Service Providers got more adversarial and controlling. But that's a story for another time.

Tools etc

For the first two years I used a Manhattan Portage bag, which looked nice but was honestly a mistake. My shoulder ended up pretty hurt after carrying a heavy messenger bag for 6+ hours a day.

The only screwdrivers I bothered with was Wiha precision screwdrivers. I tried all of them and Wiha were consistently the best by a mile. Wiha has a list of screwdrivers by Apple model available here: https://www.wihatools.com/blogs/articles/apple-and-wiha-tools

Macs of this period booted off of FireWire, so that's what I had with me. FireWire 800 LaCie drives were the standard issue drives in the field.

LaCie Rugged Triple 2TB 2TB 5400rpm FireWire 800, USB 3.0 Orange, Sølv (LAC9000448) | Dustin.dk

You'd partition it to have a series of OS X Installers on there (so you could restore the customer back to what they had before) along with a few bootable installs of OS X. These were where you'd run your diagnostic software. The most commonly used ones were as follows:

https://www.cleverfiles.com/pro.html

Mac Repair North York — Remember back when Macs were something you could fix? Crazy times

9/11 Truther

One of my first calls was for a Mac Pro at a private residence. It was a logic board, which means the motherboard of the Mac. I wasn't thrilled, because removing and replacing the Mac Pro logic board was a time-consuming repair that required a lot of light. Instead of a clean workspace with bright lights I got a guy who would not let me go until I had watched how 9/11 was an inside job.

Mac Pro 4.1 2009 Motherboard Logic Board | Kaufen auf Ricardo — The logic board in question

"Look, you don't really think the towers were blown up by planes do you?" he said as he dug around this giant stack of papers to find...presumably some sort of Apple-related document. I had told him that I had everything I needed, but that I had a tight deadline and needed to start right now. "Sure, but I'll put the video on in the background and you can just listen to it while you work." So while I took a Mac Pro down to the studs and rebuilt it, this poorly narrated video explained how it was the CIA behind 9/11.

His office or, "command center", looked like a set of the X-Files. There were folders and scraps of paper everywhere along with photos of buildings, planes, random men wearing sunglasses. I think it was supposed to come across as if he was doing an investigation, but it reminded me more of a neighbor who struggled with hoarding. If there was an organizational system, I couldn't figure it out. Why was this person so willing to dedicate an large portion of their house to "solving a mystery" the rest of us had long since moved on from?

The Mac Pro answered all my questions when it booted up. The desktop was full of videos he had edited of 9/11 truth material along with website assets for where he sold these videos. This guy wasn't just a believer, he produced the stuff. When I finished, we had to run a diagnostic test to basically confirm the thing still worked as well as move the serial number onto the logic board. When it cleared diagnostic I took off, thanking him for his time and wishing him a nice day. He looked devastated and asked if I wanted to go grab a drink at the bar and continue our conversation. I declined, jogging to the L.

The Doctors

One of the rich folks I was sent out to lived in one of those short, super expensive buildings on Lake Shore Drive. For those unfamiliar, these shorter buildings facing the water in Chicago are often divided into a few large houses. Basically you pass through an army of doorman, get shown into an elevator that opens into the persons house. That was, if you could get through the doormen.

The staff in rich peoples houses want to immediately establish with any contractor coming into the home that they're superior to you. This happened to me constantly, from personal assistants to doormen, maids, nannies, etc. Doormen in particular liked to make a big deal of demonstrating that they could stop me from going up. This one stuck out because he made me take the freight elevator, letting me know "the main elevator is for people who live here and people who work here". I muttered about how I was also working there and he rolled his eyes and called me an asshole.

On another visit to a different building I had a doorman physically threaten "to throw me down" if I tried to get on the elevator. The reason was all contractors had to have insurance registered with the building before they did work there, even though I wasn't.....removing wires from the wall. The owner came down and explained that I wasn't going to do any work, I was just "a friend visiting". I felt bad for the doorman in that moment, in a dumb hat and ill-fitting jacket with his brittle authority shattered.

So I took the freight elevator up, getting let into what I would come to see as "the rich persons template home". My time going into rich peoples houses were always disappointing, as they are often a collection of nice items sort of strewn around. I was shown by the husband into the library, a beautiful room full of books with what I (assumed) were prints of paintings in nice frames leaning against the bookshelves. There was an iMac with a dead hard drive, which is an easy repair.

The process for fixing a hard drive was "boot to DiskWarrior, attempt to fix disk, have it fail, swap the drive". Even if DiskWarrior fixed the Mac and it booted, I would still swap the drive (why not and it's what I was paid to do) but then I didn't have to have the talk. This is where I would need to basically sit someone down and tell them their data was gone. "What about my taxes?!?" I would shake my head sadly. Thankfully this time the drive was still functional so I could copy the data over with a SATA to USB adapter.

As I reinstalled OS X, I walked around the room and looked at the books. I realized they were old, really old and the paintings on the floor were not prints. There were sketches by Picasso, other names I had heard in passing through going through art museums. When he came back in, I asked why there was a lot of art. "Oh sure, my dads, his big collection, I'm going to hang it up once we get settled." He, like his wife, didn't really acknowledge my presence unless I directed a question right at him. I started to google some of the books, my eyes getting wide. There was millions of dollars in this room gathering dust. He never made eye contact with me during this period and quickly left the room.

This seems strange but was really common among these clients. I truly think for many of the C-level type people whose house I went to, they didn't really even see me. I had people turn the lights off in rooms I was in, forget I was there and leave (while arming the security system). For whatever reason I instantly became part of the furniture. When I went to the kitchen for a drink of water, the maid let me know that they have lived there for coming up on 5 years.

This was surprising to me because the apartment looked like they had moved in two weeks ago. There were still boxes on the floor, a tv sitting on the windowsill and what I would come to understand was a "prop fridge". It had bottled water, a single bottle of expensive champagne, juices, fruit and often some sort of energy drink. No leftovers, everything gets swapped out before it goes bad and gets replaced. "They're always at work" she explained, grabbing her bag and offering to let me out before she locked up. They were both specialist doctors and this was apparently where they recharged their batteries.

After the first AppleCare Dispatch visit they would call me back for years to fix random problems. I don't think either of them ever learned my name.

HARPO Studio

I was once called to fix a "high profile" computer at HARPO studios in Chicago. This was where they filmed the Oprah Winfrey Show, which I obviously knew of the existence of but had never watched. Often these celebrity calls went to me, likely because I didn't care and didn't particularly want them. I was directed to park across the street and told even though the signs said "no parking" that they had a "deal with the city".

This repair was suspicious and I got the sense that someone had name dropped Oprah to maybe get it done. AppleCare rarely sent me multiple parts unless the case was unusual or the person had gotten escalated through the system. If you emailed Steve Jobs back in the day and his staff approved a repair, it attached a special code to the serial number that allowed us to order effectively unlimited items against the serial number. However with the rare "celebrity" case, we would often find AppleCare did the same thing, throwing parts at us to make the problem go away.

The back area of HARPO was busy, with what seemed like an almost exclusively female staff. "Look it's important that if you see Oprah, you act normally, please don't ask her for an autograph or a photo". I nodded, only somewhat paying attention because never in a million years would I do that. This office felt like the set of The West Wing, with people constantly walking and talking along with a lot of hand motions. My guide led me to a back office with a desk on one side and a long table full of papers and folders. The woman told me to "fix the iMac" and left the room.

harpo-studio-chicago-office-sterling-bay-joshpabstphoto-(9) — Architecture Photography | Commercial Real Estate Photographer — Not the exact office but you get the jist

I swapped the iMac hard drive and screen, along with the memory and wifi then dived under the desk the second Oprah walked in. The woman and Oprah had a conversation about scheduling someone at a farm, or how shooting at a farm was going and then she was gone. When I popped my head up, the woman looked at me and was like "can you believe you got to meet Oprah?" She had a big smile, like she had given me the chance of a lifetime.

The bummer about the aluminum iMac repairs is you have to take the entire screen off to get anything done. This meant I couldn't just run away and hide my shame after effectively diving under a table to escape Oprah, a woman who I am certain couldn't have cared less what came out of my mouth. I could have said "I love to eat cheese sometimes" and she would have nodded and left the room.

NO TOOLS NEEDED How to replace your 27 inch iMac screen glass monitor - YouTube

So you have to pop the glass off (with suction cups, not your hands like a psycho as shown above), then unscrew and remove the LCD and then finally you get access to the actual components. Any dust that got on the LCD would stick and annoy people, so you had to try and keep it as clean as possible while moving quickly to get the swap done. The nightmare was breaking the thick cables that connected the screen to the logic board, something I did once and required a late night trip to an electronics repair guy who got me sorted out with a soldering iron.

The back alley electronics repair guy is the dark secret of the Dispatch world. If you messed up a part, pulled a cable or broke a connector, Apple could ask you to pay for that part. The Apple list price for parts were hilariously overpriced. Logic boards were like $700-$900, each stick of RAM was like $90 for ones you could buy on crucial for $25. This could destroy your pay for that month, so you'd end up going to Al, who ran basically a "solder Apple stuff back together" business in his garage. He wore overalls and talked a lot about old airplanes, which you'd need to endure in order to get the part fixed. Then I'd try to get the part swapped and just pray that the thing would turn on long enough for you to get off the property. Ironically his parts often lasted longer than the official Apple refurbished parts.

After I hid under the desk deliberately, I lied for years afterwards, telling people I didn't have time to say hi. In reality my mind completely blanked when she walked in. I stayed under the desk because I was nervous that everyone was going to look at me to be like "I loved when you did X" and my brain couldn't form a single memory of anything Oprah had ever done. I remembered Tom Cruise jumping on a couch but I couldn't recall if this was a good thing or a bad thing when it happened.

Oh and the car that I parked in the area the city didn't enforce? It had a parking ticket, which was great because I had borrowed the car. Most of the payment from my brush with celebrity went to the ticket and a tank of gas.

Brownstone Moms

One of the most common calls I got was to rich peoples houses in Lincoln Park, Streeterville, Old Town and a few other wealthy neighborhoods. They often live in distinct brownstone houses with small yards with a "public" entrance in the front, a family entrance on the side and then a staff entrance through the back or in the basement.

These houses were owned by some of the richest people in Chicago. The houses themselves were beautiful, but they don't operate like normal houses. Mostly they were run by the wives, who often had their own personal assistants. It was an endless sea of contractors coming in and out, coordinated by the mom and sometimes the nanny.

Once I was there, they'd pay me to do whatever random technical tasks existed outside of the initial repair. I typically didn't mind since I was pretty fast at the initial repair and the other stuff was easy, mostly setting up printers or routers. The sense I got was if the household made the AppleCare folks life a living hell, I would get sent out to make the problem disappear. These people often had extremely high expectations of customer service, which could be difficult at times.

There was a whole ecosystem of these small companies I started to run into more and more. They seemed to specialize in catering to rich people, providing tutoring services, in-house chefs, drivers, security and every other service under the sun. One of the AV installation companies and I worked together off the books after-hours to set up Apple TVs and Mac Minis as the digital media hubs in a lot of these houses. They'd pay me to set up 200 iPods as party favors or wire an iPad into every room.

Often I'd show up only to tell them their hard drive was dead and everything was gone. This was just how things worked before iCloud Photos, nobody kept backups and everything was constantly lost forever. Here they would often threaten or plead with me, sometimes insinuating they "knew people" at Apple or could get me fired. Jokes on you people, I don't even know people at Apple was often what ran through my head. Threats quickly lost their power when you realized nobody at any point had asked your name or any information about yourself. It's hard to threaten an anonymous person.

The golden rule that every single one of these assistants warned me about was not to bother the husband when he gets home. Typically these CEO-types would come in, say a few words to their kids and then retreat to their own area of the house. These were often TV rooms or home theaters, elaborate set pieces with $100,000+ of AV equipment in there that was treated like it was a secret lair of the house. To be clear, none of these men ever cared at all that I was there. They didn't seem to care that anybody was there, often barely acknowledging their wives even though an immense amount of work had gone into preparing for his return.

As smartphones became more of a thing, the number of "please spy on my teen" requests exploded. These varied from installing basically spyware on their kids laptops to attempting to install early MDM software on the kids iPhones. I was always uncomfortable with these jobs, in large part because the teens were extremely mean to me. One girl waited until her mom left the room to casually turn to me and say "I will pay you $500 to lie to my mom and say you set this up".

I was offended that this 15 year old thought she could buy me, in large part because she was correct. I took the $500 and told the mom the tracking software was all set up. She nodded and told me she would check that it was working and "call me back if it wasn't". I knew she was never going to check, so that part didn't spook me. I just hoped the kid didn't get kidnapped or something and I would end up on the evening news. But I was also a little short that month for rent so what can you do.

Tip for anyone reading this looking to get into this rich person Mac business

So the short answer is Time Machine is how you get paid month after month. Nobody checks Time Machine or pays attention to the "days since" notification. I wrote an AppleScript back in the day to alert you to Time Machine failures through email, but there is an app now that does the same thing: https://tmnotifier.com/

Basically when the backups fail, you schedule a visit and fix the problem. When they start to run out of space, you buy a new bigger drive. Then you backup the Time Machine to some sort of encrypted external location so when the drive (inevitably) gets stolen you can restore the files. The reason they keep paying you is you'll get a call at some point to come to the house at a weird hour and recover a PDF or a school assignment. That one call is how you get permanent standing appointments.

Nobody will ever ask you how it works, so just find the system you like best and do that. I preferred local Time Machine over something like remote backup only because you'll be sitting there until the entire restore is done and nothing beats local. Executives will often fill the "family computer" with secret corporate documents they needed printed off, so be careful with these backups. Encrypt, encrypt, encrypt then encrypt again. Don't bother explaining how the encryption works, just design the system with the assumption that someone will at some point put a CSV with your social security number onto this fun family iMac covered in dinosaur stickers.

Robbed for Broken Parts

A common customer for repairs would be schools, who would work with Apple to open a case for 100 laptops or 20 iMacs at a time. I liked these "mass repair" days, typically because the IT department for Chicago Public Schools would set us up with a nice clean area to work and I could just listen to a podcast and swap hard drives or replace top cases. However this mass repair was in one of Chicago's rougher neighborhoods.

Personal safety was a common topic among the dispatch folks when we would get together for a pizza and a beer. Everyone had bizarre stories but I was the only one not working out of my car. The general sense among the community was it was not an "if" but a "when" until you were robbed. Typically my rule was if I started to get nervous I'd "call back to the office" to check if a part had arrived. Often this would calm people down, reminding them that people knew where I was. Everyone had a story of getting followed back to their car and I had been followed back to the train once or twice.

On this trip though everything went wrong that could go wrong. My phone, the HTC Dream running v1 of Android had decided to effectively "stop phoning". It was still on but decided we were not, in fact, in the middle of a large city. I was instead in a remote forest miles away from a cell tower. I got to the school later than I wanted to be there, showing up at noon. When I tried to push it and come back the next day the staff let me know the janitors knew I would be there and would let me out.

So after replacing a ton of various Mac parts I walked out with boxes of broken parts in my bag and a bunch in an iMac box that someone had given me. My plan was I would head back home, get them checked in and labeled and then drop them off at a FedEx store. When I got out and realized it was dark, I started to accept something bad was likely about to happen to me. Live in a city for any amount of time and you'll start to develop a subconscious odds calculator. The closing line on this wasn't looking great.

Sure enough while waiting for the bus, I was approached by a man who made it clear he wanted the boxes. He didn't have a weapon but started to go on about "kicking the shit" out of me and I figured that was good enough for me. He clearly thought there was an iMac in the box and I didn't want to be here when he realized that wasn't true. I handed over my big pile of broken parts and sprinted to the bus that was just pulling up, begging the driver to keep driving. As a CTA bus driver, he had of course witnessed every possible horror a human can inflict on another human and was entirely unphased by my outburst. "Sit down or get off the bus".

When I got home I opened a chat with as Apple rep who seemed unsure of what to do. I asked if they wanted me to go to the police and the rep said if I wanted to I could, but after "talking to some people on this side" they would just mark the parts as lost in transit and it wouldn't knock my metrics. I thanked them and didn't think much more of the incident until weeks later when someone from Apple mailed me a small Apple notebook.

They never directly addressed the incident (truly the notebook might be unrelated) but I always thought the timing was funny. Get robbed, get a notebook. I still have the notebook.

Questions/comments/concerns? Find me on Mastodon: https://c.im/@matdevdug

AWS Elastic Kubernetes Service (EKS) Review

January 31, 2022 in DevOps

A bit of history

Over the last 5 years, containers have become the de-facto standard by which code is currently run. You may agree with this paradigm or think its a giant waste of resources (I waver between these two opinions on a weekly basis), but this is where we are. AWS, as perhaps the most popular cloud hosting platform, bucked the trend and attempted to launch their own container orchestration system called ECS. This was in direct competition to Kubernetes, the open-source project being embraced by organizations of every size.

ECS isn't a bad service, but unlike running EC2 instances inside of a VPC, switching to ECS truly removes any illusion that you will be able to migrate off of AWS. I think the service got a bad reputation for its deep tie-in with AWS along with some of the high Fargate pricing for the actual CPU units. As years went on, it became clear to the industry at large that ECS was not going to become the standard way we run containerized applications. This, in conjunction with the death of technology like docker swarm meant a clear winner started to emerge in k8s.

AWS EKS was introduced in 2017 and became generally available in 2018. When it launched, the response from the community was....tepid at best. It was missing a lot of the features found in competing products and seemed to be far behind the offerings from ECS at the time. The impression among folks I was talking to at the time was: AWS was forced to launch a Kubernetes service to stay competitive, but they had zero interest in diving into the k8s world and this was to meet the checkbox requirement. AWS itself didn't really want the service to explode in popularity was the rumor being passed around various infrastructure teams.

This is how EKS stacked up to its competition at launch

I became involved with EKS at a job where they had bet hard on docker swarm, it didn't work out and they were seeking a replacement. Over the years I've used it quite a bit, both as someone whose background is primary in AWS and someone who generally enjoys working with Kubernetes. So I've used the service extensively for production loads over years and have seen it change and improve in that time.

EKS Today

So its been years and you might think to yourself "certainly AWS has fixed all these problems and the service is stronger than ever". You would be...mostly incorrectly. EKS remains both a popular service and a surprisingly difficult one to set up correctly. Unlike services like RDS or Lambda, where the quality increases by quite a bit on a yearly basis, EKS has mostly remained a clearly internally disliked product.

This is how I like to imagine how AWS discusses EKS

New customer experience

There exists out of the box tooling for making a fresh EKS cluster that works well, but the tool isn't made or maintained by Amazon, which seems strange. eksctl which is maintained by Weaveworks here mostly works as promised. You will get a fully functional EKS cluster through a CLI interface that you can deploy to. However you are still pretty far from an actual functional k8s setup, which is a little bizarre.

Typically the sales pitch for AWS services is they are less work than doing it yourself. The pitch for EKS is its roughly the same amount of work to set it up, it is less maintenance in the long term. Once running, it keeps running and AWS managing the control plane means its difficult to get into a situation in which you cannot fix the cluster, which very nice. Typically k8s control plane issues are the most serious and everything else you can mostly resolve with tweaks.

However if you are considering going with EKS, understand you are going to need to spend a lot of time reading before you touch anything. You need to make hard-to-undo architectural decisions early in the setup process and probably want to experiment with test clusters before going with a full-on production option. This is not like RDS, where you can mostly go with the defaults, hit "new database" and continue on with your life.

Where do I go to get the actual setup information?

The best resource I've found is EKS Workshop, which walks you through tutorials of all the actual steps you are going to need to follow. Here is the list of things you would assume come out of the box, but strangely do not:

You are going to need to set up autoscaling. There are two options, the classic Autoscaler and the new AWS-made hotness which is Karpenter. You want to use Karpenter, it's better and doesn't have all the lack-of-awareness when it comes to AZs and EBS storage (basically autoscaling doesn't work if you have state on your servers unless you manually configure it to work, it's a serious problem). AWS has a blog talking about it.
You will probably want to install some sort of DNS service, if for nothing else so services not running inside of EKS can find your dynamic endpoints. You are ganna wanna start here.
You almost certainly want Prometheus and Grafana to see if stuff is working as outlined here.
You will want to do a lot of reading about IAM and RBAC to understand specifically how those work together. Here's a link. Also just read this entire thing from end to end: link.
Networking is obviously a choice you will need to make. AWS has a network plugin that I recommend, but you'll still need to install it: link.
Also you will need a storage driver. EBS is supported by kubernetes in the default configuration but you'll want the official one for all the latest and greatest features. You can get that here.
Maybe most importantly you are going to need some sort of ingress and load balancer controller. Enjoy setting that up with the instructions here.
OIDC auth and general cluster access is pretty much up to you. You get the IAM auth out of the box which is good enough for most use cases, but you need to understand RBAC and how that interacts with IAM. It's not as complicated as that sentence implies it will be, but it's also not super simple.

That's a lot of setup

I understand that part of the appeal of k8s is that you get to make all sorts of decisions on your own. There is some power in that, but in the same way that RDS became famous not because you could do anything you wanted, but because AWS stopped you from doing dumb stuff that would make you life hard; I'm surprised EKS is not more "out of the box". My assumption would have been that AWS launched the clusters with all their software installed by default and let you switch off those defaults in a config file or CLI option.

There is an ok Terraform module for this, but even this was not plug and play. You can find that here. This is what I use for most things and it isn't the fault of the module maintainers, EKS touches a lot of systems and because it is so customizable, it's hard to really turn into a simple module. I don't really considering this the problem of the maintainers, they've already applied a lot of creativity to getting around limitations in what they can do through the Go APIs.

Overall Impression - worth it?

I would still say yes after using it for a few years. The AWS SSO integration is great, allowing us to have folks access the EKS cluster through their SSO access even with the CLI. aws sso login --profile name_of_profile and aws eks --profile name_of_profile --region eu-west-1 update-kubeconfig --name name_of_cluster is all that is required for me to interact with the cluster through normal kubectl commands, which is great. I also appreciate the regular node AMI upgrades, removing the responsibility from me and the team for keeping on top of those.

Storage

There are some issues here. The ebs experience is serviceable but slow, resizing and other operations are very rate-limited and can be an error-prone experience. So if your service relies a lot on resizing volumes or changing volumes, ebs is not the right choice inside of EKS, you are probably going to want the efs approach. I also feel like AWS could provide more feedback to you about the dangers of relying on ebs since they are tied to AZs, requiring some planning on your part before starting to keep all those pieces working. Otherwise the pods will simply be undeployable. By default the old autoscaling doesn't resolve this problem, requiring you to set node affinities.

Storage and Kubernetes have a bad history, like a terrible marriage. I understand the container purists which say "containers should never have state, it's against the holy container rules" but I live in the real world where things must sometimes save to disk. I understand we don't like that, but like so many things in life, we must accept the things we cannot change. That being said it is still a difficult design pattern in EKS that requires planning. You might want to consider some node affinity for applications that do require persistent storage.

Networking

The AWS CNI link is quite good and I love that it allows me to generate IP addresses on the subnets in my VPCs. Not only does that simplify some of the security policy logic and the overall monitoring and deployment story, but mentally it is easier to not add another layer of abstraction to what is already an abstraction. The downside is they are actual network interfaces, so you need to monitor free IP addresses and ENIs.

AWS actually provides a nice tool to keep an eye on this stuff which you can find here. But be aware that if you intend on running a large cluster or if your cluster might scale up to be quite large, you are gonna get throttled with API calls. Remember the maximum number of ENIs you can attach to a node is determined by instance type, so its not an endlessly flexible approach. AWS maintains a list of ENI and IP limits per instance type here.

The attachment process also isn't instant, so you might be waiting a while for all this to come together. The point being you will need to understand the networking pattern of your pods inside of the cluster in a way you don't need to do as much with other CNIs.

Load Balancing and Ingress

This part of the experience is so bad it is kind of incredible. If you install the tool AWS recommends, you are going to get a new ALB for every single Ingress defined unless you group them together into groups. Ditto with every single internal LoadBalancer target, you'll end up with NLBs for each one. Now you can obviously group them together, but this process is incredibly frustrating in practice.

What do you mean?

You have a new application and you add this to your YAML:

kind: Ingress
metadata:
  namespace: new-app
  name: new-app
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  rules:
    - http:
        paths:
          - path: /*
            backend:
              serviceName: new-app
              servicePort: 80

Every time you do this you are going to end up with another ALB, which aren't free to use. You can group them together as shown here but my strong preference would have been to have AWS do this. I get they are not a charity, but it seems kind of crazy that this becomes the default unless you and the rest of your organization know how to group stuff together (and what kinds of apps can even be grouped together).

I understand it is in AWS's best interest to set it up like this, but in practice it means you'll see ALB and NLB cost exceed EC2 costs because by default you'll end up with so many of the damn things barely being used. My advice for someone setting up today would be to go with HAProxy as shown here. It's functionally a better experience and you'll save so much money you can buy the HAProxy enterprise license from the savings.

Conclusion in 2022

If you are going to run Kubernetes, don't manage it yourself unless you can dedicate staff to it. EKS is a very low-friction day to day experience but it frontloads the bad parts. Setup is a pain in the ass and requires a lot of understanding about how these pieces fit together that is very different from more "hands off" AWS products.

I think things are looking good though that AWS has realized it makes more sense to support k8s as opposed to fighting it. Karpenter as the new autoscaler is much better, I'm optimistic they'll do something with the load balancing situation and the k8s add-on stack is relatively stable. If folks are interested in buying-in today I think you'll be in ok shape in 5 years. Clearly whatever internal forces at AWS that wanted this service to fail have been pushed back by its success.

However if I were a very small company new to AWS I wouldn't touch this with a ten foot pole. It's far too complicated for the benefits and any sort of concept of "portability" is nonsense bullshit. You'll be so tied into the IAM elements of RBAC that you won't be able to lift and shift anyway, so if you are going to buy in then don't do this. EKS is for companies already invested in k8s or looking for a deeply tweakable platform inside of AWS when they have outgrown services like Lightsail and Elastic Beanstalk. If that isn't you, don't waste your time.

How to make the Mac better for developers

December 10, 2021 in Apple

A what-if scenario

A few years ago I became aware of the existence of the Apple Pro Workflow teams. These teams exist inside Apple to provide feedback to the owners of various professional-geared hardware and software teams inside the company. This can be everything from advising the Mac Pro team what kind of expansion a pro workstation might need all the way to feedback to the Logic Pro and Final Cut teams on ways the make the software fit better into conventional creative workflows.

I think the idea is really clever and wish more companies did something like this. It would be amazing, for instance, if I worked on accounting software if we had a few accountants in-house attempting to just use our software to solve problems. Shortening the feedback loop and relying less on customers reporting problems and concerns is a great way of demonstrating to people you take their time seriously. However it did get me thinking: what would a Developer Pro Workflow team ask for?

The History of the MacBook Pro and me

I've owned MacBook Pros since they launched, with the exception of the disastrous TouchBar Mac with the keyboard that didn't work. While I use Linux every day for my work, I prefer to have macOS as a "base operating system". There are a lot of reasons for that, but I think you can break them down into a few big advantages to me:

The operating system is very polished. "Well wait, maybe you haven't tried Elementary/Manjaro/etc." I have and they're great for mostly volunteer efforts, but it's very hard to beat the overall sense of quality and "pieces fitting together" that comes with macOS. This is an OS maintained by , I suspect, a large team in terms of software development team sizes. Improvements and optimizations are pretty common and there is a robust community around security. It doesn't just work but for the most part it does keep ticking along.
The hardware is able to be fixed around the world. One thing that bothered me about my switch to Linux: how do I deal with repairs and replacements. PCs change models all the time and, assuming I had a functional warranty, how long could I go without a computer? I earn my living on this machine, I can't wait six weeks for a charger or spend a lot of time fighting with some repair shop on what needs to be done to fix my computer. Worst case I can order the exact same model I have now from Apple, which is huge.
The third-party ecosystem is robust, healthy and frankly time-tested. If I need to join a Zoom call, I don't want to think about whether Zoom is going to work. If someone calls me on Slack, I don't want to deal with quitting and opening it four times to get sound AND video to work. When I need to use third-party commercial software its often non-optional (a customer requires that I use it) and I don't have a ton of time to debug it. With the Mac, commercial apps are just higher quality than Linux. They get a lot more attention internally from companies and they just have a much higher success rate of working.

Suggestions from my fake Workflow team

Now that we've established why I like the platform, what could be done to improve it for developers? How could we take this good starting point and further refine it. These suggestions are not ranked.

An official way to run Linux

Microsoft changed the conversation in developer circles with Windows Subsystem for Linux. Suddenly Windows machines, previously considered inferior for developing applications for Linux, mostly had that limitation removed. The combination of that and the rise of containerization has really reduced the appeal of Macs for developers, as frankly you just don't use the base OS UNIX tooling as much as you used to.

If Apple wants to be serious about winning back developer mindshare, they wouldn't need to make their own distro. I think something like Canonical Multipass would work great, still giving us a machine that resembles the cloud computers we're likely going to be deploying to. Making them more of a first-class citizen inside the Apple software stack like Microsoft would be a big win.

Well why would Apple do this if there are already third-party tools? Part of it is marketing, Apple just can tell a lot more users about something like this. Part of it is that it signals a seriousness about the effort and a commitment on their part to keep it working. I would be hesitant to rely heavily on third-party tooling around the Apple hyperkit VM system just because I know at any point without warning Apple could release new hardware that doesn't work with my workflow. If Apple took the tool in-house there would be at least some assurances that it would continue to work.

Podman for the Mac

Docker Desktop is no longer a product you should be using. With the changes to their licensing, Docker has decided to take a much more aggressive approach to monetization. It's their choice, obviously, but to me it invalidates using Docker Desktop for even medium sized companies. The license terms of: fewer than 250 employees AND less than $10 million in annual revenue means I can accidentally violate the license by having a really good year, or merging with someone. I don't need that stress in my life and would never accept that kind of aggressive license in business-critical software unless there was absolutely no choice.

Apple could do a lot to help with me by assisting with the porting of podman to the Mac. Container-based workflows aren't going anywhere and if Apple installed podman as part of the "Container Developer Tools" command, it would not only remove a lot of concerns from users about Docker licensing, but also would just be very nice. Again this is solvable by users using a Linux VM but its a clunky solution and nothing something I think of as very Apple-like. If there is a team sitting around thinking about the button placement in Final Cut Pro, making my life easier when running podman run for the 12th time would be nice as well.

Xcode is the worst IDE I've ever used and needs a total rewrite

I don't know how to put this better. I'm sure the Xcode team is full of nice people and I bet they work hard. It's the worst IDE I've ever used. Every single thing in Xcode is too slow. I'm talking storyboards, suggestions, every single thing drags. A new single view app, just starting from scratch, takes multiple minutes to generate.

You'll see this screen so often, you begin to wonder where "Report" even goes

Basic shit in Xcode doesn't work. Interface Builder is useless, autocomplete is random in terms of how well or poorly it will do that second, even git commands don't work all the time for me inside Xcode. I've never experienced anything like it with free IDEs, so the idea that Apple ships this software for actual people to use is shocking to me.

If you are a software developer who has never tried to make a basic app in Xcode and are curious what people are talking about, give it a try. I knew mobile development was bad, but spending a week working inside Xcode after years of PyCharm and Vim/Tmux, I got shivers imagining if I paid my mortage with this horrible tool. Every 40 GB update you must just sweat bullets, worried that this will be the one that stops letting you update your apps.

I also cannot imagine that people inside Apple use this crap. Surely there's some sort of special Xcode build that's better or something, right? I would bet money a lot of internal teams at Apple are using AppCode to do their jobs, leaving Xcode hidden away. But I digress: Xcode is terrible and needs a total rewrite or just call JetBrains and ask how much they would charge to license every single Apple Developer account with AppCode, then move on with your life. Release a smaller standalone utility just for pushing apps to the App Store and move on.

It is embarrassing that Apple asks people to use this thing. It's hostile to developers and it's been too many years and too many billions of dollars made off the back of App developers at this point. Android Studio started out bad and has gotten ok. In that time period Xcode started pretty bad and remains pretty bad. Either fix it or give up on it.

Make QuickLook aware of the existence of code

QuickLook, the Mac functionality where you click on a file and hit spacebar to get a quick look at it is a great tool. You can quickly go through a ton of files and take a quick glance at all of them, or at least creative professionals can. I want to as well. There's no reason macOS can't understand what yaml is, or how to show me JSON in a non-terrible way. There are third party tools that do this but I don't see this as something Apple couldn't bring in-house. It's a relatively low commitment and would be just another nice daily improvement to my workflow.

An Apple package manager

I like Homebrew and I have nothing but deep respect for the folks who maintain it. But it's gone on too long now. The Mac App Store is a failure, the best apps don't live there and the restrictions and sandboxing along with Apple's cut means nobody is rushing to get listed there. Not all ideas work and its time to give up on that one.

Instead just give us an actual package manager. It doesn't need to be insanely complicated, heck we can make it similar to how Apple managed their official podcast feed for years with a very hands-off approach. Submit your package along with a URL and Apple will allow users a simple CLI to install it along with declared dependencies. We don't need to start from scratch on this, you can take the great work from homebrew and add some additional validation/hosting.

How great would it be if I could write a "setup script" for new developers when I hand them a MacBook Pro that went out and got everything they needed, official Apple apps, third-party stuff, etc? You wouldn't need some giant complicated MDM solution or an internal software portal and you would be able to add some structure and vetting to the whole thing.

Just being able to share a "new laptop setup" bash script with the internet would be huge. We live in a world where more and more corporations don't take backups of work laptops and they use tooling to block employees from maintaining things like Time Machine backups. While it would be optimal to restore my new laptop from my old laptop, sometimes it isn't possible. Or maybe you just want to get rid of the crap built up over years of downloading stuff, opening it once and never touching it again.

To me this is a no-brainer, taking the amazing work from the homebrew folks, bringing them in-house and adding some real financial resources to it, with the goal of making a robust CLI-based package installation process for the Mac. Right now Homebrew developers actively have to work around Apple's restrictions to get their tool to work, a tool tons of people rely on every day. That's just not acceptable. Pay the homebrew maintainers a salary, give it a new Apple name and roll it out at WWDC. Everybody will love you and you can count the downloads as "Mac App Store" downloads during investor calls.

A git-aware text editor

I love BBedit so much I bought a sweatshirt for it and a pin I proudly display on my backpack. That's a lot of passion for a text editor, but BBedit might be the best software in the world. I know, you are actively getting upset with me as you read that, but hear me out. It just works, it never loses files and it saves you so much time. Search and replace functionality in BBedit has, multiple times, gotten me out of serious jams.

But for your average developer who isn't completely in love with BBedit, there are still huge advantages to Apple shipping even a much more stripped down text editor that Markdown and Git-functionality. While you likely have your primary work editor, be it vim for me or vscode for you, there are times when you need to make slight tweaks to a file or you just aren't working on something that needs the full setup. A stripped version of BBedit or something similar that would allow folks to write quick docs, small shell scripts or other tasks that you do a thousand times a year would be great.

A great example of this is Code for Elementary OS:

This doesn't have to be the greatest text editor ever made, but it would be a huge step forward for the developer community to have something that works out of the box. Plus formats like Markdown are becoming more common in non-technical environments.

Why would Apple do this if third-party apps exist?

For the same reason they make iPhoto even though photo editors exist. Your new laptop from Apple should be functional out of the box for a wide range of tasks and adding a text editor that can work on files that lots of Apple professional users interact with on a daily basis underscores how seriously Apple takes that market. Plus maybe it'll form the core of a new Xcode lite, codenamed Xcode functional.

An Apple monitor designed for text

This is a bit more of a reach, I know that. But Apple could really use a less-expensive entry into the market and in the same way the Pro Display XDR is designed for the top-end of the video editing community, I would love something like that but designed more for viewing text. It wouldn't have nearly the same level of complexity but it would be nice to be able to get a "full end to end" developer solution from Apple again.

IPS panels would be great for this, as the color accuracy and great viewing angels would be ideal for development but you won't care about them being limited to 60hz. A 27 inch panel is totally sufficient and I'd love to be able to order them at the same time as my laptop for a new developer. There are lots of third-party alternatives but frankly dealing with the endless word soup that is monitors these days is difficult to even begin to track. I love my Dell UltraSharp U2515H but I don't even know how long I can keep buying them or how to find the successor to it when it gets decommissioned.

Actually I guess its already been decommissioned and also no monitors are for sale.

What did I miss?

What would you add to the MacBook Pro to make it a better developer machine? Let me know at @duggan_mathew on twitter.

Don't Make My Mistakes: Common Infrastructure Errors I've Made

December 03, 2021 in DevOps

One surreal experience as my career has progressed is the intense feeling of deja vu you get hit with during meetings. From time to time, someone will mention something and you'll flash back to the same meeting you had about this a few jobs ago. A decision was made then, a terrible choice that ruined months of your working life. You spring back to the present day, almost bolting out of your chair to object, "Don't do X!". Your colleagues are startled by your intense reaction, but they haven't seen the horrors you have.

I wanted to take a moment and write down some of my worst mistakes, as a warning to others who may come later. Don't worry, you'll make all your own new mistakes instead. But allow me a moment to go back through some of the most disastrous decisions or projects I ever agreed to (or even fought to do, sometimes).

Don't migrate an application from the datacenter to the cloud

Ah the siren call of cloud services. I'm a big fan of them personally, but applications designed for physical datacenters rarely make the move to the cloud seamlessly. I've been involved now in three attempts to do large-scale migrations of applications written for a specific datacenter to the cloud and every time I have crashed upon the rocks of undocumented assumptions about the environment.

Me encountering my first unsolvable problem with a datacenter to cloud migration

As developer write and test applications, they develop expectations of how their environment will function. How do servers work, what kind of performance does my application get, how reliable is the network, what kind of latency can I expect, etc. These are reasonable thing that any person would do upon working inside of an environment for years, but it means when you package up an application and run it somewhere else, especially old applications, weird things happen. Errors that you never encountered before start to pop up and all sorts of bizarre architectural decisions need to be made to try and allow for this transition.

Soon you've eliminated a lot of the value of the migration to begin with, maybe even doing something terrible like connecting your datacenter to AWS with direct connect in an attempt to bridge the two environments seamlessly. Your list of complicated decisions start to grow and grow, hitting increasingly more and more edge cases of your cloud provider. Inevitable you find something you cannot move and you are now stuck with two environments, a datacenter you need to maintain and a new cloud account. You lament your hubris.

Instead....

Port the application to the cloud. Give developers a totally isolated from the datacenter environment, let them port the application to the cloud and then schedule 4-8 hours of downtime for your application. This will allow persistence layers to cut over and then you can change your DNS entries to point to your new cloud presence. The attempt to prevent this downtime will drown you in bad decision after bad decision. Better to just bite the bullet and move on.

Or even better, develop your application in the same environment you expect to run it in.

Don't write your own secrets system

I don't know why I keep running into this. For some reason, organizations love to write their own secrets management system. Often these are applications written by the infrastructure teams, commonly either environmental variable injection systems or some sort of RSA-key based decrypt API call. Even I have fallen victim to this idea, thinking "well certainly it can't be that difficult".

For some reason, maybe I had lost my mind or something, I decided we were going to manage our secrets inside of PostgREST application I would manage. I wrote an application that would generate and return JWTs back to applications depending on a variety of criteria. These would allow them to access their secrets in a totally secure way.

Now in defense of PostgREST, it worked well at what it promised to do. But the problem of secrets management is more complicated than it first appears. First we hit the problem of caching, how do you keep from hitting this service a million times an hour but still maintain some concept of using the server as the source of truth. This was solvable through some Nginx configs but was something I should have thought of.

Then I smacked myself in the face with the rake of rotation. It was trivial to push a new version, but secrets aren't usually versioned to a client. I authenticate with my application and I see the right secrets. But during a rotation period there are two right secrets, which is obvious when I say it but hadn't occurred to me when I was writing it. Again, not a hard thing to fix, but as time went on and I encountered more and more edge cases for my service, I realized I had made a huge mistake.

The reality is secrets management is a classic high risk and low reward service. It's not gonna help my customers directly, it won't really impress anyone in leadership that I run it, it will consume a lot of my time debugging it and its going to need a lot of domain specific knowledge in terms of running it. I had to rethink a lot of the pieces as I went, everything from multi-region availability (which like, syncing across regions is a drag) to hardening the service.

Instead....

Just use AWS Secrets Manager or Vault. I prefer Secrets Manager, but whatever you prefer is fine. Just don't write your own, there are a lot of edge cases and not a lot of benefits. You'll be the cause of why all applications are down and the cost savings at the end of the day are minimal.

Don't run your own Kubernetes cluster

I know, you have the technical skill to do it. Maybe you absolutely love running etcd and setting up the various certificates. Here is a very simple decision tree when thinking about "should I run my own k8s cluster or not":

Are you a Fortune 100 company? If no, then don't do it.

The reason is you don't have to and letting someone else run it allows you to take advantage of all this great functionality they add. AWS EKS has some incredible features, from support for AWS SSO in your kubeconfig file to allowing you to use IAM roles inside of ServiceAccounts for pod access to AWS resources. On top of all of that, they will run your control plane for less than $1000 a year. Setting all that aside for a moment, let's talk frankly for a second.

One advantage of the cloud is other people beta test upgrades for you.

I don't understand why people don't talk about this more. Yes you can run your own k8s cluster pretty successfully, but why? I have literally tens of thousands of beta testers going ahead of me in line to ensure EKS upgrades work. On top of that, I get tons of AWS engineers working on it. There's no advantage if I'm going to run my infrastructure in AWS anyway to running my own cluster except that I can maintain the illusion that at some point I could "switch cloud providers". Which leads me on to my next point.

Instead....

Let the cloud provider run it. It's their problem now. Focus on making your developers lives easier.

Don't Design for Multiple Cloud Providers

This one irks me on a deeply personal level. I was convinced by a very persuasive manager that we needed to ensure we had the ability to switch cloud providers. Against my better judgement, I fell in with the wrong crowd. We'll call them the "premature optimization" crowd.

Soon I was auditing new services for "multi-cloud compatibility", ensuring that instead of using the premade SDKs from AWS, we maintained our own. This would allow us to, at the drop of a hat, switch between them in the unlikely event this company exploded in popularity and we were big enough to somehow benefit from this migration. I guess in our collective minds this was some sort of future proofing or maybe we just had delusions of grandeur.

What we were actually doing is the worst thing you can do, which is just being a pain in the ass for people trying to ship features to customers. If you are in AWS, don't pretend that there is a real need for your applications to be deployable to multiple clouds. If AWS disappeared tomorrow, yes you would need to migrate your applications. But the probability of AWS outliving your company is high and the time investment of maintaining your own cloud agnostic translation layers is not one you are likely to ever get back.

We ended up with a bunch of libraries that were never up to date with the latest features, meaning developers were constantly reading about some great new feature of AWS they weren't able to use or try out. Tutorials obviously didn't work with our awesome custom library and we never ended up switching cloud providers or even dual deploying because financially it never made sense to do it. We ended up just eating a ton of crow from the entire development team.

Instead....

If someone says "we need to ensure we aren't tied to one cloud provider", tell them that ship sailed the second you signed up. Similar to a data center, an application designed, tested and run successfully for years in AWS is likely to pick up some expectations and patterns of that environment. Attempting to optimize for agnostic design is losing a lot of the value of cloud providers and adding a tremendous amount of busy work for you and everyone else.

Don't be that person. Nobody likes the person who is constantly saying "no we can't do that" in meeting. If you find yourself in a situation where migrating to a new provider makes financial sense, set aside at least 3 months an application for testing and porting. See if it still makes financial sense after that.

Cloud providers are a dependency, just like a programming language. You can't arbitrarily switch them without serious consideration and even then, often "porting" is the wrong choice. Typically you want to practice like you play, so developing in the same environment as your customers will use your product.

Don't let alerts grow unbounded

I'm sure you've seen this at a job. There is a tv somewhere in the office and on that tv is maybe a graph or CloudWatch alerts or something. Some alarm will trigger at an interval and be displayed on that tv, which you will be told to ignore because it isn't a big deal. "We just want to know if that happens too much" is often what is reported.

Eventually these start to trickle into on-call alerts, which page you. Again you'll be told they are informative, often by the team that owns that service. As enough time passes, it becomes unclear what the alert was supposed to tell you, only that new people will get confusing information about whether an alert is important or not. You'll eventually have an outage because the "normal" alert will fire with an unusual condition, leading to a person to silence the page and go back to sleep.

I have done this, where I even defended the system on the grounds of "well surely the person who wrote the alert had some intention behind it". I should have been on the side of "tear it all down and start again", but instead I choose a weird middle ground. It was the wrong decision for me years ago and its the wrong decision for you today.

Instead....

If an alert pages someone, it has to be a situation in which the system could never recover on its own. It needs to be serious and it cannot be something where the failure is built into the application design. An example of that would be "well sometimes our service needs to be restarted, just SSH in and restart it". Nope, not an acceptable reason to wake me up. If your service dies like that, figure out a way to bring it back.

Don't allow for the slow gradual pollution of your life with garbage alerts and feel free to declare bankruptcy on all alerts in a platform if they start to stink. If a system emails you 600 times a day, it's not working. If there is a slack channel so polluted with garbage that nobody goes in there, it isn't working as an alert system. It isn't how human attention works, you can't spam someone constantly with "not-alerts" and then suddenly expect them to carefully parse every string of your alert email and realize "wait this one is different".

Don't write internal cli tools in python

I'll keep this one short and sweet.

Nobody knows how to correctly install and package Python apps. If you write an internal tool in Python, it either needs to be totally portable or just write it in Go or Rust. Save yourself a lot of heartache as people struggle to install the right thing.

My job as a professional Slacker

It's a training problem too

Slackifying Increases

Attention has a cost

My dream

Making a simple CMS

Making the CMS

Password Login

Why are passwords insufficient?

Password Reset Flow

How do you fix this?

TOTP

Use Google/Facebook/Apple

Use Passkeys

What to do from here

Fast Forward Two Months

Gateway API

Seems Straightforward

What are CRDs?

Conclusion

Stare Into Madness

GCP Makes It Worse

Who Cares?

The Obvious Solution

Conclusion

Linkerd

Pricing

Cilium

A Wild Cisco Appears

Istio

Consul Connect

Pricing

Cloud Provider Service Meshes

What do you need to do

WordPerfect?

AppleCare Dispatch

Tools etc

9/11 Truther

The Doctors

HARPO Studio

Brownstone Moms

Robbed for Broken Parts

A bit of history

EKS Today

New customer experience

Where do I go to get the actual setup information?

That's a lot of setup

Overall Impression - worth it?

Storage

Networking

Load Balancing and Ingress

What do you mean?

Conclusion in 2022

The History of the MacBook Pro and me

Suggestions from my fake Workflow team

What did I miss?

Don't migrate an application from the datacenter to the cloud

Instead....

Don't write your own secrets system

Instead....

Don't run your own Kubernetes cluster

Don't Design for Multiple Cloud Providers

Don't let alerts grow unbounded

Don't write internal cli tools in python