Don't Make My Mistakes: Common Infrastructure Errors I've Made

December 03, 2021 in DevOps

One surreal experience as my career has progressed is the intense feeling of deja vu you get hit with during meetings. From time to time, someone will mention something and you'll flash back to the same meeting you had about this a few jobs ago. A decision was made then, a terrible choice that ruined months of your working life. You spring back to the present day, almost bolting out of your chair to object, "Don't do X!". Your colleagues are startled by your intense reaction, but they haven't seen the horrors you have.

I wanted to take a moment and write down some of my worst mistakes, as a warning to others who may come later. Don't worry, you'll make all your own new mistakes instead. But allow me a moment to go back through some of the most disastrous decisions or projects I ever agreed to (or even fought to do, sometimes).

Don't migrate an application from the datacenter to the cloud

Ah the siren call of cloud services. I'm a big fan of them personally, but applications designed for physical datacenters rarely make the move to the cloud seamlessly. I've been involved now in three attempts to do large-scale migrations of applications written for a specific datacenter to the cloud and every time I have crashed upon the rocks of undocumented assumptions about the environment.

Me encountering my first unsolvable problem with a datacenter to cloud migration

As developer write and test applications, they develop expectations of how their environment will function. How do servers work, what kind of performance does my application get, how reliable is the network, what kind of latency can I expect, etc. These are reasonable thing that any person would do upon working inside of an environment for years, but it means when you package up an application and run it somewhere else, especially old applications, weird things happen. Errors that you never encountered before start to pop up and all sorts of bizarre architectural decisions need to be made to try and allow for this transition.

Soon you've eliminated a lot of the value of the migration to begin with, maybe even doing something terrible like connecting your datacenter to AWS with direct connect in an attempt to bridge the two environments seamlessly. Your list of complicated decisions start to grow and grow, hitting increasingly more and more edge cases of your cloud provider. Inevitable you find something you cannot move and you are now stuck with two environments, a datacenter you need to maintain and a new cloud account. You lament your hubris.

Instead....

Port the application to the cloud. Give developers a totally isolated from the datacenter environment, let them port the application to the cloud and then schedule 4-8 hours of downtime for your application. This will allow persistence layers to cut over and then you can change your DNS entries to point to your new cloud presence. The attempt to prevent this downtime will drown you in bad decision after bad decision. Better to just bite the bullet and move on.

Or even better, develop your application in the same environment you expect to run it in.

Don't write your own secrets system

I don't know why I keep running into this. For some reason, organizations love to write their own secrets management system. Often these are applications written by the infrastructure teams, commonly either environmental variable injection systems or some sort of RSA-key based decrypt API call. Even I have fallen victim to this idea, thinking "well certainly it can't be that difficult".

For some reason, maybe I had lost my mind or something, I decided we were going to manage our secrets inside of PostgREST application I would manage. I wrote an application that would generate and return JWTs back to applications depending on a variety of criteria. These would allow them to access their secrets in a totally secure way.

Now in defense of PostgREST, it worked well at what it promised to do. But the problem of secrets management is more complicated than it first appears. First we hit the problem of caching, how do you keep from hitting this service a million times an hour but still maintain some concept of using the server as the source of truth. This was solvable through some Nginx configs but was something I should have thought of.

Then I smacked myself in the face with the rake of rotation. It was trivial to push a new version, but secrets aren't usually versioned to a client. I authenticate with my application and I see the right secrets. But during a rotation period there are two right secrets, which is obvious when I say it but hadn't occurred to me when I was writing it. Again, not a hard thing to fix, but as time went on and I encountered more and more edge cases for my service, I realized I had made a huge mistake.

The reality is secrets management is a classic high risk and low reward service. It's not gonna help my customers directly, it won't really impress anyone in leadership that I run it, it will consume a lot of my time debugging it and its going to need a lot of domain specific knowledge in terms of running it. I had to rethink a lot of the pieces as I went, everything from multi-region availability (which like, syncing across regions is a drag) to hardening the service.

Instead....

Just use AWS Secrets Manager or Vault. I prefer Secrets Manager, but whatever you prefer is fine. Just don't write your own, there are a lot of edge cases and not a lot of benefits. You'll be the cause of why all applications are down and the cost savings at the end of the day are minimal.

Don't run your own Kubernetes cluster

I know, you have the technical skill to do it. Maybe you absolutely love running etcd and setting up the various certificates. Here is a very simple decision tree when thinking about "should I run my own k8s cluster or not":

Are you a Fortune 100 company? If no, then don't do it.

The reason is you don't have to and letting someone else run it allows you to take advantage of all this great functionality they add. AWS EKS has some incredible features, from support for AWS SSO in your kubeconfig file to allowing you to use IAM roles inside of ServiceAccounts for pod access to AWS resources. On top of all of that, they will run your control plane for less than $1000 a year. Setting all that aside for a moment, let's talk frankly for a second.

One advantage of the cloud is other people beta test upgrades for you.

I don't understand why people don't talk about this more. Yes you can run your own k8s cluster pretty successfully, but why? I have literally tens of thousands of beta testers going ahead of me in line to ensure EKS upgrades work. On top of that, I get tons of AWS engineers working on it. There's no advantage if I'm going to run my infrastructure in AWS anyway to running my own cluster except that I can maintain the illusion that at some point I could "switch cloud providers". Which leads me on to my next point.

Instead....

Let the cloud provider run it. It's their problem now. Focus on making your developers lives easier.

Don't Design for Multiple Cloud Providers

This one irks me on a deeply personal level. I was convinced by a very persuasive manager that we needed to ensure we had the ability to switch cloud providers. Against my better judgement, I fell in with the wrong crowd. We'll call them the "premature optimization" crowd.

Soon I was auditing new services for "multi-cloud compatibility", ensuring that instead of using the premade SDKs from AWS, we maintained our own. This would allow us to, at the drop of a hat, switch between them in the unlikely event this company exploded in popularity and we were big enough to somehow benefit from this migration. I guess in our collective minds this was some sort of future proofing or maybe we just had delusions of grandeur.

What we were actually doing is the worst thing you can do, which is just being a pain in the ass for people trying to ship features to customers. If you are in AWS, don't pretend that there is a real need for your applications to be deployable to multiple clouds. If AWS disappeared tomorrow, yes you would need to migrate your applications. But the probability of AWS outliving your company is high and the time investment of maintaining your own cloud agnostic translation layers is not one you are likely to ever get back.

We ended up with a bunch of libraries that were never up to date with the latest features, meaning developers were constantly reading about some great new feature of AWS they weren't able to use or try out. Tutorials obviously didn't work with our awesome custom library and we never ended up switching cloud providers or even dual deploying because financially it never made sense to do it. We ended up just eating a ton of crow from the entire development team.

Instead....

If someone says "we need to ensure we aren't tied to one cloud provider", tell them that ship sailed the second you signed up. Similar to a data center, an application designed, tested and run successfully for years in AWS is likely to pick up some expectations and patterns of that environment. Attempting to optimize for agnostic design is losing a lot of the value of cloud providers and adding a tremendous amount of busy work for you and everyone else.

Don't be that person. Nobody likes the person who is constantly saying "no we can't do that" in meeting. If you find yourself in a situation where migrating to a new provider makes financial sense, set aside at least 3 months an application for testing and porting. See if it still makes financial sense after that.

Cloud providers are a dependency, just like a programming language. You can't arbitrarily switch them without serious consideration and even then, often "porting" is the wrong choice. Typically you want to practice like you play, so developing in the same environment as your customers will use your product.

Don't let alerts grow unbounded

I'm sure you've seen this at a job. There is a tv somewhere in the office and on that tv is maybe a graph or CloudWatch alerts or something. Some alarm will trigger at an interval and be displayed on that tv, which you will be told to ignore because it isn't a big deal. "We just want to know if that happens too much" is often what is reported.

Eventually these start to trickle into on-call alerts, which page you. Again you'll be told they are informative, often by the team that owns that service. As enough time passes, it becomes unclear what the alert was supposed to tell you, only that new people will get confusing information about whether an alert is important or not. You'll eventually have an outage because the "normal" alert will fire with an unusual condition, leading to a person to silence the page and go back to sleep.

I have done this, where I even defended the system on the grounds of "well surely the person who wrote the alert had some intention behind it". I should have been on the side of "tear it all down and start again", but instead I choose a weird middle ground. It was the wrong decision for me years ago and its the wrong decision for you today.

Instead....

If an alert pages someone, it has to be a situation in which the system could never recover on its own. It needs to be serious and it cannot be something where the failure is built into the application design. An example of that would be "well sometimes our service needs to be restarted, just SSH in and restart it". Nope, not an acceptable reason to wake me up. If your service dies like that, figure out a way to bring it back.

Don't allow for the slow gradual pollution of your life with garbage alerts and feel free to declare bankruptcy on all alerts in a platform if they start to stink. If a system emails you 600 times a day, it's not working. If there is a slack channel so polluted with garbage that nobody goes in there, it isn't working as an alert system. It isn't how human attention works, you can't spam someone constantly with "not-alerts" and then suddenly expect them to carefully parse every string of your alert email and realize "wait this one is different".

Don't write internal cli tools in python

I'll keep this one short and sweet.

Nobody knows how to correctly install and package Python apps. If you write an internal tool in Python, it either needs to be totally portable or just write it in Go or Rust. Save yourself a lot of heartache as people struggle to install the right thing.

Coding for non-programmers: Why we need better GUI automation tools

October 22, 2021

I was talking to a friend recently, a non-technical employee at a large company. He was complaining about his mouse making strange noises. Curious, I asked what he had been doing. "Oh well I need to fill in these websites every quarter with information I get off an Excel sheet. It takes about 30 hours to do and I need to click around a lot to do it." I was a bit baffled and asked him to elaborate, which he did, explaining that there was an internal business process where he got an Excel spreadsheet full of tags. He had to go into an internal website and make sure the tags on their internal app matched the spreadsheet.

"Surely you all have a development team or someone who can help automate that?" I asked, a little shocked at what he was saying. "No well we asked but they just don't have the spare capacity right now, so we're gonna keep doing it this year." He went on to explain this wasn't even the only one of these he had to do, in fact there were many web pages where he needed to manually update information, seemingly only possible through his browser. I thought of the thousands of people at just this company who are spending tens of thousands of hours copy and pasting from a spreadsheet.

When I was a kid and first introduced to computers this was exactly the sort of tedium they were supposed to help with. If you did all the work of getting the data into the computer and had everything set up, part of the sales pitch of a computer in every workplace was the incredible amount of time saved. But this didn't seem any better as compared to him pulling paper files and sorting them. What happened? Weren't we supposed to have fixed this already?

My First Automation

My first introduction to the concept of programming was a technology called HyperCard. HyperCard was a very basic GUI scripting program, which allowed you to create stacks of black and white cards. Apple gave you a series of GUI controls that you could drag and drop into the cards and you linked everything together with HyperTalk, a programming language that still structurally looks "correct" to me.

put the value of card field "typehere" into theValue is an example of a HyperTalk command, but you actually had a lot of interesting options available to you. This was my first introduction to the idea of basically prompting the user for input with commands like:

onmouseUp
answer "This is an example of the answer command" with "Reply 1" or -. "Reply 2" or "Reply 3"
endmouseUp

One of the things I enjoyed doing a lot was trying to script art, making it seem like I had drawn something on the computer that in fact had been done through HyperTalk. Those commands looked something like this:

choose spray can tool
drag from 300,100 to 400, 200 wait 1 second
choose eraser tool
drag from 300,100 to 400, 200 choose browse tool
endmouseUp

The syntax was very readable and easy to debug along with being very forgiving. HyperCard stacks weren't presented as complicated or difficult, they were supposed to be fun. What I remember about HyperCard vs the programming tools I would be exposed to later was it was impossible to get into an unrecoverable state. What I mean is backing out of a card while keeping the stack intact didn't require any special knowledge, a kid could figure it out with very little instruction.

For those who never got the chance to try it out: check out this great emulator here.

Automator and AppleScript

Years went by and my interest in IT continued, but for the most part this didn't involve a lot of scripting or programming. In the late 90s and early 2000s the focus for IT was all about local GUI applications, trying to transition power tools out of the terminal and into a place where they looked more like normal applications. Even things like server management and LDAP configuration mostly happened through GUI tools, with a few PowerShell or Bash scripts used on a regular basis.

The problem with this approach is the same problem with modern web applications, which is its great for the user who is doing this for the first time to have all this visual information and guides on what to do. For someone doing it for the 5000th time, there's a lot of slowness inherit in the design. The traditional path folks took in IT was to basically abandon all GUI tooling at that point if possible, leaving people lower on the totem pole to suffer through clicking through screens a thousand times.

I was primarily a Mac user supporting other Mac users, meaning I was very much considered a niche. Apple was considered all but dead at the time, with IT teams preferring to silo off the few departments they couldn't force to adopt windows onto a junior helpdesk person. Mostly my job was setting up the odd Mac server, debugging why Active Directory logins weren't working, nothing too shocking. However it was here when I started to encounter this category of non-programmers writing programs to solve surprisingly important problems.

You would discover these Macs running in higher education or in design studios with a post-it note on it saying "Don't turn off". Inevitably there would be some sort of AppleScript and/or Automator process running, resizing files or sending emails, sometimes even connecting to databases (or most often using Excel as a database). I saw grading tools that would take CSVs, convert them into SQL and then put them into a database as a common example.

For those who haven't used Apple Automator, part of the reason I think people liked it for these tasks in Mac-heavy environments is because they're difficult to mess up. You drag and drop a series of commands into a Workflow and then run the Workflow.

After discussing it with users over the years, a similar story emerged about how these workflows came to exist.

Someone technical introduced the user to the existence of Automator because they started using it to automate a task.
The less-technical user "recorded" themselves doing something they needed to do a thousand times. The Automator record basically just captured what you did with a mouse and ran it again when you hit play.
Eventually they'd get more comfortable with the interface and move away from recording and move towards the workflows since they were more reliable.
Or if they did batch file processing, they'd go with a Folder Action, which triggered something when an item was placed in a folder.
At some point they'd run into the limit of what that could to and call out to either a Bash script or an AppleScript depending on whatever the more technical person they had access to knew.

Why do this?

For a lot of them, this was the only route to automation they were ever going to have. They weren't programmers, they didn't want to become programmers but they needed to automate some task. Automator isn't an intimidating tool, the concept of "record what my mouse does and do it again" makes sense to users. As time went on and more functionality was added to these homegrown programs, they became mission-critical, often serving as the only way that these tasks were done.

From an IT perspective that's a nightmare. I found myself in an uncomfortable place being asked to support what were basically "hacks" around missing functionality or automation in applications. I decided early on that it was at least going to be a more interesting part of my job and so I steered into it and found myself really enjoying the process. AppleScript wasn't an amazing language to write in but the documentation was pretty good and I loved how happy it made people.

For many of you reading I'm sure you rolled your eyes a bit at that part, how happy it made people. I cannot stress enough some of the most impactful programming I maybe have ever done was during this period. You are, quite literally, freeing people to do other things. I had teachers telling me they were staying late 5+ hours a week to do the things we could automate and run. This form of basic programming with the end customer right there providing you direct feedback on whether it works is incredibly empowering for both parties.

It was also very easy to turn into a feature request. Think of all the bad internal tooling feature requests you get, with complicated explanations or workflows that make no sense when they're written down. Instead you had the full thing, from beginning to end, with all the steps laid out. I noticed a much shorter pipeline from "internal request" to "shipped feature" with these amateur automations I think in part to the clarity of the request that comes hand-in-hand with already seeing how the request works end to end.

As time went on, less and less work happened in desktop applications. Platform specific tooling like Automator and AppleScript became less relevant as the OS hooks they need to operate aren't present or can't be guaranteed. You can't really depend on mouse placement on a web page where the entire layout might change any second.

Now we're back at a place where I see people often wasting a ton of time doing exactly the sort of thing computers were supposed to prevent. Endless clicking and dragging, mind-numbing repetition of selecting check-boxes and copy/pasting values in form fields. The ideas present in technology like HyperCard and AppleScript have never been more needed in normal non-technical peoples lives, but there just doesn't seem to be any tooling for it in a browser world. Why?

What about RPA?

RPA, or "robotic process automation", is the attempted reintroduction of these old ideas into the web space. There are a variety of tools and the ones I've tried are UIPath and OpenRPA. Both have similar capabilities although UIPath is certainly more packed with features. The basic workflow is similar, but there are some pretty critical problems with it that make it hard to put into a non-technical users hands.

For those unaware the basic workflow of an RPA tool is mostly browser-based, often with some "orchestrator" process and a "client" process. A user installs a plugin to their browser that records their actions and allows them to replay it. You can then layer in more automation, either using built-in plugins to call APIs or by writing actions that look for things like CSS tags on the page.

Here are some of the big issues with the RPA ecosystem as it exists now:

There isn't a path for non-technical users to take from "Record" to building resilient workflows that functions unattended for long periods of time. The problem is again that websites are simply not stable enough targets to really program against using what you can see. You need to, at the very least, understand a bit about CSS in order to "target" the thing you want to hit reliably. Desktop applications you control the updates, but websites you don't.
The way to do this stuff reliably is often through that services API but then we've increased the difficulty exponentially. You now need to understand what an API is, how they work and be authorized by your organization to have a key to access that API. It's not a trivial thing in a large workplace to get a Google Workspaces key, so it unfortunately would be hard to get off the ground to begin with.
The tools don't provide good feedback to non-technical users about what happened. You just need too much background information about how browsers work, how websites function, what APIs are, etc in order to get even minimal value out of this tooling.
They're also really hard to setup on your computer. Getting started, running your first automation, is just not something you can do without help if you aren't technically inclined.

I understand why there has been an explosion of interest in the RPA space recently and why I see so many "RPA Developer" job openings. There's a legitimate need here to fill in the gaps between professional programming and task automation that is consuming millions of hours of peoples lives, but I don't think this is the way forward. If your technology relies on the use of existing APIs and professional tech workers to function, then its bringing minimal value to the table over just learning Python and writing a script.

The key secret for a technology like this has to be that the person who understands the nuance of the task has to be able to debug it. If it doesn't work, if the thing doesn't write to the database or whatever, maybe you don't know exactly why that happened, but because you have some context on how the task is done manually there exists the possibility of you fixing it.

You see this all the time in non-technical workplaces, a "black box" understanding of fixes. "Oh if you close that window, wait five minutes and open it again, it'll work" is one I remember hearing in a courthouse I was doing contract work in. Sure enough, that did fix the issue, although how they figured it out remains a mystery to me. People are not helpless in the face of technical problems but it also isn't their full-time job to fix them.

Why not just teach everyone to program?

I hate this ideology and I always have. Human civilization progresses in part due to specialization, that we as a society don't have to learn to do everything in order to keep things functional. There's a good argument to be made that we possibly have gone too far in the other direction, that now we understand too little about how the world around us functions, but that's outside the scope of this conversation.

There is a big difference between "having a task that would benefit from automation" and "being interested in investing the hundreds of hours to learn how to program". For most people it has nothing to do with their skills or interest and would be something they would need to learn in their off-time. But I love to program, won't they? My parents both love being lawyers but I can't think of anything I would want to do less right now.

Giving people a path from automation to "real programming", whatever that means, is great and I fully support it. But creating the expectation that the only way for people to automate tasks is to become an expert in the space is selfish thinking on the part of the people who make those services. It is easier for us if they take on the cognitive load vs if we design stuff that is easier to use and automate.

Is there a way forward to help people automate their lives?

I think there is, but I'm not sure if the modern paradigm for how we ship website designs works with it. Similar to website scraping, automation based on website elements and tags is not a forever solution and it can be difficult, if not impossible, for a user to discover that it isn't working anymore without adding manual checks.

There are tools users can use, things like Zapier that work really well for public stuff. However as more and more internal work moves to the browser, this model breaks down for obvious reasons. What would be great is if there was a way to easily communicate to the automation tool "these are the hooks on the page we promise aren't going to go away", communicated through the CSS or even some sort of "promise" page.

If some stability could be added to the process of basically scraping a webpage, I think the tooling could catch up and at least surface to the user "here are the stable targets on this page". As it exists now though, the industry treats APIs as the stable interface for their services and their sites as dynamic content that changes whenever they want. While understandable to some extent, I think it misses this opportunity to really expand the possibility of what a normal user can do with your web service.

It's now become very much the norm for employees to get onboarded and handed a computer where the most used application is the web browser. This has enabled incredible velocity when it comes to shipping new features for companies and has really changed the way we think about the idea of software. But with this has come cost incurred on the side of the actual user of this software, removing the ability to reliably automate their work.

If there was a way to give people some agency in this process, I think it would pay off hugely. Not only in the sheer hours saved of human life, but also with the quality of life for your end users. For the people who found and embraced desktop automation technology, it really allowed normal people to do pretty incredible things with their computers. I would love to see a future in which normal people felt empowered to do that with their web applications.

Warp Terminal Emulator Review

October 06, 2021

An opinionated take on the tool I use the most

Like many of you, my terminal emulator is probably my most used piece of software. My day begins with getting a cup of coffee, opening up Slack and iTerm 2, my terminal emulator for years. iTerm 2 has an incredible number of features, almost too many to list. Here are some of my most-used features just off the top of my head:

Hotkey global terminal dropdown, meaning I can get into the terminal from any application I'm in
Really good search, including support for regex
Paste history, which like come on who doesn't use that 100 times a day
Password manager. With the death of sudolikeaboss I've come to rely on this functionality just to deal with the mess of passwords that fill my life. Also if you know a replacement for sudolikeaboss that isn't the 1Password CLI let me know.
Triggers, meaning you can write basic actions that fire when text matching a regex pattern is encountered. Nice for when you want the icon to bounce in the dock when a job is done in a dock or when you want the password manager to automatically open when a certain login prompt is encountered.

With all this flexibility comes complexity, which smacks you in the face the second you open the Preference pane inside of iTerm 2. I don't blame the developers for this at all, they've done a masterful job of handling this level of customization. But I've seen new users jaw drop when they click around this preference pane:

I have very few complaints with iTerm 2, but I'm always open to try something new. Someone on Twitter told me about Warp, a new terminal emulator written in Rust with some very interesting design patterns. I don't know if its the right terminal for me but it definitely solves problems in a new way.

What makes a good terminal emulator?

This is a topic that can stir a lot of feelings for people. Terminal emulators are a tool that people invest a lot of time into, moving them from job to job. However in general I would say these are the baseline features I would expect from a modern terminal emulator:

Control over color, people don't all have the same setups
Tabs, they're great in browsers and even better with terminals
Profiles. Different setups for different terminals when you are doing totally isolated kinds of work. I like a visual indicator I'm working in production vs testing, for instance.
Access to command history through the tool itself
Bookmarks, while not a must-have are nice so you don't need to define a ton of bookmarks in your bash profile.
Notifications of some sort.
Control over fonts. I love fonts, it's just one of those things.

So why am I reviewing a terminal emulator missing most of these features or having them present in only limited configurations? Because by breaking away from this list of commonly agreed-upon "good features" they've managed to make something that requires almost no customization to get started. Along the way, they've added some really interesting features I've never seen before.

I requested an invite on their site and a few weeks later got the email inviting me to download it. First, huge credit to the Warp team. I respect the hell out of software with an opinion and Warp has a strong point of view. The default for development tools is to offer options for everything under the sun and to see someone come to the conversation with a tool that declares "there is a right way to do this" is intriguing. Here is what you see when you open warp:

From launch it wants you to know this is not your normal terminal emulator. It is trying to get you to do things the warp way from minute 1, which is great. The Command Palette is a lightning fast dropdown of all the Warp commands you might need. Search commands is just bringing up the previous commands from your history.

Executing commands in Warp is unlike anything I've ever seen before. Every command is broken into a Block which is a total rethink of the terminal. Instead of focusing primarily on the manipulation of text, you are focused on each command run as an independent unit you can manipulate through the UI. This is me trying to show what it looks like:

You'll notice the space and blocking between each command. Clicking those 3 dots gives you this dropdown:

They made sudo !! a keyboard shortcut. The audacity.

All this functionality is available on your local machine but they are also available on machines you SSH (if the remote host is using bash). This opens up a massive collection of power text editing functionality on remote machines that might not be configured to be used as a "development machine". Check out that list here and think of how much time this might have saved you in your life.

I think the primary sales point of Warp is the Sharing functionality, which allows for some very interesting workflows. It appears at some point you'll be able to add things like approval before you run commands (which I think is kind of a weird anti-pattern but still I applaud the idea). Right now though you can generate links to your specific block and share them with folks.

I made an example link you can see here. However I don't know how long the links last so here is a quick screenshot.

I love this for a million uses:

it beats sharing snippets of text in Slack all the time
it's a game changer for folks trying to do coding meetups or teaching a class
adding in concepts like approval or review to commands would be mind-blowing for emergency middle of the night fixes where you want a group of people to review it. It doesn't seem to have this functionality yet but appears to be coming.

Daily Usage

Alright so I love a lot of the concepts but how much do I like using it as a daily driver? Let's focus on the positive stuff on Mac. I'm testing this on a 16 inch MacBook Pro with 32 GB of RAM, so about as powerful as it gets.

Pros

Warp is just as fast as iTerm 2, which is to say so fast I can't make it choke on anything I tried.
Steps into an area of the market that desperately needs more options, which is the multi-platform terminal emulator space. Warp is going to eventually be on Linux, Windows and Mac which right now is something only a handful of emulators can say, the biggest being alacritty.
All this functionality comes out of the box. No giant configuration screens to go through, this all works from launch.

This is the configuration menu. No joke, this is it.

I can't stress how "baked" this software feels. Everything works pretty much like they promise every time. Even with weird setups like inside of a tmux or processing tons of text, it kept working.
The team is open to feedback and seems to be responsive. You can see their repo here.

Cons

I don't love how much space the blocks end up taking up, even with "compact mode" turned on. I often take my laptop to my balcony to work and I miss the screen real estate with Warp that I get with iTerm 2.
Missing profiles is a bummer. I like to be able to tweak stuff per workflow.
I'd love some concept of bookmark if I'm going to lose so much space to the "Block" concept.
My workflow is heavily invested in tmux and Vim, meaning I already have a series of shortcuts for how to organize and search my data into distinct blocks. I can change it for Warp, but right now that would be more of a lateral move than something that gives me a lot of benefits today.
You really don't get a lot of customization. You can change the theme to one of their 7 preset themes. In terms of fonts, you have one of 11 options. My favorite themes and fonts weren't on this list.
I would love some documentation on how I might write a plugin for Warp. There's stuff I would love to add but I couldn't really see how I might do that.
A lot of the game-changing stuff is still in the pipeline, things like real-time collaboration and shared environmental variables.
I wish they would share a bit more about how the app works in general. Even opening up the app bundle didn't tell me a lot. I have no reason to not trust this program, but anything they would be willing to share would be appreciated.

A Rust Mac App?

I'm very curious how they managed to make a Rust GUI application on the Mac. I'd love to see if there is some Swift UI or AppKit code in there or if they managed to get it done with the referenced Rust library. If they managed to make an application that feels this snappy without having to write Swift or Objective-C, all the more credit to this team. I'd love more information on how the app is constructed and specifically how they wrote the client front-end.

This does not feel like a "Mac app" though. Immediately you'll notice the lack of Preference pane underneath the "Warp" header on the menu bar. Help is not a drop-down but a search and in general there aren't a lot of MacOS specific options in the menu bar. This likely fits with their model of a common work platform across every OS, but if feeling "Mac-like" is important to you, know this doesn't. It's just as fast as a native application, but it doesn't have the UI feel of one.

Summary

If you are just starting out on the Mac as a development machine and want to use a terminal emulator, this is maybe the fastest to start with. It comes with a lot of the quality of life improvements you normally need to install a bunch of different pieces of software for. Also if you teach or end up needing to share a lot of code as you go, this "Sharing" functionality could be a real game-changer.

However if you, like me, spend your time mostly editing large blocks of text with Vim in the terminal, you aren't going to get a ton out of Warp right now. I just don't have a workflow that is going to really benefit from most of this stuff and while I appreciate their great tab completion, most of the commands I use are muscle memory at this point and have been for years.

I wish this team all the success though and cannot stress enough how great it is to see someone actually experimenting in this space. There are great ideas here, well executed with an eye for stability and quality. I'm going to keep a close eye on this product and will definitely revisit as things like "command reviews" are added down the line.

Like this? Think I'm out of my mind? Ping me on Twitter.

The hunt for a better Dockerfile

October 03, 2021 in DevOps

Time to thank Dockerfiles for their service and send them on their way

For why I don't think Dockerfiles are good enough anymore, click here. After writing about my dislike of Dockerfiles and what I think is a major regression in the tools Operations teams had to work with, I got a lot of recommendations of things to look at. I'm going to try to do a deeper look at some of these options and see if there is a reasonable option to switch to.

My ideal solution would be an API I could hit and just supply the parameters for the containers to. This would let me standardize the process with the same language I use for the app, write some tests around the containers and hook in things like CI logging conventions and exception tracking.

BuildKit

BuildKit is a child of the Moby project, an open-source project designed to advance the container space to allow for more specialized uses for containers. Judging from its about page, it seems to be staffed by some Docker employees and some folks from elsewhere in the container space.

What is the Moby project? Honestly I have no idea. They have on their list of projects high-profile things like containerd, runc, etc. You can see the list here. This seems to be the best explanation of what the Moby project is:

Docker uses the Moby Project as an open R&D lab, to experiment, develop new components, and collaborate with the ecosystem on the future of container technology. All our open source collaboration will move to the Moby project.

My guess is the Moby project is how Docker gets involved in open-source projects and in turns open-sources some elements of its stack. Like many things Docker does, it is a bit inscrutable from the outside. I'm not exactly sure who staffs most of this project or what their motivations are.

BuildKit walkthrough

BuildKit is built around a totally new model for building images. At its core is a new format for defining builds called LLB. It's an intermediate binary format that uses the Go Marshal function to seralize your data. This new model allows for actual concurrency in your builds, as well as a better model for caching. You can see more about the format here.

LLB is really about decoupling the container build process from Dockerfiles, which is nice. This is done through the use of Frontends, of which Docker is one of many. You run a frontend to convert a build definition (most often a Dockerfile) into LLB. This concept seems strange, but if you look at the Dockerfile frontend you will get a better idea of the new options open to you. That can be found here.

Of the most interest for most folks is the inclusion of a variety of different mounts. You have: --mount=type=cache which takes advantage of the more precise caching available due to LLB to persist the cache between building invocations. There is also --mount=type=secret which allows you to give the container access to secrets while ensuring they aren't baked into the image. Finally there is --mount=type=ssh which uses SSH agents to allow containers to connect using the hosts SSH to things like git over ssh.

In theory this allows you to build images using a ton of tooling. Any language that supports Protocol Buffers could be used to make images, meaning you can move your entire container build process to a series of scripts. I like this a lot, not only because the output of the build process gives you a lot of precise data about what was done, but you can add testing and whatever else.

In practice, while many Docker users are currently enjoying the benefits of LLB and BuildKit, this isn't a feasible tool to use right now to build containers using Go unless you are extremely dedicated to your own tooling. The basic building blocks are still shell commands you are executing against the frontend of Docker, although at least you can write tests.

If you are interested in what a Golang Dockerfile looks like, they have some good examples here.

buildah

With the recent announcement of Docker Desktop new licensing restrictions along with the IP based limiting of pulling images from Docker Hub, the community opinion of Docker has never been lower. There has been an explosion of interest in Docker alternatives, with podman being the frontrunner. Along with podman is a docker build alternative called buildah. I started playing around with the two for an example workflow and have to say I'm pretty impressed.

podman is a big enough topic that I'll need to spend more time on it another time, but buildah is the build system for podman. It actually predates podman and in my time testing it, offers substantial advantages over docker build with conventional Dockerfiles. The primary way that you use buildah is through writing shell scripts to construct images, but with much more precise control over layers. I especially enjoyed being able to start with an empty container that is just a directory and build up from there.

If you want to integrate buildah into your existing flow, you can also use it to build containers from Dockerfiles. Red Hat has a series of good tutorials to get you started you can check out here. In general the whole setup works well and I like moving away from the brittle Dockerfile model towards something more sustainable and less dependent on Docker.

I've never heard of PouchContainer before, an offering from Alibaba but playing around with it has been eye-opening. It's much more ambitious than a simple Docker replacement, instead adding on a ton of shims to various container technologies. The following diagram lays out just what we're talking about here:

The CLI called just pouch includes some standard options like building from a Dockerfile with pouch build. However this tool is much more flexible in terms of where you can get containers from, including concepts like pouch load which allows you to load up a tar file full of containers it will parse. Outside of just the CLI, you have a full API in order to do all sorts of things. Interested in creating a container with an API call? Check this out.

There is also a cool technology they call a "rich container", which seems to be designed for legacy applications where the model of one process running isn't sufficient and you need to kick off a nested series of processes. They aren't wrong, this is actually a common problem when migrating legacy applications to containers and it's not a bad solution to what is an antipattern. You can check out more about it here.

PouchContainer is designed around kubernetes as well, allowing for it to serve as the container plugin for k8s without needing to recompile. This combined with a P2P model for sharing containers using Dragonfly means this is really a fasinating approach to the creation and distribution of containers. I'm surprised I've never heard of it before, but alas looking at the repo it doesn't look like it's currently maintained.

Going through what is here though, I'm very impressed with the ambition and scope of PouchContainer. There are some great ideas here, from models around container distribution to easy to use APIs. If anyone has more information about what happened here or if is a sandbox somewhere I can use to learn more about this, please let me know on Twitter.

Packer, for those unfamiliar with it, is maybe the most popular tool out there for the creation of AMIs. These are the images that are used when an EC2 instance is launched, allowed organizations to install whatever software they need for things like autoscaling groups. Packer uses two different concepts for the creation of images:

Builders, which are the list of platforms you can build an image on as found here.
Provisioners, which is what you are running to make the image which supports everything under the sun. You can see the full list here.

This allows for organizations that are using things like Ansible to configure boxes after they launch to switch to baking the AMI before the instance is started. This saves time and involves less overhead. What's especially interesting for us is this allows us to set up Docker as a builder, meaning we can construct our containers using any technology we want.

How this works in practice is we can create a list of provisioners in our packer json file like so:

"provisioners": [{
        "type": "ansible",
        "user": "root",
        "playbook_file": "provision.yml"
    }],

So if we want to write most of our configuration in Ansible and construct the whole thing with Packer, that's fine, We can also use shell scripts, Chef, puppet or whatever other tooling we like. In practice you define a provisioner with whatever you want to run, then a post-processor pushing the image to your registry. All done.

Summary

I'm glad that there exists options for organizations looking to streamline their container experience. If I were starting out today and either had existing Ansible/Puppet/Chef infrastructure as code, I would go with Packer. It's easy to use and allows you to keep what you have with some relatively minor tweaks. If I were starting out fresh, I'd see how far I could get with buildah. There seems to be more community support around it and Docker as a platform is not looking particularlly robust at this particular moment.

While I strongly prefer using Ansible for creating containers vs Dockerfiles, I think the closest to the "best" solution is the buildkit Go client approach. You would still get the benefits of buildkit while being able to very precisely control exactly how a container is made, cache, etc. However the buildah process is an excellent middle group, allowing for shell scripts to create images that, ideally, contain the optimizations inherit with the newer process.

Outstanding questions I would love the answers to:

Is there a library or abstraction that allows for a less complicated time dealing with buildkit? Ideally something in Golang or Python, where we could more easily interact with it?
Or are there better docs for how to build containers in code with buildkit that I missed?
With buildah are there client libraries out there to interact with its API? Shell scripts are fine, but again ideally I'd like to be writing critical pieces of infrastructure in a language with some tests and something where the amount of domain specific knowledge would be minimal.
Is there another system like PouchContainer that I could play around with? An API that allows for the easy creation of containers through standard REST calls?

Know the answers to any of these questions or know of a Dockerfile alternative I missed? I'd love to know about it and I'll test it. Twitter

Stuff to read

September 14, 2021

There is a lot of downtime in a modern tech workers life. From meetings to waiting for deployments to finish, you can spend a lot of time watching a progress bar. Or maybe you've been banging your head against a problem all day, getting nowhere. Perhaps the endless going from one room of your apartment to another that is modern work has driven you insane, dreading your "off-time" of sitting in a different room and watching a different screen.

I can't make the problems go away, but I can give you a little break from the monotony. Take advantage of one of the perks of forever WFH and read for 5-10 minutes. I find the context switching to be a huge mental relief and it makes me feel like I'm still getting something out of the time. I've tried to organize them for a variety of moods, but feel free to tell me on Twitter if I'm missing some.

Watching a progress bar

Got something compiling, deploying or maybe just running your tests? Here's some stuff to read while you wait.

Ignition!: An informal history of liquid rocket propellants

A funny and insightful look into the history of how modern rocket fuel was developed. The writing style is very light and you can pick it up, read a few pages, check on your progress bar and get back to it. This is a part of the Space race story I didn't know much about.

Why Fish Don't Exist: A Story of Loss, Love, and the Hidden Order of Life

A history of Taxonomy in the United States and the story of one of its stars, David Starr Jordan. It is a wild story, touching on the founding of Stanford University, the history of sterilization programs in the US and everything between. The primary theme is attempting to impose order among chaos, something I think many technology workers can relate to. I was hooked from the intro on.

Picture the person you love the most. Picture them sitting on the couch, eating cereal, ranting about something totally charming, like how it bothers them when people sign their emails with a single initial instead of taking those four extra keystrokes to just finish the job-

Chaos will get them. Chaos will crack them from the outside - with a falling branch, a speeding car, a bullet - or unravel them from the inside, with the mutiny of their very own cells. Chaos will rot your plants and kill your dog and rust your bike. It will decay your most precious memories, topple your favorite cities, wreck any sanctuary you can ever build.

Blacktop Wasteland

A mistake is a lesson, unless you make the same mistake twice.

This is a great thriller, crime novel that is easy to pick up and burn through. The first chapter might be some of the tightest writing I've seen in awhile, introducing everything you need and not wasting your time. It's all about the last job a wheel-man needs to pull, a classic in these kinds of novels. Great little novel to pick up, read for 15 minutes and put back down.

Forensics: What Bugs, Burns, Prints, DNA and More Tell Us About Crime

This is the history of Forensics as told through a series of cases, making each section digestable in a limited amount of time. While there is an understandable amount of skepticism about some areas of forensics, this book really sticks to the more established scientific practices. Of particular interest to me is why you might use one tool over another in different situations.

You are frustrated by a problem and want a break

Maybe you've thrown everything you have at a problem and somehow made it worse. Is that project you inherited from the person who no longer works here an undocumented rats nest of madness? Go sit in the beanbag chairs your office provides but nobody ever intended for you to sit in. You know the ones, by the dusty PlayStation that serves as a prop of how fun your office is. Take a load off and distract yourself with one of these gems.

Trainspotting

I know, you saw the movie. The book is a classic for a good reason. It's funny, it is sad and the characters are incredibly authentic. However the biggest reason I recommend it for people needing a short mental break from a problem is it is written in a Scottish accent. You'll need to focus up a bit to get the jokes, which for me helps push problems out of my head for a few minutes.

A Confederacy of Dunces

"Oh, Fortuna, blind, heedless goddess, I am strapped to your wheel," Ignatius belched. "Do not crush me beneath your spokes. Raise me on high, divinity."

If you have never had the pleasure of reading this book, I'm so jealous of you. Ignatius J. Reilly is one of the strangest characters I've ever been introduced to in a book. His journey around New Orleans is bizarre and hilarious, with a unique voice I've never read from an author before or since. You'll forget what you were working on in seconds.

I Feel Bad About My Neck

“One of my favorite things about New York is that you can pick up the phone and order anything and someone will deliver it to you. Once I lived for a year in another city, and almost every waking hour of my life was spent going to stores, buying things, loading them into the car, bringing them home, unloading them, and carrying them into the house. How anyone gets anything done in these places is a mystery to me.”

Written by the hilarious Nora Ephron, it is an honest and deeply funny commentary about being a woman of a certain age. She's done it all, from writing hits like When Harry Met Sally to interning in the Kennedy White House. If you are still thinking about your problem 5 pages in, you aren't holding it right.

Three Men in a Boat

It's from the late 1800s, so you will need to focus up a bit to follow the story. But there is a reason this comedic gem hasn't been out of print since it was introduced. It's the story of three men and a dog, but on a bigger level is about the "clerking class" of London, a group of people who if they lived today would likely be working in startups around the world. So sit back and enjoy a ride on the Thames.

You hate this work

We've all been there. You realize that whatever passion you had for programming or technology in general is gone, replaced with a sense of mild dread every time you open a terminal and go to the git repository. Maybe it was a boss telling you that you need to migrate from one cloud provider to another. Or maybe you found yourself staring out a window at someone hauling garbage and think well at least they can say they did something at the end of the day. I can't solve burnout, but I can allow you to indulge those feelings for awhile. Then back to the git repo, you little feature machine! We need a new graph for that big customer.

New Maps - deindustrial fiction

A successor to the much-enjoyed Into the Ruins, this quarterly journal of short stories explores a post-industrial world. However this isn't Star Trek, but instead stories of messy futures with people making do. When it all feels too much in the face of climate change and the endless cycle of creation and destruction in technology, I like to reach for these short stories. They don't fill you with hope necessarily, but they are like aloe cream for burnout.

An Archdruid's Tales: Fiction From The Archdruid Report

This is an anthology of the short stories published the now-gone Archdruid Report. They're a little strange and out there, but it's a similar energy to New Maps.

The Golden Apples

If you want to escape modern life for a few hours and go to the 1950s South, this will do it. It's here, the good and bad, not really short stories but more a loosely combined collection of stories. If you've never been exposed to Eudora Welty’s writing, get ready for writing that is both light and surprisingly dense, packed full of meaning and substance.

The Fortnight in September

This is a story of normal people living ordinary lives in England, after World War One. There is a simplicity to it that is delightful because it's not a trick. You will start to care about what happens to this community and the entire thing has a refreshing lack of gravity to it. It's a beach read, something to enjoy with low stakes but will stick with you days after you finish it.

The Baron in the Trees

What's not to enjoy about an Italian noble who leaves everything behind to live in a tree for the rest of his life? About as right to the point as you can get, but it is also about the passing of an age in civilization, which feels appropriate right now.

TIL I've been changing directories incorrectly

September 12, 2021 in TIL

One of my first tasks when I start at a new job is making a series of cd alias in my profile. These are usually to the git repositories where I'm going to be doing the most work, but its not an ideal situation because obviously sometimes I work with repos only once in a while. This is to avoid endless cd ../../../ or starting from my home directory every time.

I recently found out about zoxide and after a week of using it I'm not really sure why I would ever go back to shortcuts. It basically learns the paths you use, allowing you to say z directory_name or z term_a term_b. Combined with fzf you can really zoom around your entire machine with no manually defined shortcuts. Huge fan.

Are Dockerfiles good enough?

September 10, 2021

For those looking for a fast overview of containers click here.

Containers have quickly become the favorite way to deploy software, for a lot of good reasons. They have allowed, for the first time, developers to test "as close to production" as possible. Unlike say, VMs, containers have a minimal performance hit and overhead. Almost all of the new orchestration technology like Kubernetes relies on them and they are an open standard, with a diverse range of corporate rulers overseeing them. In terms of the sky-high view, containers have never been in a better place.

I would argue though that in our haste to adopt this new workflow, we missed some steps. To be clear, this is not to say containers are bad (they aren't) or that they aren't working correctly (they are working mostly as advertised). However many of the benefits to containers aren't being used by organizations correctly, resulting in a worse situation than before. While it is possible to use containers in a stable and easy-to-replicate workflow across a fleet of servers, most businesses don't.

We're currently in a place where most organizations relying on containers don't use them correctly. At the same time, we also went back 10+ years in terms of the quality of tools Operation teams have for managing servers as defined broadly as "places where our code runs and accepts requests". There has been a major regression inside of many orgs who now tolerate risks inside containers that never would have been allowed on a fleet of virtual machines.

For me a lot of the blame seems to rest with Dockerfiles. They aren't opinionated enough or flexible enough, forcing people into workflows where they can make catastrophic mistakes with no warning, relying too much on brittle bash scripts and losing a lot of the tools we gained in Operations over the last decade.

What did containers replace?

In the beginning, there were shell scripts and they were bad. The original way a fleet of servers was managed when I started was, without a doubt, terrible. There was typically two physical machines for the databases, another 4 physical machines for the application servers, some sort of load balancer, and then networking gear at the top of the rack. You would PXE boot the box onto an install VLAN and it would kind of go from there.

There was a user with an SSH key added, usually admin. You would then run a utility to rsync a directory of bash scripts that would run. Very quickly, you would run into problems. Writing bash scripts is not "programming light", it's just real programming. But it's programming with both hands tied behind your back. You still need to be writing functions, encapsulate the logic, handling errors, etc. But bash doesn't want to help you do this.

You can still get thrown with undefined variables, comparisons vs assignment is a constant issue when people start writing bash ( foo=bar vs foo = bar), you might not check to make sure bash is the shell you are running and a million other problems. Often you had these carefully composed scripts with raw sh just in case the small things bash does to make your life better were not there. I have worked with people who are expert bash programmers and can do it correctly, but it is not a safer, easier or more reliable programming environment.

Let's look at a basic example I see all the time.

for f in $(ls *.csv); do    
    some command $f         
done

I wish this worked like I assumed it did for years. But it doesn't. You can't treat ls like a stable list and iterate over it. You'll have to account for whitespace in the file name, checking for glob characters, ls can mangle filenames. This is just a basic example of something that everyone assumes they are doing right until it causes a massive problem.

The correct way I know to do this looks like this:

while IFS= read -r -d '' file; do
  some command "$file"
done < <(find . -type f -name '*.csv' -print0)

Do you know what IFS is? It's ok if you don't, I didn't for a long time. My point is that this requires a lot of low-level understanding of how these commands work in conjunction with each other. But for years around the world, we all made the same mistakes over and over. However, things began to change for the better.

As time went on new languages became the go-to for Sys Admin tasks. We started to replace bash with python, which was superior in every imaginable way. Imagine being able to run a debugger with a business-critical bootstrapping script? This clearly emerged as the superior paradigm for Operations. Bash still has a place, but it couldn't be the first tool we reached for every time.

So we got new tools to match this new understanding. While there are a lot of tools that were used to manage fleets of servers, I'm going to focus on the one I used the most professionally: Ansible. Ansible is a configuration management framework famous for a minimal set of dependencies (Python and SSH), being lightweight enough to deploy to thousands of targets from a laptop and having a very easy-to-use playbook structure.

Part of the value of Ansible was its flexibility. It was very simple to write playbooks that could be used across your organization, applying different configurations to different hosts depending on a variety of criteria, like which inventory they were in or what their hostnames were. There was something truly magical about being able to tag a VLAN at the same time as you stood up a new database server.

Ansible took care of the abstraction between things like different Linux distributions, but its real value was in the higher-level programming concepts you could finally use. Things like sharing playbooks between different sets of servers, writing conditionals for whether to run a task or not on that resource, even writing tests on information I could query from the system. Finally, I could have event-driven system administration code, an impossibility with bash.

With some practice, it was possible to use tools like Ansible to do some pretty incredible stuff, like calling out to an API with lookups to populate information. It was a well-liked, stable platform that allowed for a lot of power. Tools like Ansible Tower allowed you to run Ansible from a SaaS platform that made it possible to keep a massive fleet of servers in exact configuration sync. While certainly not without work, it was now possible to say with complete confidence "every server in our fleet is running the exact same software". You could even do actual rolling deploys of changes.

This change didn't eliminate all previous sources of tension though. Developers could not just upgrade to a new version of a language or install random new binaries from package repositories on the system. It created a bottleneck, as changes had to be added to the existing playbooks and then rolled out. The process wasn't too terrible but it wasn't hands-off and could not be done on-demand, in that you could not decide in the morning to have a new cool-apt-package in production by that afternoon.

Then containers appeared

When I was first introduced to Docker, I was overjoyed. This seemed like a great middle step between the two sets of demands. I could still rely on my mature tooling to manage the actual boxes, but developers would have control and responsibility for what ran inside of their containers. Obviously, we would help but this could be a really good middle ground. It certainly seemed superior to developers running virtual machines on their laptops.

Then I sat down and started working with containers and quickly the illusion was shattered. I was shocked and confused, this was the future? I had to write cron jobs to clean up old images, why isn't this a config file somewhere? Why am I managing the docker user and group here? As it turns out installing docker would be the easy part.

Application teams began to write Dockerfiles and my heart started to sink. First, because these were just the bash scripts of my youth again. The learning curve was exactly the same, which is to say a very fast start and then a progressively more brutal arc. Here are some common problems I saw the first week I was exposed to Dockerfiles that I still see all the time:

FROM: ubuntu:latest Already we have a problem. You can pull that down to your laptop, work for a month, deploy it to production, and be running a totally different version of Ubuntu. You shouldn't use latest but you also shouldn't be using other normal tags. The only tool Docker gives you to ensure everyone is running the exact same thing is the SHA. Please use it. FROM ubuntu@sha256:cf25d111d193288d47d20a4e5d42a68dc2af24bb962853b067752eca3914355e is less catchy but it is likely what you intended. Even security updates should be deliberate.
apt-get is a problem. First, don't run apt-get upgrade otherwise we just upgraded all the packages and defeated the point. We want consistent, replicable builds. I've also seen a lot of confusion between users on apt vs apt-get.
COPY yourscript.py before RUN install dependencies breaks the caching functionality.
Running everything as root. We never let your code run as root before, why is it now suddenly a good idea? RUN useradd --create-home cuteappusername should be in there.
Adding random Linux packages from the internet. I understand it worked for you, but please stick to the official package registry. I have no idea what this package does or who maintains it. Looking at you, random curl in the middle of the Dockerfile.
Writing brittle shell scripts in the middle of the Dockerfile to handle complicated operations like database migrations or external calls, then not accounting for what happens if they fail.
Please stop putting secrets in ENV. I know, we all hate secrets management.
Running ADD against unstable URL targets. If you need it, download it and copy it to the repo. Stop assuming random URL will always work.
Obsessing about container size over everything else. If you have a team of Operations people familiar with Debian, following Debian releases, plugging into the ecosystem, why throw all that expertise in the trash for a smaller container?
Less && and && \please. This one isn't your fault but sometimes looking at complicated Dockerfiles makes my eyes hurt.
Running a full Linux container for a script. Thankfully Google has already solved this one.

This is not your fault

You may be looking at this list and be like "I know all this because I read some article or book". Or maybe you are looking at this list and thinking "oh no I do all of that". I'm not here to judge you or your life. Operations people knew this was a problem and it was a problem we had come to the conclusion could not be fixed by assuming people would magically discover this information.

My frustration is that we already went through this learning. We know that the differences between how distros handle packages throw people off. We know bash scripts are hard to write and easy to mess up. The entire industry learned through years of pain that it was essential you be able to roll back not just your application, but the entire infrastructure that the application is running on. Creating endless drift in infrastructure worked until it didn't when suddenly teams had to spend hours trying to reverse engineer what of the dozens of changes introduced with the latest update caused a problem.

In our rush to get to a place where obstacles were removed from developers, we threw away years of hard-earned experience. Hoping for the best and having absolutely no way to recover if it doesn't work isn't a plan. It isn't even really a philosophy. Saying "well the Linux part isn't the important part of my application" is fine until that is very much not the case. Then you are left in an extremely difficult position, reaching for troubleshooting skills your organization might not even have anymore.

Stuff we can do now

Start running a linter against our Dockerfiles: https://github.com/hadolint/hadolint
Look at alternatives to conventional Dockerfiles. Below is an example of combining Ansible and the Dockerfile template.

FROM debian@sha256:47b63f4456821dcd40802ac634bd763ae2d87735a98712d475c523a49e4cc37e

# Install Ansible
RUN apt-get update && apt-get install -y wget gcc make python python-dev python-setuptools python-pip libffi-dev libssl-dev libyaml-dev
RUN pip install -U pip
RUN pip install -U ansible

# Setup environment
RUN mkdir /ansible
COPY . /ansible
ENV ANSIBLE_ROLES_PATH /ansible/roles
ENV ANSIBLE_VAULT_PASSWORD_FILE /ansible/.vaultpass

# Launch Ansible playbook
RUN cd /ansible && ansible-playbook -c local -v example.yml

# Cleanup
RUN rm -rf /ansible
RUN apt-get purge -y python-dev python-pip
RUN apt-get autoremove -y && apt-get autoclean -y && apt-get clean -y

# Final steps
ENV HOME /home/test
WORKDIR /
USER test

CMD ["/bin/bash"]

It's not perfect but is is better.

Better than this would be to use Packer. It allows for developers to string together Docker as a builder and Ansible or Puppet as a provisioner! It's the best of all possible worlds. Here are the details. Plus you can still run all the Dockerfile commands you want.

Place I would love to get to

I would love for some blessed alternative to Dockerfiles to emerge. We don't want to break backwards compatibility but I would love a less brittle tool to work with. Think like Terraform or Packer, something sitting between me and the actual build. It doesn't need to be a full programming language but some guardrails around me making common mistakes is desperately needed, especially as there are fewer and fewer restrictions between developers and production.

Questions/comments/does this tool already exist and I don't know about it? Hit me up on twitter.

Operations is not Developer IT

September 03, 2021

My code doesn't compile. Why?

The number of times in my career I have been asked a variation on "why doesn't my application work" is shocking. When you meet up with Operations people for drinks, you'll hear endless variations on it. Application teams attempting to assign ownership of a bug to a networking team because they didn't account for timeouts. Infrastructure teams being paged in the middle of the night because an application suddenly logs 10x what it did before and there are disk space issues. But to me nothing beats the developer who pings me being like "I'm getting an error message in testing from my application and I'd like you to take a look".

It is baffling on many levels to me. First, I am not an application developer and never have been. I enjoy writing code, mostly scripting in Python, as a way to reliably solve problems in my own field. I have very little context on what your application may even do, as I deal with many application demands every week. I'm not in your retros or part of your sprint planning. I likely don't even know what "working" means in the context of your app.

Yet as the years go on the number of developers who approach me and say "this worked on my laptop, it doesn't work in the test environment, why" has steadily increased. Often they have not even bothered to do basic troubleshooting, things like read the documentation on what the error message is attempting to tell you. Sometimes I don't even get an error message in these reports, just a development saying "this page doesn't load for me now but it did before". The number of times I have sent a full-time Node developer a link to the Node.js docs is too high.

Part of my bafflement is this is not acceptable behavior among Operations teams. When I was starting out, I would never have wandered up to the Senior Network Administrator and reported a bug like "I sometimes have timeouts that go away. Do you know why?" I would have been politely but sternly told to do more troubleshooting. Because in my experience Operations is learned on the job, there was a culture of training and patience with junior members of the team. Along with that was a clear understanding that it was my obligation to demonstrate to the person I was reporting this error to:

1. What specifically the error was.

2. Why that error was something that belonged to them.

Somehow in the increased lack of distinction between Development and Operations, some developers, especially younger ones, have come to see Operations as their IT department. If a problem wasn't immediately recognizable as one resulting from their work, it might be because of "the servers" or "the network", meaning we could stop what we were doing and ask Operations to rule that out before continuing.

How did we get here?

First they came for my QA team...

When I started my career in Operations, it was very different from what exists today. Everything lived in or around datacenters for us, the lead time on new servers was often measured in months not minutes and we mostly lived in our own world. There were network engineers, managing the switches and routers. The sysadmins ruled over the boxes and, if the org was large enough, we had a SAN engineer managing the massive collection of data.

Our flow in the pipeline was pretty simple. The QA team approved some software for release and we took that software to our system along with a runbook. We treated software mostly like a black box, with whatever we needed to know contained inside of the runbook. Inside were instructions on how to deploy it, how to tell if it was working and what to do if it wasn't working. There was very little expectation that we could do much to help you. If a deployment went poorly in the initial rollout, we would roll back and then basically wait for development to tell us what to do.

There was not a lot of debate over who "owned" an issue. If the runbook for an application didn't result in the successful deployment of an application, it went back to development. Have a problem with the database? That's why we have two DBAs. Getting errors on the SAN? Talk to the SAN engineer. It was a slow process at times, but it wasn't confusing. Because it was slower, often developers and these experts could sit down and share knowledge. Sometimes we didn't agree, but we all had the same goal: ship a good product to the customer in a low-stress way.

Deployments were events and we tried to steal from more mature industries. Runbooks were an attempt to quantify the chaotic nature of software development, requiring at least someone vaguely familiar with how the application worked to sit down and write something about it. We would all sit there and watch error logs, checking to see if some bash script check failed. It was not a fast process as compared to now but it was simple to understand.

Of course this flow was simply too straightforward and involved too many employees for MBAs to allow it to survive. First we killed QA, something I am still angry about. The responsibility for ensuring that the product "worked as intended" was shifted to development teams, armed with testing frameworks that allowed them to confirm that their API endpoints returned something like the right thing. Combined with the complete garbage fire that is browser testing, we now had incredibly clunky long running testing stacks that could roughly approximate a single bad QA engineer. Thank god for that reduced headcount.

With the removal of QA came increased pressure to ship software more often. This made sense to a lot of us as smaller more frequent changes certainly seemed less dangerous than infrequent massive changes of the entire codebase. Operations teams started to see more and more pressure to get stuff out the door quickly. New features attract customers, so being the first and fastest to ship had a real competitive advantage. Release windows shrunk, from cutting a release every month to every week. The pressure to ship also increased as management looked at the competitive landscape growing more aggressive with cycles.

Soon the runbook was gone and now developers ran their own deployment schedule, pushing code out all the time. This was embraced with a philosophy called DevOps, a concept that the two groups, now that QA was dead and buried, would be able to tightly integrate to close this gap even more. Of course this was sold to Development and Operations as if it would somehow "empower" better work out of them, which was of course complete nonsense.

Instead we now had a world where all ownership of problems was muddled and everyone ended up owning everything.

This is the funniest image maybe on the internet.

DevOps is not a decision made in isolation

When Operations shifted focus to the cloud and to more GitOps style processes, there was an understanding that we were all making a very specific set of tradeoffs. We were trading tight cost control for speed, so never again would a lack of resources in our data centers cause a single feature not to launch. We were also trading safety for speed. Nobody was going to sit there and babysit a deploy, the source of truth was in the code. If something went wrong or the entire stack collapsed, we could "roll back", a concept that works better in annoying tech conference slide decks then in practice.

We soon found ourselves more pressed than ever. We still had all the responsibilities we had before, ensuring the application was available, monitored, secure and compliant. However we also built and maintained all these new pipelines, laying the groundwork for empowering development to get code out quickly and safely without us being involved. This involved massive retraining among operations teams, shifting from their traditional world of bash scripts and Linux to learning the low-level details of their cloud provider and an infrastructure as code system like Terraform.

For many businesses, the wheels came off the bus pretty quickly. Operations teams struggled to keep the balls in the air, shifting focus between business concerns like auditing and compliance to laying the track for Development to launch their products. Soon many developers, frustrated with waiting, would attempt to simply "jump over" Operations. If something was easy to do in the AWS web console on their personal account, certainly it was trivial and safe to do in the production system? We can always roll back!

In reality there are times when you can "roll back" infrastructure and there are times you can't. There are mistakes or errors you can make in configuring infrastructure that are so catastrophic it is difficult to estimate their potential impact to a business. So quickly Operations teams learned they needed to install rails to infrastructure as code, guiding people to the happy safe path in a reliable and consistent way. This is slow though and after awhile started to look a lot like what was happening before with datacenters. Businesses were spending more on the cloud than on their old datacenters but where was the speed?

Inside engineering, getting both sides of the equation to agree in the beginning "fewer blockers to deploying to production is good" was the trivial part. The fiercer fights were over ownership. Who is responsible in the middle of the night if an application starts to return errors? Historically operations was on-call, relying on those playbooks to either resolve the problem or escalate it. Now we had applications going out with no documentation, no clear safety design, no QA vetting and sometimes no developers on-call to fix it. Who owns an RDS problem?

Tools like Docker made this problem worse, with developers able to craft perfect application stacks on their laptops and push them to production with mixed results. As cloud providers came to provide more and more of the functionality, soon for many teams every problem with those providers also fell into Operations lap. Issues with SQS? Probably an Operations issue. Not sure why you are getting a CORS error on S3? I guess also an Operations problem!

The dream of perfect harmony was destroyed with the harsh reality that someone has to own a problem. It can't be a community issue, someone needs to sit down and work on it. You have an incentive in modern companies to not be the problem person, but instead to ship new features today. Nobody gets promoted for maintenance or passing a security audit.

Where we are now

In my opinion the situation has never been more bleak. Development has been completely overwhelmed with a massive increase in the scope of their responsibilities (RIP QA) but also with unrealistic expectations by management as to speed. With all restrictions lifted, it is now possible and expected that a single application will get deployed multiple times a day. There are no real limiters except for the team itself in terms of how fast they can ship features to customers.

Of course this is just a fundamental misunderstanding about how software development works. It isn't a factory and they aren't "code machines". The act of writing code is a creative exercise, something that people take pride in. Developers, in my experience, don't like shipping bad or rushed features. What we call "technical debt" can best be described as "the shortcuts taken today that have to be paid off later". Making an application is like building a house, you can take shortcuts but they aren't free. Someone pays for them later, but probably not the current executive in charge of your specific company so who cares.

Due to this, developers are not incentivized or even encouraged to gain broader knowledge of how their systems work. Whereas before you might reasonable be expected to understand how RabbitMQ works, SQS is "put message in, get message out, oh no message is not there, open ticket with Ops". This situation has gotten so bad that we have now seen the adoption of widespread large-scale systems like Kubernetes who attempt to abstract away the entire stack. Now there is a network overlay, a storage overlay, healthchecks and rollbacks all inside the stack running inside of the abstraction that is a cloud provider.

Despite the bullshit about how this was going to empower us to do "our best work faster", the results have been clear. Operations is drowning, forced to learn both all the fundamentals their peers had to learn (Linux, networking, scripting languages, logging and monitoring) along with one or more cloud providers (how do network interfaces attach to EC2 instances, what are the specific rules for how to invalidate caches on Cloudfront, walk me through IAM Profiles). On top of all of that, they need to understand the abstraction on top of this abstraction, the nuance of how K8 and AWS interact, how storage works with EBS, what are you monitoring and what is it doing. They also need to learn more code than before, now often expected to write relatively complicated internal applications which manage these processes.

They're all your problem now. $100 to the first C-level who can explain what a trace is

With this came monitoring and observability responsibilities as well. Harvesting the metrics and logs, shipping them somewhere, parsing and shipping them, then finally making them consumable by development. A group of engineers who know nothing about how the application works, who have no control over how it functions or what decisions it makes, need to own determining whether it is working or not. The concept makes no sense. Nuclear reactor technicians don't ask me if the reactor is working well or not, I have no idea what to even look for.

Developers simply do not have the excess capacity to sit down and learn this. They are certainly intellectually capable, but their incentives are totally misaligned. Every meeting, retro and sprint is about getting features out the door faster, but of course with full test coverage and if it could be done in the new cool language that would be ideal. When they encounter a problem they don't know the answer to, they turn to the Operations team because we have decided that means "the people who own everything else in the entire stack".

It's ridiculous and unsustainable. Part of it is our fault, we sell tools like Docker and Kubernetes and AWS as "incredibly easy to use", not being honest that all of them have complexity which matter more as you go. That testing an application on your laptop and hitting a "go to production" button works, until it doesn't. Someone will always have to own that gap and nobody wants to, because there is no incentive to. Who wants to own the outage, the fuck up or the slow down? Not me.

In the meantime I'll be here, explaining to someone we cannot give full administrator IAM rights to their serverless application just because the internet said that made it easier to deploy. It's not their fault, they were told this was easy.

Thoughts/opinions? @duggan_mathew on twitter

How does Apple Private Relay Work?

August 31, 2021 in Apple

What is Apple Private Relay?

Private Relay is an attempt by Apple to change the way traffic is routed from user to internet service and back. This is designed to break the relationship between user IP address and information about that user, reducing the digital footprint of that user and eliminating certain venues of advertising information.

It is a new feature in the latest version of iOS and MacOS that will be launching in "beta mode". It is available to all users who pay Apple for iCloud storage and I became interested in it after watching the WWDC session about preparing for it.

TL;DR

Private Relay provides real value to users, but also fundamentally changes the way network traffic flows across the internet for those users. Network administrators, programmers and owners of businesses which rely on IP addresses from clients for things like whitelisting, advertising and traffic analysis should be aware of this massive change. It is my belief that this change is not getting enough attention in the light of the CSAM scanning.

What happens when you turn on Private Relay?

The following traffic is impacted by Private Relay

All Safari web browsing
All DNS queries
All insecure HTTP traffic

Traffic from those sources will no longer take the normal route to their destination, instead being run through servers controlled by either Apple or its partners. They will ingress at a location close to you and then egress somewhere else, with an IP address known to be from your "region". In theory websites will still know roughly where you are coming from, but won't be able to easily combine that with other information they know about your IP address to enrich targeted advertisements. Access logs and other raw sources of data will also be less detailed, with the personally identifiable information that is your IP address no longer listed on logs for every website you visit.

Why is Apple doing this?

When you go to a website, you are identified in one of a thousand ways, from cookies to device fingerprinting. However one of the easiest ways is through your IP address. Normal consumers don't have "one" IP address, they are either given one by their ISP when their modem comes online and asks for one, or their ISP has them behind "carrier-grade NAT". So normally what happens is that you get your modem, plug it in, it receives an IP address from the ISP and that IP addresses identifies you to the world.

Normally how the process works is something like this:

Your modems MAC address appears on the ISPs network and requests an IP address
The ISP does a lookup for the MAC address, makes sure it is in the table and then assigns an IP, ideally the same IP over and over again so whatever cached routes from the ISPs side exist are still used.
All requests from your home are mapped to a specific IP addresses and, over time, given the combination of other information about the browsing history and advertising data, it is possible to combine the data together to know where you live and who you are within a specific range.
You can see how close the geographic data is by checking out the map available here. For me it got me within a few blocks of my house, which is spooky.

CGNAT

Because of IPv4 address exhaustion, it's not always possible to assign every customer their own IP address. You know you have a setup like this because the IP address your router gets is in the "private range" of IP addresses, but when you go to IP Chicken you'll have a non-private IP address.

Private IP ranges include:

10.0.0.0 – 10.255.255.255
172.16.0.0 – 172.31.255.255
192.168.0.0 – 192.168.255.255

For those interested you can get more information about how CGNAT works here.

Doesn't my home router do that?

Yeah so your home router kind of does something similar with its own IP address ranges. So next time a device warns you about "double-NAT" this might be what it is talking about, basically nested NAT. (Most often double-NAT is caused by your modem also doing NAT though.) Your home router runs something called PAT or PAT in overload. I think more often it is called NAPT in modern texts.

This process is not that different from what we see above. One public IP address is shared and the different internal targets are identified with ports. Your machine makes an outbound connection, your router receives the request and rewrites the packet with a random high port. Every outbound connection gets its own entry in this table.

IP Exposed

So during the normal course of using the internet, your IP address is exposed to the following groups

Every website or web service you are connecting to
Your DNS server also can have a record of every website you looked up.
Your ISP can obviously see where every request to and from your home went to

This means there are three groups of people able to turn your request into extremely targeted advertising. The most common one I see using IP address is hyper-local advertising. If you have ever gotten an online ad for a local business or service and wondered "how did they know it was me", there is a good chance it was through your IP.

DNS is one I think is often forgotten in the conversation about leaking IPs, but since it is a reasonable assumption that if you make a DNS lookup for a destination you will go to that destination, it is as valuable as the more invasive systems without requiring nearly as much work. Let's look at one popular example, Google DNS.

Google DNS

The famous 8.8.8.8. Google DNS has become famous because of the use of DNS around the world as a cheap and fast way to block network access for whole countries or regions. A DNS lookup is just what turns domain names into IP address. So for this site:

➜  ~ host matduggan.com
matduggan.com has address 67.205.139.103

Since DNS servers are normally controlled by ISPs and subject to local law, it is trivial if your countries leadership wants to block access for users to get to Twitter by simply blocking lookups to twitter.com. DNS is a powerful service that is normally treated as an afterthought. Alternatives came up, the most popular being Google DNS. But is it actually more secure?

Google asserts that they only store your IP address for 24-48 hours in their temporary logs. When they migrate your data to their permanent DNS logs, they remove IP address and replace with region data. So instead of being able to drill down to your specific house, they will only be able to tell your city. You can find more information here. I consider their explanation logical though and think they are certainly more secure when compared to a normal ISP DNS server.

Most ISPs don't offer that luxury, simply prefilling their DNS servers when you get your equipment from them and add it to the network. There is very little information about what they are doing with it that I was able to find, but they are allowed now to sell that information if they so choose. This means the default setting for US users is to provide an easy to query copy of every website their household visits to their ISP.

So most users will not take the proactive step to switch their DNS servers to one provided by Google or other parties. However since most folks won't do that, the information is just being openly shared with whoever has access to that DNS server.

NOTE: If you are looking to switch your DNS servers off your ISP, I recommend dns.watch. I've been using them for years and feel strongly they provide an excellent service with a minimum amount of fuss.

How does Private Relay address these concerns?

DNS

This is how a normal DNS lookup works.

Apple and Cloudflare engineers have proposed a new standard, which they discuss in their blog post here. ODNS or "oblivious DNS" is a system which allows clients to mask the originator of the request from the server making the lookup, breaking the IP chain.

This is what ODNS looks like:

Source: Princeton paper

This is why all DNS queries are getting funneled through Private Relay, removing the possibility of ISP DNS servers getting this valuable information. It is unclear to me in my testing if I am using Apple's servers or Cloudflares 1.1.1.1 DNS service. With this system it shouldn't matter in terms of privacy.

2. Website IP Tracking

When on Private Relay, all traffic is funneled first through an Apple ingress service and then out through a CDN partner. Your client makes a lookup to one of these two DNS entries using our new fancy ODNS:

mask.icloud.com
mask-h2.icloud.com

This returns a long list of IP addresses for you to choose from:

mask.icloud.com is an alias for mask.apple-dns.net.
mask.apple-dns.net has address 172.224.41.7
mask.apple-dns.net has address 172.224.41.4
mask.apple-dns.net has address 172.224.42.5
mask.apple-dns.net has address 172.224.42.4
mask.apple-dns.net has address 172.224.42.9
mask.apple-dns.net has address 172.224.41.9
mask.apple-dns.net has address 172.224.42.7
mask.apple-dns.net has address 172.224.41.6
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2909::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a05::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a07::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2904::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2905::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a04::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a08::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2907::

These IP addresses are owned by Akamai and are here in Denmark, meaning all Private Relay traffic first goes to a CDN endpoint. These are globally situated datacenters which allow companies to cache content close to users to improve response time and decrease load on their own servers. So then my client opens a connection to one of these endpoints using a new protocol, QUIC. Quick, get it? Aren't network engineers fun.

QUIC integrates TLS to encrypt all payload data and most control information. Its based on UDP for speed but is designed to replace TCP, the venerable protocol that requires a lot of overhead in terms of connections. By baking in encryption, Apple is ensuring a very high level of security for this traffic with a minimum amount of trust required between the partners. It also removes the loss recovery elements of TCP, instead shifting that responsibility to each QUIC stream. There are other advantages such as better shifting between different network providers as well.

So each user makes an insecure DNS lookup to mask.apple-dns.net, establishes a QUIC connection to the local ingress node and then that traffic is passed through to the egress CDN node. Apple maintains a list of those egress CDN nodes you can see here. However users can choose whether they want to reveal even city-level information to websites through the Private Relay settings panel.

If I choose to leave "Maintain General Location" checked, websites will know I'm coming from Copenhagen. If I select the "Country and Time Zone" you just know I'm coming fron Denmark. The traffic will appear to be coming from a variety of CDN IP addresses. You can tell Apple very delibertly did not want to offer any sort of "region hopping" functionality like users require from VPNs, letting you access things like streaming content in other countries. You will always appear to be coming from your country.

3. ISP Network Information

Similar to how the TOR protocol (link) works, this will allow you to effectively hide most of what you are doing. To the ISP your traffic will simply be going to the CDN endpoint closest to you, with no DNS queries flowing to them. Those partner CDN nodes lack the complete information to connect your IP address to the request to the site. In short, it should make the information flowing across their wires much less valuable from an advertising perspective.

In terms of performance hit it should be minimal, unlike TOR. Since we are using a faster protocol with only one hop (CDN 1 -> CDN 2 -> Destination) as opposed to TOR, in my testing its pretty hard to tell the difference. While there are costs for Apple to offer the service, by limiting the traffic to just Safari, DNS and http traffic they are greatly limiting how much raw bandwidth will pass through these servers. Most traffic (like Zoom, Slack, Software Updates, etc) will all be coming from HTTPS servers.

Conclusion

Network operators, especially with large numbers of Apple devices, should take the time to read through the QUIC management document. Since the only way Apple is allowing people to "opt out" of Private Relay at a network level is by blocking DNS lookups to mask.icloud.com and mask-h2.icloud.com, many smaller shops or organizations that choose to not host their own DNS will see a large change in how traffic flows.

For those that do host their own DNS, users receive an alert that you have blocked Private Relay on the network. This is to caution you in case you think that turning it off will result in no user complaints. I won't presume to know your requirements, but nothing I've seen on the spec document for managing QUIC suggests there is anything worth blocking from a network safety perspective. If anything, it should be a maginal reduction in the amount of packets flowing across the wire.

Apple is making some deliberate choices here with Private Relay and for the most part I support them. I think it will hurt the value of some advertising and I suspect that for the months following its release the list of Apple egress nodes will confuse network operators on why they are seeing so much traffic from the same IP addresses. I am also concerned that eventually Apple will want all traffic to flow through Private Relay, adding another level of complexity for teams attempting to debug user error reports of networking problems.

From a privacy standpoint I'm still unclear on how secure this process is from Apple. Since they are controlling the encryption and key exchange, along with authenticating with the service, it seems obvious that they can work backwards and determine the IP address. I would love for them to publish more whitepapers or additional clarification on the specifics of how they handle logging around the establishment of connections.

Any additional information people have been able to find out would be much appreciated. Feel free to ping me on twitter at: @duggan_mathew.

DevOps Crash Course - Section 2: Servers

August 30, 2021 in DevOps

via GIFER

Section 2 - Server Wrangling

For those of you just joining us you can see part 1 here.

If you went through part 1, you are a newcomer to DevOps dropped into the role with maybe not a lot of prep. We now have a good idea of what is running in our infrastructure, how it gets there and we have an idea of how secure our AWS setup is from an account perspective.

Next we're going to build on top of that knowledge and start the process of automating more of our existing tasks, allowing AWS (or really any cloud provider) to start doing more of the work. This provides a more reliable infrastructure and means we can focus more on improving quality of life.

What matters here?

Before we get into the various options for how to run your actual code in production, let's take a step back and walk about what matters. Whatever choice you and your organization end up making, here are the important questions we are trying to answer.

Can we, demonstrably and without human intervention, stand up a new server and deploy code to it?
Do we have an environment or way for a developer to precisely replicate the environment that their code runs in production in another place not hosting or serving customer traffic?
If one of our servers become unhealthy, do we have a way for customer traffic to stop being sent to that box and ideally for that box to be replaced without human intervention?

Can you use "server" in a sentence?

Alright you caught me. It's becoming increasingly more difficult to define what we mean what we say "server". To me, a server is still the piece of physical hardware running in a data center and the software for a particular host is a virtual machine. You can think of EC2 instances as virtual machines, different from docker containers in ways we'll discuss later. For our purposes EC2 instances are usually what we mean when we say servers. These instances are defined through the web UI, CLI or Terraform and launched in every possible configuration and memory size.

Really a server is just something that allows customers to interact with our application code. An API Gateway connected to Lambdas still meets my internal definition of server, except in that case we are completely hands off.

What are we dealing with here

In conversations with DevOps engineers who have had the role thrust on them or perhaps dropped into a position where maintenance hasn't happened a lot, a common theme has emerged. They often are placed in charge of a stack that looks something like this:

A user flow normally looks something like this:

A DNS request is made to the domain and it points towards (often) a classic load balancer. This load balancer handles SSL termination and then forwards traffic on to servers running inside of a VPC on port 80.
These servers are often running whatever linux distribution was installed on them when they were made. Something they are in autoscaling groups, but often they are not. These servers normally have Nginx installed along with something called uWSGI. Requests come in, they are handed by Nginx to the uWSGI workers and these interact with your application code.
The application will make calls to the database server, often running MySQL because that is what the original developer knew to use. Sometimes these are running on a managed database service and sometimes they are not.
Often with these sorts of stacks the deployment is something like "zip up a directory in the CICD stack, copy it to the servers, then unzip and move to the right location".
There are many times some sort of script to remove that box from the load balancer at the time of deploy and then readd them.

Often there are additional AWS services being used, things like S3, Cloudfront, etc but we're going to focus on the servers right now.

This stack seems to work fine. Why should I bother changing it?

Configuration drift. The inevitable result of launching different boxes at different times and hand-running commands on them which make them impossible to test or reliably replicate.

Large organizations go through lots of work to ensure every piece of their infrastructure is uniformly configured. There are tons of tools to do this, from things like Ansible Tower, Puppet, Chef, etc to check each server on a regular cycle and ensure the correct software is installed everywhere. Some organizations rely on building AMIs, building whole new server images for each variation and then deploying them to auto-scaling groups with tools like Packer. All of this work is designed to try and eliminate differences between Box A and Box B. We want all of our customers and services to be running on the same platforms we have running in our testing environment. This catches errors in earlier testing environments and means our app in production isn't impacted.

The problem we want to avoid is the nightmare one for DevOps people, which is where you can't roll forward or backwards. Something in your application has broken on some or all of your servers but you aren't sure why. The logs aren't useful and you didn't catch it before production because you don't test with the same stack you run in prod. Now your app is dying and everyone is trying to figure out why, eventually discovering some sort of hidden configuration option or file that was set a long time ago that now causes problem.

These sorts of issues plagued traditional servers for a long time, resulting in bespoke hand-crafted servers that could never be replaced without tremendous amounts of work. You either destroyed and remade your servers on a regular basis to ensure you still could or you accepted the drift and tried to ensure that you had some baseline configuration that would launch your application. Both of these scenarios suck and for a small team its just not reasonable to expect you to run that kind of maintenance operation. It's too complicated.

What if there was a tool that let you test exactly like you run your code in production? It would be easy to use, work on your local machine as well as on your servers and would even let you quickly audit the units to see if there are known security issues.

This all just sounds like Docker

It is Docker! You caught me, it's just containers all the way down. You've probably used Docker containers a few times in your life, but running your applications inside of containers is the vastly preferred model for how to run code. It simplifies testing, deployments and dependency management, allowing you to move all of it inside of git repos.

What is a container?

Let's start with what a container isn't. We talked about virtual machines before. Docker is not a virtual machine, it's a different thing. The easiest way to understand the difference is to think of a virtual machine like a physical server. It isn't, but for the most part the distinction is meaningless for your normal day to day life. You have a full file system with a kernel, users, etc.

Containers are just what your application needs to run. They are just ways of moving collections of code around along with their dependencies. Often new folks to DevOps will think of containers in the wrong paradigm, asking questions like "how do you back up a container" or using them as permanent stores of state. Everything in a container is designed to be temporary. It's just a different way of running a process on the host machine. That's why if you run ps fauxx | grep name_of_your_service on the host machine you still see it.

Are containers my only option?

Absolutely not. I have worked with organizations that manage their code in different ways outside of containers. Some of the tools I've worked with have been NPM for Node applications, RPMs for various applications linked together with Linux dependencies. Here are the key questions when evaluating something other than Docker containers?

Can you reliably stand up a new server using a bash script + this package? Typically bash scripts should be under 300 lines, so if we can make a new server with a script like that and some "other package" I would consider us to be in ok shape.
How do I roll out normal security upgrades? All linux distros have constant security upgrades, how do I do that on a normal basis while still confirming that the boxes still work?
How much does an AWS EC2 maintenance notice scare me? This is where AWS or another cloud provider emails you and says "we need to stop one of your instances randomly due to hardware failures". Is it a crisis for my business or is it a mostly boring event?
If you aren't going to use containers but something else, just ensure there is more than one source of truth for that.
For Node I have had a lot of sucess with Verdaccio as an NPM cache: https://verdaccio.org/
However in general I recommend paying for Packagecloud and pushing whatever package there: https://packagecloud.io/

How do I get my application into a container?

I find the best way to do this is to sit down with the person who has worked on the application the longest. I will spin up a brand new, fresh VM and say "can you walk me through what is required to get this app running?". Remember this is something they likely have done on their own machines a few hundred times, so they can pretty quickly recite the series of commands needed to "run the app". We need to capture those commands because they are how we write the Dockerfile, the template for how we make our application.

Once you have the list of commands, you can string them together in a Dockerfile.

How Do Containers Work?

It's a really fascinating story. Let's teleport back in time. It's the year 2000, we have survived Y2K, the most dangerous threat to human existence at the time. FreeBSD rolls out a new technology called "jails". FreeBSD jails were introduced in FreeBSD 4.X and are still being developed now.

Jails are layered on top of chroot, which allows you to change the root directory of processes. For those of you who use Python, think of chroot like virtualenv. It's a safe distinct location that allows you to simulate having a new "root" directory. These processes cannot access files or resources outside of that environment.

Jails take that concept and expanded it, virtualizing access to the file system, users, networking and every other part of the system. Jails introduce 4 things that you will quickly recognize as you start to work with Docker:

A new directory structure of dependencies that a process cannot escape.
A hostname for the specific jail
A new IP address which is often just an alias for an existing interface
A command that you want to run inside of the jail.

www {
    host.hostname = www.example.org;           # Hostname
    ip4.addr = 192.168.0.10;                   # IP address of the jail
    path = "/usr/jail/www";                    # Path to the jail
    devfs_ruleset = "www_ruleset";             # devfs ruleset
    mount.devfs;                               # Mount devfs inside the jail
    exec.start = "/bin/sh /etc/rc";            # Start command
    exec.stop = "/bin/sh /etc/rc.shutdown";    # Stop command
}

What a Jail looks like.

From FreeBSD the technology makes it way to Linux via the VServer project. As time went on more people build on the technology, taking advantage of cgroups. Control groups, shorted to cgroups is a technology that was added to Linux in 2008 from engineers at Google. It is a way of defining a collection of processes that are bound by the same restrictions. Progress has continued with cgroups from its initial launch, now at a v2.

There are two parts of a cgroup, a core and a controller. The core is responsible for organizing processes. The controller is responsible for distributing a type of resource along the hierarchy. With this continued work we have gotten incredible flexibility with how to organize, isolate and allocate resources to processes.

Finally in 2008 we got Docker, adding a simple CLI, the concept of a Docker server, a way to host and share images and more. Now containers are too big for one company, instead being overseen by the Open Container Initiative. Now instead of there being exclusively Docker clients pushing images to Dockerhub running on Docker server, we have a vibrant and strong open-source community around containers.

I could easily fill a dozen pages with interesting facts about containers, but the important thing is that containers are a mature technology built on a proven pattern of isolating processes from the host. This means we have complete flexibility for creating containers and can easily reuse a simple "base" host regardless of what is running on it.

For those interested in more details:

Anatomy of a Dockerfile

FROM debian:latest
# Copy application files
COPY . /app
# Install required system packages
RUN apt-get update
RUN apt-get -y install imagemagick curl software-properties-common gnupg vim ssh
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash -
RUN apt-get -y install nodejs
# Install NPM dependencies
RUN npm install --prefix /app
EXPOSE 80
CMD ["npm", "start", "--prefix", "app"]

This is an example of a not great Dockerfile. Source

When writing Dockerfiles, open a tab to the official Docker docs. You will need to refer to them all the time at first, because very little about it. Typically Dockerfile are stored in the top level of an existing repository and their file operations, such as COPY as shown above, operate on that principal. You don't have to do that, but it is a common pattern to see the Dockerfile at the root level of a repo. Whatever you do keep it consistent.

Formatting

Dockerfile instructions are not case-sensitive, but are usually written in uppercase so that they can be differentiated from arguments more easily. Comments have the hash symbol (#) at the beginning of the line.

FROM

First is a FROM, which just says "what is our base container that we are starting from". As you progress in your Docker experience, FROM containers are actually great ways of speeding up the build process. If all of your containers have the same requirements for packages, you can actually just make a "base container" and then use that as a FROM. But when building your first containers I recommend just sticking with Debian.

Don't Use latest

Docker images rely on tags, which you can see in the example above as: debian:latest. This is docker for "give me the more recently pushed image". You don't want to do that for production systems. Typically upgrading the base container should be a affirmative action, not just something you accidentally do.

The correct way to reference a FROM image in a Dockerfile is through the use of a hash. So we want something like this:

FROM debian@sha256:c6e865b5373b09942bc49e4b02a7b361fcfa405479ece627f5d4306554120673

Which I got from the Debian Dockerhub page here. This protects us in a few different ways.

We won't accidentally upgrade our containers without meaning to
If the team in charge of pushing Debian containers to Dockerhub makes a mistake, we aren't suddenly going to be in trouble
It eliminates another source of drift

But I see a lot of people using Alpine

That's fine, use Alpine if you want. I have more confidence in Debian when compared to Alpine and always base my stuff off Debian. I think its a more mature community and more likely to catch problems. But again, whatever you end up doing, make it consistent.

If you do want a smaller container, I recommend minideb. It still lets you get the benefits of Debian with a smaller footprint. It is a good middle ground.

COPY

COPY is very basic. The . just means "current working directory", which in this case if the Dockerfile is at the top level of a git repository. It just takes whatever you specify and copy it into the Dockerfile.

COPY vs ADD

A common question I get is "what is the difference between COPY and ADD". Super basic, ADD is for going out and fetching something from a URL or opening a compressed file into the container. So if all you need to do is copy some files into the container from the repo, just use COPY. If you have to grab a compressed directory from somewhere or unzip something use ADD.

RUN

RUN is the meat and potatoes of the Dockerfile. These are the bash commands we are running in order to basically put together all the requirements. The file we have above doesn't follow best practices. We want to compress the RUN commands down so that they are all part of one layer.

RUN wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 \
&& tar jxf samtools-1.2.tar.bz2 \
&& cd samtools-1.2 \
&& make \
&& make install

A good RUN example so all of these are one layer

WORKDIR

Allows you to set the directory inside the container from which all the other commands will run. Saves you from having to write out the absolute path every time.

CMD

The command we are executing when we run the container. Usually for most web applications this would be where we run the framework start command. This is an example from a Django app I run:

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

If you need more detail, Docker has a decent tutorial: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

One more thing

This obviously depends on your application, but many applications will also need a reverse proxy. This allows Nginx to listen on port 80 inside the container and forward requests on to your application. Docker has a good tutorial on how to add Nginx to your container: https://www.docker.com/blog/how-to-use-the-official-nginx-docker-image/

I cannot stress this enough, making the Dockerfile to run the actual application is not something a DevOps engineer should try to do on your own. You likely can do it by reverse engineering how your current servers work, but you need to pull in the other programmers in your organization.

Docker also has a good tutorial from beginning to end for Docker novices here: https://docs.docker.com/get-started/

Docker build, compose etc

Once you have a Dockerfile in your application repository, you are ready to move on to the next steps.

Have your CICD system build the images. Typing in the words "name of your CICD + build docker images" into Google to see how.
You'll need to make an IAM user for your CICD system in order to push the docker images from your CI workers to the ECR private registry. You can find the required permissions here.
Get ready to push those images to a registry. For AWS users I strongly recommend AWS ECR.
Here is how you make a private registry.
Then you need to push your image to the registry. I want to make sure you see AWS ECR helper, a great tool that makes the act of pushing from your laptop much easier. https://github.com/awslabs/amazon-ecr-credential-helper. This also can help developers pull these containers down for local testing.
Pay close attention to tags. You'll notice that the ECR registry is part of the tag along with the : and then a version information. You can use different registries for different applications or use the same registry for all your applications. Remember secrets shouldn't be in your container regardless, or customer data.
Go get a beer, you earned it.

Some hard decisions

Up to this point, we've been following a pretty conventional workflow. Get stuff into containers, push the containers up to a registry, automate the process of making new containers. Now hopefully we have our developers able to test their applications locally and everyone is very impressed with you and all the work you have gotten done.

The reason we did all this work is because now that our applications are in Docker containers, we have a wide range of options for ways to quickly and easily run this application. I can't tell you what the right option is for your organization without being there, but I can lay out the options so you can walk into the conversation armed with the relevant data.

Deploying Docker containers directly to EC2 Instances

This is a workflow you'll see quite a bit among organizations just building confidence in Docker. It works something like this -

Your CI system builds the Docker container using a worker and the Dockerfile you defined before. It pushes it to your registry with the correct tag.
You make a basic AMI with a tool like packer.
New Docker containers are pulled down to the EC2 instances running the AMIs we made with Packer.

Packer

Packer is just a tool that spins up an EC2 instance, installs the software you want installed and then saves it as an AMI. These AMIs can be deployed when new machines launch, ensuring you have identical software for each host. Since we're going to be keeping all the often-updated software inside the Docker container, this AMI can be used as a less-often touched tool.

First, go through the Packer tutorial, it's very good.

Here is another more comprehensive tutorial.

Here are the steps we're going to follow

Install Packer: https://www.packer.io/downloads.html
Pick a base AMI for Packer. This is what we're going to install all the other software on top of.

Here is a list of Debian AMI IDs based on regions: https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye which we will use for our base image. Our Packer JSON file is going to look something like this:

{
    "variables": {
        "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
        "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}"
    },
    "builders": [
        {
            "access_key": "{{user `aws_access_key`}}",
            "ami_name": "docker01",
            "instance_type": "t3.micro",
            "region": "eu-west-1",
            "source_ami": "ami-05b99bc50bd882a41",
            "ssh_username": "admin",
            "type": "amazon-ebs"
        }
    ]
}
``

The next step is to add a provisioner step, as outlined in the Packer documentation you can find here. Basically you will write a bash script that installs the required software to run Docker. Docker actually provides you with a script that should install what you need which you can find here.

The end process you will be running looks like this:

CI process builds a Docker image and pushes it to ECR.
Your deployment process is to either configure your servers to pull the latest file from ECR with a cron job so your servers are eventually consistent, or more likely to write a deployment job which connects to each server, run Docker pull and then restarts the containers as needed.

Why this a bad long-term strategy

A lot of organizations start here, but its important not to end here. This is not a good sustainable long-term workflow.

This is all super hand-made, which doesn't fit with our end goal
The entire process is going to be held together with hand-made scripts, You need something to remove the instance you are deploying to from the load balancer, pull the latest image, restart it, etc.
You'll need to configure a health check on the Docker container to know if it has started correctly.

Correct ways to run containers in AWS

If you are trying to very quickly make a hands-off infrastructure with Docker, choose Elastic Beanstalk

Elastic Beanstalk is the AWS attempt to provide infrastructure in a box. I don't approve of everything EB does, but it is one of the fastest ways to stand up a robust and easy to manage infrastructure. You can check out how to do that with the AWS docs here.

AWS with EB stands up everything you need, from a load balancer to the server and even the database if you want. It is pretty easy to get going, but Elastic Beanstalk is not a magic solution.

Elastic Beanstalk is a good solution if:

You are attempting to run a very simple application. You don't have anything too complicated you are trying to do. In terms of complexity, we are talking about something like a Wordpress site.
You aren't going to need anything in the .ebextension world. You can see that here.
There is a good logging and metrics story that developers are already using.
You want rolling deployments, load balancing, auto scaling, health checks, etc out of the box

Don't use Elastic Beanstalk if:

You need to do a lot of complicated networking stuff on the load balancer for your application.
You have a complicated web application. Elastic Beanstalk had bad documentation and its hard to figure out why stuff is working or not.
Service-to-service communication is something you are going to need now or in the future.

If you need something more robust, try ECS

AWS ECS is a service by AWS designed to quickly and easy run Docker containers. You can find the tutorial here: https://aws.amazon.com/getting-started/hands-on/deploy-docker-containers/

Use Elastic Container Service if:

You are already heavily invested in AWS resources. The integration with ECS and other AWS resources is deep and works well.
You want the option of going completely unmanaged server with Fargate
You have looked at the cost of running a stack on Fargate and are OK with it.

Don't use Elastic Container Service if:

You may need to deploy this application to a different cloud provider

What about Kubernetes?

I love Kubernetes, but its too complicated to get into this article. Kubernetes is a full stack solution that I adore but is probably too complicated for one person to run. I am working on a Kubernetes writeup, but if you are a small team I wouldn't strongly consider it. ECS is just easier to get running and keep running.

Coming up!

Logging, metrics and traces
Paging and alerts. What is a good page vs a bad page
Databases. How do we move them, what do we do with them
Status pages. Where do we tell customers about problems or upcoming maintenance.
CI/CD systems. Do we stick with Jenkins or is there something better?
Serverless. How does it work, should we be using it?
IAM. How do I give users and applications access to AWS without running the risk of bringing it all down.

Questions / Concerns?

Let me know on twitter @duggan_mathew