Don't Make My Mistakes: Common Infrastructure Errors I've Made

One surreal experience as my career has progressed is the intense feeling of deja vu you get hit with during meetings. From time to time, someone will mention something and you'll flash back to the same meeting you had about this a few jobs ago. A decision was made then, a terrible choice that ruined months of your working life. You spring back to the present day, almost bolting out of your chair to object, "Don't do X!". Your colleagues are startled by your intense reaction, but they haven't seen the horrors you have.

I wanted to take a moment and write down some of my worst mistakes, as a warning to others who may come later. Don't worry, you'll make all your own new mistakes instead. But allow me a moment to go back through some of the most disastrous decisions or projects I ever agreed to (or even fought to do, sometimes).

Don't migrate an application from the datacenter to the cloud

Ah the siren call of cloud services. I'm a big fan of them personally, but applications designed for physical datacenters rarely make the move to the cloud seamlessly. I've been involved now in three attempts to do large-scale migrations of applications written for a specific datacenter to the cloud and every time I have crashed upon the rocks of undocumented assumptions about the environment.

Me encountering my first unsolvable problem with a datacenter to cloud migration

As developer write and test applications, they develop expectations of how their environment will function. How do servers work, what kind of performance does my application get, how reliable is the network, what kind of latency can I expect, etc. These are reasonable thing that any person would do upon working inside of an environment for years, but it means when you package up an application and run it somewhere else, especially old applications, weird things happen. Errors that you never encountered before start to pop up and all sorts of bizarre architectural decisions need to be made to try and allow for this transition.

Soon you've eliminated a lot of the value of the migration to begin with, maybe even doing something terrible like connecting your datacenter to AWS with direct connect in an attempt to bridge the two environments seamlessly. Your list of complicated decisions start to grow and grow, hitting increasingly more and more edge cases of your cloud provider. Inevitable you find something you cannot move and you are now stuck with two environments, a datacenter you need to maintain and a new cloud account. You lament your hubris.

Instead....

Port the application to the cloud. Give developers a totally isolated from the datacenter environment, let them port the application to the cloud and then schedule 4-8 hours of downtime for your application. This will allow persistence layers to cut over and then you can change your DNS entries to point to your new cloud presence. The attempt to prevent this downtime will drown you in bad decision after bad decision. Better to just bite the bullet and move on.

Or even better, develop your application in the same environment you expect to run it in.

Don't write your own secrets system

I don't know why I keep running into this. For some reason, organizations love to write their own secrets management system. Often these are applications written by the infrastructure teams, commonly either environmental variable injection systems or some sort of RSA-key based decrypt API call. Even I have fallen victim to this idea, thinking "well certainly it can't be that difficult".

For some reason, maybe I had lost my mind or something, I decided we were going to manage our secrets inside of PostgREST application I would manage. I wrote an application that would generate and return JWTs back to applications depending on a variety of criteria. These would allow them to access their secrets in a totally secure way.

Now in defense of PostgREST, it worked well at what it promised to do. But the problem of secrets management is more complicated than it first appears. First we hit the problem of caching, how do you keep from hitting this service a million times an hour but still maintain some concept of using the server as the source of truth. This was solvable through some Nginx configs but was something I should have thought of.

Then I smacked myself in the face with the rake of rotation. It was trivial to push a new version, but secrets aren't usually versioned to a client. I authenticate with my application and I see the right secrets. But during a rotation period there are two right secrets, which is obvious when I say it but hadn't occurred to me when I was writing it. Again, not a hard thing to fix, but as time went on and I encountered more and more edge cases for my service, I realized I had made a huge mistake.

The reality is secrets management is a classic high risk and low reward service. It's not gonna help my customers directly, it won't really impress anyone in leadership that I run it, it will consume a lot of my time debugging it and its going to need a lot of domain specific knowledge in terms of running it. I had to rethink a lot of the pieces as I went, everything from multi-region availability (which like, syncing across regions is a drag) to hardening the service.

Instead....

Just use AWS Secrets Manager or Vault. I prefer Secrets Manager, but whatever you prefer is fine. Just don't write your own, there are a lot of edge cases and not a lot of benefits. You'll be the cause of why all applications are down and the cost savings at the end of the day are minimal.

Don't run your own Kubernetes cluster

I know, you have the technical skill to do it. Maybe you absolutely love running etcd and setting up the various certificates. Here is a very simple decision tree when thinking about "should I run my own k8s cluster or not":

Are you a Fortune 100 company? If no, then don't do it.

The reason is you don't have to and letting someone else run it allows you to take advantage of all this great functionality they add. AWS EKS has some incredible features, from support for AWS SSO in your kubeconfig file to allowing you to use IAM roles inside of ServiceAccounts for pod access to AWS resources. On top of all of that, they will run your control plane for less than $1000 a year. Setting all that aside for a moment, let's talk frankly for a second.

One advantage of the cloud is other people beta test upgrades for you.

I don't understand why people don't talk about this more. Yes you can run your own k8s cluster pretty successfully, but why? I have literally tens of thousands of beta testers going ahead of me in line to ensure EKS upgrades work. On top of that, I get tons of AWS engineers working on it. There's no advantage if I'm going to run my infrastructure in AWS anyway to running my own cluster except that I can maintain the illusion that at some point I could "switch cloud providers". Which leads me on to my next point.

Instead....

Let the cloud provider run it. It's their problem now. Focus on making your developers lives easier.

Don't Design for Multiple Cloud Providers

This one irks me on a deeply personal level. I was convinced by a very persuasive manager that we needed to ensure we had the ability to switch cloud providers. Against my better judgement, I fell in with the wrong crowd. We'll call them the "premature optimization" crowd.

Soon I was auditing new services for "multi-cloud compatibility", ensuring that instead of using the premade SDKs from AWS, we maintained our own. This would allow us to, at the drop of a hat, switch between them in the unlikely event this company exploded in popularity and we were big enough to somehow benefit from this migration. I guess in our collective minds this was some sort of future proofing or maybe we just had delusions of grandeur.

What we were actually doing is the worst thing you can do, which is just being a pain in the ass for people trying to ship features to customers. If you are in AWS, don't pretend that there is a real need for your applications to be deployable to multiple clouds. If AWS disappeared tomorrow, yes you would need to migrate your applications. But the probability of AWS outliving your company is high and the time investment of maintaining your own cloud agnostic translation layers is not one you are likely to ever get back.

We ended up with a bunch of libraries that were never up to date with the latest features, meaning developers were constantly reading about some great new feature of AWS they weren't able to use or try out. Tutorials obviously didn't work with our awesome custom library and we never ended up switching cloud providers or even dual deploying because financially it never made sense to do it. We ended up just eating a ton of crow from the entire development team.

Instead....

If someone says "we need to ensure we aren't tied to one cloud provider", tell them that ship sailed the second you signed up. Similar to a data center, an application designed, tested and run successfully for years in AWS is likely to pick up some expectations and patterns of that environment. Attempting to optimize for agnostic design is losing a lot of the value of cloud providers and adding a tremendous amount of busy work for you and everyone else.

Don't be that person. Nobody likes the person who is constantly saying "no we can't do that" in meeting. If you find yourself in a situation where migrating to a new provider makes financial sense, set aside at least 3 months an application for testing and porting. See if it still makes financial sense after that.

Cloud providers are a dependency, just like a programming language. You can't arbitrarily switch them without serious consideration and even then, often "porting" is the wrong choice. Typically you want to practice like you play, so developing in the same environment as your customers will use your product.

Don't let alerts grow unbounded

I'm sure you've seen this at a job. There is a tv somewhere in the office and on that tv is maybe a graph or CloudWatch alerts or something. Some alarm will trigger at an interval and be displayed on that tv, which you will be told to ignore because it isn't a big deal. "We just want to know if that happens too much" is often what is reported.

Eventually these start to trickle into on-call alerts, which page you. Again you'll be told they are informative, often by the team that owns that service. As enough time passes, it becomes unclear what the alert was supposed to tell you, only that new people will get confusing information about whether an alert is important or not. You'll eventually have an outage because the "normal" alert will fire with an unusual condition, leading to a person to silence the page and go back to sleep.

I have done this, where I even defended the system on the grounds of "well surely the person who wrote the alert had some intention behind it". I should have been on the side of "tear it all down and start again", but instead I choose a weird middle ground. It was the wrong decision for me years ago and its the wrong decision for you today.

Instead....

If an alert pages someone, it has to be a situation in which the system could never recover on its own. It needs to be serious and it cannot be something where the failure is built into the application design. An example of that would be "well sometimes our service needs to be restarted, just SSH in and restart it". Nope, not an acceptable reason to wake me up. If your service dies like that, figure out a way to bring it back.

Don't allow for the slow gradual pollution of your life with garbage alerts and feel free to declare bankruptcy on all alerts in a platform if they start to stink. If a system emails you 600 times a day, it's not working. If there is a slack channel so polluted with garbage that nobody goes in there, it isn't working as an alert system. It isn't how human attention works, you can't spam someone constantly with "not-alerts" and then suddenly expect them to carefully parse every string of your alert email and realize "wait this one is different".

Don't write internal cli tools in python

I'll keep this one short and sweet.

Nobody knows how to correctly install and package Python apps. If you write an internal tool in Python, it either needs to be totally portable or just write it in Go or Rust. Save yourself a lot of heartache as people struggle to install the right thing.