I've had the opportunity lately to speak to a lot of DevOps engineers at startups around Europe. Some come from a more traditional infrastructure background, beginning their careers in network administration or system administration. Most are coming from either frontend or backend teams, choosing to focus more on the infrastructure work (which hey, that's great, different perspectives are always appreciated).
However, a pretty alarming trend has emerged through these conversations. They seem to start with the existing sys admin or devops person leaving and suddenly they are dropped into the role with almost no experience or training. Left to their own devices with root access to the AWS account, they often have no idea what to even start. Learning on the job is one thing, but being responsible for the critical functioning of an entire companies infrastructure with no time to ramp up is crazy and frankly terrifying.
For some of these folks, it was the beginning of a love affair with infrastructure work. For others, it caused them to quit those jobs immediately in panic. I even spoke to a few who left programming as a career as a result of the stress they felt at the sudden pressure. That's sad for a lot of reasons, especially when these people are forced into the role. But it did spark an idea.
What advice and steps would I tell someone who suddenly had my job with no time to prepare? My goal is to try and document what I would do, if dropped into a position like that, along with my reasoning.
Disclaimer
These solutions aren't necessarily the best fit for every organization or application stack. I tried to focus on easy relatively straightforward tips for people who dropped into a role that they have very little context on. As hard as this might be to believe for some people out there, a lot of smaller companies just don't have any additional infrastructure capacity, especially in some areas of Europe.
These aren't all strictly DevOps concepts as I understand the term to mean. I hate to be the one to tell you but, like SRE and every other well-defined term before it, businesses took the title "DevOps" and slapped it on a generic "infrastructure" concept. We're gonna try to stick to some key guiding principles but I'm not a purist.
Key Concepts
- We are a team of one or two people. There is no debate about build vs buy. We're going to buy everything that isn't directly related to our core business.
- These systems have to fix themselves. We do not have the time or capacity to apply the love and care of a larger infrastructure team. Think less woodworking and more building with Legos. We are trying to snap pre-existing pieces together in a sustainable pattern.
- Boring > New. We are not trying to make the world's greatest infrastructure here. We need something sustainable, easy to operate, and ideally something we do not need to constantly be responsible for. This means teams rolling out their own resources, monitoring their own applications, and allocating their own time.
- We are not the gatekeepers. Infrastructure is a tool and like all tools, it can be abused. Your organization is going to learn to do this better collectively.
- You cannot become an expert on every element you interact with. A day in my job can be managing Postgres operations, writing a PR against an application, or sitting in a planning session helping to design a new application. The scope of what many businesses call "DevOps" is too vast to be a deep-dive expert in all parts of it.
Most importantly we'll do the best we can, but push the guilt out of your head. Mistakes are the cost of their failure to plan, not your failure to learn. A lot of the people who I have spoken to who find themselves in this problem feel intense shame or guilt for not "being able to do a better job". Your employer has messed up, you didn't.
Section One - Into the Fray
Maybe you expressed some casual interest in infrastructure work during a one on one a few months ago, or possibly you are known as the "troubleshooting person", assisting other developers with writing Docker containers. Whatever got you here, your infrastructure person has left, maybe suddenly. You have been moved into the role with almost no time to prepare. We're going to assume you are on AWS for this, but for the most part, the advice should be pretty universal.
I've tried to order these tasks in terms of importance.
1. Get a copy of the existing stack
Alright, you got your AWS credentials, the whole team is trying to reassure you not to freak out because "mostly the infrastructure just works and there isn't a lot of work that needs to be done". You sit down at your desk and your mind starts racing. Step 1 is to get a copy of the existing cloud setup.
We want to get your infrastructure as it exists right now into code because chances are you are not the only one who can log into the web panel and change things. There's a great tool for exporting existing infrastructure state in Terraform called terraformer.
Terraformer
So terraformer is a CLI tool written in Go that allows you to quickly and easily dump out all of your existing cloud resources into a Terraform repo. These files, either as TF format or JSON, will let you basically snapshot the entire AWS account. First, set up AWS CLI and your credentials as shown here. Then once you have the credentials saved, make a new git repo.
# Example flow
# Set up our credentials
aws configure --profile production
# Make sure they work
aws s3 ls --profile production
# Make our new repo
mkdir infrastructure && cd infrastructure/
git init
# Install terraformer
# Linux
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-linux-amd64
chmod +x terraformer-all-linux-amd64
sudo mv terraformer-all-linux-amd64 /usr/local/bin/terraformer
# Intel Mac
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-darwin-amd64
chmod +x terraformer-all-darwin-amd64
sudo mv terraformer-all-darwin-amd64 /usr/local/bin/terraformer
# Other Platforms
https://github.com/GoogleCloudPlatform/terraformer/releases/tag/0.8.15
# Install terraform
https://learn.hashicorp.com/tutorials/terraform/install-cli
First, if you don't know what region your AWS resources are in you can find that here.
So what we're gonna do run:
terraformer import aws --regions INSERT_AWS_REGIONS_HERE --resources="*" --profile=production
### You will get a directory structure that looks like this
generated/
└── aws
├── acm
│ ├── acm_certificate.tf
│ ├── outputs.tf
│ ├── provider.tf
│ └── terraform.tfstate
└── rds
├── db_instance.tf
├── db_parameter_group.tf
├── db_subnet_group.tf
├── outputs.tf
├── provider.tf
└── terraform.tfstate
So if you wanted to modify something for rds, you would cd to the rds directory, then run terraform init. You may get an error: Error: Invalid legacy provider address
If so, no problem. Just run
terraform state replace-provider registry.terraform.io/-/aws hashicorp/aws
Once that is set up, you now have the ability to restore the AWS account using terraform at any time. You will want to add this repo to a CICD job eventually so this gets done automatically, but at first, you might need to run it locally.
$ export AWS_ACCESS_KEY_ID="anaccesskey"
$ export AWS_SECRET_ACCESS_KEY="asecretkey"
$ export AWS_DEFAULT_REGION="us-west-2"
$ terraform plan
You should see terraform run and tell you no changes.
Why Does This Matter?
Terraform lets us do a few things, one of which is roll out infrastructure changes like we would with any other code change. This is great because, in the case of unintended outages or problems, we can rollback. It also matters because often with small companies things will get broken when someone logs into the web console and clicks something they shouldn't. Running a terraform plan can tell you exactly what changed across the entire region in a few minutes, meaning you should be able to roll it back.
Should I do this if our team already manages our stack in code?
I would. There are tools like Ansible and Puppet which are great at managing servers that some people use to manage AWS. Often these setups are somewhat custom, relying on some trial and error before you figure out exactly how they work and what they are doing. Terraform is very stock and anyone on a DevOps chat group or mailing list will be able to help you run the commands. We're trying to establish basically a "restore point". You don't need to use Terraform to manage stuff if you don't want to, but you probably won't regret having a copy now.
Later on, we're going to be putting this into a CICD pipeline so we don't need to manage who adds infrastructure resources. We'll do that by requiring approval on PRs vs us having to write everything. It'll distribute the load but still let us ensure that we have some insight into how the system is configured. Right now though, since you are responsible for infrastructure you can at least roll this back.
2. Write down how deployments work
Every stack is a little different in terms of how it gets deployed and a constant source of problems for folks starting out. You need to be able to answer the question of how exactly code goes from a repo -> production. Maybe it's Jenkins, or GitLab runners or GitHub, CodeDeploy, etc but you need to know the answer for each application. Most importantly you need to read through whatever shell script they're running to actually deploy the application because that will start to give you an idea of what hacks are required to get this thing up and running.
Here are some common questions to get you started.
- Are you running Docker? If so, where do the custom images come from? What runs the Dockerfile, where does it push the images, etc.
- How do you run migrations against the database? Is it part of the normal code base, is there a different utility?
- What is a server to your organization? Is it a stock ec2 instance running Linux and docker with everything else getting deployed with your application? Is it a server where your CICD job just rsyncs file to a directory Nginx reads from?
- Where do secrets come from? Are they stored in the CICD pipeline? Are they stored in a secrets system like Vault or Secrets Manager? (Man if your organization actually does secrets correctly with something like this, bravo).
- Do you have a "cron box"? This is a server that runs cron jobs on a regular interval outside of the normal fleet. I've seen these called "snowflake", "worker", etc. These are usually the least maintained boxes in the organization but often the most critical to how the business works.
- How similar or different are different applications? Often organizations have mixes of serverless applications (managed either through the AWS web UI and tools like serverless) and conventional web servers. Lambdas in AWS are awesome tools that often are completely unmanaged in small businesses, so try and pay special attention to these.
The goal of all of this is to be able to answer "how does code go from a developer laptop to our customers". Once you understand that specific flow, then you will be much more useful in terms of understanding a lot of how things work. Eventually, we're going to want to consolidate these down into one flow, ideally into one "target" so we can keep our lives simple and be able to really maximize what we can offer the team.
Where do logs go and what stores them?
All applications and services generate logs. Logs are critical to debugging the health of an application, and knowing how that data is gathered and stored is critical to empowering developers to understand problems. This is the first week, so we're not trying to change anything, we just want to document how it works. How are logs generated by the application?
Some likely scenarios:
- They are written to disk on the application server and pushed somewhere through syslog. Great, document the syslog configuration, where it comes from and then finally is log rotate set up to keep the boxes from running out of disk space.
- They get pushed to either the cloud provider or monitoring provider (datadog etc). Fine, couldn't be easier, but write down where the permission to push the logs comes from. What I mean by that is: does the app push the logs to AWS, or does an agent running on the box take the logs and push them up to AWS? Either is fine, but know which makes a difference.
Document the flow, looking out for expiration or deletion policies. Also see how access control works, how do developers access these raw logs? Hopefully through some sort of web UI, but if it is through SSH access to the log aggregator that's fine, just write it down.
For more information about CloudWatch logging check out the AWS docs here.
3. How does SSH access work?
You need to know exactly how SSH works from the developers' laptop to the server they are trying to access. Here are some questions to kick it off.
- How do SSH public keys get onto a server? Is there a script, does it sync from somewhere, are they put on by hand?
- What IP addresses are allowed to SSH into a server? Hopefully not all of them, most organizations have at least a bastion host or VPN set up. But test it out, don't assume the documentation is correct. Remember we're building new documentation from scratch and approaching this stack with the respect it deserves as an unknown problem.
- IMPORTANT: HOW DO EMPLOYEES GET OFFBOARDED? Trust me, people forget this all the time and it wouldn't surprise me if you find some SSH keys that shouldn't be there.
I don't know anything about SSH
Don't worry we got you. Take a quick read through this tutorial. You've likely used SSH a lot, especially if you have ever set up a Digital Ocean or personal EC2 instance on a free tier. You have public keys synced to the server and private keys on the client device.
What is a bastion host?
They're just servers that exist to allow traffic from a public subnet to a private subnet. Not all organizations use them, but a lot do, and given the conversations I've had it seems like a common pattern around the industry to use them. We're using a box between the internet and our servers as a bridge.
Do all developers need to access bastion hosts?
Nope they sure don't. Access to the Linux instances should be very restricted and ideally, we can get rid of it as we go. There are much better and easier to operate options now through AWS that let you get rid of the whole concept of bastion servers. But in the meantime, we should ensure we understand the existing stack.
Questions to answer
- How do keys get onto the bastion host?
- How does access work from the bastion host to the servers?
- Are the Linux instances we're accessing in a private subnet or are they on a public subnet?
- Is the bastion host up to date? Is the Linux distribution running current with the latest patches? There shouldn't be any other processes running on these boxes so upgrading them shouldn't be too bad.
- Do you rely on SFTP anywhere? Are you pulling something down that is critical or pushing something up to SFTP? A lot of businesses still rely heavily on automated jobs around SFTP and you want to know how that authentication is happening.
4. How do we know the applications are running?
It seems from conversations that these organizations often have bad alerting stories. They don't know applications are down until customers tell them or they happen to notice. So you want to establish some sort of baseline early on, basically "how do you know the app is still up and running". Often now there is some sort of health check path, something like domain/health or /check or something, used by a variety of services like load balancers and Kubernetes to determine if something is up and functional or not.
First, understand what this health check is actually doing. Sometimes they are just hitting a webserver and ensuring Nginx is up and running. While interesting to know that Nginx is a reliable piece of software (it is quite reliable), this doesn't tell us much. Ideally, you want a health check that interacts with as many pieces of the infrastructure as possible. Maybe it runs a read query against the database to get back some sort of UUID (which is a common pattern).
This next part depends a lot on what alerting system you use, but you want to make a dashboard that you can use very quickly to determine "are my applications up and running". Infrastructure modifications are high-risk operations and sometimes when they go sideways, they'll go very sideways. So you want some visual system to determine whether or not the stack is functional and ideally, this should alert you through Slack or something. If you don't have a route like this, considering doing the work to add one. It'll make your life easier and probably isn't too complicated to do in your framework.
My first alerting tool is almost always Uptime Robot. So we're gonna take our health route and we are going to want to set an Uptime Robot alert on that endpoint. You shouldn't allow traffic from the internet at large to hit this route (because it is a computationally expensive route it is susceptible to malicious actors). However, Uptime Robot provides a list of their IP addresses for whitelisting. So we can add them to our security groups in the terraform repo we made earlier.
If you need a free alternative I have had a good experience with Hetrix. Setting up the alerts should be self-explanatory, basically hit an endpoint and get back either a string or a status code.
5. Run a security audit
Is he out of his mind? On the first week? Security is a super hard problem and one that startups mess up all the time. We can't make this stack secure in the first week (or likely month) of this work, but we can ensure we don't make it worse and, when we get a chance, we move closer to an ideal state.
The tool I like for this is Prowler. Not only does it allow you a ton of flexibility with what security audits you run, but it lets you export the results in a lot of different formats, including a very nice-looking HTML option.
Steps to run Prowler
- Install Prowler. We're gonna run this from our local workstation using the AWS profile we made before.
On our local workstation:
git clone https://github.com/toniblyx/prowler
cd prowler
2. Run prowler. ./prowler -p production -r INSERT_REGION_HERE -M csv,json,json-asff,html -g cislevel1
The command above will output all of the options for Prowler, but I want to focus for a second on the -g option. That's the group option and it basically means "what security audit are we going to run". CIS Amazon Web Services Foundations have 2 levels and can be thought of broadly as:
Level 1: Stuff you should absolutely be doing right now that shouldn't impact most application functionality.
Level 2: Stuff you should probably be doing but is more likely to impact the functioning of an application.
We're running Level 1, because ideally, our stack should already pass a level 1 and if it doesn't, then we want to know where. The goal of this audit isn't to fix anything right now, but it IS to share it with leadership. Let them know the state of the account now while you are onboarding, so if there are serious security gaps that will require development time they know about it.
Finally, take the CSV file that was output from Prowler and stick it in Google Sheets with a date. We're going to want to have a historical record of the audit.
6. Make a Diagram!
The last thing we really want to do is make a diagram and have the folks who know more about the stack verify it. One tool that can kick this off is Cloudmapper. This is not going to get you all of the way there (you'll need to add meaningful labels and likely fill in some missing pieces) but should get you a template to work off of.
What we're primarily looking for here is understanding flow and dependencies. here are some good questions to get you started.
- Where are my application persistence layers? What hosts them? How do they talk to each other?
- Overall network design. How does traffic ingress and egress? Do all my resources talk directly to the internet or do they go through some sort of NAT gateway? Are my resources in different subnets, security groups, etc?
- Are there less obvious dependencies? SQS, RabbitMQ, S3, elasticsearch, varnish, any and all of these are good candidates.
The ideal state here is to have a diagram that we can look at and say "yes I understand all the moving pieces". For some stacks that might be much more difficult, especially serverless stacks. These often have mind-boggling designs that change deploy to deploy and might be outside of the scope of a diagram like this. We should still be able to say "traffic from our customers comes in through this load balancer to that subnet after meeting the requirements in x security group".
If your organization has LucidChart they make this really easy. You can find out more about that here. You can do almost everything Lucid or AWS Config can do with Cloudmapper without the additional cost.
Cloudmapper is too complicated, what else have you got?
Does the setup page freak you out a bit? It does take a lot to set up and run the first time. AWS actually has a pretty nice pre-made solution to this problem. Here is the link to their setup: https://docs.aws.amazon.com/solutions/latest/aws-perspective/overview.html
It does cost a little bit but is pretty much "click and go" so I recommend it if you just need a fast overview of the entire account without too much hassle.
End of section one
Ideally the state we want to be in looks something like the following.
- We have a copy of our infrastructure that we've run
terraform plan
against and there are no diffs, so we know we can go back. - We have an understanding of how the most important applications are deployed and what they are deployed to.
- The process of generating, transmitting, and storing logs is understood.
- We have some idea of how secure (or not) our setup is.
- There are some basic alerts on the entire stack, end to end, which give us some degree of confidence that "yes the application itself is functional".
For many of you who are more experienced with this type of work, I'm sure you are shocked. A lot of this should already exist and really this is a process of you getting up to speed with how it works. However sadly in my experience talking to folks who have had this job forced on them, many of these pieces were set up a few employees ago and the specifics of how they work are lost to time. Since we know we can't rely on the documentation we need to make our own. In the process, we become more comfortable with the overall stack.
Stuff still to cover!
If there is any interest I'll keep going with this. Some topics I'd love to cover.
- Metrics! How to make a dashboard that doesn't suck.
- Email. Do your apps send it, are you set up for DMARC, how do you know if email is successfully getting to customers, where does it send from?
- DNS. If it's not in the terraform directory we made before under Route53, it must be somewhere else. We gotta manage that like we manage a server because users logging into the DNS control panel and changing something can cripple the business.
- Kubernetes. Should you use it? Are there other options? If you are using it now, what do you need to know about it?
- Migrating to managed services. If your company is running its own databases or baking its own AMIs, now might be a great time to revisit that decision.
- Sandboxes and multi-account setups. How do you ensure developers can test their apps in the least annoying way while still keeping the production stack up?
- AWS billing. What are some common gotchas, how do you monitor spending, and what do to institutionally about it?
- SSO, do you need it, how to do it, what does it mean?
- Exposing logs through a web interface. What are the fastest ways to do that on a startup budget?
- How do you get up to speed? What courses and training resources are worth the time and energy?
- Where do you get help? Are there communities with people interested in providing advice?
Did I miss something obvious?
Let me know! I love constructive feedback. Bother me on Twitter. @duggan_mathew