DevOps Crash Course - Section 2: Servers

August 30, 2021 in DevOps

Section 2 - Server Wrangling

For those of you just joining us you can see part 1 here.

If you went through part 1, you are a newcomer to DevOps dropped into the role with maybe not a lot of prep. We now have a good idea of what is running in our infrastructure, how it gets there and we have an idea of how secure our AWS setup is from an account perspective.

Next we're going to build on top of that knowledge and start the process of automating more of our existing tasks, allowing AWS (or really any cloud provider) to start doing more of the work. This provides a more reliable infrastructure and means we can focus more on improving quality of life.

What matters here?

Before we get into the various options for how to run your actual code in production, let's take a step back and walk about what matters. Whatever choice you and your organization end up making, here are the important questions we are trying to answer.

Can we, demonstrably and without human intervention, stand up a new server and deploy code to it?
Do we have an environment or way for a developer to precisely replicate the environment that their code runs in production in another place not hosting or serving customer traffic?
If one of our servers become unhealthy, do we have a way for customer traffic to stop being sent to that box and ideally for that box to be replaced without human intervention?

Can you use "server" in a sentence?

Alright you caught me. It's becoming increasingly more difficult to define what we mean what we say "server". To me, a server is still the piece of physical hardware running in a data center and the software for a particular host is a virtual machine. You can think of EC2 instances as virtual machines, different from docker containers in ways we'll discuss later. For our purposes EC2 instances are usually what we mean when we say servers. These instances are defined through the web UI, CLI or Terraform and launched in every possible configuration and memory size.

Really a server is just something that allows customers to interact with our application code. An API Gateway connected to Lambdas still meets my internal definition of server, except in that case we are completely hands off.

What are we dealing with here

In conversations with DevOps engineers who have had the role thrust on them or perhaps dropped into a position where maintenance hasn't happened a lot, a common theme has emerged. They often are placed in charge of a stack that looks something like this:

A user flow normally looks something like this:

A DNS request is made to the domain and it points towards (often) a classic load balancer. This load balancer handles SSL termination and then forwards traffic on to servers running inside of a VPC on port 80.
These servers are often running whatever linux distribution was installed on them when they were made. Something they are in autoscaling groups, but often they are not. These servers normally have Nginx installed along with something called uWSGI. Requests come in, they are handed by Nginx to the uWSGI workers and these interact with your application code.
The application will make calls to the database server, often running MySQL because that is what the original developer knew to use. Sometimes these are running on a managed database service and sometimes they are not.
Often with these sorts of stacks the deployment is something like "zip up a directory in the CICD stack, copy it to the servers, then unzip and move to the right location".
There are many times some sort of script to remove that box from the load balancer at the time of deploy and then readd them.

Often there are additional AWS services being used, things like S3, Cloudfront, etc but we're going to focus on the servers right now.

This stack seems to work fine. Why should I bother changing it?

Configuration drift. The inevitable result of launching different boxes at different times and hand-running commands on them which make them impossible to test or reliably replicate.

Large organizations go through lots of work to ensure every piece of their infrastructure is uniformly configured. There are tons of tools to do this, from things like Ansible Tower, Puppet, Chef, etc to check each server on a regular cycle and ensure the correct software is installed everywhere. Some organizations rely on building AMIs, building whole new server images for each variation and then deploying them to auto-scaling groups with tools like Packer. All of this work is designed to try and eliminate differences between Box A and Box B. We want all of our customers and services to be running on the same platforms we have running in our testing environment. This catches errors in earlier testing environments and means our app in production isn't impacted.

The problem we want to avoid is the nightmare one for DevOps people, which is where you can't roll forward or backwards. Something in your application has broken on some or all of your servers but you aren't sure why. The logs aren't useful and you didn't catch it before production because you don't test with the same stack you run in prod. Now your app is dying and everyone is trying to figure out why, eventually discovering some sort of hidden configuration option or file that was set a long time ago that now causes problem.

These sorts of issues plagued traditional servers for a long time, resulting in bespoke hand-crafted servers that could never be replaced without tremendous amounts of work. You either destroyed and remade your servers on a regular basis to ensure you still could or you accepted the drift and tried to ensure that you had some baseline configuration that would launch your application. Both of these scenarios suck and for a small team its just not reasonable to expect you to run that kind of maintenance operation. It's too complicated.

What if there was a tool that let you test exactly like you run your code in production? It would be easy to use, work on your local machine as well as on your servers and would even let you quickly audit the units to see if there are known security issues.

This all just sounds like Docker

It is Docker! You caught me, it's just containers all the way down. You've probably used Docker containers a few times in your life, but running your applications inside of containers is the vastly preferred model for how to run code. It simplifies testing, deployments and dependency management, allowing you to move all of it inside of git repos.

What is a container?

Let's start with what a container isn't. We talked about virtual machines before. Docker is not a virtual machine, it's a different thing. The easiest way to understand the difference is to think of a virtual machine like a physical server. It isn't, but for the most part the distinction is meaningless for your normal day to day life. You have a full file system with a kernel, users, etc.

Containers are just what your application needs to run. They are just ways of moving collections of code around along with their dependencies. Often new folks to DevOps will think of containers in the wrong paradigm, asking questions like "how do you back up a container" or using them as permanent stores of state. Everything in a container is designed to be temporary. It's just a different way of running a process on the host machine. That's why if you run ps fauxx | grep name_of_your_service on the host machine you still see it.

Are containers my only option?

Absolutely not. I have worked with organizations that manage their code in different ways outside of containers. Some of the tools I've worked with have been NPM for Node applications, RPMs for various applications linked together with Linux dependencies. Here are the key questions when evaluating something other than Docker containers?

Can you reliably stand up a new server using a bash script + this package? Typically bash scripts should be under 300 lines, so if we can make a new server with a script like that and some "other package" I would consider us to be in ok shape.
How do I roll out normal security upgrades? All linux distros have constant security upgrades, how do I do that on a normal basis while still confirming that the boxes still work?
How much does an AWS EC2 maintenance notice scare me? This is where AWS or another cloud provider emails you and says "we need to stop one of your instances randomly due to hardware failures". Is it a crisis for my business or is it a mostly boring event?
If you aren't going to use containers but something else, just ensure there is more than one source of truth for that.
For Node I have had a lot of sucess with Verdaccio as an NPM cache: https://verdaccio.org/
However in general I recommend paying for Packagecloud and pushing whatever package there: https://packagecloud.io/

How do I get my application into a container?

I find the best way to do this is to sit down with the person who has worked on the application the longest. I will spin up a brand new, fresh VM and say "can you walk me through what is required to get this app running?". Remember this is something they likely have done on their own machines a few hundred times, so they can pretty quickly recite the series of commands needed to "run the app". We need to capture those commands because they are how we write the Dockerfile, the template for how we make our application.

Once you have the list of commands, you can string them together in a Dockerfile.

How Do Containers Work?

It's a really fascinating story. Let's teleport back in time. It's the year 2000, we have survived Y2K, the most dangerous threat to human existence at the time. FreeBSD rolls out a new technology called "jails". FreeBSD jails were introduced in FreeBSD 4.X and are still being developed now.

Jails are layered on top of chroot, which allows you to change the root directory of processes. For those of you who use Python, think of chroot like virtualenv. It's a safe distinct location that allows you to simulate having a new "root" directory. These processes cannot access files or resources outside of that environment.

Jails take that concept and expanded it, virtualizing access to the file system, users, networking and every other part of the system. Jails introduce 4 things that you will quickly recognize as you start to work with Docker:

A new directory structure of dependencies that a process cannot escape.
A hostname for the specific jail
A new IP address which is often just an alias for an existing interface
A command that you want to run inside of the jail.

www {
    host.hostname = www.example.org;           # Hostname
    ip4.addr = 192.168.0.10;                   # IP address of the jail
    path = "/usr/jail/www";                    # Path to the jail
    devfs_ruleset = "www_ruleset";             # devfs ruleset
    mount.devfs;                               # Mount devfs inside the jail
    exec.start = "/bin/sh /etc/rc";            # Start command
    exec.stop = "/bin/sh /etc/rc.shutdown";    # Stop command
}

What a Jail looks like.

From FreeBSD the technology makes it way to Linux via the VServer project. As time went on more people build on the technology, taking advantage of cgroups. Control groups, shorted to cgroups is a technology that was added to Linux in 2008 from engineers at Google. It is a way of defining a collection of processes that are bound by the same restrictions. Progress has continued with cgroups from its initial launch, now at a v2.

There are two parts of a cgroup, a core and a controller. The core is responsible for organizing processes. The controller is responsible for distributing a type of resource along the hierarchy. With this continued work we have gotten incredible flexibility with how to organize, isolate and allocate resources to processes.

Finally in 2008 we got Docker, adding a simple CLI, the concept of a Docker server, a way to host and share images and more. Now containers are too big for one company, instead being overseen by the Open Container Initiative. Now instead of there being exclusively Docker clients pushing images to Dockerhub running on Docker server, we have a vibrant and strong open-source community around containers.

I could easily fill a dozen pages with interesting facts about containers, but the important thing is that containers are a mature technology built on a proven pattern of isolating processes from the host. This means we have complete flexibility for creating containers and can easily reuse a simple "base" host regardless of what is running on it.

For those interested in more details:

Anatomy of a Dockerfile

FROM debian:latest
# Copy application files
COPY . /app
# Install required system packages
RUN apt-get update
RUN apt-get -y install imagemagick curl software-properties-common gnupg vim ssh
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash -
RUN apt-get -y install nodejs
# Install NPM dependencies
RUN npm install --prefix /app
EXPOSE 80
CMD ["npm", "start", "--prefix", "app"]

This is an example of a not great Dockerfile. Source

When writing Dockerfiles, open a tab to the official Docker docs. You will need to refer to them all the time at first, because very little about it. Typically Dockerfile are stored in the top level of an existing repository and their file operations, such as COPY as shown above, operate on that principal. You don't have to do that, but it is a common pattern to see the Dockerfile at the root level of a repo. Whatever you do keep it consistent.

Formatting

Dockerfile instructions are not case-sensitive, but are usually written in uppercase so that they can be differentiated from arguments more easily. Comments have the hash symbol (#) at the beginning of the line.

FROM

First is a FROM, which just says "what is our base container that we are starting from". As you progress in your Docker experience, FROM containers are actually great ways of speeding up the build process. If all of your containers have the same requirements for packages, you can actually just make a "base container" and then use that as a FROM. But when building your first containers I recommend just sticking with Debian.

Don't Use latest

Docker images rely on tags, which you can see in the example above as: debian:latest. This is docker for "give me the more recently pushed image". You don't want to do that for production systems. Typically upgrading the base container should be a affirmative action, not just something you accidentally do.

The correct way to reference a FROM image in a Dockerfile is through the use of a hash. So we want something like this:

FROM debian@sha256:c6e865b5373b09942bc49e4b02a7b361fcfa405479ece627f5d4306554120673

Which I got from the Debian Dockerhub page here. This protects us in a few different ways.

We won't accidentally upgrade our containers without meaning to
If the team in charge of pushing Debian containers to Dockerhub makes a mistake, we aren't suddenly going to be in trouble
It eliminates another source of drift

But I see a lot of people using Alpine

That's fine, use Alpine if you want. I have more confidence in Debian when compared to Alpine and always base my stuff off Debian. I think its a more mature community and more likely to catch problems. But again, whatever you end up doing, make it consistent.

If you do want a smaller container, I recommend minideb. It still lets you get the benefits of Debian with a smaller footprint. It is a good middle ground.

COPY

COPY is very basic. The . just means "current working directory", which in this case if the Dockerfile is at the top level of a git repository. It just takes whatever you specify and copy it into the Dockerfile.

COPY vs ADD

A common question I get is "what is the difference between COPY and ADD". Super basic, ADD is for going out and fetching something from a URL or opening a compressed file into the container. So if all you need to do is copy some files into the container from the repo, just use COPY. If you have to grab a compressed directory from somewhere or unzip something use ADD.

RUN

RUN is the meat and potatoes of the Dockerfile. These are the bash commands we are running in order to basically put together all the requirements. The file we have above doesn't follow best practices. We want to compress the RUN commands down so that they are all part of one layer.

RUN wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 \
&& tar jxf samtools-1.2.tar.bz2 \
&& cd samtools-1.2 \
&& make \
&& make install

A good RUN example so all of these are one layer

WORKDIR

Allows you to set the directory inside the container from which all the other commands will run. Saves you from having to write out the absolute path every time.

CMD

The command we are executing when we run the container. Usually for most web applications this would be where we run the framework start command. This is an example from a Django app I run:

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

If you need more detail, Docker has a decent tutorial: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

One more thing

This obviously depends on your application, but many applications will also need a reverse proxy. This allows Nginx to listen on port 80 inside the container and forward requests on to your application. Docker has a good tutorial on how to add Nginx to your container: https://www.docker.com/blog/how-to-use-the-official-nginx-docker-image/

I cannot stress this enough, making the Dockerfile to run the actual application is not something a DevOps engineer should try to do on your own. You likely can do it by reverse engineering how your current servers work, but you need to pull in the other programmers in your organization.

Docker also has a good tutorial from beginning to end for Docker novices here: https://docs.docker.com/get-started/

Docker build, compose etc

Once you have a Dockerfile in your application repository, you are ready to move on to the next steps.

Have your CICD system build the images. Typing in the words "name of your CICD + build docker images" into Google to see how.
You'll need to make an IAM user for your CICD system in order to push the docker images from your CI workers to the ECR private registry. You can find the required permissions here.
Get ready to push those images to a registry. For AWS users I strongly recommend AWS ECR.
Here is how you make a private registry.
Then you need to push your image to the registry. I want to make sure you see AWS ECR helper, a great tool that makes the act of pushing from your laptop much easier. https://github.com/awslabs/amazon-ecr-credential-helper. This also can help developers pull these containers down for local testing.
Pay close attention to tags. You'll notice that the ECR registry is part of the tag along with the : and then a version information. You can use different registries for different applications or use the same registry for all your applications. Remember secrets shouldn't be in your container regardless, or customer data.
Go get a beer, you earned it.

Some hard decisions

Up to this point, we've been following a pretty conventional workflow. Get stuff into containers, push the containers up to a registry, automate the process of making new containers. Now hopefully we have our developers able to test their applications locally and everyone is very impressed with you and all the work you have gotten done.

The reason we did all this work is because now that our applications are in Docker containers, we have a wide range of options for ways to quickly and easily run this application. I can't tell you what the right option is for your organization without being there, but I can lay out the options so you can walk into the conversation armed with the relevant data.

Deploying Docker containers directly to EC2 Instances

This is a workflow you'll see quite a bit among organizations just building confidence in Docker. It works something like this -

Your CI system builds the Docker container using a worker and the Dockerfile you defined before. It pushes it to your registry with the correct tag.
You make a basic AMI with a tool like packer.
New Docker containers are pulled down to the EC2 instances running the AMIs we made with Packer.

Packer

Packer is just a tool that spins up an EC2 instance, installs the software you want installed and then saves it as an AMI. These AMIs can be deployed when new machines launch, ensuring you have identical software for each host. Since we're going to be keeping all the often-updated software inside the Docker container, this AMI can be used as a less-often touched tool.

First, go through the Packer tutorial, it's very good.

Here is another more comprehensive tutorial.

Here are the steps we're going to follow

Install Packer: https://www.packer.io/downloads.html
Pick a base AMI for Packer. This is what we're going to install all the other software on top of.

Here is a list of Debian AMI IDs based on regions: https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye which we will use for our base image. Our Packer JSON file is going to look something like this:

{
    "variables": {
        "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
        "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}"
    },
    "builders": [
        {
            "access_key": "{{user `aws_access_key`}}",
            "ami_name": "docker01",
            "instance_type": "t3.micro",
            "region": "eu-west-1",
            "source_ami": "ami-05b99bc50bd882a41",
            "ssh_username": "admin",
            "type": "amazon-ebs"
        }
    ]
}
``

The next step is to add a provisioner step, as outlined in the Packer documentation you can find here. Basically you will write a bash script that installs the required software to run Docker. Docker actually provides you with a script that should install what you need which you can find here.

The end process you will be running looks like this:

CI process builds a Docker image and pushes it to ECR.
Your deployment process is to either configure your servers to pull the latest file from ECR with a cron job so your servers are eventually consistent, or more likely to write a deployment job which connects to each server, run Docker pull and then restarts the containers as needed.

Why this a bad long-term strategy

A lot of organizations start here, but its important not to end here. This is not a good sustainable long-term workflow.

This is all super hand-made, which doesn't fit with our end goal
The entire process is going to be held together with hand-made scripts, You need something to remove the instance you are deploying to from the load balancer, pull the latest image, restart it, etc.
You'll need to configure a health check on the Docker container to know if it has started correctly.

Correct ways to run containers in AWS

If you are trying to very quickly make a hands-off infrastructure with Docker, choose Elastic Beanstalk

Elastic Beanstalk is the AWS attempt to provide infrastructure in a box. I don't approve of everything EB does, but it is one of the fastest ways to stand up a robust and easy to manage infrastructure. You can check out how to do that with the AWS docs here.

AWS with EB stands up everything you need, from a load balancer to the server and even the database if you want. It is pretty easy to get going, but Elastic Beanstalk is not a magic solution.

Elastic Beanstalk is a good solution if:

You are attempting to run a very simple application. You don't have anything too complicated you are trying to do. In terms of complexity, we are talking about something like a Wordpress site.
You aren't going to need anything in the .ebextension world. You can see that here.
There is a good logging and metrics story that developers are already using.
You want rolling deployments, load balancing, auto scaling, health checks, etc out of the box

Don't use Elastic Beanstalk if:

You need to do a lot of complicated networking stuff on the load balancer for your application.
You have a complicated web application. Elastic Beanstalk had bad documentation and its hard to figure out why stuff is working or not.
Service-to-service communication is something you are going to need now or in the future.

If you need something more robust, try ECS

AWS ECS is a service by AWS designed to quickly and easy run Docker containers. You can find the tutorial here: https://aws.amazon.com/getting-started/hands-on/deploy-docker-containers/

Use Elastic Container Service if:

You are already heavily invested in AWS resources. The integration with ECS and other AWS resources is deep and works well.
You want the option of going completely unmanaged server with Fargate
You have looked at the cost of running a stack on Fargate and are OK with it.

Don't use Elastic Container Service if:

You may need to deploy this application to a different cloud provider

What about Kubernetes?

I love Kubernetes, but its too complicated to get into this article. Kubernetes is a full stack solution that I adore but is probably too complicated for one person to run. I am working on a Kubernetes writeup, but if you are a small team I wouldn't strongly consider it. ECS is just easier to get running and keep running.

Coming up!

Logging, metrics and traces
Paging and alerts. What is a good page vs a bad page
Databases. How do we move them, what do we do with them
Status pages. Where do we tell customers about problems or upcoming maintenance.
CI/CD systems. Do we stick with Jenkins or is there something better?
Serverless. How does it work, should we be using it?
IAM. How do I give users and applications access to AWS without running the risk of bringing it all down.

Questions / Concerns?

Let me know on twitter @duggan_mathew

DevOps Engineer Crash Course - Section 1

August 14, 2021 in DevOps

Fake it till you make it Starfleet Captain Kelsey Grammer Link

I've had the opportunity lately to speak to a lot of DevOps engineers at startups around Europe. Some come from a more traditional infrastructure background, beginning their careers in network administration or system administration. Most are coming from either frontend or backend teams, choosing to focus more on the infrastructure work (which hey, that's great, different perspectives are always appreciated).

However, a pretty alarming trend has emerged through these conversations. They seem to start with the existing sys admin or devops person leaving and suddenly they are dropped into the role with almost no experience or training. Left to their own devices with root access to the AWS account, they often have no idea what to even start. Learning on the job is one thing, but being responsible for the critical functioning of an entire companies infrastructure with no time to ramp up is crazy and frankly terrifying.

For some of these folks, it was the beginning of a love affair with infrastructure work. For others, it caused them to quit those jobs immediately in panic. I even spoke to a few who left programming as a career as a result of the stress they felt at the sudden pressure. That's sad for a lot of reasons, especially when these people are forced into the role. But it did spark an idea.

What advice and steps would I tell someone who suddenly had my job with no time to prepare? My goal is to try and document what I would do, if dropped into a position like that, along with my reasoning.

Disclaimer

These solutions aren't necessarily the best fit for every organization or application stack. I tried to focus on easy relatively straightforward tips for people who dropped into a role that they have very little context on. As hard as this might be to believe for some people out there, a lot of smaller companies just don't have any additional infrastructure capacity, especially in some areas of Europe.

These aren't all strictly DevOps concepts as I understand the term to mean. I hate to be the one to tell you but, like SRE and every other well-defined term before it, businesses took the title "DevOps" and slapped it on a generic "infrastructure" concept. We're gonna try to stick to some key guiding principles but I'm not a purist.

Key Concepts

We are a team of one or two people. There is no debate about build vs buy. We're going to buy everything that isn't directly related to our core business.
These systems have to fix themselves. We do not have the time or capacity to apply the love and care of a larger infrastructure team. Think less woodworking and more building with Legos. We are trying to snap pre-existing pieces together in a sustainable pattern.
Boring > New. We are not trying to make the world's greatest infrastructure here. We need something sustainable, easy to operate, and ideally something we do not need to constantly be responsible for. This means teams rolling out their own resources, monitoring their own applications, and allocating their own time.
We are not the gatekeepers. Infrastructure is a tool and like all tools, it can be abused. Your organization is going to learn to do this better collectively.
You cannot become an expert on every element you interact with. A day in my job can be managing Postgres operations, writing a PR against an application, or sitting in a planning session helping to design a new application. The scope of what many businesses call "DevOps" is too vast to be a deep-dive expert in all parts of it.

Most importantly we'll do the best we can, but push the guilt out of your head. Mistakes are the cost of their failure to plan, not your failure to learn. A lot of the people who I have spoken to who find themselves in this problem feel intense shame or guilt for not "being able to do a better job". Your employer has messed up, you didn't.

Section One - Into the Fray

Maybe you expressed some casual interest in infrastructure work during a one on one a few months ago, or possibly you are known as the "troubleshooting person", assisting other developers with writing Docker containers. Whatever got you here, your infrastructure person has left, maybe suddenly. You have been moved into the role with almost no time to prepare. We're going to assume you are on AWS for this, but for the most part, the advice should be pretty universal.

I've tried to order these tasks in terms of importance.

1. Get a copy of the existing stack

Alright, you got your AWS credentials, the whole team is trying to reassure you not to freak out because "mostly the infrastructure just works and there isn't a lot of work that needs to be done". You sit down at your desk and your mind starts racing. Step 1 is to get a copy of the existing cloud setup.

We want to get your infrastructure as it exists right now into code because chances are you are not the only one who can log into the web panel and change things. There's a great tool for exporting existing infrastructure state in Terraform called terraformer.

Terraformer

So terraformer is a CLI tool written in Go that allows you to quickly and easily dump out all of your existing cloud resources into a Terraform repo. These files, either as TF format or JSON, will let you basically snapshot the entire AWS account. First, set up AWS CLI and your credentials as shown here. Then once you have the credentials saved, make a new git repo.

# Example flow

# Set up our credentials
aws configure --profile production

# Make sure they work
aws s3 ls --profile production 

# Make our new repo
mkdir infrastructure && cd infrastructure/
git init 

# Install terraformer
# Linux
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-linux-amd64
chmod +x terraformer-all-linux-amd64
sudo mv terraformer-all-linux-amd64 /usr/local/bin/terraformer

# Intel Mac
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-darwin-amd64
chmod +x terraformer-all-darwin-amd64
sudo mv terraformer-all-darwin-amd64 /usr/local/bin/terraformer

# Other Platforms
https://github.com/GoogleCloudPlatform/terraformer/releases/tag/0.8.15

# Install terraform
https://learn.hashicorp.com/tutorials/terraform/install-cli

First, if you don't know what region your AWS resources are in you can find that here.

So what we're gonna do run:

terraformer import aws --regions INSERT_AWS_REGIONS_HERE --resources="*" --profile=production

### You will get a directory structure that looks like this
generated/
└── aws
    ├── acm
    │   ├── acm_certificate.tf
    │   ├── outputs.tf
    │   ├── provider.tf
    │   └── terraform.tfstate
    └── rds
        ├── db_instance.tf
        ├── db_parameter_group.tf
        ├── db_subnet_group.tf
        ├── outputs.tf
        ├── provider.tf
        └── terraform.tfstate

So if you wanted to modify something for rds, you would cd to the rds directory, then run terraform init. You may get an error: Error: Invalid legacy provider address

If so, no problem. Just run

terraform state replace-provider registry.terraform.io/-/aws hashicorp/aws

Once that is set up, you now have the ability to restore the AWS account using terraform at any time. You will want to add this repo to a CICD job eventually so this gets done automatically, but at first, you might need to run it locally.

$ export AWS_ACCESS_KEY_ID="anaccesskey"
$ export AWS_SECRET_ACCESS_KEY="asecretkey"
$ export AWS_DEFAULT_REGION="us-west-2"
$ terraform plan

You should see terraform run and tell you no changes.

Why Does This Matter?

Terraform lets us do a few things, one of which is roll out infrastructure changes like we would with any other code change. This is great because, in the case of unintended outages or problems, we can rollback. It also matters because often with small companies things will get broken when someone logs into the web console and clicks something they shouldn't. Running a terraform plan can tell you exactly what changed across the entire region in a few minutes, meaning you should be able to roll it back.

Should I do this if our team already manages our stack in code?

I would. There are tools like Ansible and Puppet which are great at managing servers that some people use to manage AWS. Often these setups are somewhat custom, relying on some trial and error before you figure out exactly how they work and what they are doing. Terraform is very stock and anyone on a DevOps chat group or mailing list will be able to help you run the commands. We're trying to establish basically a "restore point". You don't need to use Terraform to manage stuff if you don't want to, but you probably won't regret having a copy now.

Later on, we're going to be putting this into a CICD pipeline so we don't need to manage who adds infrastructure resources. We'll do that by requiring approval on PRs vs us having to write everything. It'll distribute the load but still let us ensure that we have some insight into how the system is configured. Right now though, since you are responsible for infrastructure you can at least roll this back.

2. Write down how deployments work

Every stack is a little different in terms of how it gets deployed and a constant source of problems for folks starting out. You need to be able to answer the question of how exactly code goes from a repo -> production. Maybe it's Jenkins, or GitLab runners or GitHub, CodeDeploy, etc but you need to know the answer for each application. Most importantly you need to read through whatever shell script they're running to actually deploy the application because that will start to give you an idea of what hacks are required to get this thing up and running.

Here are some common questions to get you started.

Are you running Docker? If so, where do the custom images come from? What runs the Dockerfile, where does it push the images, etc.
How do you run migrations against the database? Is it part of the normal code base, is there a different utility?
What is a server to your organization? Is it a stock ec2 instance running Linux and docker with everything else getting deployed with your application? Is it a server where your CICD job just rsyncs file to a directory Nginx reads from?
Where do secrets come from? Are they stored in the CICD pipeline? Are they stored in a secrets system like Vault or Secrets Manager? (Man if your organization actually does secrets correctly with something like this, bravo).
Do you have a "cron box"? This is a server that runs cron jobs on a regular interval outside of the normal fleet. I've seen these called "snowflake", "worker", etc. These are usually the least maintained boxes in the organization but often the most critical to how the business works.
How similar or different are different applications? Often organizations have mixes of serverless applications (managed either through the AWS web UI and tools like serverless) and conventional web servers. Lambdas in AWS are awesome tools that often are completely unmanaged in small businesses, so try and pay special attention to these.

The goal of all of this is to be able to answer "how does code go from a developer laptop to our customers". Once you understand that specific flow, then you will be much more useful in terms of understanding a lot of how things work. Eventually, we're going to want to consolidate these down into one flow, ideally into one "target" so we can keep our lives simple and be able to really maximize what we can offer the team.

Where do logs go and what stores them?

All applications and services generate logs. Logs are critical to debugging the health of an application, and knowing how that data is gathered and stored is critical to empowering developers to understand problems. This is the first week, so we're not trying to change anything, we just want to document how it works. How are logs generated by the application?

Some likely scenarios:

They are written to disk on the application server and pushed somewhere through syslog. Great, document the syslog configuration, where it comes from and then finally is log rotate set up to keep the boxes from running out of disk space.
They get pushed to either the cloud provider or monitoring provider (datadog etc). Fine, couldn't be easier, but write down where the permission to push the logs comes from. What I mean by that is: does the app push the logs to AWS, or does an agent running on the box take the logs and push them up to AWS? Either is fine, but know which makes a difference.

Document the flow, looking out for expiration or deletion policies. Also see how access control works, how do developers access these raw logs? Hopefully through some sort of web UI, but if it is through SSH access to the log aggregator that's fine, just write it down.

For more information about CloudWatch logging check out the AWS docs here.

3. How does SSH access work?

You need to know exactly how SSH works from the developers' laptop to the server they are trying to access. Here are some questions to kick it off.

How do SSH public keys get onto a server? Is there a script, does it sync from somewhere, are they put on by hand?
What IP addresses are allowed to SSH into a server? Hopefully not all of them, most organizations have at least a bastion host or VPN set up. But test it out, don't assume the documentation is correct. Remember we're building new documentation from scratch and approaching this stack with the respect it deserves as an unknown problem.
IMPORTANT: HOW DO EMPLOYEES GET OFFBOARDED? Trust me, people forget this all the time and it wouldn't surprise me if you find some SSH keys that shouldn't be there.

I don't know anything about SSH

Don't worry we got you. Take a quick read through this tutorial. You've likely used SSH a lot, especially if you have ever set up a Digital Ocean or personal EC2 instance on a free tier. You have public keys synced to the server and private keys on the client device.

What is a bastion host?

They're just servers that exist to allow traffic from a public subnet to a private subnet. Not all organizations use them, but a lot do, and given the conversations I've had it seems like a common pattern around the industry to use them. We're using a box between the internet and our servers as a bridge.

Do all developers need to access bastion hosts?

Nope they sure don't. Access to the Linux instances should be very restricted and ideally, we can get rid of it as we go. There are much better and easier to operate options now through AWS that let you get rid of the whole concept of bastion servers. But in the meantime, we should ensure we understand the existing stack.

Questions to answer

How do keys get onto the bastion host?
How does access work from the bastion host to the servers?
Are the Linux instances we're accessing in a private subnet or are they on a public subnet?
Is the bastion host up to date? Is the Linux distribution running current with the latest patches? There shouldn't be any other processes running on these boxes so upgrading them shouldn't be too bad.
Do you rely on SFTP anywhere? Are you pulling something down that is critical or pushing something up to SFTP? A lot of businesses still rely heavily on automated jobs around SFTP and you want to know how that authentication is happening.

4. How do we know the applications are running?

It seems from conversations that these organizations often have bad alerting stories. They don't know applications are down until customers tell them or they happen to notice. So you want to establish some sort of baseline early on, basically "how do you know the app is still up and running". Often now there is some sort of health check path, something like domain/health or /check or something, used by a variety of services like load balancers and Kubernetes to determine if something is up and functional or not.

First, understand what this health check is actually doing. Sometimes they are just hitting a webserver and ensuring Nginx is up and running. While interesting to know that Nginx is a reliable piece of software (it is quite reliable), this doesn't tell us much. Ideally, you want a health check that interacts with as many pieces of the infrastructure as possible. Maybe it runs a read query against the database to get back some sort of UUID (which is a common pattern).

This next part depends a lot on what alerting system you use, but you want to make a dashboard that you can use very quickly to determine "are my applications up and running". Infrastructure modifications are high-risk operations and sometimes when they go sideways, they'll go very sideways. So you want some visual system to determine whether or not the stack is functional and ideally, this should alert you through Slack or something. If you don't have a route like this, considering doing the work to add one. It'll make your life easier and probably isn't too complicated to do in your framework.

My first alerting tool is almost always Uptime Robot. So we're gonna take our health route and we are going to want to set an Uptime Robot alert on that endpoint. You shouldn't allow traffic from the internet at large to hit this route (because it is a computationally expensive route it is susceptible to malicious actors). However, Uptime Robot provides a list of their IP addresses for whitelisting. So we can add them to our security groups in the terraform repo we made earlier.

If you need a free alternative I have had a good experience with Hetrix. Setting up the alerts should be self-explanatory, basically hit an endpoint and get back either a string or a status code.

5. Run a security audit

Is he out of his mind? On the first week? Security is a super hard problem and one that startups mess up all the time. We can't make this stack secure in the first week (or likely month) of this work, but we can ensure we don't make it worse and, when we get a chance, we move closer to an ideal state.

The tool I like for this is Prowler. Not only does it allow you a ton of flexibility with what security audits you run, but it lets you export the results in a lot of different formats, including a very nice-looking HTML option.

Steps to run Prowler

Install Prowler. We're gonna run this from our local workstation using the AWS profile we made before.

On our local workstation:
git clone https://github.com/toniblyx/prowler
cd prowler

2. Run prowler. ./prowler -p production -r INSERT_REGION_HERE -M csv,json,json-asff,html -g cislevel1

The command above will output all of the options for Prowler, but I want to focus for a second on the -g option. That's the group option and it basically means "what security audit are we going to run". CIS Amazon Web Services Foundations have 2 levels and can be thought of broadly as:

Level 1: Stuff you should absolutely be doing right now that shouldn't impact most application functionality.

Level 2: Stuff you should probably be doing but is more likely to impact the functioning of an application.

We're running Level 1, because ideally, our stack should already pass a level 1 and if it doesn't, then we want to know where. The goal of this audit isn't to fix anything right now, but it IS to share it with leadership. Let them know the state of the account now while you are onboarding, so if there are serious security gaps that will require development time they know about it.

Finally, take the CSV file that was output from Prowler and stick it in Google Sheets with a date. We're going to want to have a historical record of the audit.

6. Make a Diagram!

The last thing we really want to do is make a diagram and have the folks who know more about the stack verify it. One tool that can kick this off is Cloudmapper. This is not going to get you all of the way there (you'll need to add meaningful labels and likely fill in some missing pieces) but should get you a template to work off of.

What we're primarily looking for here is understanding flow and dependencies. here are some good questions to get you started.

Where are my application persistence layers? What hosts them? How do they talk to each other?
Overall network design. How does traffic ingress and egress? Do all my resources talk directly to the internet or do they go through some sort of NAT gateway? Are my resources in different subnets, security groups, etc?
Are there less obvious dependencies? SQS, RabbitMQ, S3, elasticsearch, varnish, any and all of these are good candidates.

The ideal state here is to have a diagram that we can look at and say "yes I understand all the moving pieces". For some stacks that might be much more difficult, especially serverless stacks. These often have mind-boggling designs that change deploy to deploy and might be outside of the scope of a diagram like this. We should still be able to say "traffic from our customers comes in through this load balancer to that subnet after meeting the requirements in x security group".

If your organization has LucidChart they make this really easy. You can find out more about that here. You can do almost everything Lucid or AWS Config can do with Cloudmapper without the additional cost.

Cloudmapper is too complicated, what else have you got?

Does the setup page freak you out a bit? It does take a lot to set up and run the first time. AWS actually has a pretty nice pre-made solution to this problem. Here is the link to their setup: https://docs.aws.amazon.com/solutions/latest/aws-perspective/overview.html

It does cost a little bit but is pretty much "click and go" so I recommend it if you just need a fast overview of the entire account without too much hassle.

End of section one

Ideally the state we want to be in looks something like the following.

We have a copy of our infrastructure that we've run terraform plan against and there are no diffs, so we know we can go back.
We have an understanding of how the most important applications are deployed and what they are deployed to.
The process of generating, transmitting, and storing logs is understood.
We have some idea of how secure (or not) our setup is.
There are some basic alerts on the entire stack, end to end, which give us some degree of confidence that "yes the application itself is functional".

For many of you who are more experienced with this type of work, I'm sure you are shocked. A lot of this should already exist and really this is a process of you getting up to speed with how it works. However sadly in my experience talking to folks who have had this job forced on them, many of these pieces were set up a few employees ago and the specifics of how they work are lost to time. Since we know we can't rely on the documentation we need to make our own. In the process, we become more comfortable with the overall stack.

Stuff still to cover!

If there is any interest I'll keep going with this. Some topics I'd love to cover.

Metrics! How to make a dashboard that doesn't suck.
Email. Do your apps send it, are you set up for DMARC, how do you know if email is successfully getting to customers, where does it send from?
DNS. If it's not in the terraform directory we made before under Route53, it must be somewhere else. We gotta manage that like we manage a server because users logging into the DNS control panel and changing something can cripple the business.
Kubernetes. Should you use it? Are there other options? If you are using it now, what do you need to know about it?
Migrating to managed services. If your company is running its own databases or baking its own AMIs, now might be a great time to revisit that decision.
Sandboxes and multi-account setups. How do you ensure developers can test their apps in the least annoying way while still keeping the production stack up?
AWS billing. What are some common gotchas, how do you monitor spending, and what do to institutionally about it?
SSO, do you need it, how to do it, what does it mean?
Exposing logs through a web interface. What are the fastest ways to do that on a startup budget?
How do you get up to speed? What courses and training resources are worth the time and energy?
Where do you get help? Are there communities with people interested in providing advice?

Did I miss something obvious?

Let me know! I love constructive feedback. Bother me on Twitter. @duggan_mathew

TIL Command to get memory usage by process in Linux

May 10, 2021 in TIL

If like me you are constantly trying to figure out using a combination of ps and free to see what is eating all your memory, check this out:

ps -eo size,pid,user,command --sort -size | \
    awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' |\
    cut -d "" -f2 | cut -d "-" -f1

TIL docker-compose lies to you....

April 29, 2021 in TIL

You, like me, might assume that when you write a docker-compose healthcheck, it does something useful with that information. So for instance you might add something like this to your docker-compose file:

healthcheck:
      test: ["CMD", "curl", "-f", "-L", "http://localhost/website.aspx"]
      interval: 5s
      timeout: 10s
      retries: 2
      start_period: 60s

You run your docker container in production and when the container is running but no longer working, your site will go down. Being a reasonable human being you check docker-compose ps to see if docker knows your container is down. Weirdly, docker DOES know that the docker container is unhealthy but seems to do nothing with this information.

Wait, so Docker just records that the container is unhealthy?

Apparently! I have no idea why you would do that or what the purpose of a healthcheck is if not to kill and restart the container. However there is a good solution.

The quick fix to make standalone Docker do what you want

  image: willfarrell/autoheal:latest
  restart: always
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  environment:
    - AUTOHEAL_CONTAINER_LABEL=all
    - AUTOHEAL_START_PERIOD=60

This small container will automatically restart unhealthy containers and works great. Huge fan.

This Bloomberg Story Makes No Sense

October 12, 2018 in Programming

Someone Is Making Something Up

One of the biggest stories right now in the tech world is the bombshell dropped by Bloomberg that the biggest names in tech have been hacked. Not through a software exploit or through some gap in a firewall, but by the Chinese government infiltrating their server supply chain. You can read the story here.

This resonates with tech professionals for a lot of good reasons. First, the people who know how to make these servers do something useful for an end user are rarely the same people who understand the specifics of how they work. I might know how to bootstrap and set up a bunch of Dell servers, but if asked to explain in intricate detail how iDRAC functions or what specifically can be done to harden the server itself, I would have to plead some ignorance. There's just so much to learn in this field that one person can't do it all. At some point I have to trust the thing I just unboxed from Dell or Supermicro or whomever is safe to use.

So when the assertion came out that the supply chain itself for the hardware we all rely on may have been breached, it shook us all to our cores. First we rely on these companies to provide us with services not easily duplicated elsewhere. I don't know a lot of stacks anymore that don't rely on some AWS functionality, so saying all of that information may have been released to the Chinese government is akin to your bank saying it might have lost the keys to the safety deposit boxes. There's also the knowledge that we cannot individually all vet these vendors. I lack the training to disable every PCB inside of every model of device in my datacenter and determine whether it's safe and up to spec. I don't even have the diagrams to do this assuming I could! Basically Bloomberg asserted everyone's worse fears.

Then cracks started to show up. Not small cracks, not companies covering their butts kind of problems. Amazon denies the story 100% in a way where there is no room for debate as to what they mean.. Amazon wasn't alone in these strong denials. Apple denied it as well in a letter to Congress. I would challenge anyone to read these statements and come away thinking these were parties attempting to leave room for doubt. Which is incredibly strange!

As time goes on we start to hear more problems with the actual reporting. One of the few named sources in the story, who provided technical background and context, admits it feels strange that almost everything he told the reporter as to how these attacks might happen is later apparently confirmed by different sources. I would encourage folks to listen to Joe Fitzpatrick here. Then the people who WERE supporting the narrative started to come out of the woodwork and they raised more questions than they answered. Yossi Appleboum, CEO of Sepio Systems (a firm with deep ties to the intelligence community based on their companies "About" page) comes out swinging that this is a much bigger problem! The scope is indeed even higher than Bloomberg asserts. You can read his take here.

Someone is lying, clearly. Either this hack didn't happen or it did happen and companies are willing to lie on the record to everyone involved. The later scenario feels unlikely for a number of reasons. One, because we know from recent events like the Facebook hack and the Google+ hack, the penalty for being caught leaking user data isn't that high. It would be a stock-price hit for Apple and Amazon if indeed this was true, but certainly one they could recover from. However the PR hit would appear to be relatively minor and you could bury the entire thing in enough technical details to throw off the average reader. Like I doubt if I told a random person on the street who uses and enjoys their iPhone that some of the servers making up the iCloud infrastructure might have been compromised by Chinese, they would swear off the company forever.

If it isn't "this hack happened and everyone involved is attempting to cover it up", then what is it? A false flag attack by the intelligence community on China? Sure maybe, but why? I've never seen anything like this before where a major news outlet that seemingly specializes in business news gets shut down by major industry players and continues writing the story. Whatever this story is, it feels like these companies coming out strongly against it is sending a message. They're not going to play along with whatever this narrative is. It also doesn't appear that foreign nationals would even need to go through all this trouble. Turns out the Supermicro firmware wasn't the best to start with on its own.

At the end of this saga I think either some folks at Bloomberg need to be let go or we need to open investigations into being deceived by officers of publicly traded companies. I don't see any other way for this to resolve.

Git Tips and Tricks

December 28, 2017 in Programming

Git is an amazing tool that does a lot of incredible things. It's also a tool that is easy to screw up. Here is a list of tips I've found useful for when working with git.

Git Up

So when you run git pull chances are what you actually want to do is rebase over them. You also are probably working in a "dirty workspace" which means you want to pull the stuff ahead of you and rebase over it but until you are ready to commit, you don't want your work to be reflected. (I'm saying all this like its a certainty, I'm sure for a lot of people it's not. However I work in an extremely busy repo all day).

We're gonna set a new git alias for us that will do what we actually want to do with git.
git config --global alias.up 'pull --rebase --autostash'

Autostash basically makes a temporary stash before you run the rest of the rebase and then applies it back after the operation ends. This lets you work out of a dirty directory. Make sure you sanity check your changes though since the final stash application has the potential to do some weird things.

Approving Stuff As You Git Add

Been working for awhile and ready to add stuff to a commit but maybe don't remember all the changes you've made? Not a problem.

git add -p lets you step through your changes one by one and only add the ones to this commit you need. I use this as a final sanity check to ensure some fix I didn't mean to go out with this deploy doesn't sneak out.

Go Back In Commit History

git reflog will show you a list of everything done across all branches along with the index HEAD@{index}

git reset HEAD@{index} will let you fix it

Commited to the wrong branch

Go to the branch you want: git checkout right-branch

Grab your commit: git cherry-pick SHA of commit (get this from git log) or if its the last commit on a branch just the name of the branch

If needed, delete the commit from the wrong branch: git checkout wrong-branch then git reset HEAD~ --hard to knock out the last commit

You can pass an argument to HEAD to undo multiple commits if needed, knucklehead: git reset HEAD~3 to undo three

Squash Commits Down

First just do a soft reset passing HEAD the argument for how many commits you want to go back: git reset --soft HEAD~3

Then take your previous commit messages and make them your new unified commit: git commit --edit -m"$(git log --format=%B --reverse HEAD..HEAD@{1})"

Fix Messed Up Commit Message

git commit --amend and then just follow the prompts

MegaCLI Cheat Sheet

December 28, 2017 in Programming

Everyone who has worked with specific vendors servers for awhile has likely encountered megacli. This is the comman line tool to manage your RAID controller and disks. Since often I'll only end up doing this every six months or so, I usually forget the syntax and decided to write it down here.

Install megacli

sudo apt-get install megacli

Common Parameters

Controller Syntax: -aN

This is telling MegaCLI the adapter ID. Don't use the ALL flag, always use this because you'll get in a bad habit and forget which boxes have multiple controllers

Physical drive: -PhysDrv [E:S]

I hate this format. E is the enclosure drive ID where the drive is and S is the slot number starting at 0. You get this info with the megacli -EncInfo -aALL

Virtual drive: -Lx

Used for specifying the virtual drive (where x is a number starting with zero or the string all).

Information Commands

Get Controller Information:

megacli -AdpAllInfo -aALL
megacli -CfgDsply -aALL
megacli -adpeventlog -getevents -f lsi-events.log -a0 -nolog

Common Operations

Replace drive:

Set the drive offline with megacli -PDOffline -PhysDrv[E:S] -aN
Mark the drive as missing: megacli -PDMarkMissing -PhysDrv [E:S] -aN
Get the drive ready to yank out: megacli -PDPrpRmv -PhysDrv [E:S] -aN
Change/replace the drive: megacli -PdReplaceMissing -PhysDrv[E:S] -ArrayN -rowN -aN
Start the drive: megacli -PDRbld -Start -PhysDrv [E:S] -aN

Fix Drive marked as "foreign" when inserted:

This happens when you steal a drive from a existing RAID to put in your more critical array. I know, you've never done anything so evil. Anyway when you need to here's how you do it.

Flip to good: megacli -PDMakeGood -PhysDrv [E:S] -aALL
Clear the foreign (that feels offensive but we're gonna move on): megacli -CfgForeign -Clear -aALL

Shut the stupid alarm off for good

megacli -AdpSetProp AlarmDsbl -aALL