Skip to content

GitHub Copilot Workspace Review

I was recently invited to try out the beta for GitHub's new AI-driven web IDE and figured it could be an interesting time to dip my toes into AI. So far I've avoided all of the AI tooling, trying the GitHub paid Copilot option and being frankly underwhelmed. It made more work for me than it saved. However this is free for me to try and I figured "hey why not".

Disclaimer: I am not and have never been an employee of GitHub, Microsoft, any company owned by Microsoft, etc. They don't care about me and likely aren't aware of my existence. Nobody from GitHub PR asked me to do this and probably won't like what I have to say anyway.

TL;DR

GitHub Copilot Workspace didn't work on a super simple task regardless of how easy I made the task. I wouldn't use something like this for free, much less pay for it. It sort of failed in every way it could at every step.

What is GitHub Copilot Workspace?

So after the success of GitHub Copilot, which seems successful according to them:

In 2022, we launched GitHub Copilot as an autocomplete pair programmer in the editor, boosting developer productivity by up to 55%. Copilot is now the most widely adopted AI developer tool. In 2023, we released GitHub Copilot Chat—unlocking the power of natural language in coding, debugging, and testing—allowing developers to converse with their code in real time.

They have expanded on this feature set with GitHub Copilot Workspace, a combination of an AI tool with an online IDE....sorta. It's all powered by GPT-4 so my understanding is this is the best LLM money can buy. The workflow of the tool is strange and takes a little bit of explanation to convey what it is doing.

GitHub has the marketing page here: https://githubnext.com/projects/copilot-workspace and the docs here: https://github.com/githubnext/copilot-workspace-user-manual. It's a beta product and I thought the docs were nicely written.

Effectively you start with a GitHub Issue, the classic way maintainers are harassed by random strangers. I've moved my very simple demo site: https://gcp-iam-reference.matduggan.com/ to a GitHub repo to show what I did. So I open the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/1

Very simple, makes sense, Then I click "Open in Workspaces" which brings me to a kind of GitHub Actions inspired flow.

It reads the Issue and creates a Specification, which is editable.

Then you generate a Plan:

Finally it generates the files of that plan and you can choose whether to implement them or not and open a Pull Request against the main branch.

Implementation:

It makes a Pull Request:

Great right? Well except it didn't do any of it right.

  • It didn't add a route to the Flask app to expose this information
  • It didn't stick with the convention of storing the information in JSON files, writing it out to Markdown for some reason
  • It decided the way that it was going to reveal this information was to add it to the README
  • Finally it didn't get anywhere near all the machine types.
Before you ping me yes I tried to change the Proposed plan

Baby Web App

So the app I've written here is primarily for my own use and it is very brain dead simple. The entire thing is the work of roughly an afternoon of poking around while responding to Slack messages. However I figured this would be a good example of maybe a more simple internal tool where you might trust AI to go a bit nuts since nothing critical will explode if it messes up.

How the site works it is relies on the output of the gcloud CLI tool to generate JSON of all of IAM permissions for GCP and then output them so that I can put them into categories and quickly look for the one I want. I found the official documentation to be slow and hard to use, so I made my own. It's a Flask app, which means it is pretty stupid simple.

import os
from flask import *
from all_functions import *
import json


app = Flask(__name__)

@app.route('/')
def main():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("index.html", items=items, role_data=role_data)

@app.route('/all-roles')
def all_roles():
    items = get_iam_categories()
    role_data = get_roles_data()
    return render_template("all_roles.html", items=items, role_data=role_data)

@app.route('/search')
def search():
    items = get_iam_categories()
    return render_template('search_page.html', items=items)

@app.route('/iam-classes')
def iam_classes():
    source = request.args.get('parameter')
    items = get_iam_categories()
    specific_items = get_specific_roles(source)
    print(specific_items)
    return render_template("iam-classes.html", specific_items=specific_items, items=items)

@app.route('/tsid', methods=['GET'])
def tsid():
    data = get_tsid()
    return jsonify(data)

@app.route('/eu-eea', methods=['GET'])
def eueea():
    country_code = get_country_codes()
    return is_eea(country_code)


if __name__ == '__main__':
    app.run(debug=False)

I also have an endpoint I use during testing if I need to test some specific GDPR code so I can curl it and see if the IP address is coming from EU/EEA or not along with a TSID generator I used for a brief period of testing that I don't need anymore. So again, pretty simple. It could be rewritten to be much better but I'm the primary user and I don't care, so whatever.

So effectively what I want to add is another route where I would also have a list of all the GCP machine types because their official documentation is horrible and unreadable. https://cloud.google.com/compute/docs/machine-resource

What I'm looking to add is something like this: https://gcloud-compute.com/

Look how information packed it is! My god, I can tell at a glance if a machine type is eligible for Sustained Use Discounts, how many regions it is in, Hour/Spot/Month pricing and the breakout per OS along with Clock speed. If only Google had a team capable of making a spreadsheet.

Nothing I enjoy more than nested pages with nested submenus that lack all the information I would actually need. I'm also not clear what a Tier_1 bandwidth is but it does seem unlikely that it matters for machine types when so few have it.

I could complain about how GCP organizes information all day but regardless the information exists. So I don't need anything to this level, but could I make a simpler version of this that gives me some of the same information? Seems possible.

How I Would Do It

First let's try to stick with the gcloud CLI approach.

gcloud compute machine-types list --format="json"

Only problem with this is that it does output the information I want, but for some reason it outputs a JSON file per region.

  {
    "creationTimestamp": "1969-12-31T16:00:00.000-08:00",
    "description": "4 vCPUs 4 GB RAM",
    "guestCpus": 4,
    "id": "903004",
    "imageSpaceGb": 0,
    "isSharedCpu": false,
    "kind": "compute#machineType",
    "maximumPersistentDisks": 128,
    "maximumPersistentDisksSizeGb": "263168",
    "memoryMb": 4096,
    "name": "n2-highcpu-4",
    "selfLink": "https://www.googleapis.com/compute/v1/projects/sybogames-artifact/zones/africa-south1-c/machineTypes/n2-highcpu-4",
    "zone": "africa-south1-c"
  }

I don't know why but sure. However I don't actually need every region so I can cheat here. gcloud compute machine-types list --format="json" gets me some of the way there.

Where's the price?

Yeah so Google doesn't expose pricing through the API as far as I can tell. You can download what is effectively a global price list for your account at https://console.cloud.google.com/billing/[your billing account id]/pricing. That's a 13 MB CSV that includes what your specific pricing will be, which is what I would use. So then I would combine the information from my region with the information from the CSV and then output the values. However since I don't know whether the pricing I have is relevant to you, I can't really use this to generate a public webpage.

Web Scraping

So realistically my only option would be to scrape the pricing page here: https://cloud.google.com/compute/all-pricing. Except of course it was designed in such a way as to make it as hard to do that as possible.

Boy it is hard to escape the impression GCP does not want me doing large-scale cost analysis. Wonder why?

So there's actually a tool called gcosts which seems to power a lot of these sites running price analysis. However it relies on a pricing.yml file which is automatically generated weekly. The work involved in generating this file is not trivial:

 +--------------------------+  +------------------------------+
 | Google Cloud Billing API |  | Custom mapping (mapping.csv) |
 +--------------------------+  +------------------------------+
               ↓                              ↓
 +------------------------------------------------------------+
 | » Export SKUs and add custom mapping IDs to SKUs (skus.sh) |
 +------------------------------------------------------------+
               ↓
 +----------------------------------+  +-----------------------------+
 | SKUs pricing with custom mapping |  | Google Cloud Platform info. |
 |             (skus.db)            |  |           (gcp.yml)         |
 +----------------------------------+  +-----------------------------+
                \                             /
         +--------------------------------------------------+
         | » Generate pricing information file (pricing.pl) |
         +--------------------------------------------------+
                              ↓
                +-------------------------------+
                |  GCP pricing information file |
                |          (pricing.yml)        |
                +-------------------------------+

Alright so looking through the GitHub Action that generates this pricing.yml file, here, I can see how it works and how the file is generated. But also I can just skip that part and pull the latest for my usecase whenever I regenerate the site. That can be found here.

Effectively with no assistance from AI, I have now figured out how I would do this:

  • Pull down the pricing.yml file and parse it
  • Take that information and output it to a simple table structure
  • Make a new route on the Flask app and expose that information
  • Add a step to the Dockerfile to pull in the new pricing.yml with every Dockerfile build just so I'm not hammering the GitHub CDN all the time.

Why Am I Saying All This?

So this is a perfect example of an operation that should be simple but because the vendor doesn't want to make it simple, is actually pretty complicated. As we can now tell from the PR generated before, AI is never going to be able to understand all the steps we just walked through to understand how one actually get the prices for these machines. We've also learned that because of the hard work of someone else, we can skip a lot of the steps. So let's try it again.

Attempt 2

Maybe if I give it super specific information, it can do a better job.

You can see the issue here: https://github.com/matdevdug/gcp-iam-reference/issues/4

I think I've explained maybe what I'm trying to do. Certainly a person would understand this. Obviously this isn't the right way to organize this information, I would want to do a different view and sort by region and blah blah blah. However this should be easier for the machine to understand.

Note: I am aware that Copilot has issues making calls to the internet to pull files, even from GitHub itself. That's why I've tried to include a sample of the data. If there's a canonical way to pass the tool information inside of the issue let me know at the link at the bottom.

Results

So at first things looked promising.

It seems to understand what I'm asking and why I'm asking it. This is roughly the correct thing. The plan also looks ok:

You can see the PR it generated here: https://github.com/matdevdug/gcp-iam-reference/pull/5

So this is much closer but it's still not really "right". First like most Flask apps I have a base template that I want to include on every page: https://github.com/matdevdug/gcp-iam-reference/blob/main/templates/base.html

Then for every HTML file after that we extend the base:

{% extends "base.html" %}

{% block main %}

<style>
        table {
            border-collapse: collapse;
            width: 100%;
        }

        th, td {
            border: 1px solid #dddddd;
            text-align: left;
            padding: 8px;
        }

        tr:nth-child(even) {
            background-color: #f2f2f2;
        }
</style>

The AI doesn't understand that we've done this and is just re-implementing Bootstrap: https://github.com/matdevdug/gcp-iam-reference/pull/5/files#diff-a8e8dd2ad94897b3e1d15ec0de6c7cfeb760c15c2bd62d828acba2317189a5a5

It's not adding it to the menu bar, there are actually a lot of pretty basic misses here. I wouldn't accept this PR from a person, but let's see if it works!

 => ERROR [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml                                             0.1s
------
 > [6/8] RUN wget https://raw.githubusercontent.com/Cyclenerd/google-cloud-pricing-cost-calculator/master/pricing.yml -O pricing.yml:
0.104 /bin/sh: 1: wget: not found

No worries, easy to fix.

Alright fixed wget, let's try again!

2024-06-18 11:18:57   File "/usr/local/lib/python3.12/site-packages/gunicorn/util.py", line 371, in import_app
2024-06-18 11:18:57     mod = importlib.import_module(module)
2024-06-18 11:18:57           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "/usr/local/lib/python3.12/importlib/__init__.py", line 90, in import_module
2024-06-18 11:18:57     return _bootstrap._gcd_import(name[level:], package, level)
2024-06-18 11:18:57            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
2024-06-18 11:18:57   File "<frozen importlib._bootstrap_external>", line 995, in exec_module
2024-06-18 11:18:57   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
2024-06-18 11:18:57   File "/app/main.py", line 2, in <module>
2024-06-18 11:18:57     import yaml
2024-06-18 11:18:57 ModuleNotFoundError: No module named 'yaml'

Yeah I did anticipate this one. Alright let's add PyYAML so there's something to import. I'll give AI a break on this one, this is a dumb Python thing.

Ok so it didn't add it to the menu, it didn't follow the style conventions, but did it at least work? Also no.

I'm not sure how it could have done a worse job to be honest. I understand what it did wrong and why this ended up like it did, but the work involved in fixing it exceeds the amount of work it would take for me to do it myself by scratch. The point of this was to give it a pretty simple concept (parse a YAML file) and see what it did.

Conclusion

I'm sure this tool is useful to someone on Earth. That person probably hates programming and gets no joy out of it, looking for something that could help them spend less time doing it. I am not that person. Having a tool that makes stuff that looks right but ends up broken is worse than not having the tool at all.

If you are a person maintaining an extremely simple thing with amazing test coverage, I guess go for it. Otherwise this is just a great way to get PRs that look right and completely waste your time. I'm sure there are ways to "prompt engineer" this better and if someone wants to tell me what I could do, I'm glad to re-run the test. However as it exists now, this is not worth using.

If you want to use it, here are my tips:

  • Your source of data must be inside of the repo, it doesn't like making network calls
  • It doesn't seem to go check any sort of requirements file for Python, so assume the dependencies are wrong
  • It understands Dockerfile but not checking if a binary is present so add a check for that
  • It seems to do better with JSON than YAML

Questions/comments/concerns: https://c.im/@matdevdug


Why Don't I Like Git More?

Why Don't I Like Git More?

I've been working with git now full-time for around a decade now. I use it every day, relying on the command-line version primarily. I've read a book, watched talks, practiced with it and in general use it effectively to get my job done. I even have a custom collection of hooks I install in new repos to help me stay on the happy path. I should like it, based on mere exposure effect alone. I don't.

I don't feel like I can always "control" what git is going to do, with commands sometimes resulting in unexpected behavior that is consistent with the way git works but doesn't track with how I think it should work. Instead, I need to keep a lot in my mind to get it to do what I want. "Alright, I want to move unstaged edits to a new branch. If the branch doesn't exist, I want to use checkout, but if it does exist I need to stash, checkout and then stash pop." "Now if the problem is that I made changes on the wrong branch, I want stash apply and not stash pop." "I need to bring in some cross-repo dependencies. Do I want submodules or subtree?"

I need to always deeply understand the difference between reset, revert, checkout, clone, pull, fetch, cherrypick when I'm working even though some of those words mean the same thing in English. You need to remember that push and pull aren't opposites despite the name. When it comes to merging, you need to think through the logic of when you want rebase vs merge vs merge --squash. What is the direction of the merge? Shit, I accidentally deleted a file awhile ago. I need to remember git rev-list -n 1 HEAD – filename. Maybe I deleted a file and immediately realized it but accidentally committed it. git reset --hard HEAD~1 will fix my mistake but I need to remember what specifically --hard does when you use it and make sure it's the right flag to pass.

Nobody is saying this is impossible and clearly git works for millions of people around the world, but can we be honest for a second and acknowledge that this is massive overkill for the workflow I use at almost every job which looks as follows:

  • Make a branch
  • Push branch to remote
  • Do work on branch and then make a Pull Request
  • Merge PR, typically with a squash and merge cause it is easier to read
  • Let CI/CD do its thing.

I've never emailed a patch or restored a repo from my local copy. I don't spend weeks working offline only to attempt to merge a giant branch. We don't let repos get larger than 1-2 GB because then they become difficult to work with when I just need to change like three files and make a PR. None of the typical workflow benefits from the complexity of git.

More specifically, it doesn't work offline. It relies on merge controls that aren't even a part of git with Pull Requests. Most of that distributed history gets thrown away when I do a squash. I don't gain anything with my local disk being cluttered up with out-of-date repos I need to update before I start working anyway.

Now someone saying "I don't like how git works" is sort of like complaining about PHP in terms of being a new and novel perspective. Let me lay out what I think would be the perfect VCS and explore if I can get anywhere close to there with anything on the market.

Gitlite

What do I think a VCS needs (and doesn't need) to replace git for 95% of usecases.

  • Dump the decentralized model. I work with tons of repos, everyone works with tons of repos, I need to hit the server all the time to do my work anyway. The complexity of decentralized doesn't pay off and I'd rather be able to do the next section and lose it. If GitHub is down today I can't deploy anyway so I might as well embrace the server requirement as a perk.
  • Move a lot of the work server-side and on-demand. I wanna search for something in a repo. Instead of copying everything from the repo, running the search locally and then just accepting that it might be out of date, run it on the server and tell me what files I want. Then let me ask for just those files on-demand instead of copying everything.
  • I want big repos and I don't want to copy the entire thing to my disk. Just give me the stuff I want when I request it and then leave the rest of it up there. Why am I constantly pulling down hundreds of files when I work with like 3 of them.
  • Pull Request as a first-class citizen. We have the concept of branches and we've all adopted the idea of checks that a branch must pass before it can be merged. Let's make that a part of the CLI flow. How great would be it to be able to, inside the same tool, ask the server to "dry-run" a PR check and see if my branch passes? Imagine taking the functionality of the gh CLI and not making it platform specific ala kubectl with different hosted Kubernetes providers.
  • Endorsing and simplifying the idea of cross-repo dependencies. submodules don't work the way anybody wants them to. subtree does but taking work and pushing it back to the upstream dependency is confusing and painful to explain to people. Instead I want something like: https://gitmodules.com/
    • My server keeps it in sync with the remote server if I'm pulling from a remote server but I can pin the version in my repo.
    • My changes in my repo go to the remote dependency if I have permission
    • If there are conflicts they are resolved through a PR.
  • Build in better visualization tools. Let me kick out to a browser or whatever to more graphically explore what I'm looking at here. A lot of people use the CLI + a GUI tool to do this with git and it seems like something we could roll into one step.
  • Easier centralized commit message and other etiquette enforcement. Yes I can distribute a bunch of git hooks but it would be nice if when you cloned the repo you got all the checks to make sure that you are doing things the right way before you wasted a bunch of time only to get caught by the CI linter or commit message format checker. I'd also love some prompts like "hey this branch is getting pretty big" or "every commit must be of a type fix/feat/docs/style/test/ci whatever".
  • Read replica concept. I'd love to be able to point my CI/CD systems at a read replica box and preserve my primary VCS box for actual users. Primary server fires a webhook that triggers a build with a tag, hits the read replica which knows to pull from the primary if it doesn't have that tag. Be even more amazing if we could do some sort of primary/secondary model where I can set both in the config and if primary (cloud provider) is down I can keep pushing stuff up to somewhere that is backed up.

So I tried out a few competitors to see "is there any system moving more towards this direction".

SVN in 2024

My first introduction to version control was SVN (Subversion), which was pitched to me as "don't try to make a branch until you've worked here a year". However getting SVN to work as a newbie was extremely easy because it doesn't do much. Add, delete, copy, move, mkdir, status, diff, update, commit, log, revert, update -r, co -r were pretty much all the commands you needed to get rolling. Subversion has a very simple mental model of how it works which also assists with getting you started. It's effectively "we copied stuff to a file server and back to your laptop when you ask us to".

I have to say though, svn is a much nicer experience than I remember. A lot of the rough edges seem to have been sanded down and I didn't hit any of the old issues I used to. Huge props to the Subversion team for delivering great work.

Subversion Basics

Effectively your Subversion client commits all your files as a single atomic transaction to the central server as the basic function. Whenever that happens, it creates a new version of the whole project, called a revision. This isn't a hash, it's just a number starting at zero, so there's no confusion as a new user what is the "newer" or "older" thing. These are global numbers, not tied to a file, so it's the state of the world. For each individual file there are 4 states it can be in:

  • Unchanged locally + current remote: leave it alone
  • Locally changed + current remote: to publish the change you need to commit it, an update will do nothing
  • Unchanged locally + out of date remotely: svn update will merge the latest copy into your working copy
  • Locally changed + out of date remotely: svn commit won't work, svn update will try to resolve the problem but if it can't then the user will need to figure out what to do.

It's nearly impossible to "break" SVN because pushing up doesn't mean you are pulling down. This means different files and directories can be set to different revisions, but only when you run svn update does the whole world true itself up to the latest revision.

Working with SVN looks as follows:

  • Ensure you are on the network
  • Run svn update to get your working copy up to latest
  • Make the changes you need, remembering not to use OS tooling to move or delete files and instead use svn copy and svn move so it knows about the changes.
  • Run svn diff to make sure you want to do what you are talking about doing
  • Run svn update again, resolve conflicts with svn resolve
  • Feeling good? Hit svn commit and you are done.

Why did SVN get dumped then? One word: branches.

SVN Branches

In SVN a branch is really just a directory that you stick into where you are working. Typically you do it as a remote copy and then start working with it, so it looks more like you are copying the URL path to a new URL path. But to users they just look like normal directories in the repository that you've made. Before SVN 1.4 merging a branch required a masters degree and a steady hand, but they added an svn merge which made it a bit easier.

Practically you are using svn merge against the main to keep your branch in sync and then when you are ready to go, you run svn merge --reintegrate to push the branch to master. Then you can delete the branch, but if you need to read the log the URL of the branch will always work to read the log of. This was particularly nice with ticket systems where the URL was just the ticket number. But you don't need to clutter things up forever with random directories.

In short a lot of the things that used to be wrong with svn branches aren't anymore.

What's wrong with it?

So SVN breaks down IME when it comes to automation. You need to make it all yourself. While you do you have nuanced access control over different parts of a repo, in practice this wasn't often valuable. What you don't have is the ability to block someone from merging in a branch without some sort of additional controls or check. It also can place a lot of burden on the SVN server since nobody seems to ever update them even when you add a lot more employees.

Also the UIs are dated and the entire tooling ecosystem has started to rot from users leaving. I don't know if I could realistically recommend someone jump from git to svn right now, but I do think it has a lot of good ideas that moves us closer to what I want. It would just need a tremendous amount of UI/UX investment in terms of web to get it to where I would like using it over git. But I think if someone was interested in that work, the fundamental "bones" of Subversion are good.

Sapling

One thing I've heard from every former Meta engineer I've worked with is how much they miss their VCS. Sapling is that team letting us play around with a lot of those pieces, adopted for a more GitHub-centric world. I've been using it for my own personal stuff for a few months and have really come to fall in love with it. It feels like Sapling is specifically designed to be easy to understand, which is a delightful change.

A lot of the stuff is the same. You clone with sl clone, you check the status with sl status and you commit with sl commit. The differences that immediately stick out are the concept of stacks and the concept of the smartlog. So stacks are "collections of commits" and the idea is that from the command line I can issue PRs for those changes with sl pr submit with each GitHub PR being one of the commits. This view (obviously) is cluttered and annoying, so there's another tool that helps you see the changes correctly which is ReviewStack.

None of this makes a lot of sense unless I show you what I'm talking about. I made a new repo and I'm adding files to it. First I check the status:

❯ sl st
? Dockerfile
? all_functions.py
? get-roles.sh
? gunicorn.sh
? main.py
? requirements.in
? requirements.txt

Then I add the files:

sl add .
adding Dockerfile
adding all_functions.py
adding get-roles.sh
adding gunicorn.sh
adding main.py
adding requirements.in
adding requirements.txt

If I want a nicer web UI running locally, I run sl web and get this:

So I added all those files in as one Initial Commit. Great, let's add some more.

❯ sl
@  5a23c603a  4 seconds ago  mathew.duggan
│  feat: adding the exceptions handler
│
o  2652cf416  17 seconds ago  mathew.duggan
│  feat: adding auth
│
o  2f5b8ee0c  9 minutes ago  mathew.duggan
   Initial Commit

Now if I want to navigate this stack, I can just use sl prev which moves me up and down the stack:

sl prev 1
0 files updated, 0 files merged, 1 files removed, 0 files unresolved
[2f5b8e] Initial Commit

And that is also represented in my sl output

❯ sl
o  5a23c603a  108 seconds ago  mathew.duggan
│  feat: adding the exceptions handler
│
o  2652cf416  2 minutes ago  mathew.duggan
│  feat: adding auth
│
@  2f5b8ee0c  11 minutes ago  mathew.duggan
   Initial Commit

This also shows up in my local web UI

Finally the flow ends with sl pr to create Pull Requests. They are GitHub Pull Requests but they don't look like normal GitHub pull requests and you don't want to review them the same way. The tool you want to use for this is ReviewStack.

I stole their GIF because it does a good job

Why I like it

Sapling lines up with what I expect a VCS to do. It's easier to see what is going on, it's designed to work with a large team and it surfaces the information I want in a way that makes more sense. The commands make more sense to me and I've never found myself unable to do something I needed to do.

More specifically I like throwing away the idea of branches. What I have is a collection of commits that fork off from the main line of development, but I don't have a distinct thing I want named that I'm asking you to add. I want to take the main line of work and add a stack of commits to it and then I want someone to look at that collection of commits and make sure it makes sense and then run automated checks against it. The "branch" concept doesn't do anything for me and ends up being something I delete anyway.

I also like that it's much easier to undo work. This is something where I feel like git makes it really difficult to handle and uncommit, unamend, unhide, and undo in Sapling just work better for me and always seem to result in the behavior that I expected. Losing the staging area and focusing more on easy to use commands is a more logical design.

Why you shouldn't switch

If I love Sapling so much, what's the problem? So to get Sapling to the place I actually want it to be, I need more of the Meta special sauce running. Sapling works pretty well on top of GitHub, but what I'd love is to get:

These seem to be the pieces to get all the goodness of this system as outlined below

  • On-demand historical file fetching (remotefilelog, 2013)
  • File system monitor for faster working copy status (watchman, 2014)
  • In-repo sparse profile to shrink working copy (2015)
  • Limit references to exchange (selective pull, 2016)
  • On-demand historical tree fetching (2017)
  • Incremental updates to working copy state (treestate, 2017)
  • New server infrastructure for push throughput and faster indexes (Mononoke, 2017)
  • Virtualized working copy for on-demand currently checked out file or tree fetching (EdenFS, 2018)
  • Faster commit graph algorithms (segmented changelog, 2020)
  • On-demand commit fetching (2021)

I'd love to try all of this together (and since there is source code for a lot of it, I am working on trying to get it started) but so far I don't think I've been able to see the full Sapling experience. All these pieces together would provide a really interesting argument for transitioning to Sapling but without them I'm really tacking a lot of custom workflow on top of GitHub. I think I could pitch migrating wholesale from GitHub to something else, but Meta would need to release more of these pieces in an easier to consume fashion.

Scalar

Alright so until Facebook decided to release the entire package end to end, Sapling exists as a great stack on top of GitHub but not something I could (realistically) see migrating a team to. Can I make git work more the way I want to? Or at least can I make it less of a pain to manage all the individual files?

Microsoft has a tool that does this, VFS for Git, but it's Windows only so that does nothing for me. However they also offer a cross-platform tool called Scalar that is designed to "enable working with large repos at scale". It was originally a Microsoft technology and was eventually moved to git proper, so maybe it'll do what I want.

What scalar does is effectively set all the most modern git options for working with a large repo. So this is the built-in file-system monitor, multi-pack index, commit graphs, scheduled background maintenance, partial cloning, and clone mode sparse-checkout.

So what are these things?

  • The file system monitor is FSMonitor, a daemon that tracks changes to files and directories from the OS and adds them to a queue. That means git status doesn't need to query every file in the repo to find changes.
  • Take the git pack directory with a pack file and break it into multiples.
  • Commit graphs which from the docs:
    • " The commit-graph file stores the commit graph structure along with some extra metadata to speed up graph walks. By listing commit OIDs in lexicographic order, we can identify an integer position for each commit and refer to the parents of a commit using those integer positions. We use binary search to find initial commits and then use the integer positions for fast lookups during the walk."
  • Finally clone mode sparse-checkout. This allows people to limit their working directory to specific files

The purpose of this tool is to create an easy-mode for dealing with large monorepos, with an eye towards monorepos that are actually a collection of microservices. Ok but does it do what I want?

Why I like it

Well it's already built into git which is great and it is incredibly easy to use and get started with. Also it does some of what I want. Taking a bunch of existing repos and creating one giant monorepo, the performance was surprisingly good. The sparse-checkout means I get to designate what I care about and what I don't and also solves the problem of "what if I have a giant directory of binaries that I don't want people to worry about" since it follows the same pattern matching as .gitignore

Now what it doesn't do is radically change what git is. You could grow a repo to much much larger with these defaults set, but it's still handling a lot of things locally and requiring me to do the work. However I will say it makes a lot of my complaints go away. Combined with the gh CLI tool for PRs and I can cobble together a reasonably good workflow that I really like.

So while this is definitely the pattern I'm going to be adopting from now on (monorepo full of microservices where I manage scale with scalar), I think it represents how far you can modify git as an existing platform. This is the best possible option today but it still doesn't get me to where I want to be. It is closer though.

You can try it out yourself: https://git-scm.com/docs/scalar

Conclusion

So where does this leave us? Honestly, I could write another 5000 words on this stuff. It feels like as a field we get maddeningly close to cracking this code and then give up because we hit upon a solution that is mostly good enough. As workflows have continued to evolve, we haven't come back to touch this third rail of application design.

Why? I think the people not satisfied with git are told that is a result of them not understanding it. It creates a feeling that if you aren't clicking with the tool, then the deficiency is with you and not with the tool. I also think programmers love decentralized designs because it encourages the (somewhat) false hope of portability. Yes I am entirely reliant on GitHub actions, Pull Requests, GitHub access control, SSO, secrets and releases but in a pinch I could move the actual repo itself to a different provider.

Hopefully someone decides to take another run at this problem. I don't feel like we're close to done and it seems like, from playing around with all these, that there is a lot of low-hanging optimization fruit that anyone could grab. I think the primary blocker would be you'd need to leave git behind and migrate to a totally different structure, which might be too much for us. I'll keep hoping it's not though.

Corrections/suggestions: https://c.im/@matdevdug


How to make the Mac better for developers

A what-if scenario

A few years ago I became aware of the existence of the Apple Pro Workflow teams. These teams exist inside Apple to provide feedback to the owners of various professional-geared hardware and software teams inside the company. This can be everything from advising the Mac Pro team what kind of expansion a pro workstation might need all the way to feedback to the Logic Pro and Final Cut teams on ways the make the software fit better into conventional creative workflows.

I think the idea is really clever and wish more companies did something like this. It would be amazing, for instance, if I worked on accounting software if we had a few accountants in-house attempting to just use our software to solve problems. Shortening the feedback loop and relying less on customers reporting problems and concerns is a great way of demonstrating to people you take their time seriously. However it did get me thinking: what would a Developer Pro Workflow team ask for?

The History of the MacBook Pro and me

I've owned MacBook Pros since they launched, with the exception of the disastrous TouchBar Mac with the keyboard that didn't work. While I use Linux every day for my work, I prefer to have macOS as a "base operating system". There are a lot of reasons for that, but I think you can break them down into a few big advantages to me:

  1. The operating system is very polished. "Well wait, maybe you haven't tried Elementary/Manjaro/etc." I have and they're great for mostly volunteer efforts, but it's very hard to beat the overall sense of quality and "pieces fitting together" that comes with macOS. This is an OS maintained by , I suspect, a large team in terms of software development team sizes. Improvements and optimizations are pretty common and there is a robust community around security. It doesn't just work but for the most part it does keep ticking along.
  2. The hardware is able to be fixed around the world. One thing that bothered me about my switch to Linux: how do I deal with repairs and replacements. PCs change models all the time and, assuming I had a functional warranty, how long could I go without a computer? I earn my living on this machine, I can't wait six weeks for a charger or spend a lot of time fighting with some repair shop on what needs to be done to fix my computer. Worst case I can order the exact same model I have now from Apple, which is huge.
  3. The third-party ecosystem is robust, healthy and frankly time-tested. If I need to join a Zoom call, I don't want to think about whether Zoom is going to work. If someone calls me on Slack, I don't want to deal with quitting and opening it four times to get sound AND video to work. When I need to use third-party commercial software its often non-optional (a customer requires that I use it) and I don't have a ton of time to debug it. With the Mac, commercial apps are just higher quality than Linux. They get a lot more attention internally from companies and they just have a much higher success rate of working.  

Suggestions from my fake Workflow team

Now that we've established why I like the platform, what could be done to improve it for developers? How could we take this good starting point and further refine it. These suggestions are not ranked.

  • An official way to run Linux

Microsoft changed the conversation in developer circles with Windows Subsystem for Linux. Suddenly Windows machines, previously considered inferior for developing applications for Linux, mostly had that limitation removed. The combination of that and the rise of containerization has really reduced the appeal of Macs for developers, as frankly you just don't use the base OS UNIX tooling as much as you used to.

If Apple wants to be serious about winning back developer mindshare, they wouldn't need to make their own distro. I think something like Canonical Multipass would work great, still giving us a machine that resembles the cloud computers we're likely going to be deploying to. Making them more of a first-class citizen inside the Apple software stack like Microsoft would be a big win.

Well why would Apple do this if there are already third-party tools? Part of it is marketing, Apple just can tell a lot more users about something like this. Part of it is that it signals a seriousness about the effort and a commitment on their part to keep it working. I would be hesitant to rely heavily on third-party tooling around the Apple hyperkit VM system just because I know at any point without warning Apple could release new hardware that doesn't work with my workflow. If Apple took the tool in-house there would be at least some assurances that it would continue to work.

  • Podman for the Mac

Docker Desktop is no longer a product you should be using. With the changes to their licensing, Docker has decided to take a much more aggressive approach to monetization. It's their choice, obviously, but to me it invalidates using Docker Desktop for even medium sized companies. The license terms of:  fewer than 250 employees AND less than $10 million in annual revenue means I can accidentally violate the license by having a really good year, or merging with someone. I don't need that stress in my life and would never accept that kind of aggressive license in business-critical software unless there was absolutely no choice.

Apple could do a lot to help with me by assisting with the porting of podman to the Mac. Container-based workflows aren't going anywhere and if Apple installed podman as part of the "Container Developer Tools" command, it would not only remove a lot of concerns from users about Docker licensing, but also would just be very nice. Again this is solvable by users using a Linux VM but its a clunky solution and nothing something I think of as very Apple-like. If there is a team sitting around thinking about the button placement in Final Cut Pro, making my life easier when running podman run for the 12th time would be nice as well.

  • Xcode is the worst IDE I've ever used and needs a total rewrite

I don't know how to put this better. I'm sure the Xcode team is full of nice people and I bet they work hard. It's the worst IDE I've ever used. Every single thing in Xcode is too slow. I'm talking storyboards, suggestions, every single thing drags. A new single view app, just starting from scratch, takes multiple minutes to generate.

You'll see this screen so often, you begin to wonder where "Report" even goes

Basic shit in Xcode doesn't work. Interface Builder is useless, autocomplete is random in terms of how well or poorly it will do that second, even git commands don't work all the time for me inside Xcode. I've never experienced anything like it with free IDEs, so the idea that Apple ships this software for actual people to use is shocking to me.

If you are a software developer who has never tried to make a basic app in Xcode and are curious what people are talking about, give it a try. I knew mobile development was bad, but spending a week working inside Xcode after years of PyCharm and Vim/Tmux, I got shivers imagining if I paid my mortage with this horrible tool. Every 40 GB update you must just sweat bullets, worried that this will be the one that stops letting you update your apps.

I also cannot imagine that people inside Apple use this crap. Surely there's some sort of special Xcode build that's better or something, right? I would bet money a lot of internal teams at Apple are using AppCode to do their jobs, leaving Xcode hidden away. But I digress: Xcode is terrible and needs a total rewrite or just call JetBrains and ask how much they would charge to license every single Apple Developer account with AppCode, then move on with your life. Release a smaller standalone utility just for pushing apps to the App Store and move on.

It is embarrassing that Apple asks people to use this thing. It's hostile to developers and it's been too many years and too many billions of dollars made off the back of App developers at this point. Android Studio started out bad and has gotten ok. In that time period Xcode started pretty bad and remains pretty bad. Either fix it or give up on it.

  • Make QuickLook aware of the existence of code

QuickLook, the Mac functionality where you click on a file and hit spacebar to get a quick look at it is a great tool. You can quickly go through a ton of files and take a quick glance at all of them, or at least creative professionals can. I want to as well. There's no reason macOS can't understand what yaml is, or how to show me JSON in a non-terrible way. There are third party tools that do this but I don't see this as something Apple couldn't bring in-house. It's a relatively low commitment and would be just another nice daily improvement to my workflow.

Look how great that is!
  • An Apple package manager

I like Homebrew and I have nothing but deep respect for the folks who maintain it. But it's gone on too long now. The Mac App Store is a failure, the best apps don't live there and the restrictions and sandboxing along with Apple's cut means nobody is rushing to get listed there. Not all ideas work and its time to give up on that one.

Instead just give us an actual package manager. It doesn't need to be insanely complicated, heck we can make it similar to how Apple managed their official podcast feed for years with a very hands-off approach. Submit your package along with a URL and Apple will allow users a simple CLI to install it along with declared dependencies. We don't need to start from scratch on this, you can take the great work from homebrew and add some additional validation/hosting.

How great would it be if I could write a "setup script" for new developers when I hand them a MacBook Pro that went out and got everything they needed, official Apple apps, third-party stuff, etc? You wouldn't need some giant complicated MDM solution or an internal software portal and you would be able to add some structure and vetting to the whole thing.

Just being able to share a "new laptop setup" bash script with the internet would be huge. We live in a world where more and more corporations don't take backups of work laptops and they use tooling to block employees from maintaining things like Time Machine backups. While it would be optimal to restore my new laptop from my old laptop, sometimes it isn't possible. Or maybe you just want to get rid of the crap built up over years of downloading stuff, opening it once and never touching it again.

To me this is a no-brainer, taking the amazing work from the homebrew folks, bringing them in-house and adding some real financial resources to it, with the goal of making a robust CLI-based package installation process for the Mac. Right now Homebrew developers actively have to work around Apple's restrictions to get their tool to work, a tool tons of people rely on every day. That's just not acceptable. Pay the homebrew maintainers a salary, give it a new Apple name and roll it out at WWDC. Everybody will love you and you can count the downloads as "Mac App Store" downloads during investor calls.

  • A git-aware text editor

I love BBedit so much I bought a sweatshirt for it and a pin I proudly display on my backpack. That's a lot of passion for a text editor, but BBedit might be the best software in the world. I know, you are actively getting upset with me as you read that, but hear me out. It just works, it never loses files and it saves you so much time. Search and replace functionality in BBedit has, multiple times, gotten me out of serious jams.

But for your average developer who isn't completely in love with BBedit, there are still huge advantages to Apple shipping even a much more stripped down text editor that Markdown and Git-functionality. While you likely have your primary work editor, be it vim for me or vscode for you, there are times when you need to make slight tweaks to a file or you just aren't working on something that needs the full setup. A stripped version of BBedit or something similar that would allow folks to write quick docs, small shell scripts or other tasks that you do a thousand times a year would be great.

A great example of this is Code for Elementary OS:

This doesn't have to be the greatest text editor ever made, but it would be a huge step forward for the developer community to have something that works out of the box. Plus formats like Markdown are becoming more common in non-technical environments.

Why would Apple do this if third-party apps exist?

For the same reason they make iPhoto even though photo editors exist. Your new laptop from Apple should be functional out of the box for a wide range of tasks and adding a text editor that can work on files that lots of Apple professional users interact with on a daily basis underscores how seriously Apple takes that market. Plus maybe it'll form the core of a new Xcode lite, codenamed Xcode functional.

  • An Apple monitor designed for text

This is a bit more of a reach, I know that. But Apple could really use a less-expensive entry into the market and in the same way the Pro Display XDR is designed for the top-end of the video editing community, I would love something like that but designed more for viewing text. It wouldn't have nearly the same level of complexity but it would be nice to be able to get a "full end to end" developer solution from Apple again.

IPS panels would be great for this, as the color accuracy and great viewing angels would be ideal for development but you won't care about them being limited to 60hz. A 27 inch panel is totally sufficient and I'd love to be able to order them at the same time as my laptop for a new developer. There are lots of third-party alternatives but frankly dealing with the endless word soup that is monitors these days is difficult to even begin to track. I love my Dell UltraSharp U2515H but I don't even know how long I can keep buying them or how to find the successor to it when it gets decommissioned.

Actually I guess its already been decommissioned and also no monitors are for sale. 

What did I miss?

What would you add to the MacBook Pro to make it a better developer machine? Let me know at @duggan_mathew on twitter.


Don't Make My Mistakes: Common Infrastructure Errors I've Made

One surreal experience as my career has progressed is the intense feeling of deja vu you get hit with during meetings. From time to time, someone will mention something and you'll flash back to the same meeting you had about this a few jobs ago. A decision was made then, a terrible choice that ruined months of your working life. You spring back to the present day, almost bolting out of your chair to object, "Don't do X!". Your colleagues are startled by your intense reaction, but they haven't seen the horrors you have.

I wanted to take a moment and write down some of my worst mistakes, as a warning to others who may come later. Don't worry, you'll make all your own new mistakes instead. But allow me a moment to go back through some of the most disastrous decisions or projects I ever agreed to (or even fought to do, sometimes).

Don't migrate an application from the datacenter to the cloud

Ah the siren call of cloud services. I'm a big fan of them personally, but applications designed for physical datacenters rarely make the move to the cloud seamlessly. I've been involved now in three attempts to do large-scale migrations of applications written for a specific datacenter to the cloud and every time I have crashed upon the rocks of undocumented assumptions about the environment.

Me encountering my first unsolvable problem with a datacenter to cloud migration

As developer write and test applications, they develop expectations of how their environment will function. How do servers work, what kind of performance does my application get, how reliable is the network, what kind of latency can I expect, etc. These are reasonable thing that any person would do upon working inside of an environment for years, but it means when you package up an application and run it somewhere else, especially old applications, weird things happen. Errors that you never encountered before start to pop up and all sorts of bizarre architectural decisions need to be made to try and allow for this transition.

Soon you've eliminated a lot of the value of the migration to begin with, maybe even doing something terrible like connecting your datacenter to AWS with direct connect in an attempt to bridge the two environments seamlessly. Your list of complicated decisions start to grow and grow, hitting increasingly more and more edge cases of your cloud provider. Inevitable you find something you cannot move and you are now stuck with two environments, a datacenter you need to maintain and a new cloud account. You lament your hubris.

Instead....

Port the application to the cloud. Give developers a totally isolated from the datacenter environment, let them port the application to the cloud and then schedule 4-8 hours of downtime for your application. This will allow persistence layers to cut over and then you can change your DNS entries to point to your new cloud presence. The attempt to prevent this downtime will drown you in bad decision after bad decision. Better to just bite the bullet and move on.

Or even better, develop your application in the same environment you expect to run it in.

Don't write your own secrets system

I don't know why I keep running into this. For some reason, organizations love to write their own secrets management system. Often these are applications written by the infrastructure teams, commonly either environmental variable injection systems or some sort of RSA-key based decrypt API call. Even I have fallen victim to this idea, thinking "well certainly it can't be that difficult".

For some reason, maybe I had lost my mind or something, I decided we were going to manage our secrets inside of PostgREST application I would manage. I wrote an application that would generate and return JWTs back to applications depending on a variety of criteria. These would allow them to access their secrets in a totally secure way.

Now in defense of PostgREST, it worked well at what it promised to do. But the problem of secrets management is more complicated than it first appears. First we hit the problem of caching, how do you keep from hitting this service a million times an hour but still maintain some concept of using the server as the source of truth. This was solvable through some Nginx configs but was something I should have thought of.

Then I smacked myself in the face with the rake of rotation. It was trivial to push a new version, but secrets aren't usually versioned to a client. I authenticate with my application and I see the right secrets. But during a rotation period there are two right secrets, which is obvious when I say it but hadn't occurred to me when I was writing it. Again, not a hard thing to fix, but as time went on and I encountered more and more edge cases for my service, I realized I had made a huge mistake.

The reality is secrets management is a classic high risk and low reward service. It's not gonna help my customers directly, it won't really impress anyone in leadership that I run it, it will consume a lot of my time debugging it and its going to need a lot of domain specific knowledge in terms of running it. I had to rethink a lot of the pieces as I went, everything from multi-region availability (which like, syncing across regions is a drag) to hardening the service.

Instead....

Just use AWS Secrets Manager or Vault. I prefer Secrets Manager, but whatever you prefer is fine. Just don't write your own, there are a lot of edge cases and not a lot of benefits. You'll be the cause of why all applications are down and the cost savings at the end of the day are minimal.

Don't run your own Kubernetes cluster

I know, you have the technical skill to do it. Maybe you absolutely love running etcd and setting up the various certificates. Here is a very simple decision tree when thinking about "should I run my own k8s cluster or not":

Are you a Fortune 100 company? If no, then don't do it.

The reason is you don't have to and letting someone else run it allows you to take advantage of all this great functionality they add. AWS EKS has some incredible features, from support for AWS SSO in your kubeconfig file to allowing you to use IAM roles inside of ServiceAccounts for pod access to AWS resources. On top of all of that, they will run your control plane for less than $1000 a year. Setting all that aside for a moment, let's talk frankly for a second.

One advantage of the cloud is other people beta test upgrades for you.

I don't understand why people don't talk about this more. Yes you can run your own k8s cluster pretty successfully, but why? I have literally tens of thousands of beta testers going ahead of me in line to ensure EKS upgrades work. On top of that, I get tons of AWS engineers working on it. There's no advantage if I'm going to run my infrastructure in AWS anyway to running my own cluster except that I can maintain the illusion that at some point I could "switch cloud providers". Which leads me on to my next point.

Instead....

Let the cloud provider run it. It's their problem now. Focus on making your developers lives easier.

Don't Design for Multiple Cloud Providers

This one irks me on a deeply personal level. I was convinced by a very persuasive manager that we needed to ensure we had the ability to switch cloud providers. Against my better judgement, I fell in with the wrong crowd. We'll call them the "premature optimization" crowd.

Soon I was auditing new services for "multi-cloud compatibility", ensuring that instead of using the premade SDKs from AWS, we maintained our own. This would allow us to, at the drop of a hat, switch between them in the unlikely event this company exploded in popularity and we were big enough to somehow benefit from this migration. I guess in our collective minds this was some sort of future proofing or maybe we just had delusions of grandeur.

What we were actually doing is the worst thing you can do, which is just being a pain in the ass for people trying to ship features to customers. If you are in AWS, don't pretend that there is a real need for your applications to be deployable to multiple clouds. If AWS disappeared tomorrow, yes you would need to migrate your applications. But the probability of AWS outliving your company is high and the time investment of maintaining your own cloud agnostic translation layers is not one you are likely to ever get back.

We ended up with a bunch of libraries that were never up to date with the latest features, meaning developers were constantly reading about some great new feature of AWS they weren't able to use or try out. Tutorials obviously didn't work with our awesome custom library and we never ended up switching cloud providers or even dual deploying because financially it never made sense to do it. We ended up just eating a ton of crow from the entire development team.

Instead....

If someone says "we need to ensure we aren't tied to one cloud provider", tell them that ship sailed the second you signed up. Similar to a data center, an application designed, tested and run successfully for years in AWS is likely to pick up some expectations and patterns of that environment. Attempting to optimize for agnostic design is losing a lot of the value of cloud providers and adding a tremendous amount of busy work for you and everyone else.

Don't be that person. Nobody likes the person who is constantly saying "no we can't do that" in meeting. If you find yourself in a situation where migrating to a new provider makes financial sense, set aside at least 3 months an application for testing and porting. See if it still makes financial sense after that.

Cloud providers are a dependency, just like a programming language. You can't arbitrarily switch them without serious consideration and even then, often "porting" is the wrong choice. Typically you want to practice like you play, so developing in the same environment as your customers will use your product.

Don't let alerts grow unbounded

I'm sure you've seen this at a job. There is a tv somewhere in the office and on that tv is maybe a graph or CloudWatch alerts or something. Some alarm will trigger at an interval and be displayed on that tv, which you will be told to ignore because it isn't a big deal. "We just want to know if that happens too much" is often what is reported.

Eventually these start to trickle into on-call alerts, which page you. Again you'll be told they are informative, often by the team that owns that service. As enough time passes, it becomes unclear what the alert was supposed to tell you, only that new people will get confusing information about whether an alert is important or not. You'll eventually have an outage because the "normal" alert will fire with an unusual condition, leading to a person to silence the page and go back to sleep.

I have done this, where I even defended the system on the grounds of "well surely the person who wrote the alert had some intention behind it". I should have been on the side of "tear it all down and start again", but instead I choose a weird middle ground. It was the wrong decision for me years ago and its the wrong decision for you today.

Instead....

If an alert pages someone, it has to be a situation in which the system could never recover on its own. It needs to be serious and it cannot be something where the failure is built into the application design. An example of that would be "well sometimes our service needs to be restarted, just SSH in and restart it". Nope, not an acceptable reason to wake me up. If your service dies like that, figure out a way to bring it back.

Don't allow for the slow gradual pollution of your life with garbage alerts and feel free to declare bankruptcy on all alerts in a platform if they start to stink. If a system emails you 600 times a day, it's not working. If there is a slack channel so polluted with garbage that nobody goes in there, it isn't working as an alert system. It isn't how human attention works, you can't spam someone constantly with "not-alerts" and then suddenly expect them to carefully parse every string of your alert email and realize "wait this one is different".

Don't write internal cli tools in python

I'll keep this one short and sweet.

Nobody knows how to correctly install and package Python apps. If you write an internal tool in Python, it either needs to be totally portable or just write it in Go or Rust. Save yourself a lot of heartache as people struggle to install the right thing.


The hunt for a better Dockerfile

Source

Time to thank Dockerfiles for their service and send them on their way

For why I don't think Dockerfiles are good enough anymore, click here. After writing about my dislike of Dockerfiles and what I think is a major regression in the tools Operations teams had to work with, I got a lot of recommendations of things to look at. I'm going to try to do a deeper look at some of these options and see if there is a reasonable option to switch to.

My ideal solution would be an API I could hit and just supply the parameters for the containers to. This would let me standardize the process with the same language I use for the app, write some tests around the containers and hook in things like CI logging conventions and exception tracking.

BuildKit

BuildKit is a child of the Moby project, an open-source project designed to advance the container space to allow for more specialized uses for containers. Judging from its about page, it seems to be staffed by some Docker employees and some folks from elsewhere in the container space.

What is the Moby project? Honestly I have no idea. They have on their list of projects high-profile things like containerd, runc, etc. You can see the list here. This seems to be the best explanation of what the Moby project is:

Docker uses the Moby Project as an open R&D lab, to experiment, develop new components, and collaborate with the ecosystem on the future of container technology. All our open source collaboration will move to the Moby project.

My guess is the Moby project is how Docker gets involved in open-source projects and in turns open-sources some elements of its stack. Like many things Docker does, it is a bit inscrutable from the outside. I'm not exactly sure who staffs most of this project or what their motivations are.

BuildKit walkthrough

BuildKit is built around a totally new model for building images. At its core is a new format for defining builds called LLB. It's an intermediate binary format that uses the Go Marshal function to seralize your data. This new model allows for actual concurrency in your builds, as well as a better model for caching. You can see more about the format here.

LLB is really about decoupling the container build process from Dockerfiles, which is nice. This is done through the use of Frontends, of which Docker is one of many. You run a frontend to convert a build definition (most often a Dockerfile) into LLB. This concept seems strange, but if you look at the Dockerfile frontend you will get a better idea of the new options open to you. That can be found here.

Of the most interest for most folks is the inclusion of a variety of different mounts. You have: --mount=type=cache which takes advantage of the more precise caching available due to LLB to persist the cache between building invocations. There is also --mount=type=secret which allows you to give the container access to secrets while ensuring they aren't baked into the image. Finally there is --mount=type=ssh which uses SSH agents to allow containers to connect using the hosts SSH to things like git over ssh.

In theory this allows you to build images using a ton of tooling. Any language that supports Protocol Buffers could be used to make images, meaning you can move your entire container build process to a series of scripts. I like this a lot, not only because the output of the build process gives you a lot of precise data about what was done, but you can add testing and whatever else.

In practice, while many Docker users are currently enjoying the benefits of LLB and BuildKit, this isn't a feasible tool to use right now to build containers using Go unless you are extremely dedicated to your own tooling. The basic building blocks are still shell commands you are executing against the frontend of Docker, although at least you can write tests.

If you are interested in what a Golang Dockerfile looks like, they have some good examples here.

buildah

With the recent announcement of Docker Desktop new licensing restrictions along with the IP based limiting of pulling images from Docker Hub, the community opinion of Docker has never been lower. There has been an explosion of interest in Docker alternatives, with podman being the frontrunner. Along with podman is a docker build alternative called buildah. I started playing around with the two for an example workflow and have to say I'm pretty impressed.

podman is a big enough topic that I'll need to spend more time on it another time, but buildah is the build system for podman. It actually predates podman and in my time testing it, offers substantial advantages over docker build with conventional Dockerfiles. The primary way that you use buildah is through writing shell scripts to construct images, but with much more precise control over layers. I especially enjoyed being able to start with an empty container that is just a directory and build up from there.

If you want to integrate buildah into your existing flow, you can also use it to build containers from Dockerfiles. Red Hat has a series of good tutorials to get you started you can check out here. In general the whole setup works well and I like moving away from the brittle Dockerfile model towards something more sustainable and less dependent on Docker.

I've never heard of PouchContainer before, an offering from Alibaba but playing around with it has been eye-opening. It's much more ambitious than a simple Docker replacement, instead adding on a ton of shims to various container technologies. The following diagram lays out just what we're talking about here:

The CLI called just pouch includes some standard options like building from a Dockerfile with pouch build. However this tool is much more flexible in terms of where you can get containers from, including concepts like pouch load which allows you to load up a tar file full of containers it will parse. Outside of just the CLI, you have a full API in order to do all sorts of things. Interested in creating a container with an API call? Check this out.

There is also a cool technology they call a "rich container", which seems to be designed for legacy applications where the model of one process running isn't sufficient and you need to kick off a nested series of processes. They aren't wrong, this is actually a common problem when migrating legacy applications to containers and it's not a bad solution to what is an antipattern. You can check out more about it here.

PouchContainer is designed around kubernetes as well, allowing for it to serve as the container plugin for k8s without needing to recompile. This combined with a P2P model for sharing containers using Dragonfly means this is really a fasinating approach to the creation and distribution of containers. I'm surprised I've never heard of it before, but alas looking at the repo it doesn't look like it's currently maintained.

Going through what is here though, I'm very impressed with the ambition and scope of PouchContainer. There are some great ideas here, from models around container distribution to easy to use APIs. If anyone has more information about what happened here or if is a sandbox somewhere I can use to learn more about this, please let me know on Twitter.

Packer, for those unfamiliar with it, is maybe the most popular tool out there for the creation of AMIs. These are the images that are used when an EC2 instance is launched, allowed organizations to install whatever software they need for things like autoscaling groups. Packer uses two different concepts for the creation of images:

This allows for organizations that are using things like Ansible to configure boxes after they launch to switch to baking the AMI before the instance is started. This saves time and involves less overhead. What's especially interesting for us is this allows us to set up Docker as a builder, meaning we can construct our containers using any technology we want.

How this works in practice is we can create a list of provisioners in our packer json file like so:

"provisioners": [{
        "type": "ansible",
        "user": "root",
        "playbook_file": "provision.yml"
    }],

So if we want to write most of our configuration in Ansible and construct the whole thing with Packer, that's fine, We can also use shell scripts, Chef, puppet or whatever other tooling we like. In practice you define a provisioner with whatever you want to run, then a post-processor pushing the image to your registry. All done.

Summary

I'm glad that there exists options for organizations looking to streamline their container experience. If I were starting out today and either had existing Ansible/Puppet/Chef infrastructure as code, I would go with Packer. It's easy to use and allows you to keep what you have with some relatively minor tweaks. If I were starting out fresh, I'd see how far I could get with buildah. There seems to be more community support around it and Docker as a platform is not looking particularlly robust at this particular moment.

While I strongly prefer using Ansible for creating containers vs Dockerfiles, I think the closest to the "best" solution is the buildkit Go client approach. You would still get the benefits of buildkit while being able to very precisely control exactly how a container is made, cache, etc. However the buildah process is an excellent middle group, allowing for shell scripts to create images that, ideally, contain the optimizations inherit with the newer process.

Outstanding questions I would love the answers to:

  • Is there a library or abstraction that allows for a less complicated time dealing with buildkit? Ideally something in Golang or Python, where we could more easily interact with it?
  • Or are there better docs for how to build containers in code with buildkit that I missed?
  • With buildah are there client libraries out there to interact with its API? Shell scripts are fine, but again ideally I'd like to be writing critical pieces of infrastructure in a language with some tests and something where the amount of domain specific knowledge would be minimal.
  • Is there another system like PouchContainer that I could play around with? An API that allows for the easy creation of containers through standard REST calls?

Know the answers to any of these questions or know of a Dockerfile alternative I missed? I'd love to know about it and I'll test it. Twitter


TIL I've been changing directories incorrectly

One of my first tasks when I start at a new job is making a series of cd alias in my profile. These are usually to the git repositories where I'm going to be doing the most work, but its not an ideal situation because obviously sometimes I work with repos only once in a while. This is to avoid endless cd ../../../ or starting from my home directory every time.

I recently found out about zoxide and after a week of using it I'm not really sure why I would ever go back to shortcuts. It basically learns the paths you use, allowing you to say z directory_name or z term_a term_b. Combined with fzf you can really zoom around your entire machine with no manually defined shortcuts. Huge fan.


DevOps Crash Course - Section 2: Servers

via GIFER

Section 2 - Server Wrangling

For those of you just joining us you can see part 1 here.

If you went through part 1, you are a newcomer to DevOps dropped into the role with maybe not a lot of prep. We now have a good idea of what is running in our infrastructure, how it gets there and we have an idea of how secure our AWS setup is from an account perspective.

Next we're going to build on top of that knowledge and start the process of automating more of our existing tasks, allowing AWS (or really any cloud provider) to start doing more of the work. This provides a more reliable infrastructure and means we can focus more on improving quality of life.

What matters here?

Before we get into the various options for how to run your actual code in production, let's take a step back and walk about what matters. Whatever choice you and your organization end up making, here are the important questions we are trying to answer.

  • Can we, demonstrably and without human intervention, stand up a new server and deploy code to it?
  • Do we have an environment or way for a developer to precisely replicate the environment that their code runs in production in another place not hosting or serving customer traffic?
  • If one of our servers become unhealthy, do we have a way for customer traffic to stop being sent to that box and ideally for that box to be replaced without human intervention?

Can you use "server" in a sentence?

Alright you caught me. It's becoming increasingly more difficult to define what we mean what we say "server". To me, a server is still the piece of physical hardware running in a data center and the software for a particular host is a virtual machine. You can think of EC2 instances as virtual machines, different from docker containers in ways we'll discuss later. For our purposes EC2 instances are usually what we mean when we say servers. These instances are defined through the web UI, CLI or Terraform and launched in every possible configuration and memory size.

Really a server is just something that allows customers to interact with our application code. An API Gateway connected to Lambdas still meets my internal definition of server, except in that case we are completely hands off.

What are we dealing with here

In conversations with DevOps engineers who have had the role thrust on them or perhaps dropped into a position where maintenance hasn't happened a lot, a common theme has emerged. They often are placed in charge of a stack that looks something like this:

A user flow normally looks something like this:

  1. A DNS request is made to the domain and it points towards (often) a classic load balancer. This load balancer handles SSL termination and then forwards traffic on to servers running inside of a VPC on port 80.
  2. These servers are often running whatever linux distribution was installed on them when they were made. Something they are in autoscaling groups, but often they are not. These servers normally have Nginx installed along with something called uWSGI. Requests come in, they are handed by Nginx to the uWSGI workers and these interact with your application code.
  3. The application will make calls to the database server, often running MySQL because that is what the original developer knew to use. Sometimes these are running on a managed database service and sometimes they are not.
  4. Often with these sorts of stacks the deployment is something like "zip up a directory in the CICD stack, copy it to the servers, then unzip and move to the right location".
  5. There are many times some sort of script to remove that box from the load balancer at the time of deploy and then readd them.

Often there are additional AWS services being used, things like S3, Cloudfront, etc but we're going to focus on the servers right now.

This stack seems to work fine. Why should I bother changing it?

Configuration drift. The inevitable result of launching different boxes at different times and hand-running commands on them which make them impossible to test or reliably replicate.

Large organizations go through lots of work to ensure every piece of their infrastructure is uniformly configured. There are tons of tools to do this, from things like Ansible Tower, Puppet, Chef, etc to check each server on a regular cycle and ensure the correct software is installed everywhere. Some organizations rely on building AMIs, building whole new server images for each variation and then deploying them to auto-scaling groups with tools like Packer. All of this work is designed to try and eliminate differences between Box A and Box B. We want all of our customers and services to be running on the same platforms we have running in our testing environment. This catches errors in earlier testing environments and means our app in production isn't impacted.

The problem we want to avoid is the nightmare one for DevOps people, which is where you can't roll forward or backwards. Something in your application has broken on some or all of your servers but you aren't sure why. The logs aren't useful and you didn't catch it before production because you don't test with the same stack you run in prod. Now your app is dying and everyone is trying to figure out why, eventually discovering some sort of hidden configuration option or file that was set a long time ago that now causes problem.

These sorts of issues plagued traditional servers for a long time, resulting in bespoke hand-crafted servers that could never be replaced without tremendous amounts of work. You either destroyed and remade your servers on a regular basis to ensure you still could or you accepted the drift and tried to ensure that you had some baseline configuration that would launch your application. Both of these scenarios suck and for a small team its just not reasonable to expect you to run that kind of maintenance operation. It's too complicated.

What if there was a tool that let you test exactly like you run your code in production? It would be easy to use, work on your local machine as well as on your servers and would even let you quickly audit the units to see if there are known security issues.

This all just sounds like Docker

It is Docker! You caught me, it's just containers all the way down. You've probably used Docker containers a few times in your life, but running your applications inside of containers is the vastly preferred model for how to run code. It simplifies testing, deployments and dependency management, allowing you to move all of it inside of git repos.

What is a container?

Let's start with what a container isn't. We talked about virtual machines before. Docker is not a virtual machine, it's a different thing. The easiest way to understand the difference is to think of a virtual machine like a physical server. It isn't, but for the most part the distinction is meaningless for your normal day to day life. You have a full file system with a kernel, users, etc.

Containers are just what your application needs to run. They are just ways of moving collections of code around along with their dependencies. Often new folks to DevOps will think of containers in the wrong paradigm, asking questions like "how do you back up a container" or using them as permanent stores of state. Everything in a container is designed to be temporary. It's just a different way of running a process on the host machine. That's why if you run ps fauxx | grep name_of_your_service on the host machine you still see it.

Are containers my only option?

Absolutely not. I have worked with organizations that manage their code in different ways outside of containers. Some of the tools I've worked with have been NPM for Node applications, RPMs for various applications linked together with Linux dependencies. Here are the key questions when evaluating something other than Docker containers?

  • Can you reliably stand up a new server using a bash script + this package? Typically bash scripts should be under 300 lines, so if we can make a new server with a script like that and some "other package" I would consider us to be in ok shape.
  • How do I roll out normal security upgrades? All linux distros have constant security upgrades, how do I do that on a normal basis while still confirming that the boxes still work?
  • How much does an AWS EC2 maintenance notice scare me? This is where AWS or another cloud provider emails you and says "we need to stop one of your instances randomly due to hardware failures". Is it a crisis for my business or is it a mostly boring event?
  • If you aren't going to use containers but something else, just ensure there is more than one source of truth for that.
  • For Node I have had a lot of sucess with Verdaccio as an NPM cache: https://verdaccio.org/
  • However in general I recommend paying for Packagecloud and pushing whatever package there: https://packagecloud.io/

How do I get my application into a container?

I find the best way to do this is to sit down with the person who has worked on the application the longest. I will spin up a brand new, fresh VM and say "can you walk me through what is required to get this app running?". Remember this is something they likely have done on their own machines a few hundred times, so they can pretty quickly recite the series of commands needed to "run the app". We need to capture those commands because they are how we write the Dockerfile, the template for how we make our application.

Once you have the list of commands, you can string them together in a Dockerfile.

How Do Containers Work?

It's a really fascinating story.  Let's teleport back in time. It's the year 2000, we have survived Y2K, the most dangerous threat to human existence at the time. FreeBSD rolls out a new technology called "jails". FreeBSD jails were introduced in FreeBSD 4.X and are still being developed now.

Jails are layered on top of chroot, which allows you to change the root directory of processes. For those of you who use Python, think of chroot like virtualenv. It's a safe distinct location that allows you to simulate having a new "root" directory. These processes cannot access files or resources outside of that environment.

Jails take that concept and expanded it, virtualizing access to the file system, users, networking and every other part of the system. Jails introduce 4 things that you will quickly recognize as you start to work with Docker:

  • A new directory structure of dependencies that a process cannot escape.
  • A hostname for the specific jail
  • A new IP address which is often just an alias for an existing interface
  • A command that you want to run inside of the jail.
www {
    host.hostname = www.example.org;           # Hostname
    ip4.addr = 192.168.0.10;                   # IP address of the jail
    path = "/usr/jail/www";                    # Path to the jail
    devfs_ruleset = "www_ruleset";             # devfs ruleset
    mount.devfs;                               # Mount devfs inside the jail
    exec.start = "/bin/sh /etc/rc";            # Start command
    exec.stop = "/bin/sh /etc/rc.shutdown";    # Stop command
}
What a Jail looks like.

From FreeBSD the technology makes it way to Linux via the VServer project. As time went on more people build on the technology, taking advantage of cgroups. Control groups, shorted to cgroups is a technology that was added to Linux in 2008 from engineers at Google. It is a way of defining a collection of processes that are bound by the same restrictions. Progress has continued with cgroups from its initial launch, now at a v2.

There are two parts of a cgroup, a core and a controller. The core is responsible for organizing processes. The controller is responsible for distributing a type of resource along the hierarchy. With this continued work we have gotten incredible flexibility with how to organize, isolate and allocate resources to processes.

Finally in 2008 we got Docker, adding a simple CLI, the concept of a Docker server, a way to host and share images and more. Now containers are too big for one company, instead being overseen by the Open Container Initiative. Now instead of there being exclusively Docker clients pushing images to Dockerhub running on Docker server, we have a vibrant and strong open-source community around containers.

I could easily fill a dozen pages with interesting facts about containers, but the important thing is that containers are a mature technology built on a proven pattern of isolating processes from the host. This means we have complete flexibility for creating containers and can easily reuse a simple "base" host regardless of what is running on it.

For those interested in more details:

Anatomy of a Dockerfile

FROM debian:latest
# Copy application files
COPY . /app
# Install required system packages
RUN apt-get update
RUN apt-get -y install imagemagick curl software-properties-common gnupg vim ssh
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash -
RUN apt-get -y install nodejs
# Install NPM dependencies
RUN npm install --prefix /app
EXPOSE 80
CMD ["npm", "start", "--prefix", "app"]
This is an example of a not great Dockerfile. Source

When writing Dockerfiles, open a tab to the official Docker docs. You will need to refer to them all the time at first, because very little about it. Typically Dockerfile are stored in the top level of an existing repository and their file operations, such as COPY as shown above, operate on that principal. You don't have to do that, but it is a common pattern to see the Dockerfile at the root level of a repo. Whatever you do keep it consistent.

Formatting

Dockerfile instructions are not case-sensitive, but are usually written in uppercase so that they can be differentiated from arguments more easily. Comments have the hash symbol (#) at the beginning of the line.

FROM

First is a FROM, which just says "what is our base container that we are starting from". As you progress in your Docker experience, FROM containers are actually great ways of speeding up the build process. If all of your containers have the same requirements for packages, you can actually just make a "base container" and then use that as a FROM. But when building your first containers I recommend just sticking with Debian.

Don't Use latest

Docker images rely on tags, which you can see in the example above as: debian:latest. This is docker for "give me the more recently pushed image". You don't want to do that for production systems. Typically upgrading the base container should be a affirmative action, not just something you accidentally do.

The correct way to reference a FROM image in a Dockerfile is through the use of a hash. So we want something like this:

FROM debian@sha256:c6e865b5373b09942bc49e4b02a7b361fcfa405479ece627f5d4306554120673

Which I got from the Debian Dockerhub page here. This protects us in a few different ways.

  • We won't accidentally upgrade our containers without meaning to
  • If the team in charge of pushing Debian containers to Dockerhub makes a mistake, we aren't suddenly going to be in trouble
  • It eliminates another source of drift

But I see a lot of people using Alpine

That's fine, use Alpine if you want. I have more confidence in Debian when compared to Alpine and always base my stuff off Debian. I think its a more mature community and more likely to catch problems. But again, whatever you end up doing, make it consistent.

If you do want a smaller container, I recommend minideb. It still lets you get the benefits of Debian with a smaller footprint. It is a good middle ground.

COPY

COPY is very basic. The . just means "current working directory", which in this case if the Dockerfile is at the top level of a git repository. It just takes whatever you specify and copy it into the Dockerfile.

COPY vs ADD

A common question I get is "what is the difference between COPY and ADD". Super basic, ADD is for going out and fetching something from a URL or opening a compressed file into the container. So if all you need to do is copy some files into the container from the repo, just use COPY. If you have to grab a compressed directory from somewhere or unzip something use ADD.

RUN

RUN is the meat and potatoes of the Dockerfile. These are the bash commands we are running in order to basically put together all the requirements. The file we have above doesn't follow best practices. We want to compress the RUN commands down so that they are all part of one layer.

RUN wget https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 \
&& tar jxf samtools-1.2.tar.bz2 \
&& cd samtools-1.2 \
&& make \
&& make install
A good RUN example so all of these are one layer

WORKDIR

Allows you to set the directory inside the container from which all the other commands will run. Saves you from having to write out the absolute path every time.

CMD

The command we are executing when we run the container. Usually for most web applications this would be where we run the framework start command. This is an example from a Django app I run:

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

If you need more detail, Docker has a decent tutorial: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

One more thing

This obviously depends on your application, but many applications will also need a reverse proxy. This allows Nginx to listen on port 80 inside the container and forward requests on to your application. Docker has a good tutorial on how to add Nginx to your container: https://www.docker.com/blog/how-to-use-the-official-nginx-docker-image/

I cannot stress this enough, making the Dockerfile to run the actual application is not something a DevOps engineer should try to do on your own. You likely can do it by reverse engineering how your current servers work, but you need to pull in the other programmers in your organization.

Docker also has a good tutorial from beginning to end for Docker novices here: https://docs.docker.com/get-started/

Docker build, compose etc

Once you have a Dockerfile in your application repository, you are ready to move on to the next steps.

  1. Have your CICD system build the images. Typing in the words "name of your CICD + build docker images" into Google to see how.
  2. You'll need to make an IAM user for your CICD system in order to push the docker images from your CI workers to the ECR private registry. You can find the required permissions here.
  3. Get ready to push those images to a registry. For AWS users I strongly recommend AWS ECR.
  4. Here is how you make a private registry.
  5. Then you need to push your image to the registry. I want to make sure you see AWS ECR helper, a great tool that makes the act of pushing from your laptop much easier. https://github.com/awslabs/amazon-ecr-credential-helper. This also can help developers pull these containers down for local testing.
  6. Pay close attention to tags. You'll notice that the ECR registry is part of the tag along with the : and then a version information. You can use different registries for different applications or use the same registry for all your applications. Remember secrets shouldn't be in your container regardless, or customer data.
  7. Go get a beer, you earned it.

Some hard decisions

Up to this point, we've been following a pretty conventional workflow. Get stuff into containers, push the containers up to a registry, automate the process of making new containers. Now hopefully we have our developers able to test their applications locally and everyone is very impressed with you and all the work you have gotten done.

The reason we did all this work is because now that our applications are in Docker containers, we have a wide range of options for ways to quickly and easily run this application. I can't tell you what the right option is for your organization without being there, but I can lay out the options so you can walk into the conversation armed with the relevant data.

Deploying Docker containers directly to EC2 Instances

This is a workflow you'll see quite a bit among organizations just building confidence in Docker. It works something like this -

  • Your CI system builds the Docker container using a worker and the Dockerfile you defined before. It pushes it to your registry with the correct tag.
  • You make a basic AMI with a tool like packer.
  • New Docker containers are pulled down to the EC2 instances running the AMIs we made with Packer.

Packer

Packer is just a tool that spins up an EC2 instance, installs the software you want installed and then saves it as an AMI. These AMIs can be deployed when new machines launch, ensuring you have identical software for each host. Since we're going to be keeping all the often-updated software inside the Docker container, this AMI can be used as a less-often touched tool.

First, go through the Packer tutorial, it's very good.

Here is another more comprehensive tutorial.

Here are the steps we're going to follow

  1. Install Packer: https://www.packer.io/downloads.html
  2. Pick a base AMI for Packer. This is what we're going to install all the other software on top of.

Here is a list of Debian AMI IDs based on regions: https://wiki.debian.org/Cloud/AmazonEC2Image/Bullseye which we will use for our base image. Our Packer JSON file is going to look something like this:

{
    "variables": {
        "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
        "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}"
    },
    "builders": [
        {
            "access_key": "{{user `aws_access_key`}}",
            "ami_name": "docker01",
            "instance_type": "t3.micro",
            "region": "eu-west-1",
            "source_ami": "ami-05b99bc50bd882a41",
            "ssh_username": "admin",
            "type": "amazon-ebs"
        }
    ]
}
``

The next step is to add a provisioner step, as outlined in the Packer documentation you can find here. Basically you will write a bash script that installs the required software to run Docker. Docker actually provides you with a script that should install what you need which you can find here.

The end process you will be running looks like this:

  • CI process builds a Docker image and pushes it to ECR.
  • Your deployment process is to either configure your servers to pull the latest file from ECR with a cron job so your servers are eventually consistent, or more likely to write a deployment job which connects to each server, run Docker pull and then restarts the containers as needed.

Why this a bad long-term strategy

A lot of organizations start here, but its important not to end here. This is not a good sustainable long-term workflow.

  • This is all super hand-made, which doesn't fit with our end goal
  • The entire process is going to be held together with hand-made scripts, You need something to remove the instance you are deploying to from the load balancer, pull the latest image, restart it, etc.
  • You'll need to configure a health check on the Docker container to know if it has started correctly.

Correct ways to run containers in AWS

If you are trying to very quickly make a hands-off infrastructure with Docker, choose Elastic Beanstalk

Elastic Beanstalk is the AWS attempt to provide infrastructure in a box. I don't approve of everything EB does, but it is one of the fastest ways to stand up a robust and easy to manage infrastructure. You can check out how to do that with the AWS docs here.

AWS with EB stands up everything you need, from a load balancer to the server and even the database if you want. It is pretty easy to get going, but Elastic Beanstalk is not a magic solution.

Elastic Beanstalk is a good solution if:

  1. You are attempting to run a very simple application. You don't have anything too complicated you are trying to do. In terms of complexity, we are talking about something like a Wordpress site.
  2. You aren't going to need anything in the .ebextension world. You can see that here.
  3. There is a good logging and metrics story that developers are already using.
  4. You want rolling deployments, load balancing, auto scaling, health checks, etc out of the box

Don't use Elastic Beanstalk if:

  1. You need to do a lot of complicated networking stuff on the load balancer for your application.
  2. You have a complicated web application. Elastic Beanstalk had bad documentation and its hard to figure out why stuff is working or not.
  3. Service-to-service communication is something you are going to need now or in the future.

If you need something more robust, try ECS

AWS ECS is a service by AWS designed to quickly and easy run Docker containers. You can find the tutorial here: https://aws.amazon.com/getting-started/hands-on/deploy-docker-containers/

Use Elastic Container Service if:

  1. You are already heavily invested in AWS resources. The integration with ECS and other AWS resources is deep and works well.
  2. You want the option of going completely unmanaged server with Fargate
  3. You have looked at the cost of running a stack on Fargate and are OK with it.

Don't use Elastic Container Service if:

  1. You may need to deploy this application to a different cloud provider

What about Kubernetes?

I love Kubernetes, but its too complicated to get into this article. Kubernetes is a full stack solution that I adore but is probably too complicated for one person to run. I am working on a Kubernetes writeup, but if you are a small team I wouldn't strongly consider it. ECS is just easier to get running and keep running.

Coming up!

  • Logging, metrics and traces
  • Paging and alerts. What is a good page vs a bad page
  • Databases. How do we move them, what do we do with them
  • Status pages. Where do we tell customers about problems or upcoming maintenance.
  • CI/CD systems. Do we stick with Jenkins or is there something better?
  • Serverless. How does it work, should we be using it?
  • IAM. How do I give users and applications access to AWS without running the risk of bringing it all down.

Questions / Concerns?

Let me know on twitter @duggan_mathew


DevOps Engineer Crash Course - Section 1

Fake it till you make it Starfleet Captain Kelsey Grammer Link

I've had the opportunity lately to speak to a lot of DevOps engineers at startups around Europe. Some come from a more traditional infrastructure background, beginning their careers in network administration or system administration. Most are coming from either frontend or backend teams, choosing to focus more on the infrastructure work (which hey, that's great, different perspectives are always appreciated).

However, a pretty alarming trend has emerged through these conversations. They seem to start with the existing sys admin or devops person leaving and suddenly they are dropped into the role with almost no experience or training. Left to their own devices with root access to the AWS account, they often have no idea what to even start. Learning on the job is one thing, but being responsible for the critical functioning of an entire companies infrastructure with no time to ramp up is crazy and frankly terrifying.

For some of these folks, it was the beginning of a love affair with infrastructure work. For others, it caused them to quit those jobs immediately in panic. I even spoke to a few who left programming as a career as a result of the stress they felt at the sudden pressure. That's sad for a lot of reasons, especially when these people are forced into the role. But it did spark an idea.

What advice and steps would I tell someone who suddenly had my job with no time to prepare? My goal is to try and document what I would do, if dropped into a position like that, along with my reasoning.

Disclaimer

These solutions aren't necessarily the best fit for every organization or application stack. I tried to focus on easy relatively straightforward tips for people who dropped into a role that they have very little context on. As hard as this might be to believe for some people out there, a lot of smaller companies just don't have any additional infrastructure capacity, especially in some areas of Europe.

These aren't all strictly DevOps concepts as I understand the term to mean. I hate to be the one to tell you but, like SRE and every other well-defined term before it, businesses took the title "DevOps" and slapped it on a generic "infrastructure" concept. We're gonna try to stick to some key guiding principles but I'm not a purist.

Key Concepts

  1. We are a team of one or two people. There is no debate about build vs buy. We're going to buy everything that isn't directly related to our core business.
  2. These systems have to fix themselves. We do not have the time or capacity to apply the love and care of a larger infrastructure team. Think less woodworking and more building with Legos. We are trying to snap pre-existing pieces together in a sustainable pattern.
  3. Boring > New. We are not trying to make the world's greatest infrastructure here. We need something sustainable, easy to operate, and ideally something we do not need to constantly be responsible for. This means teams rolling out their own resources, monitoring their own applications, and allocating their own time.
  4. We are not the gatekeepers. Infrastructure is a tool and like all tools, it can be abused. Your organization is going to learn to do this better collectively.
  5. You cannot become an expert on every element you interact with. A day in my job can be managing Postgres operations, writing a PR against an application, or sitting in a planning session helping to design a new application. The scope of what many businesses call "DevOps" is too vast to be a deep-dive expert in all parts of it.

Most importantly we'll do the best we can, but push the guilt out of your head. Mistakes are the cost of their failure to plan, not your failure to learn.  A lot of the people who I have spoken to who find themselves in this problem feel intense shame or guilt for not "being able to do a better job". Your employer has messed up, you didn't.

Section One - Into the Fray

Maybe you expressed some casual interest in infrastructure work during a one on one a few months ago, or possibly you are known as the "troubleshooting person", assisting other developers with writing Docker containers. Whatever got you here, your infrastructure person has left, maybe suddenly. You have been moved into the role with almost no time to prepare. We're going to assume you are on AWS for this, but for the most part, the advice should be pretty universal.

I've tried to order these tasks in terms of importance.

1. Get a copy of the existing stack

Alright, you got your AWS credentials, the whole team is trying to reassure you not to freak out because "mostly the infrastructure just works and there isn't a lot of work that needs to be done". You sit down at your desk and your mind starts racing. Step 1 is to get a copy of the existing cloud setup.

We want to get your infrastructure as it exists right now into code because chances are you are not the only one who can log into the web panel and change things. There's a great tool for exporting existing infrastructure state in Terraform called terraformer.

Terraformer

So terraformer is a CLI tool written in Go that allows you to quickly and easily dump out all of your existing cloud resources into a Terraform repo. These files, either as TF format or JSON, will let you basically snapshot the entire AWS account. First, set up AWS CLI and your credentials as shown here. Then once you have the credentials saved, make a new git repo.

# Example flow

# Set up our credentials
aws configure --profile production

# Make sure they work
aws s3 ls --profile production 

# Make our new repo
mkdir infrastructure && cd infrastructure/
git init 

# Install terraformer
# Linux
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-linux-amd64
chmod +x terraformer-all-linux-amd64
sudo mv terraformer-all-linux-amd64 /usr/local/bin/terraformer

# Intel Mac
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-darwin-amd64
chmod +x terraformer-all-darwin-amd64
sudo mv terraformer-all-darwin-amd64 /usr/local/bin/terraformer

# Other Platforms
https://github.com/GoogleCloudPlatform/terraformer/releases/tag/0.8.15

# Install terraform
https://learn.hashicorp.com/tutorials/terraform/install-cli

First, if you don't know what region your AWS resources are in you can find that here.

So what we're gonna do run:

terraformer import aws --regions INSERT_AWS_REGIONS_HERE --resources="*" --profile=production

### You will get a directory structure that looks like this
generated/
└── aws
    ├── acm
    │   ├── acm_certificate.tf
    │   ├── outputs.tf
    │   ├── provider.tf
    │   └── terraform.tfstate
    └── rds
        ├── db_instance.tf
        ├── db_parameter_group.tf
        ├── db_subnet_group.tf
        ├── outputs.tf
        ├── provider.tf
        └── terraform.tfstate

So if you wanted to modify something for rds, you would cd to the rds directory, then run terraform init. You may get an error: Error: Invalid legacy provider address

If so, no problem. Just run

terraform state replace-provider registry.terraform.io/-/aws hashicorp/aws

Once that is set up, you now have the ability to restore the AWS account using terraform at any time. You will want to add this repo to a CICD job eventually so this gets done automatically, but at first, you might need to run it locally.

$ export AWS_ACCESS_KEY_ID="anaccesskey"
$ export AWS_SECRET_ACCESS_KEY="asecretkey"
$ export AWS_DEFAULT_REGION="us-west-2"
$ terraform plan

You should see terraform run and tell you no changes.

Why Does This Matter?

Terraform lets us do a few things, one of which is roll out infrastructure changes like we would with any other code change. This is great because, in the case of unintended outages or problems, we can rollback. It also matters because often with small companies things will get broken when someone logs into the web console and clicks something they shouldn't. Running a terraform plan can tell you exactly what changed across the entire region in a few minutes, meaning you should be able to roll it back.

Should I do this if our team already manages our stack in code?

I would. There are tools like Ansible and Puppet which are great at managing servers that some people use to manage AWS. Often these setups are somewhat custom, relying on some trial and error before you figure out exactly how they work and what they are doing. Terraform is very stock and anyone on a DevOps chat group or mailing list will be able to help you run the commands. We're trying to establish basically a "restore point". You don't need to use Terraform to manage stuff if you don't want to, but you probably won't regret having a copy now.

Later on, we're going to be putting this into a CICD pipeline so we don't need to manage who adds infrastructure resources. We'll do that by requiring approval on PRs vs us having to write everything. It'll distribute the load but still let us ensure that we have some insight into how the system is configured. Right now though, since you are responsible for infrastructure you can at least roll this back.

2. Write down how deployments work


Every stack is a little different in terms of how it gets deployed and a constant source of problems for folks starting out. You need to be able to answer the question of how exactly code goes from a repo -> production. Maybe it's Jenkins, or GitLab runners or GitHub, CodeDeploy, etc but you need to know the answer for each application. Most importantly you need to read through whatever shell script they're running to actually deploy the application because that will start to give you an idea of what hacks are required to get this thing up and running.

Here are some common questions to get you started.

  • Are you running Docker? If so, where do the custom images come from? What runs the Dockerfile, where does it push the images, etc.
  • How do you run migrations against the database? Is it part of the normal code base, is there a different utility?
  • What is a server to your organization? Is it a stock ec2 instance running Linux and docker with everything else getting deployed with your application? Is it a server where your CICD job just rsyncs file to a directory Nginx reads from?
  • Where do secrets come from? Are they stored in the CICD pipeline? Are they stored in a secrets system like Vault or Secrets Manager? (Man if your organization actually does secrets correctly with something like this, bravo).
  • Do you have a "cron box"? This is a server that runs cron jobs on a regular interval outside of the normal fleet. I've seen these called "snowflake", "worker", etc. These are usually the least maintained boxes in the organization but often the most critical to how the business works.
  • How similar or different are different applications? Often organizations have mixes of serverless applications (managed either through the AWS web UI and tools like serverless) and conventional web servers. Lambdas in AWS are awesome tools that often are completely unmanaged in small businesses, so try and pay special attention to these.

The goal of all of this is to be able to answer "how does code go from a developer laptop to our customers". Once you understand that specific flow, then you will be much more useful in terms of understanding a lot of how things work. Eventually, we're going to want to consolidate these down into one flow, ideally into one "target" so we can keep our lives simple and be able to really maximize what we can offer the team.

Where do logs go and what stores them?

All applications and services generate logs. Logs are critical to debugging the health of an application, and knowing how that data is gathered and stored is critical to empowering developers to understand problems. This is the first week, so we're not trying to change anything, we just want to document how it works. How are logs generated by the application?

Some likely scenarios:

  • They are written to disk on the application server and pushed somewhere through syslog. Great, document the syslog configuration, where it comes from and then finally is log rotate set up to keep the boxes from running out of disk space.
  • They get pushed to either the cloud provider or monitoring provider (datadog etc). Fine, couldn't be easier, but write down where the permission to push the logs comes from. What I mean by that is: does the app push the logs to AWS, or does an agent running on the box take the logs and push them up to AWS? Either is fine, but know which makes a difference.

Document the flow, looking out for expiration or deletion policies. Also see how access control works, how do developers access these raw logs? Hopefully through some sort of web UI, but if it is through SSH access to the log aggregator that's fine, just write it down.

For more information about CloudWatch logging check out the AWS docs here.

3. How does SSH access work?

You need to know exactly how SSH works from the developers' laptop to the server they are trying to access. Here are some questions to kick it off.

  • How do SSH public keys get onto a server? Is there a script, does it sync from somewhere, are they put on by hand?
  • What IP addresses are allowed to SSH into a server? Hopefully not all of them, most organizations have at least a bastion host or VPN set up. But test it out, don't assume the documentation is correct. Remember we're building new documentation from scratch and approaching this stack with the respect it deserves as an unknown problem.
  • IMPORTANT: HOW DO EMPLOYEES GET OFFBOARDED? Trust me, people forget this all the time and it wouldn't surprise me if you find some SSH keys that shouldn't be there.

I don't know anything about SSH

Don't worry we got you. Take a quick read through this tutorial. You've likely used SSH a lot, especially if you have ever set up a Digital Ocean or personal EC2 instance on a free tier. You have public keys synced to the server and private keys on the client device.

What is a bastion host?

They're just servers that exist to allow traffic from a public subnet to a private subnet. Not all organizations use them, but a lot do, and given the conversations I've had it seems like a common pattern around the industry to use them. We're using a box between the internet and our servers as a bridge.

Do all developers need to access bastion hosts?

Nope they sure don't. Access to the Linux instances should be very restricted and ideally, we can get rid of it as we go. There are much better and easier to operate options now through AWS that let you get rid of the whole concept of bastion servers. But in the meantime, we should ensure we understand the existing stack.

Questions to answer

  • How do keys get onto the bastion host?
  • How does access work from the bastion host to the servers?
  • Are the Linux instances we're accessing in a private subnet or are they on a public subnet?
  • Is the bastion host up to date? Is the Linux distribution running current with the latest patches? There shouldn't be any other processes running on these boxes so upgrading them shouldn't be too bad.
  • Do you rely on SFTP anywhere? Are you pulling something down that is critical or pushing something up to SFTP? A lot of businesses still rely heavily on automated jobs around SFTP and you want to know how that authentication is happening.

4. How do we know the applications are running?

It seems from conversations that these organizations often have bad alerting stories. They don't know applications are down until customers tell them or they happen to notice. So you want to establish some sort of baseline early on, basically "how do you know the app is still up and running". Often now there is some sort of health check path, something like domain/health or /check or something, used by a variety of services like load balancers and Kubernetes to determine if something is up and functional or not.

First, understand what this health check is actually doing. Sometimes they are just hitting a webserver and ensuring Nginx is up and running. While interesting to know that Nginx is a reliable piece of software (it is quite reliable), this doesn't tell us much. Ideally, you want a health check that interacts with as many pieces of the infrastructure as possible. Maybe it runs a read query against the database to get back some sort of UUID (which is a common pattern).

This next part depends a lot on what alerting system you use, but you want to make a dashboard that you can use very quickly to determine "are my applications up and running". Infrastructure modifications are high-risk operations and sometimes when they go sideways, they'll go very sideways. So you want some visual system to determine whether or not the stack is functional and ideally, this should alert you through Slack or something. If you don't have a route like this, considering doing the work to add one. It'll make your life easier and probably isn't too complicated to do in your framework.

My first alerting tool is almost always Uptime Robot. So we're gonna take our health route and we are going to want to set an Uptime Robot alert on that endpoint. You shouldn't allow traffic from the internet at large to hit this route (because it is a computationally expensive route it is susceptible to malicious actors). However, Uptime Robot provides a list of their IP addresses for whitelisting. So we can add them to our security groups in the terraform repo we made earlier.

If you need a free alternative I have had a good experience with Hetrix. Setting up the alerts should be self-explanatory, basically hit an endpoint and get back either a string or a status code.

5. Run a security audit

Is he out of his mind? On the first week? Security is a super hard problem and one that startups mess up all the time. We can't make this stack secure in the first week (or likely month) of this work, but we can ensure we don't make it worse and, when we get a chance, we move closer to an ideal state.

The tool I like for this is Prowler. Not only does it allow you a ton of flexibility with what security audits you run, but it lets you export the results in a lot of different formats, including a very nice-looking HTML option.

Steps to run Prowler

  1. Install Prowler. We're gonna run this from our local workstation using the AWS profile we made before.

On our local workstation:
git clone https://github.com/toniblyx/prowler
cd prowler

2. Run prowler. ./prowler -p production -r INSERT_REGION_HERE -M csv,json,json-asff,html -g cislevel1

The command above will output all of the options for Prowler, but I want to focus for a second on the -g option. That's the group option and it basically means "what security audit are we going to run". CIS Amazon Web Services Foundations have 2 levels and can be thought of broadly as:


Level 1: Stuff you should absolutely be doing right now that shouldn't impact most application functionality.

Level 2: Stuff you should probably be doing but is more likely to impact the functioning of an application.

We're running Level 1, because ideally, our stack should already pass a level 1 and if it doesn't, then we want to know where. The goal of this audit isn't to fix anything right now, but it IS to share it with leadership. Let them know the state of the account now while you are onboarding, so if there are serious security gaps that will require development time they know about it.

Finally, take the CSV file that was output from Prowler and stick it in Google Sheets with a date. We're going to want to have a historical record of the audit.

6. Make a Diagram!

The last thing we really want to do is make a diagram and have the folks who know more about the stack verify it. One tool that can kick this off is Cloudmapper. This is not going to get you all of the way there (you'll need to add meaningful labels and likely fill in some missing pieces) but should get you a template to work off of.

What we're primarily looking for here is understanding flow and dependencies. here are some good questions to get you started.

  • Where are my application persistence layers? What hosts them? How do they talk to each other?
  • Overall network design. How does traffic ingress and egress? Do all my resources talk directly to the internet or do they go through some sort of NAT gateway? Are my resources in different subnets, security groups, etc?
  • Are there less obvious dependencies? SQS, RabbitMQ, S3, elasticsearch, varnish, any and all of these are good candidates.

The ideal state here is to have a diagram that we can look at and say "yes I understand all the moving pieces". For some stacks that might be much more difficult, especially serverless stacks. These often have mind-boggling designs that change deploy to deploy and might be outside of the scope of a diagram like this. We should still be able to say "traffic from our customers comes in through this load balancer to that subnet after meeting the requirements in x security group".

We're looking for something like this

If your organization has LucidChart they make this really easy. You can find out more about that here. You can do almost everything Lucid or AWS Config can do with Cloudmapper without the additional cost.

Cloudmapper is too complicated, what else have you got?

Does the setup page freak you out a bit? It does take a lot to set up and run the first time. AWS actually has a pretty nice pre-made solution to this problem. Here is the link to their setup: https://docs.aws.amazon.com/solutions/latest/aws-perspective/overview.html

It does cost a little bit but is pretty much "click and go" so I recommend it if you just need a fast overview of the entire account without too much hassle.

End of section one

Ideally the state we want to be in looks something like the following.

  • We have a copy of our infrastructure that we've run terraform plan against and there are no diffs, so we know we can go back.
  • We have an understanding of how the most important applications are deployed and what they are deployed to.
  • The process of generating, transmitting, and storing logs is understood.
  • We have some idea of how secure (or not) our setup is.
  • There are some basic alerts on the entire stack, end to end, which give us some degree of confidence that "yes the application itself is functional".

For many of you who are more experienced with this type of work, I'm sure you are shocked. A lot of this should already exist and really this is a process of you getting up to speed with how it works. However sadly in my experience talking to folks who have had this job forced on them, many of these pieces were set up a few employees ago and the specifics of how they work are lost to time. Since we know we can't rely on the documentation we need to make our own. In the process, we become more comfortable with the overall stack.

Stuff still to cover!

If there is any interest I'll keep going with this. Some topics I'd love to cover.

  • Metrics! How to make a dashboard that doesn't suck.
  • Email. Do your apps send it, are you set up for DMARC, how do you know if email is successfully getting to customers, where does it send from?
  • DNS. If it's not in the terraform directory we made before under Route53, it must be somewhere else. We gotta manage that like we manage a server because users logging into the DNS control panel and changing something can cripple the business.
  • Kubernetes. Should you use it? Are there other options? If you are using it now, what do you need to know about it?
  • Migrating to managed services. If your company is running its own databases or baking its own AMIs, now might be a great time to revisit that decision.
  • Sandboxes and multi-account setups. How do you ensure developers can test their apps in the least annoying way while still keeping the production stack up?
  • AWS billing. What are some common gotchas, how do you monitor spending, and what do to institutionally about it?
  • SSO, do you need it, how to do it, what does it mean?
  • Exposing logs through a web interface. What are the fastest ways to do that on a startup budget?
  • How do you get up to speed? What courses and training resources are worth the time and energy?
  • Where do you get help? Are there communities with people interested in providing advice?

Did I miss something obvious?

Let me know! I love constructive feedback. Bother me on Twitter. @duggan_mathew


How does FaceTime Work?

As an ex-pat living in Denmark, I use FaceTime audio a lot. Not only is it simple to use and reliable, but the sound quality is incredible. For those of you old enough to remember landlines, it reminds me of those but if you had a good headset. When we all switched to cell service audio quality took a huge hit and with modern VoIP home phones the problem hasn't gotten better. So when my mom and I chat over FaceTime Audio and the quality is so good it is like she is in the room with me, it really stands out compared to my many other phone calls in the course of a week.

So how does Apple do this? As someone who has worked as a systems administrator for their entire career, the technical challenges are kind of immense when you think about them. We need to establish a connection between two devices through various levels of networking abstraction, both at the ISP level and home level. This connection needs to be secure, reliable enough to maintain a conversation and also low bandwidth enough to be feasible given modern cellular data limits and home internet data caps. All of this needs to run on a device with a very impressive CPU but limited battery capacity.

What do we know about FaceTime?

A lot of our best information for how FaceTime worked (past tense is important here) is from interested parties around the time the feature was announced, so around the 2010 timeframe. During this period there was a lot of good packet capture work done by interested parties and we got a sense for how the protocol functioned. For those who have worked in VoIP technologies in their career, it's going to look pretty similar to what you may have seen before (with some Apple twists). Here were the steps to a FaceTime call around 2010:

  • A TCP connection over port 5223 is established with an Apple server. We know that 5223 is used by a lot of things, but for Apple its used for their push notification services. Interestingly, it is ALSO used for XMPP connections, which will come up later.
  • UDP traffic between the iOS device and Apple servers on ports 16385 and 16386. These ports might be familiar to those of you who have worked with firewalls. These are ports associated with audio and video RTP, which makes sense. RTP, or real-time transport protocol was designed to facilitate video and audio communications over the internet with low latency.
  • RTP relies on something else to establish a session and in Apple's case it appears to rely on XMPP. This XMPP connection relies on a client certificate on the device issued by Apple. This is why non-iOS devices cannot use FaceTime, even if they could reverse engineer the connection they don't have the certificate.
  • Apple uses ICE, STUN and TURN to negotiate a way for these two devices to communicate directly with each other. These are common tools used to negotiate peer to peer connections between NAT so that devices without public IP addresses can still talk to each other.
  • The device itself is identified by registering either a phone number or email address with Apple's server. This, along with STUN information, is how Apple knows how to connect the two devices. STUN, or Session Traversal Utilities for NAT is when a device reaches out to a publically available server and the server determines how this client can be reached.
  • At the end of all of this negotiation and network traversal, a SIP INVITE message is sent. This has the name of the person along with the bandwidth requirements and call parameters.
  • Once the call is established there are a series of SIP MESSAGE packets that are likely used to authenticate the devices. Then the actual connection is established and FaceTimes protocols take over using the UDP ports discussed before.
  • Finally the call is terminated using the SIP protocol when it is concluded. The assumption I'm making is that for FaceTime audio vs video the difference is minor, the primary distinction being that the codec used for audio, AAC-ELD. There is nothing magical about Apple using this codec but it is widely seen as an excellent choice.

That was how the process worked. But we know that in the later years Apple changed FaceTime, adding more functionality and presumably more capacity. According to their port requirements these are the ones required now. I've added what I suspect they are used for.

Port Likely Reason
80 (TCP) unclear but possibly XMPP since it uses these as backups
443 (TCP) same as above since they are never blocked
3478 through 3497 (UDP) STUN
5223 (TCP) APN/XMPP
16384 through 16387 (UDP) Audio/video RTP
16393 through 16402 (UDP) FaceTime exclusive

Video and Audio Quality

A video FaceTime call is 4 media streams in each call. The audio is AAC-ELD as described above, with an observed 68 kbps in each direction (or about 136 kbps give or take) consumed. Video is H.264 and varies quite a bit in quality depending presumably on whatever bandwidth calculations were passed through SIP. We know that SIP has allowances for H.264 information about total consumed bandwidth, although the specifics of how FaceTime does on-the-fly calculations for what capacity is available to a consumer is still unknown to me.

You can observe this behavior by switching from cellular to wifi for video call, where often video compression is visible during the switch (but interestingly the call doesn't drop, a testament to effective network interface handoff inside of iOS). However with audio calls, this behavior is not replicated, where the call either maintaining roughly the same quality or dropping entirely, suggesting less flexibility (which makes sense given the much lower bandwidth requirements).

So does FaceTime still work like this?

I think a lot of it is still true, but wasn't entirely sure if the XMPP component is still there. However after more reading I believe this is still how it works and indeed how a lot of how Apple's iOS infrastructure works. While Apple doesn't have a lot of documentation available about the internals for FaceTime, one that stood out to me was the security document. You can find that document here.

FaceTime is Apple’s video and audio calling service. Like iMessage, FaceTime calls use the Apple Push Notification service (APNs) to establish an initial connection to the user’s registered devices. The audio/video contents of FaceTime calls are protected by end-to-end encryption, so no one but the sender and receiver can access them. Apple can’t decrypt the data.

So we know that port 5223 (TCP) is used by both Apple's push notification service and also XMPP over SSL. We know from older packet dumps that Apple used to used 5223 to establish a connection to their own Jabber servers as the initial starting point of the entire process. My suspicion here is that Apple's push notifications work similar to a normal XMPP pubsub setup.

  • Apple kind of says as much in their docs here.

This is interesting because it suggests the underlying technology for a lot of Apple's backend is XMPP, surprising because for most of us XMPP is thought of as an older, less used technology. As discussed later I'm not sure if this is XMPP or just uses the same port. Alright so messages are exchanged, but how about the key sharing? These communications are encrypted, but I'm not uploading or sharing public keys (nor do I seem to have any sort of access to said keys).

Keys? I'm lost, I thought we were talking about calls

One of Apple's big selling points is security and iMessage became famous for being an encrypted text message exchange. Traditional SMS was not encrypted and nor were a lot of (most) text based communication, including email. Encryption is computationally expensive and wasn't seen as a high priority until Apple really made it a large part of the conversation for text communication. But why hasn't encryption been a bigger part of the consumer computer ecosystem?

In short: because managing keys sucks ass. If I want to send an encrypted message to you I need to first know your public key. Then I can encrypt the body of a message and you can decrypt it. Traditionally this process is super manual and frankly, pretty shitty.

Credit: Protonmail

So Apple must have some way of generating the keys (presumably on device) and then sharing the public keys. They in fact do, a service called IDS or Apple Identity Service. This is what links up your phone number or email address to the public key for that device.

Apple has a nice little diagram explaining the flow:

As far as I can tell the process is much the same for FaceTime calls as it is for iMessage but with some nuance for the audio/video channels. The certificates are used to establish a shared secret and the actual media is streamed over SRTP.

Not exactly the same but still gets the point across

Someone at Apple read the SSL book

Alright so SIP itself has a mechanism for how to handle encryption, but FaceTime and iMessage work on devices going all the way back to the iPhone 4. So the principal makes sense but then I don't understand why we don't see tons of iMessage clones for Android. If there are billions of Apple devices floating around and most of this relies on local client-side negotiation isn't there a way to fake it?

Alright so this is where it gets a bit strange. So there's a defined way of sending client certificates as outlined in RFC 5246. It appears Apple used to do this but they have changed their process. Now its sent through the application, along with a public token, a nonce and a signature. We're gonna focus on the token and the certificate for a moment.

Token

  • 256-bit binary string
NSLog(@"%@", deviceToken);
// Prints "<965b251c 6cb1926d e3cb366f dfb16ddd e6b9086a 8a3cac9e 5f857679 376eab7C>"
Example

Certificate

  • Generated on device APN activation
  • Certificate request sent to albert.apple.com
  • Uses two TLS extensions, APLN and Server name

So why don't I have a bunch of great Android apps able to send this stuff?

As near as I can tell, the primary issue is two-fold. First the protocol to establish the connection isn't standard. Apple uses APLN to handle the negotiation and the client uses a protocol apns-pack-v1 to handle this. So if you wanted to write your own application to interface with Apple's servers, you would first need to get the x509 client certificate (which seems to be generated at the time of activation). You would then need to be able to establish a connection to the server using APLN passing server name, which I don't know if Android supports. You also can't just generate this one-time, as Apple only allows each device one connection. So if you made an app using values taken from a real Mac or iOS device, I think it would just cause the actual Apple device to drop. If your Mac connected, then the fake device would drop.

But how do Hackintoshes work? For those that don't know, these are normal x86 computers running MacOS. Presumably they would have the required extensions to establish these connections and would also be able to generate the required certificates. This is where it gets a little strange. It appears the Macs serial number is a crucial part of how this process functions, presumably passing some check on Apple's side to figure out "should this device be allowed to initiate a connection".  

The way to do this is by generating fake Mac serial numbers as outlined here. The process seems pretty fraught, relying on a couple of factors. First the Apple ID seems to need to be activated through some other device and apparently age of the ID matters. This is likely some sort of weight system to keep the process from getting flooded with fake requests. However it seems before Apple completes the registration process it looks at the plist of the device and attempts to determine "is this a real Apple device".

Apple device serial numbers are not random values though, they are actually a pretty interesting data format that packs in a lot of info. Presumably this was done to make service easier, allowing the AppleCare website and Apple Stores a way to very quickly determine model and age without having to check with some "master Apple serial number server". You can check out the old Apple serial number format here: link.

This ability to brute force new serial numbers is, I suspect, behind the decision by Apple to change the format of the serial number. By switching from a value that can be generated to a totally random value that varies in length, I assume Apple will be able to say with a much higher degree of certainty that "yes this is a MacBook Pro with x serial number" by doing a lookup on an internal database. This would make generating fake serial numbers for these generations of devices virtually impossible, since you would need to get incredibly lucky with both model, MAC address information, logic board ID and serial number.

How secure is all this?

It's as secure as Apple, for all the good and the bad that suggests. Apple is entirely in control of enrollment, token generation, certificate verification and exchange along with the TLS handshake process. The inability for users to provide their own keys for encryption isn't surprising (this is Apple and uploading public keys for users doesn't seem on-brand for them), but I was surprised that there isn't any way for me to display a users key. This would seem like a logical safeguard against man in the middle attacks.

So if Apple wanted to enroll another email address and associate it with an Apple ID and allow it to receive the APN notifications for FaceTime/receive a call, there isn't anything I can see that would stop them from doing that. I'm not suggesting they do or would, simply that it seems technically feasible (since we already know multiple devices receive a FaceTime call at the same time and the enrollment of a new target for a notification depends more on the particular URI for that piece of the Apple ID be it phone number or email address).

So is this all XMPP or not?

I'm not entirely sure. The port is the same and there are some similarities in terms of message subscription, but the large amount of modification to handle the actual transfer of messages tells me if this is XMPP behind the scenes now, it has been heavily modified. I suspect the original design may have been something closer to stock but over the years Apple has made substantial changes to how the secret sauce all works.

To me it still looks a lot like how I would expect this to function, with a massive distributed message queue. You connect to a random APN server, rand(0,255)-courier.push.apple.com, initiate TLS handshake and then messages are pushed to your device as identified by your token. Presumably at Apple's scale of billions of messages flowing at all times, the process is more complicated on the back end, but I suspect a lot of the concepts are similar.

Conclusion

FaceTime is a great service that seems to rely on a very well understood and battle-tested part of the Apple ecosystem, which is their push notification service along with their Apple ID registration service. This process, which is also used by non-Apple applications to receive notifications, allows individual devices to quickly negotiate a client certificate, initiate a secure connection, use normal networking protocols to allow Apple to assist them with bypassing NAT and then establishes a connection between devices using standard SIP protocols. The quality is the result of Apple licensing good codecs and making devices capable of taking advantage of those codecs.

FaceTime and iMessage are linked together along with the rest of the Apple ID services, allowing users to register a phone number or email address as a unique destination.

Still a lot we don't know

I am confident a lot of this is wrong or out of date. It is difficult to get more information about this process, even with running some commands locally. I would love any additional information folks would be willing to share or to point me towards articles or documents I should read.

Citations:


TIL Easy way to encrypt and decrypt files with Python and GnuPG

I often have to share files with outside parties at work, a process which previously involved a lot of me manually running gpg commands. I finally decided to automate the process and was surprised at how little time it took. Now I have a very simple Lambda based encryption flow importing keys from S3, encrypting files for delivery to end users and then sending the encrypted message as the body of an email with SES.

Requirements

How to Import Keys

from pprint import pprint
import sys
from pathlib import Path
from shutil import which


#Pass the key you want to import like this: python3 import_keys.py filename_of_public_key.asc
if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
key_data = open(sys.argv[1], encoding="utf-8").read()
import_result = gpg.import_keys(key_data)
pprint(import_result.results)

public_keys = gpg.list_keys()
pprint(public_keys)

Encrypt a File

import sys
import pprint
from shutil import which

#Example: python3 encrypt_file.py name_of_file.txt [email protected]

if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
with open (sys.argv[1], 'rb') as f:
    status = gpg.encrypt_file(
            f, recipients=[sys.argv[2]],
            output=sys.argv[1] + '.gpg',
            always_trust = True
            )

    print('ok: ', status.ok)
    print('status: ', status.status)
    print('stderr: ', status.stderr)

Decrypt a File

import sys
import pprint
from shutil import which
import os
#Example: python3 decrypt_file.py name_of_file.txt passphrase

if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
with open (sys.argv[1], 'rb') as f:
    status = gpg.decrypt_file(
            file=f,
			passphrase=sys.argv[2],
            output=("decrypted-" + sys.argv[1])
            )

    print('ok: ', status.ok)
    print('status: ', status.status)
    print('stderr: ', status.stderr)