DevOps Engineer Crash Course - Section 1

August 14, 2021 in DevOps

Fake it till you make it Starfleet Captain Kelsey Grammer Link

I've had the opportunity lately to speak to a lot of DevOps engineers at startups around Europe. Some come from a more traditional infrastructure background, beginning their careers in network administration or system administration. Most are coming from either frontend or backend teams, choosing to focus more on the infrastructure work (which hey, that's great, different perspectives are always appreciated).

However, a pretty alarming trend has emerged through these conversations. They seem to start with the existing sys admin or devops person leaving and suddenly they are dropped into the role with almost no experience or training. Left to their own devices with root access to the AWS account, they often have no idea what to even start. Learning on the job is one thing, but being responsible for the critical functioning of an entire companies infrastructure with no time to ramp up is crazy and frankly terrifying.

For some of these folks, it was the beginning of a love affair with infrastructure work. For others, it caused them to quit those jobs immediately in panic. I even spoke to a few who left programming as a career as a result of the stress they felt at the sudden pressure. That's sad for a lot of reasons, especially when these people are forced into the role. But it did spark an idea.

What advice and steps would I tell someone who suddenly had my job with no time to prepare? My goal is to try and document what I would do, if dropped into a position like that, along with my reasoning.

Disclaimer

These solutions aren't necessarily the best fit for every organization or application stack. I tried to focus on easy relatively straightforward tips for people who dropped into a role that they have very little context on. As hard as this might be to believe for some people out there, a lot of smaller companies just don't have any additional infrastructure capacity, especially in some areas of Europe.

These aren't all strictly DevOps concepts as I understand the term to mean. I hate to be the one to tell you but, like SRE and every other well-defined term before it, businesses took the title "DevOps" and slapped it on a generic "infrastructure" concept. We're gonna try to stick to some key guiding principles but I'm not a purist.

Key Concepts

We are a team of one or two people. There is no debate about build vs buy. We're going to buy everything that isn't directly related to our core business.
These systems have to fix themselves. We do not have the time or capacity to apply the love and care of a larger infrastructure team. Think less woodworking and more building with Legos. We are trying to snap pre-existing pieces together in a sustainable pattern.
Boring > New. We are not trying to make the world's greatest infrastructure here. We need something sustainable, easy to operate, and ideally something we do not need to constantly be responsible for. This means teams rolling out their own resources, monitoring their own applications, and allocating their own time.
We are not the gatekeepers. Infrastructure is a tool and like all tools, it can be abused. Your organization is going to learn to do this better collectively.
You cannot become an expert on every element you interact with. A day in my job can be managing Postgres operations, writing a PR against an application, or sitting in a planning session helping to design a new application. The scope of what many businesses call "DevOps" is too vast to be a deep-dive expert in all parts of it.

Most importantly we'll do the best we can, but push the guilt out of your head. Mistakes are the cost of their failure to plan, not your failure to learn. A lot of the people who I have spoken to who find themselves in this problem feel intense shame or guilt for not "being able to do a better job". Your employer has messed up, you didn't.

Section One - Into the Fray

Maybe you expressed some casual interest in infrastructure work during a one on one a few months ago, or possibly you are known as the "troubleshooting person", assisting other developers with writing Docker containers. Whatever got you here, your infrastructure person has left, maybe suddenly. You have been moved into the role with almost no time to prepare. We're going to assume you are on AWS for this, but for the most part, the advice should be pretty universal.

I've tried to order these tasks in terms of importance.

1. Get a copy of the existing stack

Alright, you got your AWS credentials, the whole team is trying to reassure you not to freak out because "mostly the infrastructure just works and there isn't a lot of work that needs to be done". You sit down at your desk and your mind starts racing. Step 1 is to get a copy of the existing cloud setup.

We want to get your infrastructure as it exists right now into code because chances are you are not the only one who can log into the web panel and change things. There's a great tool for exporting existing infrastructure state in Terraform called terraformer.

Terraformer

So terraformer is a CLI tool written in Go that allows you to quickly and easily dump out all of your existing cloud resources into a Terraform repo. These files, either as TF format or JSON, will let you basically snapshot the entire AWS account. First, set up AWS CLI and your credentials as shown here. Then once you have the credentials saved, make a new git repo.

# Example flow

# Set up our credentials
aws configure --profile production

# Make sure they work
aws s3 ls --profile production 

# Make our new repo
mkdir infrastructure && cd infrastructure/
git init 

# Install terraformer
# Linux
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-linux-amd64
chmod +x terraformer-all-linux-amd64
sudo mv terraformer-all-linux-amd64 /usr/local/bin/terraformer

# Intel Mac
curl -LO https://github.com/GoogleCloudPlatform/terraformer/releases/download/0.8.15/terraformer-all-darwin-amd64
chmod +x terraformer-all-darwin-amd64
sudo mv terraformer-all-darwin-amd64 /usr/local/bin/terraformer

# Other Platforms
https://github.com/GoogleCloudPlatform/terraformer/releases/tag/0.8.15

# Install terraform
https://learn.hashicorp.com/tutorials/terraform/install-cli

First, if you don't know what region your AWS resources are in you can find that here.

So what we're gonna do run:

terraformer import aws --regions INSERT_AWS_REGIONS_HERE --resources="*" --profile=production

### You will get a directory structure that looks like this
generated/
└── aws
    ├── acm
    │   ├── acm_certificate.tf
    │   ├── outputs.tf
    │   ├── provider.tf
    │   └── terraform.tfstate
    └── rds
        ├── db_instance.tf
        ├── db_parameter_group.tf
        ├── db_subnet_group.tf
        ├── outputs.tf
        ├── provider.tf
        └── terraform.tfstate

So if you wanted to modify something for rds, you would cd to the rds directory, then run terraform init. You may get an error: Error: Invalid legacy provider address

If so, no problem. Just run

terraform state replace-provider registry.terraform.io/-/aws hashicorp/aws

Once that is set up, you now have the ability to restore the AWS account using terraform at any time. You will want to add this repo to a CICD job eventually so this gets done automatically, but at first, you might need to run it locally.

$ export AWS_ACCESS_KEY_ID="anaccesskey"
$ export AWS_SECRET_ACCESS_KEY="asecretkey"
$ export AWS_DEFAULT_REGION="us-west-2"
$ terraform plan

You should see terraform run and tell you no changes.

Why Does This Matter?

Terraform lets us do a few things, one of which is roll out infrastructure changes like we would with any other code change. This is great because, in the case of unintended outages or problems, we can rollback. It also matters because often with small companies things will get broken when someone logs into the web console and clicks something they shouldn't. Running a terraform plan can tell you exactly what changed across the entire region in a few minutes, meaning you should be able to roll it back.

Should I do this if our team already manages our stack in code?

I would. There are tools like Ansible and Puppet which are great at managing servers that some people use to manage AWS. Often these setups are somewhat custom, relying on some trial and error before you figure out exactly how they work and what they are doing. Terraform is very stock and anyone on a DevOps chat group or mailing list will be able to help you run the commands. We're trying to establish basically a "restore point". You don't need to use Terraform to manage stuff if you don't want to, but you probably won't regret having a copy now.

Later on, we're going to be putting this into a CICD pipeline so we don't need to manage who adds infrastructure resources. We'll do that by requiring approval on PRs vs us having to write everything. It'll distribute the load but still let us ensure that we have some insight into how the system is configured. Right now though, since you are responsible for infrastructure you can at least roll this back.

2. Write down how deployments work

Every stack is a little different in terms of how it gets deployed and a constant source of problems for folks starting out. You need to be able to answer the question of how exactly code goes from a repo -> production. Maybe it's Jenkins, or GitLab runners or GitHub, CodeDeploy, etc but you need to know the answer for each application. Most importantly you need to read through whatever shell script they're running to actually deploy the application because that will start to give you an idea of what hacks are required to get this thing up and running.

Here are some common questions to get you started.

Are you running Docker? If so, where do the custom images come from? What runs the Dockerfile, where does it push the images, etc.
How do you run migrations against the database? Is it part of the normal code base, is there a different utility?
What is a server to your organization? Is it a stock ec2 instance running Linux and docker with everything else getting deployed with your application? Is it a server where your CICD job just rsyncs file to a directory Nginx reads from?
Where do secrets come from? Are they stored in the CICD pipeline? Are they stored in a secrets system like Vault or Secrets Manager? (Man if your organization actually does secrets correctly with something like this, bravo).
Do you have a "cron box"? This is a server that runs cron jobs on a regular interval outside of the normal fleet. I've seen these called "snowflake", "worker", etc. These are usually the least maintained boxes in the organization but often the most critical to how the business works.
How similar or different are different applications? Often organizations have mixes of serverless applications (managed either through the AWS web UI and tools like serverless) and conventional web servers. Lambdas in AWS are awesome tools that often are completely unmanaged in small businesses, so try and pay special attention to these.

The goal of all of this is to be able to answer "how does code go from a developer laptop to our customers". Once you understand that specific flow, then you will be much more useful in terms of understanding a lot of how things work. Eventually, we're going to want to consolidate these down into one flow, ideally into one "target" so we can keep our lives simple and be able to really maximize what we can offer the team.

Where do logs go and what stores them?

All applications and services generate logs. Logs are critical to debugging the health of an application, and knowing how that data is gathered and stored is critical to empowering developers to understand problems. This is the first week, so we're not trying to change anything, we just want to document how it works. How are logs generated by the application?

Some likely scenarios:

They are written to disk on the application server and pushed somewhere through syslog. Great, document the syslog configuration, where it comes from and then finally is log rotate set up to keep the boxes from running out of disk space.
They get pushed to either the cloud provider or monitoring provider (datadog etc). Fine, couldn't be easier, but write down where the permission to push the logs comes from. What I mean by that is: does the app push the logs to AWS, or does an agent running on the box take the logs and push them up to AWS? Either is fine, but know which makes a difference.

Document the flow, looking out for expiration or deletion policies. Also see how access control works, how do developers access these raw logs? Hopefully through some sort of web UI, but if it is through SSH access to the log aggregator that's fine, just write it down.

For more information about CloudWatch logging check out the AWS docs here.

3. How does SSH access work?

You need to know exactly how SSH works from the developers' laptop to the server they are trying to access. Here are some questions to kick it off.

How do SSH public keys get onto a server? Is there a script, does it sync from somewhere, are they put on by hand?
What IP addresses are allowed to SSH into a server? Hopefully not all of them, most organizations have at least a bastion host or VPN set up. But test it out, don't assume the documentation is correct. Remember we're building new documentation from scratch and approaching this stack with the respect it deserves as an unknown problem.
IMPORTANT: HOW DO EMPLOYEES GET OFFBOARDED? Trust me, people forget this all the time and it wouldn't surprise me if you find some SSH keys that shouldn't be there.

I don't know anything about SSH

Don't worry we got you. Take a quick read through this tutorial. You've likely used SSH a lot, especially if you have ever set up a Digital Ocean or personal EC2 instance on a free tier. You have public keys synced to the server and private keys on the client device.

What is a bastion host?

They're just servers that exist to allow traffic from a public subnet to a private subnet. Not all organizations use them, but a lot do, and given the conversations I've had it seems like a common pattern around the industry to use them. We're using a box between the internet and our servers as a bridge.

Do all developers need to access bastion hosts?

Nope they sure don't. Access to the Linux instances should be very restricted and ideally, we can get rid of it as we go. There are much better and easier to operate options now through AWS that let you get rid of the whole concept of bastion servers. But in the meantime, we should ensure we understand the existing stack.

Questions to answer

How do keys get onto the bastion host?
How does access work from the bastion host to the servers?
Are the Linux instances we're accessing in a private subnet or are they on a public subnet?
Is the bastion host up to date? Is the Linux distribution running current with the latest patches? There shouldn't be any other processes running on these boxes so upgrading them shouldn't be too bad.
Do you rely on SFTP anywhere? Are you pulling something down that is critical or pushing something up to SFTP? A lot of businesses still rely heavily on automated jobs around SFTP and you want to know how that authentication is happening.

4. How do we know the applications are running?

It seems from conversations that these organizations often have bad alerting stories. They don't know applications are down until customers tell them or they happen to notice. So you want to establish some sort of baseline early on, basically "how do you know the app is still up and running". Often now there is some sort of health check path, something like domain/health or /check or something, used by a variety of services like load balancers and Kubernetes to determine if something is up and functional or not.

First, understand what this health check is actually doing. Sometimes they are just hitting a webserver and ensuring Nginx is up and running. While interesting to know that Nginx is a reliable piece of software (it is quite reliable), this doesn't tell us much. Ideally, you want a health check that interacts with as many pieces of the infrastructure as possible. Maybe it runs a read query against the database to get back some sort of UUID (which is a common pattern).

This next part depends a lot on what alerting system you use, but you want to make a dashboard that you can use very quickly to determine "are my applications up and running". Infrastructure modifications are high-risk operations and sometimes when they go sideways, they'll go very sideways. So you want some visual system to determine whether or not the stack is functional and ideally, this should alert you through Slack or something. If you don't have a route like this, considering doing the work to add one. It'll make your life easier and probably isn't too complicated to do in your framework.

My first alerting tool is almost always Uptime Robot. So we're gonna take our health route and we are going to want to set an Uptime Robot alert on that endpoint. You shouldn't allow traffic from the internet at large to hit this route (because it is a computationally expensive route it is susceptible to malicious actors). However, Uptime Robot provides a list of their IP addresses for whitelisting. So we can add them to our security groups in the terraform repo we made earlier.

If you need a free alternative I have had a good experience with Hetrix. Setting up the alerts should be self-explanatory, basically hit an endpoint and get back either a string or a status code.

5. Run a security audit

Is he out of his mind? On the first week? Security is a super hard problem and one that startups mess up all the time. We can't make this stack secure in the first week (or likely month) of this work, but we can ensure we don't make it worse and, when we get a chance, we move closer to an ideal state.

The tool I like for this is Prowler. Not only does it allow you a ton of flexibility with what security audits you run, but it lets you export the results in a lot of different formats, including a very nice-looking HTML option.

Steps to run Prowler

Install Prowler. We're gonna run this from our local workstation using the AWS profile we made before.

On our local workstation:
git clone https://github.com/toniblyx/prowler
cd prowler

2. Run prowler. ./prowler -p production -r INSERT_REGION_HERE -M csv,json,json-asff,html -g cislevel1

The command above will output all of the options for Prowler, but I want to focus for a second on the -g option. That's the group option and it basically means "what security audit are we going to run". CIS Amazon Web Services Foundations have 2 levels and can be thought of broadly as:

Level 1: Stuff you should absolutely be doing right now that shouldn't impact most application functionality.

Level 2: Stuff you should probably be doing but is more likely to impact the functioning of an application.

We're running Level 1, because ideally, our stack should already pass a level 1 and if it doesn't, then we want to know where. The goal of this audit isn't to fix anything right now, but it IS to share it with leadership. Let them know the state of the account now while you are onboarding, so if there are serious security gaps that will require development time they know about it.

Finally, take the CSV file that was output from Prowler and stick it in Google Sheets with a date. We're going to want to have a historical record of the audit.

6. Make a Diagram!

The last thing we really want to do is make a diagram and have the folks who know more about the stack verify it. One tool that can kick this off is Cloudmapper. This is not going to get you all of the way there (you'll need to add meaningful labels and likely fill in some missing pieces) but should get you a template to work off of.

What we're primarily looking for here is understanding flow and dependencies. here are some good questions to get you started.

Where are my application persistence layers? What hosts them? How do they talk to each other?
Overall network design. How does traffic ingress and egress? Do all my resources talk directly to the internet or do they go through some sort of NAT gateway? Are my resources in different subnets, security groups, etc?
Are there less obvious dependencies? SQS, RabbitMQ, S3, elasticsearch, varnish, any and all of these are good candidates.

The ideal state here is to have a diagram that we can look at and say "yes I understand all the moving pieces". For some stacks that might be much more difficult, especially serverless stacks. These often have mind-boggling designs that change deploy to deploy and might be outside of the scope of a diagram like this. We should still be able to say "traffic from our customers comes in through this load balancer to that subnet after meeting the requirements in x security group".

If your organization has LucidChart they make this really easy. You can find out more about that here. You can do almost everything Lucid or AWS Config can do with Cloudmapper without the additional cost.

Cloudmapper is too complicated, what else have you got?

Does the setup page freak you out a bit? It does take a lot to set up and run the first time. AWS actually has a pretty nice pre-made solution to this problem. Here is the link to their setup: https://docs.aws.amazon.com/solutions/latest/aws-perspective/overview.html

It does cost a little bit but is pretty much "click and go" so I recommend it if you just need a fast overview of the entire account without too much hassle.

End of section one

Ideally the state we want to be in looks something like the following.

We have a copy of our infrastructure that we've run terraform plan against and there are no diffs, so we know we can go back.
We have an understanding of how the most important applications are deployed and what they are deployed to.
The process of generating, transmitting, and storing logs is understood.
We have some idea of how secure (or not) our setup is.
There are some basic alerts on the entire stack, end to end, which give us some degree of confidence that "yes the application itself is functional".

For many of you who are more experienced with this type of work, I'm sure you are shocked. A lot of this should already exist and really this is a process of you getting up to speed with how it works. However sadly in my experience talking to folks who have had this job forced on them, many of these pieces were set up a few employees ago and the specifics of how they work are lost to time. Since we know we can't rely on the documentation we need to make our own. In the process, we become more comfortable with the overall stack.

Stuff still to cover!

If there is any interest I'll keep going with this. Some topics I'd love to cover.

Metrics! How to make a dashboard that doesn't suck.
Email. Do your apps send it, are you set up for DMARC, how do you know if email is successfully getting to customers, where does it send from?
DNS. If it's not in the terraform directory we made before under Route53, it must be somewhere else. We gotta manage that like we manage a server because users logging into the DNS control panel and changing something can cripple the business.
Kubernetes. Should you use it? Are there other options? If you are using it now, what do you need to know about it?
Migrating to managed services. If your company is running its own databases or baking its own AMIs, now might be a great time to revisit that decision.
Sandboxes and multi-account setups. How do you ensure developers can test their apps in the least annoying way while still keeping the production stack up?
AWS billing. What are some common gotchas, how do you monitor spending, and what do to institutionally about it?
SSO, do you need it, how to do it, what does it mean?
Exposing logs through a web interface. What are the fastest ways to do that on a startup budget?
How do you get up to speed? What courses and training resources are worth the time and energy?
Where do you get help? Are there communities with people interested in providing advice?

Did I miss something obvious?

Let me know! I love constructive feedback. Bother me on Twitter. @duggan_mathew

How does FaceTime Work?

August 04, 2021 in DevOps

As an ex-pat living in Denmark, I use FaceTime audio a lot. Not only is it simple to use and reliable, but the sound quality is incredible. For those of you old enough to remember landlines, it reminds me of those but if you had a good headset. When we all switched to cell service audio quality took a huge hit and with modern VoIP home phones the problem hasn't gotten better. So when my mom and I chat over FaceTime Audio and the quality is so good it is like she is in the room with me, it really stands out compared to my many other phone calls in the course of a week.

So how does Apple do this? As someone who has worked as a systems administrator for their entire career, the technical challenges are kind of immense when you think about them. We need to establish a connection between two devices through various levels of networking abstraction, both at the ISP level and home level. This connection needs to be secure, reliable enough to maintain a conversation and also low bandwidth enough to be feasible given modern cellular data limits and home internet data caps. All of this needs to run on a device with a very impressive CPU but limited battery capacity.

What do we know about FaceTime?

A lot of our best information for how FaceTime worked (past tense is important here) is from interested parties around the time the feature was announced, so around the 2010 timeframe. During this period there was a lot of good packet capture work done by interested parties and we got a sense for how the protocol functioned. For those who have worked in VoIP technologies in their career, it's going to look pretty similar to what you may have seen before (with some Apple twists). Here were the steps to a FaceTime call around 2010:

A TCP connection over port 5223 is established with an Apple server. We know that 5223 is used by a lot of things, but for Apple its used for their push notification services. Interestingly, it is ALSO used for XMPP connections, which will come up later.
UDP traffic between the iOS device and Apple servers on ports 16385 and 16386. These ports might be familiar to those of you who have worked with firewalls. These are ports associated with audio and video RTP, which makes sense. RTP, or real-time transport protocol was designed to facilitate video and audio communications over the internet with low latency.
RTP relies on something else to establish a session and in Apple's case it appears to rely on XMPP. This XMPP connection relies on a client certificate on the device issued by Apple. This is why non-iOS devices cannot use FaceTime, even if they could reverse engineer the connection they don't have the certificate.
Apple uses ICE, STUN and TURN to negotiate a way for these two devices to communicate directly with each other. These are common tools used to negotiate peer to peer connections between NAT so that devices without public IP addresses can still talk to each other.
The device itself is identified by registering either a phone number or email address with Apple's server. This, along with STUN information, is how Apple knows how to connect the two devices. STUN, or Session Traversal Utilities for NAT is when a device reaches out to a publically available server and the server determines how this client can be reached.
At the end of all of this negotiation and network traversal, a SIP INVITE message is sent. This has the name of the person along with the bandwidth requirements and call parameters.
Once the call is established there are a series of SIP MESSAGE packets that are likely used to authenticate the devices. Then the actual connection is established and FaceTimes protocols take over using the UDP ports discussed before.
Finally the call is terminated using the SIP protocol when it is concluded. The assumption I'm making is that for FaceTime audio vs video the difference is minor, the primary distinction being that the codec used for audio, AAC-ELD. There is nothing magical about Apple using this codec but it is widely seen as an excellent choice.

That was how the process worked. But we know that in the later years Apple changed FaceTime, adding more functionality and presumably more capacity. According to their port requirements these are the ones required now. I've added what I suspect they are used for.

Port	Likely Reason
80 (TCP)	unclear but possibly XMPP since it uses these as backups
443 (TCP)	same as above since they are never blocked
3478 through 3497 (UDP)	STUN
5223 (TCP)	APN/XMPP
16384 through 16387 (UDP)	Audio/video RTP
16393 through 16402 (UDP)	FaceTime exclusive

Video and Audio Quality

A video FaceTime call is 4 media streams in each call. The audio is AAC-ELD as described above, with an observed 68 kbps in each direction (or about 136 kbps give or take) consumed. Video is H.264 and varies quite a bit in quality depending presumably on whatever bandwidth calculations were passed through SIP. We know that SIP has allowances for H.264 information about total consumed bandwidth, although the specifics of how FaceTime does on-the-fly calculations for what capacity is available to a consumer is still unknown to me.

You can observe this behavior by switching from cellular to wifi for video call, where often video compression is visible during the switch (but interestingly the call doesn't drop, a testament to effective network interface handoff inside of iOS). However with audio calls, this behavior is not replicated, where the call either maintaining roughly the same quality or dropping entirely, suggesting less flexibility (which makes sense given the much lower bandwidth requirements).

So does FaceTime still work like this?

I think a lot of it is still true, but wasn't entirely sure if the XMPP component is still there. However after more reading I believe this is still how it works and indeed how a lot of how Apple's iOS infrastructure works. While Apple doesn't have a lot of documentation available about the internals for FaceTime, one that stood out to me was the security document. You can find that document here.

FaceTime is Apple’s video and audio calling service. Like iMessage, FaceTime calls use the Apple Push Notification service (APNs) to establish an initial connection to the user’s registered devices. The audio/video contents of FaceTime calls are protected by end-to-end encryption, so no one but the sender and receiver can access them. Apple can’t decrypt the data.

So we know that port 5223 (TCP) is used by both Apple's push notification service and also XMPP over SSL. We know from older packet dumps that Apple used to used 5223 to establish a connection to their own Jabber servers as the initial starting point of the entire process. My suspicion here is that Apple's push notifications work similar to a normal XMPP pubsub setup.

Apple kind of says as much in their docs here.

This is interesting because it suggests the underlying technology for a lot of Apple's backend is XMPP, surprising because for most of us XMPP is thought of as an older, less used technology. As discussed later I'm not sure if this is XMPP or just uses the same port. Alright so messages are exchanged, but how about the key sharing? These communications are encrypted, but I'm not uploading or sharing public keys (nor do I seem to have any sort of access to said keys).

Keys? I'm lost, I thought we were talking about calls

One of Apple's big selling points is security and iMessage became famous for being an encrypted text message exchange. Traditional SMS was not encrypted and nor were a lot of (most) text based communication, including email. Encryption is computationally expensive and wasn't seen as a high priority until Apple really made it a large part of the conversation for text communication. But why hasn't encryption been a bigger part of the consumer computer ecosystem?

In short: because managing keys sucks ass. If I want to send an encrypted message to you I need to first know your public key. Then I can encrypt the body of a message and you can decrypt it. Traditionally this process is super manual and frankly, pretty shitty.

So Apple must have some way of generating the keys (presumably on device) and then sharing the public keys. They in fact do, a service called IDS or Apple Identity Service. This is what links up your phone number or email address to the public key for that device.

Apple has a nice little diagram explaining the flow:

As far as I can tell the process is much the same for FaceTime calls as it is for iMessage but with some nuance for the audio/video channels. The certificates are used to establish a shared secret and the actual media is streamed over SRTP.

Not exactly the same but still gets the point across

Someone at Apple read the SSL book

Alright so SIP itself has a mechanism for how to handle encryption, but FaceTime and iMessage work on devices going all the way back to the iPhone 4. So the principal makes sense but then I don't understand why we don't see tons of iMessage clones for Android. If there are billions of Apple devices floating around and most of this relies on local client-side negotiation isn't there a way to fake it?

Alright so this is where it gets a bit strange. So there's a defined way of sending client certificates as outlined in RFC 5246. It appears Apple used to do this but they have changed their process. Now its sent through the application, along with a public token, a nonce and a signature. We're gonna focus on the token and the certificate for a moment.

Token

256-bit binary string

NSLog(@"%@", deviceToken);
// Prints "<965b251c 6cb1926d e3cb366f dfb16ddd e6b9086a 8a3cac9e 5f857679 376eab7C>"

Example

"Documented" here - Apple Developer Docs
server-side generated

Certificate

Generated on device APN activation
Certificate request sent to albert.apple.com
Uses two TLS extensions, APLN and Server name

So why don't I have a bunch of great Android apps able to send this stuff?

As near as I can tell, the primary issue is two-fold. First the protocol to establish the connection isn't standard. Apple uses APLN to handle the negotiation and the client uses a protocol apns-pack-v1 to handle this. So if you wanted to write your own application to interface with Apple's servers, you would first need to get the x509 client certificate (which seems to be generated at the time of activation). You would then need to be able to establish a connection to the server using APLN passing server name, which I don't know if Android supports. You also can't just generate this one-time, as Apple only allows each device one connection. So if you made an app using values taken from a real Mac or iOS device, I think it would just cause the actual Apple device to drop. If your Mac connected, then the fake device would drop.

But how do Hackintoshes work? For those that don't know, these are normal x86 computers running MacOS. Presumably they would have the required extensions to establish these connections and would also be able to generate the required certificates. This is where it gets a little strange. It appears the Macs serial number is a crucial part of how this process functions, presumably passing some check on Apple's side to figure out "should this device be allowed to initiate a connection".

The way to do this is by generating fake Mac serial numbers as outlined here. The process seems pretty fraught, relying on a couple of factors. First the Apple ID seems to need to be activated through some other device and apparently age of the ID matters. This is likely some sort of weight system to keep the process from getting flooded with fake requests. However it seems before Apple completes the registration process it looks at the plist of the device and attempts to determine "is this a real Apple device".

Apple device serial numbers are not random values though, they are actually a pretty interesting data format that packs in a lot of info. Presumably this was done to make service easier, allowing the AppleCare website and Apple Stores a way to very quickly determine model and age without having to check with some "master Apple serial number server". You can check out the old Apple serial number format here: link.

This ability to brute force new serial numbers is, I suspect, behind the decision by Apple to change the format of the serial number. By switching from a value that can be generated to a totally random value that varies in length, I assume Apple will be able to say with a much higher degree of certainty that "yes this is a MacBook Pro with x serial number" by doing a lookup on an internal database. This would make generating fake serial numbers for these generations of devices virtually impossible, since you would need to get incredibly lucky with both model, MAC address information, logic board ID and serial number.

How secure is all this?

It's as secure as Apple, for all the good and the bad that suggests. Apple is entirely in control of enrollment, token generation, certificate verification and exchange along with the TLS handshake process. The inability for users to provide their own keys for encryption isn't surprising (this is Apple and uploading public keys for users doesn't seem on-brand for them), but I was surprised that there isn't any way for me to display a users key. This would seem like a logical safeguard against man in the middle attacks.

So if Apple wanted to enroll another email address and associate it with an Apple ID and allow it to receive the APN notifications for FaceTime/receive a call, there isn't anything I can see that would stop them from doing that. I'm not suggesting they do or would, simply that it seems technically feasible (since we already know multiple devices receive a FaceTime call at the same time and the enrollment of a new target for a notification depends more on the particular URI for that piece of the Apple ID be it phone number or email address).

So is this all XMPP or not?

I'm not entirely sure. The port is the same and there are some similarities in terms of message subscription, but the large amount of modification to handle the actual transfer of messages tells me if this is XMPP behind the scenes now, it has been heavily modified. I suspect the original design may have been something closer to stock but over the years Apple has made substantial changes to how the secret sauce all works.

To me it still looks a lot like how I would expect this to function, with a massive distributed message queue. You connect to a random APN server, rand(0,255)-courier.push.apple.com, initiate TLS handshake and then messages are pushed to your device as identified by your token. Presumably at Apple's scale of billions of messages flowing at all times, the process is more complicated on the back end, but I suspect a lot of the concepts are similar.

Conclusion

FaceTime is a great service that seems to rely on a very well understood and battle-tested part of the Apple ecosystem, which is their push notification service along with their Apple ID registration service. This process, which is also used by non-Apple applications to receive notifications, allows individual devices to quickly negotiate a client certificate, initiate a secure connection, use normal networking protocols to allow Apple to assist them with bypassing NAT and then establishes a connection between devices using standard SIP protocols. The quality is the result of Apple licensing good codecs and making devices capable of taking advantage of those codecs.

FaceTime and iMessage are linked together along with the rest of the Apple ID services, allowing users to register a phone number or email address as a unique destination.

Still a lot we don't know

I am confident a lot of this is wrong or out of date. It is difficult to get more information about this process, even with running some commands locally. I would love any additional information folks would be willing to share or to point me towards articles or documents I should read.

Citations:

Why I'm Excited for the Steam Deck

July 31, 2021 in Personal

Looks like a Nintendo Switch and a Game Gear had a baby

When the Steam Deck preorders went live, I went nuts. I was standing in my living room with an iPad, laptop and phone ready to go. Thankfully I got my order in quickly and I'm one of the lucky ones that gets to enjoy the Steam Deck in December of 2021. As someone who doesn't play a ton of PC games, mostly indie titles, I was asked by a few friends "why bother with a new console".

It's a good question, especially coming from a company like Valve. While I love them, Valve has been attempting to crack this particular nut for years. The initial salvo was "Steam OS", a Debian fork that was an attempt by Valve to create an alternative to Windows. Microsoft had decided to start selling applications and games through its Windows Store and Valve was concerned about Microsoft locking partners out. It's not crazy to think of a world in which Microsoft would require games to be signed with a Microsoft client certificate to access DirectX APIs, so an alternative was needed.

So SteamOS launches with big dreams in 2014 and for the most part flops. While it has some nice controller-centric design elements that play well with the new Steam Controller, these "Big Picture" UI changes also come to Windows. Game compatibility is bad at first, then slowly gets better, but a lack of support for the big anti-cheat tools means multiplayer games are mostly out of the question. Steam Machines launch to a fizzle, with consumers not sure what they're paying for and Valve making a critical error.

Since they don't make the actual pieces of hardware, relying instead on third-parties like Alienware to do it, they're basically trying to have their cake and eat it too. Traditionally game consoles work like this: companies sell the console at cost or for a slight profit. Then they make money on every game sold, initially through licensing fees back in the day. Now you make it through the licensing fee plus the cut of the console store transaction as games become more digitial. Steam as a platform makes its billions of dollars there, taking around 30% of the transaction for every digital good sold on its store.

So if you look at the original Steambox with SteamOS from the perspective of a consumer, it's a terrible deal. All of the complexity of migrating to Linux has been shifted to you or to Dell customer support. You need to know whether your games will work or not and you need to be in charge of fixing any problems that arise. The hardware partner can't sell the hardware at the kind of margin consoles usually get sold for, so you are paying more for your hardware. Game developers don't have any financial incentive to do the work of porting, because almost immediately the steam machine manufacturers shipped Windows versions of the same hardware, so chances are they don't care if it doesn't work on SteamOS.

The picture doesn't get much better if you are a game developer. Valve is still taking 30% from you, the hardware isn't flying off the shelf so chances are these aren't even new customers, just existing customers playing games they already paid for. You need to handle all the technical complexity of the port plus now your QA process is 2x as complicated. In short it was kind of a ridiculous play by Valve, an attempt to get the gaming community to finance and support their migration away from Windows with no benefit to the individual except getting to run Linux.

Alright so why is the Steam Deck different?

The Steam Deck follows the traditional console route. Valve is selling the units at close to cost, meaning you aren't paying the markup required to support a hardware manufacturer AND Valve. Instead they are eating the hardware cost to build a base, something everyone else has already done.
We know this form factor works. The Nintendo Switch is a massive hit among casual and serious gamers for allowing people to play both a large catalog of Nintendo titles on the go (which obviously the Steam Deck will not be able to) and a massive library of indies. Given the slow pace of Nintendo releases, I would argue it is the indie titles and ports of existing PC games that have contributed in large part to the Switches success.
Valve has done the work through Proton (a fork of Wine, the windows not-emulator) to ensure a deep library of games work. They have also addresses the anti-cheat vendors, meaning the cost to consumers in terms of what titles they will have access to has been greatly reduced.
They switched away from Debian, going with Arch. This means faster access to drivers and other technology in the Linux kernel and less waiting time for fixes to make their way to users. There is obviously some sacrifice in terms of stability but given that they have a hardware target they can test again, I think the pros outweigh the cons.
A common CPU architecture. This is a similar chipset to the current crop of Sony and Microsoft consoles, hopefully reducing the amount of work required by engine makers and game developers to port to this stack.

Who Cares, I Already Have a Switch

The reason the Steam Deck matters in a universe where the Nintendo Switch is a massive success is because Nintendo simply cannot stay out of their own way. For long term fans of the company many of their decisions are frankly...baffling. A simple example is their lack of emphasis on online play, considered table stakes for most services now. Their account system is still a mess, playing with friends and communicating with them still relies on you either using your phone or using apps not owned by Nintendo and in general they seem to either hate the online experience or would prefer to pretend it doesn't exist.

Dan Adelman, former Nintendo employee who worked a lot with indie developers shed some light on their internal culture years ago which I think is still relevant:

Nintendo is not only a Japanese company, it is a Kyoto-based company. For people who aren't familiar, Kyoto-based are to Japanese companies as Japanese companies are to US companies. They're very traditional, and very focused on hierarchy and group decision making. Unfortunately, that creates a culture where everyone is an advisor and no one is a decision maker – but almost everyone has veto power.
Even Mr. Iwata is often loathe to make a decision that will alienate one of the executives in Japan, so to get anything done, it requires laying a lot of groundwork: talking to the different groups, securing their buy-in, and using that buy-in to get others on board. At the subsidiary level, this is even more pronounced, since people have to go through this process first at NOA or NOE (or sometimes both) and then all over again with headquarters. All of this is not necessarily a bad thing, though it can be very inefficient and time consuming. The biggest risk is that at any step in that process, if someone flat out says no, the proposal is as good as dead. So in general, bolder ideas don't get through the process unless they originate at the top.
There are two other problems that come to mind. First, at the risk of sounding ageist, because of the hierarchical nature of Japanese companies, it winds up being that the most senior executives at the company cut their teeth during NES and Super NES days and do not really understand modern gaming, so adopting things like online gaming, account systems, friends lists, as well as understanding the rise of PC gaming has been very slow. Ideas often get shut down prematurely just because some people with the power to veto an idea simply don't understand it.
The last problem is that there is very little reason to try and push these ideas. Risk taking is generally not really rewarded. Long-term loyalty is ultimately what gets rewarded, so the easiest path is simply to stay the course. I'd love to see Nintendo make a more concerted effort to encourage people at all levels of the company to feel empowered to push through ambitious proposals, and then get rewarded for doing so.

None of this is necessarily a bad culture, in fact I suspect this steady leadership and focus on long-term thinking is likely the reason we don't see Nintendo fall victim to every passing fad. However it does mean that the things we don't like about the current situation with Nintendo (locking down their hardware, not playing well with online services, reselling old games instead of backwards compatibility) is unlikely to change.

On the flip side it also means we know Nintendo will make truly mysterious decisions on a regular basis and will not react to or even acknowledge criticism. On my Nintendo Switch I've burned through three Joy-Cons due to drift. I'm not a professional gamer and I play maximum an hour a day. If I am burning through these little controllers at this rate I imagine that more serious enthusiasts have either switched to the Pro controller a long time ago or are just living with tremendous problems. Despite two new models coming out, Nintendo hasn't redesigned their controllers to use better joysticks.

Even though the hardware supports it, the Switch doesn't allow me to use a bluetooth headset. Online play for certain games either doesn't work or is designed in such a way as to be almost user-hostile. Splatoon 2, a flagship title for Nintendo has largly abandoned its online community, just stopping their normal rotation of activities. Animal Crossing, maybe the biggest game of the COVID-19 lockdown, is a perfect game for casual gamers to enjoy online. You cannot enjoy a large community of other gamers island without the heavy use of third-party tools and even then the game is fighting you every step of the way.

So with a company like Nintendo, while I currently have a good experience with the Switch, it increasingly feels like it was a fluke. I'm not sure if they know why its so successful or what is currently holding it back, so it becomes difficult to have a lot of confidence that their future versions will prioritize the things I value. It would not surprise me at all if the Switch 2 didn't have backwards compability with previous games, or if there wasn't a Switch 2 but instead a shift back to a traditional box under the tv. I just can't assume with Nintendo that their next decision will make any sense.

What Challenges does the Steam Deck Face?

Loads. The Steam Deck, even with the work Valve has already put in, faces quite an uphill battle. Some of these will be familiar to Linux fans who have run Linux at work and on their personal machines for years. A few of these are just the realities of launching a new console.

Linux still doesn't do amazingly at battery life for portable devices. You can tune this (and I fully expect that Valve will) but considerable attention will need to be paid to battery consumption in the OS. With the wide range of games Valve is showing off, the Steam Deck is going to get a bad reputation among less technical folks if the battery lasts 30 minutes.
Technical support. Despite its flaws the Nintendo Switch just works. There isn't anything you need to do in order to get it to function. Valve is not a huge company and games don't need to go through a long vetting process before you can launch them on the Deck. This means that when users encounter problems, which they will a lot at first, Valve is not going to be there to help. They simply have too much software. So its entirely conceivable you can buy this thing, launch three games in a row that crash or barely run and there is no number to call to help you.
Build quality and QA. I've purchased all the hardware that Valve has made up to this point and so far its been pretty good. I especially like the controller, even though it is kind of a bizarre design. However a controller is a lot less complicated when compared to the Deck, and how Valve manages QA for the devices is going to be a big thing for consumers. You might love the Google Pixel phone, but their hardware support has been garbage compared to Apple and it makes a difference, especially to less technical users. How I can get the Deck fixed, what kind of build quality and consistency there is, etc are all outstanding questions.
Finally is Valve going to support the machine long-term? Valve loves experiments and has a work culture that is very flat and decentralized. Employees enjoy a great deal of flexibility in terms of what they work on, which is...a strategy. I don't know if its the best strategy but it does seem to have worked pretty well for them. For this machine to be the kind of success I think they want it to be, customers are going to want to see a pretty high level of software quality out of the gate and for that quality to improve over time. If Valve loses interest (or if the Proton model of compatibility turns out to require a lot of hand-holding per title for the Deck) I could easily see Valve abandonding this device with the justification that users "can load their own OS on there".

In closing the Steam Deck is a fasinating opportunity for the Linux gaming community. We might finally have a 1st class hardware target for developers backed by a company with the financial assets and interest in solving the myriad of technical problems along the way. It could be a huge step towards breaking Microsofts dominance of the PC gaming market and, more importantly, bringing some of the value of the less regulated PC gaming space to the console market.

However a lot of this is going to depend on Valve's commitment to the device for the first 12 months of its life. Skeptics are going to be looking closely to see how quickly software incompatibility issues are addressed, consumers are going to want to have an experience similar to the Switch in terms of "pick up and play" and Linux fans are going to want to enjoy a lot of flexibility. These are hard things to balance, especially for a company with some hardware experience but likely nothing on the anticipated scale of the Steam Deck.

TIL Easy way to encrypt and decrypt files with Python and GnuPG

July 29, 2021 in Programming

I often have to share files with outside parties at work, a process which previously involved a lot of me manually running gpg commands. I finally decided to automate the process and was surprised at how little time it took. Now I have a very simple Lambda based encryption flow importing keys from S3, encrypting files for delivery to end users and then sending the encrypted message as the body of an email with SES.

Requirements

GnuPG installed: sudo apt install gnupg
Python3 + https://gnupg.readthedocs.io/en/latest/#

How to Import Keys

from pprint import pprint
import sys
from pathlib import Path
from shutil import which


#Pass the key you want to import like this: python3 import_keys.py filename_of_public_key.asc
if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
key_data = open(sys.argv[1], encoding="utf-8").read()
import_result = gpg.import_keys(key_data)
pprint(import_result.results)

public_keys = gpg.list_keys()
pprint(public_keys)

Encrypt a File

import sys
import pprint
from shutil import which

#Example: python3 encrypt_file.py name_of_file.txt [email protected]

if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
with open (sys.argv[1], 'rb') as f:
    status = gpg.encrypt_file(
            f, recipients=[sys.argv[2]],
            output=sys.argv[1] + '.gpg',
            always_trust = True
            )

    print('ok: ', status.ok)
    print('status: ', status.status)
    print('stderr: ', status.stderr)

Decrypt a File

import sys
import pprint
from shutil import which
import os
#Example: python3 decrypt_file.py name_of_file.txt passphrase

if which('gpg') is None:
    sys.exit("Please install gnupg in linux")

gpg = gnupg.GPG()
with open (sys.argv[1], 'rb') as f:
    status = gpg.decrypt_file(
            file=f,
			passphrase=sys.argv[2],
            output=("decrypted-" + sys.argv[1])
            )

    print('ok: ', status.ok)
    print('status: ', status.status)
    print('stderr: ', status.stderr)

Easier alternative to Nginx + Lets Encrypt with Caddy Docker Proxy

July 23, 2021 in DevOps

So this is a request I get probably 4-5 times a year. "I'm looking to host a small application in docker and I need it to be easy to run through a GitLab/GitHub CICD pipeline, it needs SSL and I never ever want to think about how it works." Up until this point in my career the solution has been pretty consistent: Nginx with Let's Encrypt. Now you might think "oh, this must be a super common request and very easy to do." You would think that.

However the solution I've used up to this point has been frankly pretty shitty. It usually involves a few files that look like this:

services:
    web: 
        image: nginx:latest
        restart: always
        volumes:
            - ./public:/var/www/html
            - ./conf.d:/etc/nginx/conf.d
            - ./certbot/conf:/etc/nginx/ssl
            - ./certbot/data:/var/www/certbot
        ports:
            - 80:80
            - 443:443

    certbot:
        image: certbot/certbot:latest
        command: certonly --webroot --webroot-path=/var/www/certbot --email [email protected] --agree-tos --no-eff-email -d domain.com -d www.domain.com
        volumes:
            - ./certbot/conf:/etc/letsencrypt
            - ./certbot/logs:/var/log/letsencrypt
            - ./certbot/data:/var/www/certbot

This sets up my webserver with Nginx bound to host ports 80 and 443 along with the certbot image. Then I need to add the Nginx configuration to handle forwarding traffic to the actual application which is defined later in the docker-compose file along with everything else I need. It works but its a hassle. There's a good walkthrough of how to set this up if you are interested here: https://pentacent.medium.com/nginx-and-lets-encrypt-with-docker-in-less-than-5-minutes-b4b8a60d3a71

This obviously works but I'd love something less terrible. Enter Caddy Docker Proxy: https://github.com/lucaslorentz/caddy-docker-proxy. Here is an example of Grafana running behind SSL:

services:
  caddy:
    image: lucaslorentz/caddy-docker-proxy:ci-alpine
    ports:
      - 80:80
      - 443:443
    environment:
      - CADDY_INGRESS_NETWORKS=caddy
    networks:
      - caddy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - caddy_data:/data
    restart: unless-stopped
    
  grafana:
    environment:
      GF_SERVER_ROOT_URL: "https://GRAFANA_EXTERNAL_HOST"
      GF_INSTALL_PLUGINS: "digiapulssi-breadcrumb-panel,grafana-polystat-panel,yesoreyeram-boomtable-panel,natel-discrete-panel"
    image: grafana/grafana:latest-ubuntu
    restart: unless-stopped
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/grafana.ini:/etc/grafana/grafana.ini
    networks:
      - caddy
    labels:
      caddy: grafana.example.com
      caddy.reverse_proxy: "{{upstreams 3000}}"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - prometheus-data:/prometheus
      - ./prometheus/config:/etc/prometheus
    ports:
      - 9090:9090
    restart: unless-stopped
    networks:
      - caddy

How it works is super simple. Caddy listens on the external ports and proxies traffic to your docker applications. In return, your docker applications tell Caddy Proxy what url they need. It goes out, generates the SSL certificate for grafana.example.com as specified above and stores it in its volume. That's it, otherwise you are good to go.

Let's use the example from this blog as a good test case. If you want to set up a site that is identical to this one, here is a great template docker compose for you to run.

services:

  ghost:
    image: ghost:latest
    restart: always
    networks:
      - caddy
    environment:
      url: https://matduggan.com
    volumes:
      - /opt/ghost_content:/var/lib/ghost/content
    labels:
      caddy: matduggan.com
      caddy.reverse_proxy: "{{upstreams 2368}}"

  caddy:
    image: lucaslorentz/caddy-docker-proxy:ci-alpine
    ports:
      - 80:80
      - 443:443
    environment:
      - CADDY_INGRESS_NETWORKS=caddy
    networks:
      - caddy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - caddy_data:/data
    restart: unless-stopped

networks:
  caddy:
    external: true

volumes:
  caddy_data: {}

So all you need to do in order to make a copy of this site in docker-compose is:

Install Docker Compose.
Run docker network create caddy
Replace matduggan.com with your domain name
Run docker-compose up -d
Go to your domain and set up your Ghost credentials.

It really couldn't be more easy and it works like that for a ton of things like Wordpress, Magento, etc. It is, in many ways, idiot proof AND super easy to automate with a CICD pipeline using a free tier.

I've dumped all the older Let's Encrypt + Nginx configs for this one and couldn't be happier. Performance has been amazing, Caddy as a tool is extremely stable and I have not had to think about SSL certificates since I started. Strongly recommend for small developers looking to just get something on the internet without thinking about it or even larger shops where you have an application that doesn't justify going behind a load balancer.

My Development Setup

May 16, 2021

I've been seeing a lot of these going around lately and thought it might be fun to write up my own. I have no idea if this is typical or super bizarre, but it has worked extremely well for me for the last few years.

Development Hardware

Raspberry Pi 4
KODI Raspberry Pi 4 Case: Link
Fideco M.2 NVME External Enclosure: Link
WD Blue SN550 1TB SSD: Link
Anker Powered USB hub: Link

Traditionally I've been a desktop person for doing Serious Work ™ but most jobs have gotten rid of the desktop option. With the lockdown and remote work becoming the new normal, it's unlikely we are ever going back to a desktop lifestyle. However, the benefits for me of a desktop still remain, which are a stable target of hardware that lets me maintain the same virtual terminal for weeks or months, easily expandable storage and enough CPU and memory to let it sit unattended.

While the $75 Raspberry Pi 4 probably doesn't seem like it would fall into that category, it is actually plenty fast for the work I do with the exception of Docker. Writing Terraform, Python and Go is fast and pleasant, the box itself is extremely stable and with the new option to boot off of USB, I have tons of storage space and a drive that is work ready with lots of read/writes for me. While the Raspberry Pi 4 as my headless work machine started as a bit of a lark, it's grown into an incredibly useful tool. There are also a lot of Docker images available for the Raspberry Pi 4 out of the box.

Software

OS is Raspbian configured with SSH enabled out of the box. Find out how to do that here.
Specific version is Raspbian lite: Link

I know there are more options for Raspberry Pi OS than ever before, but I've stuck with Rasbian for a number of years now. A number of other folks swear by Ubuntu, but I've had enough negative experiences that I'm soured on that ecosystem. Raspbian has been plenty stable for development work, mostly getting rebooted for kernel upgrades.

Vim running the Braintree plugins: Find those amazing plugins here.

I have no opinion on the merits of Vim vs emacs, I've only ever used Vim and at this point my interest in learning a new text editor is extremely low. Vim works reliably, never seems to introduce anything I would consider to be a shocking change in behavior. I understand that Vim vs NeoVim is really a conversation about community based development vs a single maintainer, but in general I don't really care until I'm forced to care.

If you are interested in learning how to use Vim, there are a ton of great resources. Vim itself has a tutorial but I've never seen newcomers get a lot out of it. For me Vim didn't click until I worked my way through the Vim Bible. In terms of hours saved in my life, working through that book might be one of the best decisions I ever made for my career. Easily thousands of hours saved. If you prefer a more practical tutorial I love Vim Golf.

Tmux terminal multiplier

I used Tmux a few hundred times a week and have nothing but good things to say about it. For those who don't know, Tmux allows you to have several terminal windows open and active at the same time, while still allow you to disconnect leaving them running. This means when I start working in the morning, I connect to my existing Tmux session and all of my work is still there. You can do things like run long-running scripts in the background, etc. It's great and you can get started using it here: Tmux tutorial.

Environmental variable management with direnv

I might be the last person on earth to discover this. For a long time I've been overloading my ~/.profile with all the different environmental variables needed to do my work. Since I spend a lot of time working with and testing CICD pipeline, serverless applications, etc, environmental variables and their injections are how a lot of application configuration is handled. Direnv is great, letting you dynamically load and unload those values per project by directory, meaning you never need to think if a program isn't working because you accidentally used the same environmental variable twice.

Manage my various dotfiles with chezmoi

Chezmoi is an interesting tool and one whose utility is so obvious that I'm shocked nobody made it before now. It's a tool that allows you to manage all of those various configuration files with git, a tool you likely use a hundred times a day anyway. Basically you add dotfiles to a repo with the chezmoi tool, push them to a remote repo and then pull them down again on a new machine or just keep them updated across all your various work devices.

None of this is too amazing if all it did was make a git repo, but it also includes tools like templates to deploy different defaults to different machines as shown here. It also does all the hard work to make secret management as easy as possible, integrating with my favorite password manager 1Password. See how that works here. With Chezmoi the time I spent customizing my configurations to match my workflow exactly is not time wasted when I switch jobs or laptops and I can easily write setup scripts to get a new Raspberry Pi or Raspbian install back to exactly how I want it without having to make something like an Ansible playbook.

A fun shell prompt with Starship

I just started using Starship a few weeks ago, and I'm still not sure if I love it. Typically, this sort of stuff annoys me, overwhelming my terminal window with useless information. But I have to say the team behind this tool really nailed it.

Without any configuration the tool understood my AWS region from my AWS Config file, told me my directory and otherwise got out of my way.

Even reminds me I'm inside of a virtual environment for python!

Inside a Git repo it tracked my git status, Python version for this project and even let me set a fun emoji for Python which definitely isn't required, but I also don't hate. One problem I ran into was not showing emojis by default correctly. I solved this by installing this font and setting it to be my font in iTerm. However, if that doesn't work Starship has more troubleshooting information here.

Keychain

I use SSH all the time, my keys all have passphrases, but I hate entering them a million times a day. Keychain manages all that for me and is one of the first things I install.

entr

If you write in a language where you want to trigger some action when a file changes, entr will save you from rerunning the same six commands in the terminal a thousand times. I'll use this constantly when writing any compiled language.

moreutils

Moreutils is a collection of tools that you would think came out of the box. Tools like sponge which write standard input to a file, isutf8 which just checks if a file is utf8 and more. I use sponge on a weekly basis at least and love all these tools.

PDFtk

CLI tool to work with PDFs. I've used this in serious business applications all without issue for years. Lets you combine/modify PDFs in shell scripts.

hub

I work with Git and GitHub all day every day. Hub lets me do more with GitHub, almost everything I would normally do through the web UI through the terminal. While I like GitHub's interface quite a bit, this just saves me time during the day and keeps me from breaking out from my task and get distracted from what I'm doing. For GitLab users this seems to be roughly the same: link

httpie

When you work with web applications in Docker, you spend a lot of time curling to see if stuff is working. I use this for healthchecks, metrics endpoints, etc. So imagine my pleasure at discovering a nicer to read curl with httpie. With options like --session= which lets you simulate having a consistent session and --offline to simulate the request without sending it off. I use this tool all the time.

tealdeer

I use man a lot with my command line tools, but sometimes I don't want to get into all the millions of options a tool has and just want some examples of commonly used things. For that, tldr can save me some time and tealdeer is a very nice interface for those pages.

datamash

Datamash is one of the weirder tools I use. Basically it allows you to run operations against txt files and sometimes do analysis of those files even if the format is messed up. I'm not exactly sure how it works, but sometimes it really saves me a ton of time with stranger files.

ngrok

If you work locally with remote APIs, get ngrok. It handles all the tunneling for you, allowing you to simulate basically having a publically available server on your local laptop. It has revolutionized my workflow and I cannot recommend it highly enough.

TIL Command to get memory usage by process in Linux

May 10, 2021 in TIL

If like me you are constantly trying to figure out using a combination of ps and free to see what is eating all your memory, check this out:

ps -eo size,pid,user,command --sort -size | \
    awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' |\
    cut -d "" -f2 | cut -d "-" -f1

TIL docker-compose lies to you....

April 29, 2021 in TIL

You, like me, might assume that when you write a docker-compose healthcheck, it does something useful with that information. So for instance you might add something like this to your docker-compose file:

healthcheck:
      test: ["CMD", "curl", "-f", "-L", "http://localhost/website.aspx"]
      interval: 5s
      timeout: 10s
      retries: 2
      start_period: 60s

You run your docker container in production and when the container is running but no longer working, your site will go down. Being a reasonable human being you check docker-compose ps to see if docker knows your container is down. Weirdly, docker DOES know that the docker container is unhealthy but seems to do nothing with this information.

Wait, so Docker just records that the container is unhealthy?

Apparently! I have no idea why you would do that or what the purpose of a healthcheck is if not to kill and restart the container. However there is a good solution.

The quick fix to make standalone Docker do what you want

  image: willfarrell/autoheal:latest
  restart: always
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  environment:
    - AUTOHEAL_CONTAINER_LABEL=all
    - AUTOHEAL_START_PERIOD=60

This small container will automatically restart unhealthy containers and works great. Huge fan.

GRADO SR80e Headphone Review

April 14, 2021 in Personal

The Best Headphones I've Ever Owned

I'm pretty new to the whole audiophile world. It wasn't until I started working in an open office in Chicago that the need for headphones became an obsession. One concept I've run across a lot is the idea of "endgame headphones", which are presumably the last headphones you'll ever need to buy. I don't know if the SR80e's are that, but they're damn close.

Wait, who the hell is Grado?

Don't be embarassed, I also had no idea. As someone who spent years going through Apple headphones, I'm far from an audiophile. It turns our Grado is a fasinating business. They are a US-based family business, based in south Brooklyn and you would have no idea what you were looking at if you drove by.

They've been making the real deal since the 50s and for the audiophile community and started out making phono cartridges for turntables. I strongly recommend reading through their company timeline which they've put on their website in a easy to read scrolling page. You can find that here.

What's not to love about a global HQ like this?

Packaging

The SR80e came in one of the strangest packages for electronics I've ever seen. I bought it from Amazon and got a very nice but extremely flimsy cardboard box with the headphones. It didn't bother me, but I am glad I bought a carrying case. This is the one I ended up with.

This is minimal packaging at its best. You get: Headphones, Warranty, Grado story-sheet, 6.5mm Golden Adapter and that's it. So if you need anything more, make sure you buy it. I recommend a DAC at the very least, which I'll have a review up later about the ones I tried. One surprising thing was the headphones are made in the US, which shocked me at the $99 price point.

Fit and Feel

First impression is these headphones remind me of my dads ancient hifi gear. They feel solid, with a nice weight that is good to pick up but isn't too heavy on the head. The headband adjusts nicely to my head and the cord is remarkably thick, like industrial thick. There is something incredible in this modern age of aluminum and glass to having something that feel retro in a fun way. Throwing it on the scale, it weighs about 235 g without weighing the cord. I found these a lot more comfortable to wear when compared to the AirPods Max I tried around the same time that weigh in at 385 grams.

The best way to describe these headphones is "professional grade". They feel like they could last for years and I have no doubt I could use these daily with no problems. The foam ear cushions are comfortable enough and I love that they are replaceable for when I wear them out in years. There are no bells and whistles here, no mic or anything extra. These are designed to play music.

I love the grill mesh look that lets you see the drivers. The ear cups are fully rotatable and you get the sense that if you needed to break these open and soldier a wire back, you could. The sturdy design philosophy extends to the cable, which clocks in at an extremely long 2m or 7 ft. However Apple designs their incredibly terrible cables, Grado does the opposite with thick cables and durable straight relief at the jack.

Sound Quality

These are some of the best selling headphones in the "beginning audiophile" section of websites and once you start listening to them, you can tell why. I don't "burn in headphones" because I think its junk science, I think you just get used to how they sound which is why people report an "increase in quality". Most of the headphones I've owned have had some sort of "boost" in them, boosting either the bass or the midrange.

It's hard to explain but this makes music sound "correct". There's a smoothness to the sound that reveals layers to music that I have not experienced before. I've always been suspicious of people who claim they could instantly tell the quality of speakers or headphones with music, mostly because sound feels like a very subjective experience to me. But when relistening to old favorite albums I felt like I was in the studio or listening to them live.

Common Questions about Sound:

Are they good for an open office or shared working space? No, they're open-back headphones which means everyone will hear your music.
Are these good for planes? No, they have no sound isolation or noise cancellation.
What kinds of music sound awesome on these? I love classical music on these headphones along with rock/alternative that has vocals. EDM was less good and I felt I needed more bass to really get into it.

Should I buy them?

I love them and strongly recommend them.

Download Mister Rogers Neighborhood with Python

April 04, 2021 in Programming

A dad posted on a forum I frequent in Denmark asking for some help. His child loves Mister Rogers, but he was hoping for a way to download a bunch of episodes that didn't involve streaming them from the website to stick on an iPad. I love simple Python projects like this and so I jumped on the chance. Let me walk you through what I did.

If you just want to download the script you can skip all this and find the full script here.

Step 1: Download Youtube-DL

My first thought was of youtube-dl for the actual downloading and thankfully it worked great. This is one of those insanely useful utilities that I cannot recommend highly enough. You can find the download instructions here: http://ytdl-org.github.io/youtube-dl/download.html

Step 2: Install Python 3

You shouldn't need a super modern version of python. I wrote this with Python 3.7.3, so anything that number or newer should be good. We are using f strings because I love them, so you will need 3.6 or newer.

Download Python here.

I'm checking the version here but only to confirm that you are running Python 3, on the assumption that if you have 3 you have a relatively recent version of 3.

version = platform.python_version_tuple()
if version[0] != "3":
    print("You are not running Python 3. Please check your version.")
    sys.exit(1)

Step 3: Decide where you are going to download the files

I have my download location in the script here:

path = "/mnt/usb/television/mister-rogers-neighborhood/"

However if you just want them to download into the Downloads folder, uncomment the line above this one by removing the # and delete the line I show above. So you'll want path = str(Path.home() / "Downloads") to not have a # in front of it.

Step 4: Run the script

Not sure how to run a Python script? We got you taken care of. Click here for Windows. Here are some Mac tips.

You can find the script on Gitlab here: https://gitlab.com/-/snippets/2100082

Download the script and run it locally. The script checks if it is the first or third Monday of the month and only runs the download if it is. This is to basically keep us from endlessly spamming the servers hosting this great free content.

The first Monday of every month will feature programs from the early years 1968-1975. The third Monday of every month will feature programs from the “Theme Weeks” library 1979-2001.

NOTE: If you just want to download 5 episodes right now, delete these lines:

today = date.today().isocalendar()
if today[2] == 1 and (today[1] == 1 or 3):
    logging.info("There is a new download available.")
else:
    logging.info("There are no new downloads today.")
    sys.exit(0)

Step 5: Set the script to run every day

This script is designed to be run every day and only go out to the servers if there is a new file to get.

Here is how to run a python script every day on Windows.

For Linux and Mac open up your terminal, run crontab -e and enter in the frequency you want to run the script at. Here is a useful site to generate the whole entry.

File Formatting

Here is the metadata formatting I followed for the Infuse iOS app, my favorite app. You may want a different format for the filename depending on your application.

Questions?

If people actually use this script I'll rewrite it to use celery beat to handle the scheduling of the downloads, but for my own use case I'm comfortable writing cron jobs. However if you run into issues running this, either add a comment on the GitLab link or shoot me an email: mat at matduggan.com.