Networking

How Mobile Networks Work

September 27, 2024 in Networking

I've spent a fair amount of time around networking. I've worked for a small ISP, helped to set up campus and office networks and even done a fair amount of work with BGP and assisting with ISP failover and route work. However in my current role I've been doing a lot of mobile network diagnostics and troubleshooting which made me realize I actually don't know anything about how mobile networks operate. So I figured it was a good idea for me to learn more and write up what I find.

It's interesting that without a doubt cellular internet is either going to become or has become the default Internet for most humans alive, but almost no developers I know have any idea how it works (including myself until recently). As I hope that I demonstrate below, it is untold amounts of amazing work that has been applied to this problem over decades that has really produced incredible results. As it turns out the network engineers working with cellular were doing nuclear physics while I was hot-gluing stuff together.

I am not an expert. I will update this as I get better information, but use this as a reference for stuff to look up, not a bible. It is my hope, over many revisions, to turn this into a easier to read PDF that folks can download. However I want to get it out in front of people to help find mistakes.

TL/DR: There is a shocking, eye-watering amount of complexity when it comes to cellular data as compared to a home or datacenter network connection. I could spend the next six months of my life reading about this and feel like I barely scratched the surface. However I'm hoping that I have provided some basic-level information about how this magic all works.

Corrections/Requests: https://c.im/@matdevdug. I know I didn't get it all right, I promise I won't be offended.

Basics

A modern cellular network at the core is comprised of three basic elements:

the RAN (radio access network)
CN (core network)
Services network

RAN

The RAN contains the base stations that allow for the communication with the phones using radio signals. When we think of a cell tower we are thinking of a RAN. When we are thinking of what a cellular network provides in terms of services, a lot of that is actually contained within the CN. That's where the stuff like user authorization, services turned on or off for the user and all the background stuff for the transfer and hand-off of user traffic. Think SMS and phone calls for most users today.

Key Components of the RAN:

Base Transceiver Station (BTS): The BTS is a radio transmitter/receiver that communicates with your phone over the air interface.
Node B (or Evolved Node B for 4G or gNodeB for 5G): In modern cellular networks, Node B refers to a base station that's managed by multiple cell sites. It aggregates data from these cell sites and forwards it to the RAN controller.
Radio Network Controller (RNC): The RNC is responsible for managing the radio link between your phone and the BTS/Node B.
Base Station Subsystem (BSS): The BSS is a term used in older cellular networks, referring to the combination of the BTS and RNC.

Link: https://www.cisco.com/c/en/us/products/collateral/wireless/nb-06-radio-access-networks-cte-en.html

Startup

Cell Search and Network Acquisition. The device powers on and begins searching for available cells by scanning the frequencies of surrounding base stations (e.g., eNodeB for LTE, gNodeB for 5G).

┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│   Broadcast  │             │              │
│  ──────────> │ Search for  │ <──────────  │
│              │ Sync Signals│ Synchronizes │
│              │             │              │
└──────────────┘             └──────────────┘

- Device listens for synchronization signals.
- Identifies the best base station for connection.

Random Access. After identifying the cell to connect to, the device sends a random access request to establish initial communication with the base station.This is often called RACH. If you want to read about it I found an incredible amount of detail here: https://www.sharetechnote.com/html/RACH_LTE.html

┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│  Random Access Response    │              │
│  <────────── │ ──────────> │ Random Access│
│              │             │ Request      │
└──────────────┘             └──────────────┘

- Device sends a Random Access Preamble.
- Base station responds with timing and resource allocation.

Dedicated Radio Connection Setup (RRC Setup). The base station allocates resources for the device to establish a dedicated radio connection using the Radio Resource Control (RRC) protocol.

┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│              │             │   Device     │
│  RRC Setup   │             │              │
│  ──────────> │ Send RRC    │              │
│              │ Request     │ <──────────  │
│              │             │ RRC Response │
└──────────────┘             └──────────────┘

- Device requests RRC connection.
- Base station assigns resources and confirms.

Device-to-Core Network Communication (Authentication, Security, etc.). Once the RRC connection is established, the device communicates with the core network (e.g., EPC in LTE, 5GC in 5G) for authentication, security setup, and session establishment.

┌──────────────┐               ┌──────────────┐
│  Base Station│               │   Mobile     │
│  ──────────> │ Forward       │              │
│              │ Authentication Data          │
│              │ <──────────   │Authentication│
│              │               │ Request      │
│              │               │              │
└──────────────┘               └──────────────┘

- Device exchanges authentication and security data with the core network.
- Secure communication is established.

Data Transfer (Downlink and Uplink). After setup, the device starts sending (uplink) and receiving (downlink) data using the established radio connection.

┌──────────────┐             ┌──────────────┐
│  Base Station│             │   Mobile     │
│  ──────────> │ Data        │              │
│  Downlink    │             │  <─────────  │
│  <────────── │ Data Uplink │ ──────────>  │
│              │             │              │
└──────────────┘             └──────────────┘

- Data is transmitted between the base station and the device.
- Downlink (BS to Device) and Uplink (Device to BS) transmissions.

Handover. If the device moves out of range of the current base station, a handover is initiated to transfer the connection to a new base station without interrupting the service.

Signaling

As shown in the diagram above, there are a lot of references to something called "signaling". Signaling seems to be a shorthand for handling a lot of configuration and hand-off between tower and device and the core network. As far as I can tell they can be broken into 3 types.

Access Stratum Signaling
1. Set of protocols to manage the radio link between your phone and cellular network.
2. Handles authentication and encryption
3. Radio bearer establishment (setting up a dedicated channel for data transfer)
4. Mobility management (handovers, etc)
5. Quality of Service control.
Non-Access Stratum (NAS) Signaling
1. Set of protocols used to manage the interaction between your phone and the cellular network's core infrastructure.
2. It handles tasks such as authentication, billing, and location services.
3. Authentication with the Home Location Register (HLR)
4. Roaming management
5. Charging and billing
6. IMSI Attach/ Detach procedure
Lower Layer Signaling on the Air Interface
1. This refers to the control signaling that occurs between your phone and the cellular network's base station at the physical or data link layer.
2. It ensures reliable communication over the air interface, error detection and correction, and efficient use of resources (e.g., allocating radio bandwidth).
3. Modulation and demodulation control
4. Error detection and correction using CRCs (Cyclic Redundancy Checks)

High Level Overview of Signaling

You turn on your phone (AS signaling starts).
Your phone sends an Initial Direct Transfer (IDT) message to establish a radio connection with the base station (lower layer signaling takes over).
The base station authenticates your phone using NAS signaling, contacting the HLR for authentication.
Once authenticated, lower layer signaling continues to manage data transfer between your phone and the base station.

What is HLR?

Home Location Register contains the subscriber data for a network. Their IMSI, phone number, service information and is what negotiates where in the world the user physically is.

Duplexing

You have a lot of devices and you have a few towers. You need to do many uplinks and downlinks to many devices.

It is important that any cellular communications system you can send and receive in both directions at the same time. This enables conversations to be made, with either end being able to talk and listen as required. In order to be able to transmit in both directions, a device (UE) and base station must have a duplex scheme. There are a lot of them including Frequency Division Duplex (FDD), Time Division Duplex (TDD), Semi-static TDD and Dynamic TDD.

Duplexing Types:

Frequency Division Duplex (FDD): Uses separate frequency bands for downlink and uplink signals.
1. Downlink: The mobile device receives data from the base station on a specific frequency (F1).
2. Uplink: The mobile device sends data to the base station on a different frequency (F2).
3. Key Principle: Separate frequencies for uplink and downlink enable simultaneous transmission and reception.

┌──────────────┐              ┌──────────────┐
│  Base Station│              │   Mobile     │
│              │              │   Device     │
│  ──────────> │ F1 (Downlink)│ <──────────  │
│              │              │              │
│  <────────── │ F2 (Uplink)  │ ──────────>  │
└──────────────┘              └──────────────┘

Separate frequency bands (F1 and F2)

Time Division Duplex (TDD): Alternates between downlink and uplink signals over the same frequency band.
1. Downlink: The base station sends data to the mobile device in a time slot.
2. Uplink: The mobile device sends data to the base station in a different time slot using the same frequency.
3. Key Principle: The same frequency is used for both uplink and downlink, but at different times.

 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
                 (Base station sends data)

     ◄───────────  Time Slot 2 (Uplink)
                 (Mobile sends data)
     
     ───────────►  Time Slot 3 (Downlink)
                 (Base station sends data)
                 
     ◄───────────  Time Slot 4 (Uplink)
                 (Mobile sends data)
 
    - The same frequency is used for both directions.
    - Communication alternates between downlink and uplink in predefined time slots.

Frame design

Downlink/Uplink: There are predetermined time slots for uplink and downlink, but they can be changed periodically (e.g., minutes, hours).
Key Principle: Time slots are allocated statically for longer durations but can be switched based on network traffic patterns (e.g., heavier downlink traffic during peak hours).
A frame typically lasts 10 ms and is divided into time slots for downlink (DL) and uplink (UL).
"Guard" time slots are used to allow switching between transmission and reception.

4. Dynamic Time Division Duplex (Dynamic TDD):

Downlink/Uplink: Time slots for uplink and downlink are dynamically adjusted in real time based on instantaneous traffic demands.
Key Principle: Uplink and downlink time slots are flexible and can vary dynamically to optimize the usage of the available spectrum in real-time, depending on the traffic load.
See second diagram for what "guard periods" are. Basically windows to ensure there are gaps and the signal doesn't overlap.

 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
     ───────────►  Time Slot 2 (Downlink)
     ───────────►  Time Slot 3 (Downlink)
     ◄───────────  Time Slot 4 (Uplink)
     ───────────►  Time Slot 5 (Downlink)
     
     ◄───────────  Time Slot 6 (Uplink)
     
    - More slots for downlink in scenarios with high download traffic (e.g., streaming video).
    - Dynamic slot assignment can change depending on the real-time demand.

 ┌──────────────┐                     ┌──────────────┐
 │  Base Station│                     │  Mobile Phone│
 │ (eNodeB/gNB) │                     │              │
 └──────────────┘                     └──────────────┘

     ───────────►  Time Slot 1 (Downlink)
     ───────────►  Time Slot 2 (Downlink)
     [Guard Period]                          (Switch from downlink to uplink)
     ◄───────────  Time Slot 3 (Uplink)
     [Guard Period]                          (Switch from uplink to downlink)
     ───────────►  Time Slot 4 (Downlink)
     
    - Guard periods allow safe switching from one direction to another.
    - Guard periods prevent signals from overlapping and causing interference.

Core

So I've written a lot about what the RAN does. But we haven't really touched on what the core network concept does. Basically once the device registers with the base station using the random access procedure discussed above, the device is enabled and allows the core network to do a bunch of stuff that we typically associate with "having a cellular plan".

For modern devices when we say authentication we mean "mutual authentication", which means the device authenticates the network and the network authenticates the device. This is typically something like a subscriber-specific secret key and a random number to generate a response to the request sent by the device. Then the network sends an authentication token and the device compares this token with the expected token to authenticate the network. It looks like the following:

┌───────────────────────┐
│    Encryption &       │
│  Integrity Algorithms │
├───────────────────────┤
│  - AES (Encryption)   │
│  - SNOW 3G (Encryption│
│  - ZUC (Encryption)   │
│  - SHA-256 (Integrity)│
└───────────────────────┘

- AES: Strong encryption algorithm commonly used in LTE/5G.
- SNOW 3G: Stream cipher used for encryption in mobile communications.
- ZUC: Encryption algorithm used in 5G.
- SHA-256: Integrity algorithm ensuring data integrity.

The steps of the core network are as follows:

Registration (also called attach procedure): The device connects to the core network (e.g., EPC in LTE or 5GC in 5G) to register and declare its presence. This involves the device identifying itself and the network confirming its identity.

Mutual Authentication: The network and device authenticate each other to ensure a secure connection. The device verifies the network’s authenticity, and the network confirms the device’s identity.
Security Activation: After successful authentication, the network and the device establish a secure channel using encryption and integrity protection to ensure data confidentiality and integrity.
Session Setup and IP Address Allocation: The device establishes a data session with the core network, which includes setting up bearers (logical paths for data) and assigning an IP address to enable internet connectivity.

How Data Gets To Phone

Alright we've talked about how the phone finds a tower to talk to, how the tower knows who the phone is and all the millions of steps involved in getting the mobile phone an actual honest-to-god IP address. How is data actually getting to the phone itself?

Configuration for Downlink Measurement: Before downlink data transmission can occur, the mobile device (UE) must be configured to perform downlink measurements. This helps the network optimize transmission based on the channel conditions. Configuration messages are sent from the base station (eNodeB in LTE or gNB in 5G) to instruct the UE to measure certain DL reference signals.
Reference Signal (Downlink Measurements): The mobile device receives reference signals from the network. These reference signals are used by the UE to estimate DL channel conditions. In LTE, Cell-specific Reference Signals (CRS) are used, and in 5G, Channel State Information-Reference Signals (CSI-RS) are used.
DL Channel Conditions (CQI, PMI, RI): The mobile device processes the reference signals to assess the downlink channel conditions and generates reports such as CQI (Channel Quality Indicator), PMI (Precoding Matrix Indicator), and RI (Rank Indicator). These reports are sent back to the base station.
DL Resource Allocation and Packet Transmission: Based on the UE’s channel reports (CQI, PMI, RI), the base station allocates appropriate downlink resources. It determines the modulation scheme, coding rate, MIMO layers, and frequency resources (PRBs) and sends a DL scheduling grant to the UE. The data packets are then transmitted over the downlink.
Positive/Negative Acknowledgement (HARQ Feedback): After the UE receives the downlink data, it checks the integrity of the packets using CRC (Cyclic Redundancy Check). If the CRC passes, the UE sends a positive acknowledgement (ACK) back to the network. If the CRC fails, a negative acknowledgement (NACK) is sent, indicating that retransmission is needed.
New Transmission or Retransmission (HARQ Process): If the network receives a NACK, it retransmits the packet using the HARQ process. The retransmission is often incremental (IR-HARQ), meaning the device combines the new transmission with previously received data to improve decoding.

Uplink is a little different but is basically the device asking for a timeslot to upload, getting a grant, sending the data up and then getting an ack that it is sent.

Gs

So as everyone knows cellular networks have gone through a series of revisions over the years around the world. I'm going to talk about them and just try to walk through how they are different and what they mean.

1G

Starts in Japan, moves to Europe and then the US and UK.
Speeds up to 2.4kbps and operated in the frequency band of 150 KHz.
Didn't work between countries, had low capacity, unreliable handoff and no security. Basically any receiver can listen to a conversation.

2G

Launched in 1991 in Finland
Allows for text messages, picture messages and MMS.
Speeds up to 14.4kbps between 900MHz and 1800MHz bands
Actual security between sender and receiver with messages digitally encrypted.

Wait, are text messages encrypted?

So this was completely new to me but I guess my old Nokia brick had some encryption on it. Here's how that process worked:

Mobile device stores a secret key in the SIM card and the network generates a random challenge and sends it to the mobile device.
The A3 algorithm is used to compute a Signed Response (SRES) using the secret key and the random value.
Then the A8 algorithm is used with secret and the random value to generate a session encryption key Kc (64-bit key). This key will be used for encrypting data, including SMS.
After the authentication process and key generation, encryption of SMS messages begins. GSM uses a stream cipher to encrypt both voice and data traffic, including text messages. The encryption algorithm used for SMS is either A5/1 or A5/2, depending on the region and network configuration.
1. A5/1: A stronger encryption algorithm used in Europe and other regions.
2. A5/2: A weaker variant used in some regions, but deprecated due to its vulnerabilities.
3. The A5 algorithm generates a keystream that is XORed with the plaintext message (SMS) to produce the ciphertext, ensuring the confidentiality of the message.

So basically text messages from the phone to the base station were encrypted and then exposed there. However I honestly didn't even know that was happening.

TSMA and CDMA

I remember a lot of conversations about GSM vs CDMA when you were talking about cellular networks but at the time all I really knew was "GSM is European and CDMA is US".

TSMA is GSM and uses time slots
CDMA allocates each user a special code to communicate over multiple physical channels
GSM is where we see services like voice mail, SMS, call waiting

EDGE

So everyone who is old like me remembers EDGE on cellphones, including the original iPhone I waited in line for. EDGE was effectively a retrofit you could put on top of an existing GSM network, keeping the cost for adding it low. You got speeds on 9.6-200kbps.

3G

Welcome to the year 2000
Frequency spectrum of 3G transmissions is 1900-2025MHz and 2110-2200MHz.
UTMS takes over for GSM and CDMA2000 takes over from CDMA.
Maxes out around 8-10Mbps
IMT-2000 = 3G

So let's just recap quickly how we got here.

2G (GSM): Initially focused on voice communication and slow data services (up to 9.6 kbps using Circuit Switched Data).
2.5G (GPRS): Introduced packet-switched data with rates of 40-50 kbps. It allowed more efficient use of radio resources for data services.
2.75G (EDGE): Enhanced the data rate by improving modulation techniques (8PSK). This increased data rates to around 384 kbps, making it more suitable for early mobile internet usage.

EDGE introduced 8-PSK (8-Phase Shift Keying) modulation, which allowed the encoding of 3 bits per symbol (as opposed to 1 bit per symbol with the original GSM’s GMSK (Gaussian Minimum Shift Keying) modulation). This increased spectral efficiency and data throughput.

EDGE had really high latency so it wasn't really usable for things like video streaming or online gaming.

3G (WCDMA): Max data rate: 2 Mbps (with improvements over EDGE in practice). Introduced spread-spectrum (CDMA) technology with QPSK modulation.
3.5G (HSDPA): Enhanced WCDMA by introducing adaptive modulation (AMC), HARQ, and NodeB-based scheduling. Max data rate: 14.4 Mbps (downlink).

So when we say 3G we actually mean a pretty wide range of technologies all underneath the same umbrella.

4G

4G or as it is sometimes called LTE evolved from WCDMA. Instead of developing new radio interfaces and new technology existing and newly developed wireless system like GPRS, EDGE, Bluetooth, WLAN and Hiper-LAN were integrated together
4G has a download speed of 67.65Mbps and upload speed of 29.37Mbps
4G operates at frequency bands of 2500-2570MHz for uplink and 2620-2690MHz for downlink with channel bandwidth of 1.25-20MHz
4G has a few key technologies, mainly OFDM, SDR and Multiple-Input Multiple-Output (MIMO).
- OFDM (Orthogonal Frequency Division Multiplexing)
  - Allows for more efficient use of the available bandwidth by breaking down data into smaller pieces and sending them simultaneously
  - Since each channel uses a different frequency, if one channel experiences interference or errors, the others remain unaffected.
  - OFDM can adapt to changing network conditions by dynamically adjusting the power levels and frequencies used for each channel.
- SDR (Software Defined Radio)
  - Like it sounds, it is a technology that enables flexible and efficient implementation of wireless communication systems by using software algorithms to control and process radio signals in real-time. In cellular 4G, SDR is used to improve performance, reduce costs, and enable advanced features like multi-band support and spectrum flexibility.
- MIMO (multiple-input multiple-output)
  - A technology used in cellular 4G to improve the performance and capacity of wireless networks. It allows for the simultaneous transmission and reception of multiple data streams over the same frequency band, using multiple antennas at both the base station and mobile device.
  - Works by having both the base station and the mobile device equipped with multiple antennas
  - Each antenna transmits and receives a separate data stream, allowing for multiple streams to be transmitted over the same frequency band
  - There is Spatial Multiplexing where multiple data streams are transmitted over the same frequency band using different antennas. Then Beamforming where advanced signal processing techniques to direct the transmitted beams towards specific users, improving signal quality and reducing interference. Finally Massive MIMO where you use a lot of antennas (64 or more) to improve capacity and performance.

5G

The International Telecommunication Union (ITU) defines 5G as a wireless communication system that supports speeds of at least 20 Gbps (gigabits per second), with ultra-low latency of less than 1 ms (millisecond).
5G operates on a much broader range of frequency bands than 4G
- Low-band frequencies: These frequencies are typically below 3 GHz and are used for coverage in rural areas or indoor environments. Examples include the 600 MHz, 700 MHz, and 850 MHz bands.
- Mid-band frequencies: These frequencies range from approximately 3-10 GHz and are used for both coverage and capacity in urban areas. Examples include the 4.5 GHz, 6 GHz, and 24 GHz bands.
- High-band frequencies: These frequencies range from approximately 10-90 GHz and are used primarily for high-speed data transfer in dense urban environments. Examples include the 28 GHz, 39 GHz, and 73 GHz bands.
5g network designs are a step up in complexity from their 4g predecessors, with a control plane and a userplane with each plane using a separate network function. 4G networks have a single plane.
5G uses advanced modulation schemes such as 256-Quadrature Amplitude Modulation (QAM) to achieve higher data transfer rates than 4G, which typically uses 64-QAM or 16-QAM
All the MIMO stuff discussed above.

What the hell is Quadrature Amplitude Modulation?

I know, it sounds like a Star Trek thing. It is a way to send digital information over a communication channel, like a wireless network or cable. It's a method of "modulating" the signal, which means changing its characteristics in a way that allows us to transmit data.

When we say 256-QAM, it refers to the specific type of modulation being used. Here's what it means:

Quadrature: This refers to the fact that the signal is being modulated using two different dimensions (or "quadratures"). Think of it like a coordinate system with x and y axes.
Amplitude Modulation (AM): This is the way we change the signal's characteristics. In this case, we're changing the amplitude (magnitude) of the signal to represent digital information.
256: This refers to the number of possible states or levels that the signal can take on. Think of it like a binary alphabet with 2^8 = 256 possible combinations.

Why does 5G want this?

More information per symbol: With 256-QAM, each "symbol" (or signal change) can represent one of 256 different values. This means we can pack more data into the same amount of time.
Faster transmission speeds: As a result, we can transmit data at higher speeds without compromising quality.

Kubernetes and 5G

Kubernetes is a popular technology in 5G and is used for a number of functions, including the following:

Virtual Network Functions (VNFs): VNFs are software-based implementations of traditional network functions, such as firewalls or packet filters. Kubernetes is used to deploy and manage these VNFs.
Cloud-Native Network Functions (CNFs): CNFs are cloud-native applications that provide network function capabilities, such as traffic management or security filtering. Kubernetes is used to deploy and manage these CNFs.
Network Function Virtualization (NFV) Infrastructure: NFV infrastructure provides the underlying hardware and software resources for running VNFs and CNFs. Kubernetes is used to orchestrate and manage this infrastructure.

Conclusion

So one of the common sources of frustration for developers I've worked with when debugging cellular network problems is that often while there is plenty of bandwidth for what they are trying to do, the latency involved can be quite variable. If you look at all the complexity behind the scenes and then factor in that the network radio on the actual cellular device is constantly flipping between an Active and Idle state in an attempt to save battery life, this suddenly makes sense.

Because all of the complexity I'm talking about ultimately gets you back to the same TCP stack we've been using for years with all the overhead involved in that back and forth. We're still ending up with a SYN -> SYN-ACK. There are tools you can use to shorten this process somewhat (TCP Fast Open) and changing the initial congestion window but still you are mostly dealing with the same level of overhead you always dealt with.

Ultimately there isn't much you can do with this information, as developers have almost no control over the elements present here. However I think it's useful as cellular networks continue to become the dominant default Internet for the Earth's population that more folks understand the pieces happening in the background of this stack.

IPv6 Is A Disaster (but we can fix it)

August 04, 2023 in Networking

IP addresses have been in the news a lot lately and not for good reasons. AWS has announced they are charging $.005 per IPv4 address per hour, joining other cloud providers in charging for the luxury of a public IPv4 address. GCP charges $.004, same with Azure and Hetzner charges €0.001/h. Clearly the era of cloud providers going out and purchasing more IPv4 space is coming to an end. As time goes on, the addresses are just more valuable and it makes less sense to give them out for free.

So the writing is on the wall. We need to switch to IPv6. Now I was first told that we were going to need to switch to IPv6 when I was in high school in my first Cisco class and I'm 36 now, to give you some perspective on how long this has been "coming down the pipe". Up to this point I haven't done much at all with IPv6, there has been almost no market demand for those skills and I've never had a job where anybody seemed all that interested in doing it. So I skipped learning about it, which is a shame because it's actually a great advancement in networking.

Now is the second best time to learn though, so I decided to migrate this blog to IPv6 only. We'll stick it behind a CDN to handle the IPv4 traffic, but let's join the cool kids club. What I found was horrifying: almost nothing works out of the box. Major dependencies cease functioning right away and workarounds cannot be described as production ready. The migration process for teams to IPv6 is going to be very rocky, mostly because almost nobody has done the work. We all skipped it for years and now we'll need to pay the price.

Why is IPv6 worth the work?

I'm not gonna do a thing about what is IPv4 vs IPv6. There are plenty of great articles on the internet about that. Let's just quickly recap though "why would anyone want to make the jump to IPv6".

Address space (obviously)
Smaller number of header fields (8 vs 13 on v4)

Faster processing: No more checksum, so routers don't have to do a recalculation for every packet.
Faster routing: More summary routes and hierarchical routes. (Don't know what that is? No stress. Summary route = combining multiple IPs so you don't need all the addresses, just the general direction based on the first part of the address. Ditto with routes, since IPv6 is globally unique you can have small and efficient backbone routing.)
QoS: Traffic Class and Flow Label fields make QoS easier.
Auto-addressing. This allows IPv6 hosts on a LAN to connect without a router or DHCP server.
You can add IPsec to IPv6 with the Authentication Header and Encapsulating Security Payload.

Finally the biggest one: because IPv6 addresses are free and IPv4 ones are not.

Setting up an IPv6-Only Server

The actual setup process was simple. I provisioned a Debian box and selected "IPv6". Then I got my first surprise. My box didn't get an IPv6 address. I was given a /64 of addresses, which is 18,446,744,073,709,551,616. It is good to know that my small ARM server could scale to run all the network infrastructure for every company I've ever worked for on all public addresses.

Now this sounds wasteful but when you look at how IPv6 works, it really isn't. Since IPv6 is much less "chatty" than IPv4, even if I had 10,000 hosts on this network it doesn't matter. As discussed here it actually makes sense to keep all the IPv6 space, even if at first it comes across as insanely wasteful. So just don't think about how many addresses are getting sent to each device.

Important: resist the urge to optimize address utilization. Talking to more experienced networking folks, this seems to be a common trap people fall into. We've all spent so much time worrying about how much space we have remaining in an IPv4 block and designing around that problem. That issue doesn't exist anymore. A /64 prefix is the smallest you should configure on an interface.

Attempting to stick a smaller prefix, which is something I've heard people try, like a /68 or a /96 can break stateless address auto-configuration. Your mentality should be a /48 per site. That's what the Regional Internet Registries hands out when allocating IPv6. When thinking about network organization, you need to think about the nibble boundary. (I know, it sounds like I'm making shit up now). It's basically a way to make IPv6 easier to read.

Let's say you have 2402:9400:10::/48. You would divide it up as follows if you wanted only /64 for each box as a flat network.

Subnet #	Subnet Address
0	2402:9400:10::/64
1	2402:9400:10:1::/64
2	2402:9400:10:2::/64
3	2402:9400:10:3::/64
4	2402:9400:10:4::/64
5	2402:9400:10:5::/64

A /52 works a similar way.

Subnet #	Subnet Address
0	2402:9400:10::/52
1	2402:9400:10:1000::/52
2	2402:9400:10:2000::/52
3	2402:9400:10:3000::/52
4	2402:9400:10:4000::/52
5	2402:9400:10:5000::/52

You can still at a glance know which subnet you are looking at.

Alright I've got my box ready to go. Let's try to set it up like a normal server.

Problem 1 - I can't SSH in

This was a predictable problem. Neither my work or home ISP supports IPv6. So it's great that I have this box set up, but now I can't really do anything with it. Fine, I attach an IPv4 address for now, SSH in and I'll set up cloudflared to run a tunnel. Presumably they'll handle the conversion on their side.

Except that isn't how Cloudflare rolls. Imagine my surprise when the tunnel collapses when I remove the IPv4 address. By default the cloudflared utility assumes IPv4 and you need to go in and edit the systemd service file to add: --edge-ip-version 6. After this, the tunnel is up and I'm able to SSH in.

Problem 2 - I can't use GitHub

Alright so I'm on the box. Now it's time to start setting up stuff. I run my server setup script and it immediately fails. It's trying to access the installation script for hishtory, a great shell history utility I use on all my personal stuff. It's trying to pull the install file from GitHub and failing. "Certainly that can't be right. GitHub must support IPv6?"

Nope. Alright fine, seems REALLY bad that the service the entire internet uses to release software doesn't work with IPv6, but you know Microsoft is broke and also only cares about fake AI now, so whatever. I ended up using the TransIP Github Proxy which worked fine. Now I have access to Github. But then Python fails with urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>. Alright I give up on this. My guess is the version of Python 3 in Debian doesn't like IPv6, but I'm not in the mood to troubleshoot it right now.

Problem 3 - Can't set up Datadog

Let's do something more basic. Certainly I can set up Datadog to keep an eye on this box. I don't need a lot of metrics, just a few historical load numbers. Go to Datadog, log in and start to walk through the process. Immediately collapses. The simple setup has you run curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh. Now S3 supports IPv6, so what the fuck?

curl -v https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh
*   Trying [64:ff9b::34d9:8430]:443...
*   Trying 52.216.133.245:443...
* Immediate connect fail for 52.216.133.245: Network is unreachable
*   Trying 54.231.138.48:443...
* Immediate connect fail for 54.231.138.48: Network is unreachable
*   Trying 52.217.96.222:443...
* Immediate connect fail for 52.217.96.222: Network is unreachable
*   Trying 52.216.152.62:443...
* Immediate connect fail for 52.216.152.62: Network is unreachable
*   Trying 54.231.229.16:443...
* Immediate connect fail for 54.231.229.16: Network is unreachable
*   Trying 52.216.210.200:443...
* Immediate connect fail for 52.216.210.200: Network is unreachable
*   Trying 52.217.89.94:443...
* Immediate connect fail for 52.217.89.94: Network is unreachable
*   Trying 52.216.205.173:443...
* Immediate connect fail for 52.216.205.173: Network is unreachable

It's not S3 or the box, because I can connect to the test S3 bucket AWS provides just fine.

curl -v  http://s3.dualstack.us-west-2.amazonaws.com/
*   Trying [2600:1fa0:40bf:a809:345c:d3f8::]:80...
* Connected to s3.dualstack.us-west-2.amazonaws.com (2600:1fa0:40bf:a809:345c:d3f8::) port 80 (#0)
> GET / HTTP/1.1
> Host: s3.dualstack.us-west-2.amazonaws.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< x-amz-id-2: r1WAG/NYpaggrPl3Oja4SG1CrcBZ+1RIpYKivAiIhiICtfwiItTgLfm6McPXXJpKWeM848YWvOQ=
< x-amz-request-id: BPCVA8T6SZMTB3EF
< Date: Tue, 01 Aug 2023 10:31:27 GMT
< Location: https://aws.amazon.com/s3/
< Server: AmazonS3
< Content-Length: 0
<
* Connection #0 to host s3.dualstack.us-west-2.amazonaws.com left intact

Fine I'll do it the manual way through apt.

0% [Connecting to apt.datadoghq.com (18.66.192.22)]

Goddamnit. Alright Datadog is out. It's at this point I realize the experiment of trying to go IPv6 only isn't going to work. Almost nothing seems to work right without proxies and hacks. I'll try to stick as much as I can on IPv6 but going exclusive isn't an option at this point.

NAT64

So in order to access IPv4 resources from IPv6 you need to go through a NAT64 service. I ended up using this one: https://nat64.net/. Immediately all my problems stopped and I was able to access resources normally. I am a little nervous about relying exclusively on what appears to be a hobby project for accessing critical internet resources, but since nobody seems to care upstream of me about IPv6 I don't think I have a lot of choice.

I am surprised there aren't more of these. This is the best list I was able to find:

Most of them seem to be gone now. Dresel's link doesn't work, Trex in my testing had problems, August Internet is gone, most of the Go6lab test devices are down, Tuxis worked but they launched the service in 2019 and seem to have no further interaction with it. Basically Kasper Dupont seems to be the only person on the internet with any sort of widespread interest in allowing IPv6 to actually work. Props to you Kasper.

Basically one person props up this entire part of the internet.

Kasper Dupont

So I was curious about Kasper and emailed him to ask a few questions. You can see that back and forth below.

Me: I found the Public NAT64 service super useful in the transition but would love to know a little bit more about why you do it.

Kasper: I do it primarily because I want to push IPv6 forward. For a few years
I had the opportunity to have a native IPv6-only network at home with
DNS64+NAT64, and I found that to be a pleasant experience which I
wanted to give more people a chance to experience.

When I brought up the first NAT64 gateway it was just a proof of
concept of a NAT64 extension I wanted to push. The NAT64 service took
off, the extension - not so much.

A few months ago I finally got native IPv6 at my current home, so now
I can use my own service in a fashion which much more resembles how my
target users would use it.

Me: You seem to be one of the few remaining free public services like this on the internet and would love to know a bit more about what motivated you to do it, how much it costs to run, anything you would feel comfortable sharing.

Kasper: For my personal products I have a total of 7 VMs across different
hosting providers. Some of them I purchase from Hetzner at 4.51 Euro
per month: https://hetzner.cloud/?ref=fFum6YUDlpJz

The other VMs are a bit more expensive, but not a lot.

Out of those VMs the 4 are used for the NAT64 service and the others
are used for other IPv6 transition related services. For example I
also run this service on a single VM: http://v4-frontend.netiter.com/

I hope to eventually make arrangements with transit providers which
will allow me to grow the capacity of the service and make it
profitable such that I can work on IPv6 full time rather than as a
side gig. The ideal outcome of that would be that IPv4-only content
providers pay the cost through their transit bandwidth payments.

Me: Any technical details you would like to mention would also be great

Kasper: That's my kind of audience :-)

I can get really really technical.

I think what primarily sets my service aside from other services is
that each of my DNS64 servers is automatically updated with NAT64
prefixes based on health checks of all the gateways. That means the
outage of any single NAT64 gateway will be mostly invisible to users.
This also helps with maintenance. I think that makes my NAT64 service
the one with the highest availability among the public NAT64 services.

The NAT64 code is developed entirely by myself and currently runs as a
user mode daemon on Linux. I am considering porting the most
performance critical part to a kernel module.

This site

Alright so I got the basics up and running. In order to pull docker containers over IPv6 you need to add: registry.ipv6.docker.com/library/ to the front of the image name. So for instance:
image: mysql:8.0 becomes image: registry.ipv6.docker.com/library/mysql:8.0

Docker warns you this setup isn't production ready. I'm not really sure what that means for Docker. Presumably if it were to stop you should be able to just pull normally?

Once that was done, we set up the site as an AAAA DNS record and allowed Cloudflare to proxy, meaning they handle the advertisement of IPv6 and bring the traffic to me. One thing I did modify from before was previously I was using Caddy webserver but since I now have a hard reliance on Cloudflare for most of my traffic, I switched to Nginx. One nice thing you can do now that you know all traffic is coming from Cloudflare is switch how SSL works.

Now I have an Origin Certificate from Cloudflare hard-loaded into Nginx with Authenticated Origin Pulls set up so that I know for sure all traffic is running through Cloudflare. The certificate is signed for 15 years, so I can feel pretty confident sticking it in my secrets management system and not thinking about it ever again. For those that are interested there is a tutorial here on how to do it: https://www.digitalocean.com/community/tutorials/how-to-host-a-website-using-cloudflare-and-nginx-on-ubuntu-22-04

Alright the site is back up and working fine. It's what you are reading right now, so if it's up then the system works.

Unsolved Problems

My containers still can't communicate with IPv4 resources even though they're on an IPv6 network with an IPv6 bridge. The DNS64 resolution is working, and I've added fixed-cidr-v6 into Docker. I can talk to IPv6 resources just fine, but the NAT64 conversion process doesn't work. I'm going to keep plugging away at it.
Before you ping me I did add NAT with ip6tables.
SMTP server problems. I haven't been able to find a commercial SMTP service that has an AAAA record. Mailgun and SES were both duds as were a few of the smaller ones I tried. Even Fastmail didn't have anything that could help me. If you know of one please let me know: https://c.im/@matdevdug

Why not stick with IPv4?

Putting aside "because we're running out of addresses" for a minute. If we had adopted IPv6 earlier, the way we do infrastructure could be radically different. So often companies use technology like load balancers and tunnels not because they actually need anything that these things do, but because they need some sort of logical division between private IP ranges and a public IP address they can stick in an DNS A record.

If you break a load balancer into its basic parts, it is doing two things. It is distributing incoming packets onto the back-end servers and it s checking the health of those servers and taking unhealthy ones out of the rotation. Nowadays they often handle things like SSL termination and metrics, but it's not a requirement to be called a load balancer.

There are many ways to load balance, but the most common are as follows:

Round-robin of connection requests.
Weighted Round-Robin with different servers getting more or less.
Least-Connection with servers that have the fewest connections getting more requests.
Weighted Least-Connection, same thing but you can tilt it towards certain boxes.

What you notice is there isn't anything there that requires, or really even benefits from a private IP address vs a public IP address. Configuring the hosts to accept traffic from only one source (the load balancer) is pretty simple and relatively cheap to do, computationally speaking. A lot of the infrastructure designs we've been forced into, things like VPCs, NAT gateways, public vs private subnets, all of these things could have been skipped or relied on less.

The other irony is that IP whitelisting, which currently is a broken security practice that is mostly a waste of time as we all use IP addresses owned by cloud providers, would actually be something that mattered. The process for companies to purchase a /44 for themselves would have gotten easier with demand and it would have been more common for people to go and buy a block of IPs from American Registry for Internet Numbers (ARIN), Réseaux IP Européens Network Coordination Centre (RIPE), or Asia-Pacific Network Information Centre (APNIC).

You would never need to think "well is Google going to buy more IP addresses" or "I need to monitor GitHub support page to make sure they don't add more later". You'd have one block they'd use for their entire business until the end of time. Container systems wouldn't need to assign internal IP addresses on each host, it would be trivial to allocate chunks of public IPs for them to use and also advertise over standard public DNS as needed.

Obviously I'm not saying private networks serve no function. My point is a lot of the network design we've adopted isn't based on necessity but on forced design. I suspect we would have ended up designing applications with the knowledge that they sit on the open internet vs relying entirely on the security of a private VPC. Given how security exploits work this probably would have been a benefit to overall security and design.

So even if cost and availability isn't a concern for you, allowing your organization more ownership and control over how your network functions has real measurable value.

Is this gonna get better?

So this sucks. You either pay cloud providers more money or you get a broken internet. My hope is that the folks who don't want to pay push more IPv6 adoption, but it's also a shame that it has taken so long for us to get here. All these problems and issues could have been addressed gradually and instead it's going to be something where people freak out until the teams that own these resources make the required changes.

I'm hopeful the end result might be better. I think at the very least it might open up more opportunities for smaller companies looking to establish themselves permanently with an IP range that they'll own forever, plus as IPv6 gets more mainstream it will (hopefully) get easier for customers to live with. But I have to say right now this is so broken it's kind of amazing.

If you are a small company looking to not pay the extra IP tax, set aside a lot of time to solve a myriad of problems you are going to encounter.

Thoughts/corrections/objections: [email protected]

AWS Elastic Kubernetes Service (EKS) Review

January 31, 2022 in DevOps

A bit of history

Over the last 5 years, containers have become the de-facto standard by which code is currently run. You may agree with this paradigm or think its a giant waste of resources (I waver between these two opinions on a weekly basis), but this is where we are. AWS, as perhaps the most popular cloud hosting platform, bucked the trend and attempted to launch their own container orchestration system called ECS. This was in direct competition to Kubernetes, the open-source project being embraced by organizations of every size.

ECS isn't a bad service, but unlike running EC2 instances inside of a VPC, switching to ECS truly removes any illusion that you will be able to migrate off of AWS. I think the service got a bad reputation for its deep tie-in with AWS along with some of the high Fargate pricing for the actual CPU units. As years went on, it became clear to the industry at large that ECS was not going to become the standard way we run containerized applications. This, in conjunction with the death of technology like docker swarm meant a clear winner started to emerge in k8s.

AWS EKS was introduced in 2017 and became generally available in 2018. When it launched, the response from the community was....tepid at best. It was missing a lot of the features found in competing products and seemed to be far behind the offerings from ECS at the time. The impression among folks I was talking to at the time was: AWS was forced to launch a Kubernetes service to stay competitive, but they had zero interest in diving into the k8s world and this was to meet the checkbox requirement. AWS itself didn't really want the service to explode in popularity was the rumor being passed around various infrastructure teams.

This is how EKS stacked up to its competition at launch

I became involved with EKS at a job where they had bet hard on docker swarm, it didn't work out and they were seeking a replacement. Over the years I've used it quite a bit, both as someone whose background is primary in AWS and someone who generally enjoys working with Kubernetes. So I've used the service extensively for production loads over years and have seen it change and improve in that time.

EKS Today

So its been years and you might think to yourself "certainly AWS has fixed all these problems and the service is stronger than ever". You would be...mostly incorrectly. EKS remains both a popular service and a surprisingly difficult one to set up correctly. Unlike services like RDS or Lambda, where the quality increases by quite a bit on a yearly basis, EKS has mostly remained a clearly internally disliked product.

This is how I like to imagine how AWS discusses EKS

New customer experience

There exists out of the box tooling for making a fresh EKS cluster that works well, but the tool isn't made or maintained by Amazon, which seems strange. eksctl which is maintained by Weaveworks here mostly works as promised. You will get a fully functional EKS cluster through a CLI interface that you can deploy to. However you are still pretty far from an actual functional k8s setup, which is a little bizarre.

Typically the sales pitch for AWS services is they are less work than doing it yourself. The pitch for EKS is its roughly the same amount of work to set it up, it is less maintenance in the long term. Once running, it keeps running and AWS managing the control plane means its difficult to get into a situation in which you cannot fix the cluster, which very nice. Typically k8s control plane issues are the most serious and everything else you can mostly resolve with tweaks.

However if you are considering going with EKS, understand you are going to need to spend a lot of time reading before you touch anything. You need to make hard-to-undo architectural decisions early in the setup process and probably want to experiment with test clusters before going with a full-on production option. This is not like RDS, where you can mostly go with the defaults, hit "new database" and continue on with your life.

Where do I go to get the actual setup information?

The best resource I've found is EKS Workshop, which walks you through tutorials of all the actual steps you are going to need to follow. Here is the list of things you would assume come out of the box, but strangely do not:

You are going to need to set up autoscaling. There are two options, the classic Autoscaler and the new AWS-made hotness which is Karpenter. You want to use Karpenter, it's better and doesn't have all the lack-of-awareness when it comes to AZs and EBS storage (basically autoscaling doesn't work if you have state on your servers unless you manually configure it to work, it's a serious problem). AWS has a blog talking about it.
You will probably want to install some sort of DNS service, if for nothing else so services not running inside of EKS can find your dynamic endpoints. You are ganna wanna start here.
You almost certainly want Prometheus and Grafana to see if stuff is working as outlined here.
You will want to do a lot of reading about IAM and RBAC to understand specifically how those work together. Here's a link. Also just read this entire thing from end to end: link.
Networking is obviously a choice you will need to make. AWS has a network plugin that I recommend, but you'll still need to install it: link.
Also you will need a storage driver. EBS is supported by kubernetes in the default configuration but you'll want the official one for all the latest and greatest features. You can get that here.
Maybe most importantly you are going to need some sort of ingress and load balancer controller. Enjoy setting that up with the instructions here.
OIDC auth and general cluster access is pretty much up to you. You get the IAM auth out of the box which is good enough for most use cases, but you need to understand RBAC and how that interacts with IAM. It's not as complicated as that sentence implies it will be, but it's also not super simple.

That's a lot of setup

I understand that part of the appeal of k8s is that you get to make all sorts of decisions on your own. There is some power in that, but in the same way that RDS became famous not because you could do anything you wanted, but because AWS stopped you from doing dumb stuff that would make you life hard; I'm surprised EKS is not more "out of the box". My assumption would have been that AWS launched the clusters with all their software installed by default and let you switch off those defaults in a config file or CLI option.

There is an ok Terraform module for this, but even this was not plug and play. You can find that here. This is what I use for most things and it isn't the fault of the module maintainers, EKS touches a lot of systems and because it is so customizable, it's hard to really turn into a simple module. I don't really considering this the problem of the maintainers, they've already applied a lot of creativity to getting around limitations in what they can do through the Go APIs.

Overall Impression - worth it?

I would still say yes after using it for a few years. The AWS SSO integration is great, allowing us to have folks access the EKS cluster through their SSO access even with the CLI. aws sso login --profile name_of_profile and aws eks --profile name_of_profile --region eu-west-1 update-kubeconfig --name name_of_cluster is all that is required for me to interact with the cluster through normal kubectl commands, which is great. I also appreciate the regular node AMI upgrades, removing the responsibility from me and the team for keeping on top of those.

Storage

There are some issues here. The ebs experience is serviceable but slow, resizing and other operations are very rate-limited and can be an error-prone experience. So if your service relies a lot on resizing volumes or changing volumes, ebs is not the right choice inside of EKS, you are probably going to want the efs approach. I also feel like AWS could provide more feedback to you about the dangers of relying on ebs since they are tied to AZs, requiring some planning on your part before starting to keep all those pieces working. Otherwise the pods will simply be undeployable. By default the old autoscaling doesn't resolve this problem, requiring you to set node affinities.

Storage and Kubernetes have a bad history, like a terrible marriage. I understand the container purists which say "containers should never have state, it's against the holy container rules" but I live in the real world where things must sometimes save to disk. I understand we don't like that, but like so many things in life, we must accept the things we cannot change. That being said it is still a difficult design pattern in EKS that requires planning. You might want to consider some node affinity for applications that do require persistent storage.

The AWS CNI link is quite good and I love that it allows me to generate IP addresses on the subnets in my VPCs. Not only does that simplify some of the security policy logic and the overall monitoring and deployment story, but mentally it is easier to not add another layer of abstraction to what is already an abstraction. The downside is they are actual network interfaces, so you need to monitor free IP addresses and ENIs.

AWS actually provides a nice tool to keep an eye on this stuff which you can find here. But be aware that if you intend on running a large cluster or if your cluster might scale up to be quite large, you are gonna get throttled with API calls. Remember the maximum number of ENIs you can attach to a node is determined by instance type, so its not an endlessly flexible approach. AWS maintains a list of ENI and IP limits per instance type here.

The attachment process also isn't instant, so you might be waiting a while for all this to come together. The point being you will need to understand the networking pattern of your pods inside of the cluster in a way you don't need to do as much with other CNIs.

Load Balancing and Ingress

This part of the experience is so bad it is kind of incredible. If you install the tool AWS recommends, you are going to get a new ALB for every single Ingress defined unless you group them together into groups. Ditto with every single internal LoadBalancer target, you'll end up with NLBs for each one. Now you can obviously group them together, but this process is incredibly frustrating in practice.

What do you mean?

You have a new application and you add this to your YAML:

kind: Ingress
metadata:
  namespace: new-app
  name: new-app
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  rules:
    - http:
        paths:
          - path: /*
            backend:
              serviceName: new-app
              servicePort: 80

Every time you do this you are going to end up with another ALB, which aren't free to use. You can group them together as shown here but my strong preference would have been to have AWS do this. I get they are not a charity, but it seems kind of crazy that this becomes the default unless you and the rest of your organization know how to group stuff together (and what kinds of apps can even be grouped together).

I understand it is in AWS's best interest to set it up like this, but in practice it means you'll see ALB and NLB cost exceed EC2 costs because by default you'll end up with so many of the damn things barely being used. My advice for someone setting up today would be to go with HAProxy as shown here. It's functionally a better experience and you'll save so much money you can buy the HAProxy enterprise license from the savings.

Conclusion in 2022

If you are going to run Kubernetes, don't manage it yourself unless you can dedicate staff to it. EKS is a very low-friction day to day experience but it frontloads the bad parts. Setup is a pain in the ass and requires a lot of understanding about how these pieces fit together that is very different from more "hands off" AWS products.

I think things are looking good though that AWS has realized it makes more sense to support k8s as opposed to fighting it. Karpenter as the new autoscaler is much better, I'm optimistic they'll do something with the load balancing situation and the k8s add-on stack is relatively stable. If folks are interested in buying-in today I think you'll be in ok shape in 5 years. Clearly whatever internal forces at AWS that wanted this service to fail have been pushed back by its success.

However if I were a very small company new to AWS I wouldn't touch this with a ten foot pole. It's far too complicated for the benefits and any sort of concept of "portability" is nonsense bullshit. You'll be so tied into the IAM elements of RBAC that you won't be able to lift and shift anyway, so if you are going to buy in then don't do this. EKS is for companies already invested in k8s or looking for a deeply tweakable platform inside of AWS when they have outgrown services like Lightsail and Elastic Beanstalk. If that isn't you, don't waste your time.

How does Apple Private Relay Work?

August 31, 2021 in Apple

What is Apple Private Relay?

Private Relay is an attempt by Apple to change the way traffic is routed from user to internet service and back. This is designed to break the relationship between user IP address and information about that user, reducing the digital footprint of that user and eliminating certain venues of advertising information.

It is a new feature in the latest version of iOS and MacOS that will be launching in "beta mode". It is available to all users who pay Apple for iCloud storage and I became interested in it after watching the WWDC session about preparing for it.

TL;DR

Private Relay provides real value to users, but also fundamentally changes the way network traffic flows across the internet for those users. Network administrators, programmers and owners of businesses which rely on IP addresses from clients for things like whitelisting, advertising and traffic analysis should be aware of this massive change. It is my belief that this change is not getting enough attention in the light of the CSAM scanning.

What happens when you turn on Private Relay?

The following traffic is impacted by Private Relay

All Safari web browsing
All DNS queries
All insecure HTTP traffic

Traffic from those sources will no longer take the normal route to their destination, instead being run through servers controlled by either Apple or its partners. They will ingress at a location close to you and then egress somewhere else, with an IP address known to be from your "region". In theory websites will still know roughly where you are coming from, but won't be able to easily combine that with other information they know about your IP address to enrich targeted advertisements. Access logs and other raw sources of data will also be less detailed, with the personally identifiable information that is your IP address no longer listed on logs for every website you visit.

Why is Apple doing this?

When you go to a website, you are identified in one of a thousand ways, from cookies to device fingerprinting. However one of the easiest ways is through your IP address. Normal consumers don't have "one" IP address, they are either given one by their ISP when their modem comes online and asks for one, or their ISP has them behind "carrier-grade NAT". So normally what happens is that you get your modem, plug it in, it receives an IP address from the ISP and that IP addresses identifies you to the world.

Normally how the process works is something like this:

Your modems MAC address appears on the ISPs network and requests an IP address
The ISP does a lookup for the MAC address, makes sure it is in the table and then assigns an IP, ideally the same IP over and over again so whatever cached routes from the ISPs side exist are still used.
All requests from your home are mapped to a specific IP addresses and, over time, given the combination of other information about the browsing history and advertising data, it is possible to combine the data together to know where you live and who you are within a specific range.
You can see how close the geographic data is by checking out the map available here. For me it got me within a few blocks of my house, which is spooky.

CGNAT

Because of IPv4 address exhaustion, it's not always possible to assign every customer their own IP address. You know you have a setup like this because the IP address your router gets is in the "private range" of IP addresses, but when you go to IP Chicken you'll have a non-private IP address.

Private IP ranges include:

10.0.0.0 – 10.255.255.255
172.16.0.0 – 172.31.255.255
192.168.0.0 – 192.168.255.255

For those interested you can get more information about how CGNAT works here.

Doesn't my home router do that?

Yeah so your home router kind of does something similar with its own IP address ranges. So next time a device warns you about "double-NAT" this might be what it is talking about, basically nested NAT. (Most often double-NAT is caused by your modem also doing NAT though.) Your home router runs something called PAT or PAT in overload. I think more often it is called NAPT in modern texts.

This process is not that different from what we see above. One public IP address is shared and the different internal targets are identified with ports. Your machine makes an outbound connection, your router receives the request and rewrites the packet with a random high port. Every outbound connection gets its own entry in this table.

IP Exposed

So during the normal course of using the internet, your IP address is exposed to the following groups

Every website or web service you are connecting to
Your DNS server also can have a record of every website you looked up.
Your ISP can obviously see where every request to and from your home went to

This means there are three groups of people able to turn your request into extremely targeted advertising. The most common one I see using IP address is hyper-local advertising. If you have ever gotten an online ad for a local business or service and wondered "how did they know it was me", there is a good chance it was through your IP.

DNS is one I think is often forgotten in the conversation about leaking IPs, but since it is a reasonable assumption that if you make a DNS lookup for a destination you will go to that destination, it is as valuable as the more invasive systems without requiring nearly as much work. Let's look at one popular example, Google DNS.

Google DNS

The famous 8.8.8.8. Google DNS has become famous because of the use of DNS around the world as a cheap and fast way to block network access for whole countries or regions. A DNS lookup is just what turns domain names into IP address. So for this site:

➜  ~ host matduggan.com
matduggan.com has address 67.205.139.103

Since DNS servers are normally controlled by ISPs and subject to local law, it is trivial if your countries leadership wants to block access for users to get to Twitter by simply blocking lookups to twitter.com. DNS is a powerful service that is normally treated as an afterthought. Alternatives came up, the most popular being Google DNS. But is it actually more secure?

Google asserts that they only store your IP address for 24-48 hours in their temporary logs. When they migrate your data to their permanent DNS logs, they remove IP address and replace with region data. So instead of being able to drill down to your specific house, they will only be able to tell your city. You can find more information here. I consider their explanation logical though and think they are certainly more secure when compared to a normal ISP DNS server.

Most ISPs don't offer that luxury, simply prefilling their DNS servers when you get your equipment from them and add it to the network. There is very little information about what they are doing with it that I was able to find, but they are allowed now to sell that information if they so choose. This means the default setting for US users is to provide an easy to query copy of every website their household visits to their ISP.

So most users will not take the proactive step to switch their DNS servers to one provided by Google or other parties. However since most folks won't do that, the information is just being openly shared with whoever has access to that DNS server.

NOTE: If you are looking to switch your DNS servers off your ISP, I recommend dns.watch. I've been using them for years and feel strongly they provide an excellent service with a minimum amount of fuss.

How does Private Relay address these concerns?

DNS

This is how a normal DNS lookup works.

Apple and Cloudflare engineers have proposed a new standard, which they discuss in their blog post here. ODNS or "oblivious DNS" is a system which allows clients to mask the originator of the request from the server making the lookup, breaking the IP chain.

This is what ODNS looks like:

Source: Princeton paper

This is why all DNS queries are getting funneled through Private Relay, removing the possibility of ISP DNS servers getting this valuable information. It is unclear to me in my testing if I am using Apple's servers or Cloudflares 1.1.1.1 DNS service. With this system it shouldn't matter in terms of privacy.

2. Website IP Tracking

When on Private Relay, all traffic is funneled first through an Apple ingress service and then out through a CDN partner. Your client makes a lookup to one of these two DNS entries using our new fancy ODNS:

mask.icloud.com
mask-h2.icloud.com

This returns a long list of IP addresses for you to choose from:

mask.icloud.com is an alias for mask.apple-dns.net.
mask.apple-dns.net has address 172.224.41.7
mask.apple-dns.net has address 172.224.41.4
mask.apple-dns.net has address 172.224.42.5
mask.apple-dns.net has address 172.224.42.4
mask.apple-dns.net has address 172.224.42.9
mask.apple-dns.net has address 172.224.41.9
mask.apple-dns.net has address 172.224.42.7
mask.apple-dns.net has address 172.224.41.6
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2909::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a05::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a07::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2904::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2905::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a04::
mask.apple-dns.net has IPv6 address 2a02:26f7:36:0:ace0:2a08::
mask.apple-dns.net has IPv6 address 2a02:26f7:34:0:ace0:2907::

These IP addresses are owned by Akamai and are here in Denmark, meaning all Private Relay traffic first goes to a CDN endpoint. These are globally situated datacenters which allow companies to cache content close to users to improve response time and decrease load on their own servers. So then my client opens a connection to one of these endpoints using a new protocol, QUIC. Quick, get it? Aren't network engineers fun.

QUIC integrates TLS to encrypt all payload data and most control information. Its based on UDP for speed but is designed to replace TCP, the venerable protocol that requires a lot of overhead in terms of connections. By baking in encryption, Apple is ensuring a very high level of security for this traffic with a minimum amount of trust required between the partners. It also removes the loss recovery elements of TCP, instead shifting that responsibility to each QUIC stream. There are other advantages such as better shifting between different network providers as well.

So each user makes an insecure DNS lookup to mask.apple-dns.net, establishes a QUIC connection to the local ingress node and then that traffic is passed through to the egress CDN node. Apple maintains a list of those egress CDN nodes you can see here. However users can choose whether they want to reveal even city-level information to websites through the Private Relay settings panel.

If I choose to leave "Maintain General Location" checked, websites will know I'm coming from Copenhagen. If I select the "Country and Time Zone" you just know I'm coming fron Denmark. The traffic will appear to be coming from a variety of CDN IP addresses. You can tell Apple very delibertly did not want to offer any sort of "region hopping" functionality like users require from VPNs, letting you access things like streaming content in other countries. You will always appear to be coming from your country.

3. ISP Network Information

Similar to how the TOR protocol (link) works, this will allow you to effectively hide most of what you are doing. To the ISP your traffic will simply be going to the CDN endpoint closest to you, with no DNS queries flowing to them. Those partner CDN nodes lack the complete information to connect your IP address to the request to the site. In short, it should make the information flowing across their wires much less valuable from an advertising perspective.

In terms of performance hit it should be minimal, unlike TOR. Since we are using a faster protocol with only one hop (CDN 1 -> CDN 2 -> Destination) as opposed to TOR, in my testing its pretty hard to tell the difference. While there are costs for Apple to offer the service, by limiting the traffic to just Safari, DNS and http traffic they are greatly limiting how much raw bandwidth will pass through these servers. Most traffic (like Zoom, Slack, Software Updates, etc) will all be coming from HTTPS servers.

Conclusion

Network operators, especially with large numbers of Apple devices, should take the time to read through the QUIC management document. Since the only way Apple is allowing people to "opt out" of Private Relay at a network level is by blocking DNS lookups to mask.icloud.com and mask-h2.icloud.com, many smaller shops or organizations that choose to not host their own DNS will see a large change in how traffic flows.

For those that do host their own DNS, users receive an alert that you have blocked Private Relay on the network. This is to caution you in case you think that turning it off will result in no user complaints. I won't presume to know your requirements, but nothing I've seen on the spec document for managing QUIC suggests there is anything worth blocking from a network safety perspective. If anything, it should be a maginal reduction in the amount of packets flowing across the wire.

Apple is making some deliberate choices here with Private Relay and for the most part I support them. I think it will hurt the value of some advertising and I suspect that for the months following its release the list of Apple egress nodes will confuse network operators on why they are seeing so much traffic from the same IP addresses. I am also concerned that eventually Apple will want all traffic to flow through Private Relay, adding another level of complexity for teams attempting to debug user error reports of networking problems.

From a privacy standpoint I'm still unclear on how secure this process is from Apple. Since they are controlling the encryption and key exchange, along with authenticating with the service, it seems obvious that they can work backwards and determine the IP address. I would love for them to publish more whitepapers or additional clarification on the specifics of how they handle logging around the establishment of connections.

Any additional information people have been able to find out would be much appreciated. Feel free to ping me on twitter at: @duggan_mathew.