matduggan.com

Adventures in IPv6 Part 2

August 08, 2023

As I discussed in Part 1 I've converted this site over to pure IPv6. Well at least as pure as I could get away with. I still have some problems though, chief among them that I cannot send emails with the Ghost CMS. I've switched from Mailgun to Scaleway which does have IPv6 for their SMTP service.

smtp.tem.scw.cloud has IPv6 address 2001:bc8:1201:21:d6ae:52ff:fed0:418e
smtp.tem.scw.cloud has IPv6 address 2001:bc8:1201:21:d6ae:52ff:fed0:6aac

I've also confirmed that my docker-compose stack running Ghost can successfully reach IPv6 external addresses with no issues.

matdevdug-busy-1      | PING google.com (2a00:1450:4002:411::200e): 56 data bytes
matdevdug-busy-1      | 64 bytes from 2a00:1450:4002:411::200e: seq=0 ttl=113 time=15.079 ms
matdevdug-busy-1      | 64 bytes from 2a00:1450:4002:411::200e: seq=1 ttl=113 time=14.607 ms
matdevdug-busy-1      | 64 bytes from 2a00:1450:4002:411::200e: seq=2 ttl=113 time=14.540 ms
matdevdug-busy-1      | 64 bytes from 2a00:1450:4002:411::200e: seq=3 ttl=113 time=14.593 ms
matdevdug-busy-1      |
matdevdug-busy-1      |
matdevdug-busy-1      | --- google.com ping statistics ---
matdevdug-busy-1      | 4 packets transmitted, 4 packets received, 0% packet loss
matdevdug-busy-1      | round-trip min/avg/max = 14.540/14.704/15.079 ms

I've also confirmed that Scaleway is reachable by the container no problem with the domain name, so it isn't a DNS problem.

PING smtp.tem.scw.cloud(ff6ad116-d710-4726-b5d3-1687dceb56cb.fr-par-2.baremetal.scw.cloud (2001:bc8:1201:21:d6ae:52ff:fed0:6aac)) 56 data bytes
64 bytes from ff6ad116-d710-4726-b5d3-1687dceb56cb.fr-par-2.baremetal.scw.cloud (2001:bc8:1201:21:d6ae:52ff:fed0:6aac): icmp_seq=1 ttl=53 time=23.1 ms
64 bytes from ff6ad116-d710-4726-b5d3-1687dceb56cb.fr-par-2.baremetal.scw.cloud (2001:bc8:1201:21:d6ae:52ff:fed0:6aac): icmp_seq=2 ttl=53 time=22.2 ms
64 bytes from ff6ad116-d710-4726-b5d3-1687dceb56cb.fr-par-2.baremetal.scw.cloud (2001:bc8:1201:21:d6ae:52ff:fed0:6aac): icmp_seq=3 ttl=53 time=22.2 ms
64 bytes from ff6ad116-d710-4726-b5d3-1687dceb56cb.fr-par-2.baremetal.scw.cloud (2001:bc8:1201:21:d6ae:52ff:fed0:6aac): icmp_seq=4 ttl=53 time=22.1 ms

--- smtp.tem.scw.cloud ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 22.086/22.397/23.063/0.388 ms

At this point I have three theories.

It's an SMTP problem. Possible, but unlikely given how long SMTP has supported IPv6. A quick check by running it over bash by following the instructions here shows that works fine.
Something is blocking the port.

telnet smtp.tem.scw.cloud 587
Trying 2001:bc8:1201:21:d6ae:52ff:fed0:6aac...
Connected to smtp.tem.scw.cloud.
Escape character is '^]'.
220 smtp.scw-tem.cloud ESMTP Service Ready

Alright it's not that.

3. Nodemailer is being stupid. It looks like Ghost relies on Nodemailer so let's check it out. Let's install Node and NPM on my debian junk machine.

sudo apt install npm
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  eslint gyp handlebars libjs-async libjs-events libjs-inherits libjs-is-typedarray libjs-prettify libjs-regenerate libjs-source-map
  libjs-sprintf-js libjs-typedarray-to-buffer libjs-util libnode-dev libssl-dev libuv1-dev node-abbrev node-agent-base node-ajv node-ajv-keywords
  node-ampproject-remapping node-ansi-escapes node-ansi-regex node-ansi-styles node-anymatch node-aproba node-archy node-are-we-there-yet
  node-argparse node-arrify node-assert node-async node-async-each node-babel-helper-define-polyfill-provider node-babel-plugin-add-module-exports
  node-babel-plugin-lodash node-babel-plugin-polyfill-corejs2 node-babel-plugin-polyfill-corejs3 node-babel-plugin-polyfill-regenerator node-babel7
  node-babel7-runtime node-balanced-match node-base64-js node-binary-extensions node-brace-expansion node-braces node-browserslist node-builtins
  node-cacache node-camelcase node-caniuse-lite node-chalk node-chokidar node-chownr node-chrome-trace-event node-ci-info node-cli-table node-cliui
  node-clone node-clone-deep node-color-convert node-color-name node-colors node-columnify node-commander node-commondir node-concat-stream
  node-console-control-strings node-convert-source-map node-copy-concurrently node-core-js node-core-js-compat node-core-js-pure node-core-util-is
  node-css-loader node-css-selector-tokenizer node-data-uri-to-buffer node-debbundle-es-to-primitive node-debug node-decamelize
  node-decompress-response node-deep-equal node-deep-is node-defaults node-define-properties node-defined node-del node-delegates node-depd
  node-diff node-doctrine node-electron-to-chromium node-encoding node-end-of-stream node-enhanced-resolve node-err-code node-errno node-error-ex
  node-es-abstract node-es-module-lexer node-es6-error node-escape-string-regexp node-escodegen node-eslint-scope node-eslint-utils
  node-eslint-visitor-keys node-espree node-esprima node-esquery node-esrecurse node-estraverse node-esutils node-events node-fancy-log
  node-fast-deep-equal node-fast-levenshtein node-fetch node-file-entry-cache node-fill-range node-find-cache-dir node-find-up node-flat-cache
  node-flatted node-for-in node-for-own node-foreground-child node-fs-readdir-recursive node-fs-write-stream-atomic node-fs.realpath
  node-function-bind node-functional-red-black-tree node-gauge node-get-caller-file node-get-stream node-glob node-glob-parent node-globals
  node-globby node-got node-graceful-fs node-gyp node-has-flag node-has-unicode node-hosted-git-info node-https-proxy-agent node-iconv-lite
  node-icss-utils node-ieee754 node-iferr node-ignore node-imurmurhash node-indent-string node-inflight node-inherits node-ini node-interpret
  node-ip node-ip-regex node-is-arrayish node-is-binary-path node-is-buffer node-is-extendable node-is-extglob node-is-glob node-is-number
  node-is-path-cwd node-is-path-inside node-is-plain-obj node-is-plain-object node-is-stream node-is-typedarray node-is-windows node-isarray
  node-isexe node-isobject node-istanbul node-jest-debbundle node-jest-worker node-js-tokens node-js-yaml node-jsesc node-json-buffer
  node-json-parse-better-errors node-json-schema node-json-schema-traverse node-json-stable-stringify node-json5 node-jsonify node-jsonparse
  node-kind-of node-levn node-loader-runner node-locate-path node-lodash node-lodash-packages node-lowercase-keys node-lru-cache node-make-dir
  node-memfs node-memory-fs node-merge-stream node-micromatch node-mime node-mime-types node-mimic-response node-minimatch node-minimist
  node-minipass node-mkdirp node-move-concurrently node-ms node-mute-stream node-n3 node-negotiator node-neo-async node-nopt
  node-normalize-package-data node-normalize-path node-npm-bundled node-npm-package-arg node-npm-run-path node-npmlog node-object-assign
  node-object-inspect node-once node-optimist node-optionator node-osenv node-p-cancelable node-p-limit node-p-locate node-p-map node-parse-json
  node-path-dirname node-path-exists node-path-is-absolute node-path-is-inside node-path-type node-picocolors node-pify node-pkg-dir node-postcss
  node-postcss-modules-extract-imports node-postcss-modules-values node-postcss-value-parser node-prelude-ls node-process-nextick-args node-progress
  node-promise-inflight node-promise-retry node-promzard node-prr node-pump node-punycode node-quick-lru node-randombytes node-read
  node-read-package-json node-read-pkg node-readable-stream node-readdirp node-rechoir node-regenerate node-regenerate-unicode-properties
  node-regenerator-runtime node-regenerator-transform node-regexpp node-regexpu-core node-regjsgen node-regjsparser node-repeat-string
  node-require-directory node-resolve node-resolve-cwd node-resolve-from node-resumer node-retry node-rimraf node-run-queue node-safe-buffer
  node-schema-utils node-semver node-serialize-javascript node-set-blocking node-set-immediate-shim node-shebang-command node-shebang-regex
  node-signal-exit node-slash node-slice-ansi node-source-list-map node-source-map node-source-map-support node-spdx-correct node-spdx-exceptions
  node-spdx-expression-parse node-spdx-license-ids node-sprintf-js node-ssri node-string-decoder node-string-width node-strip-ansi node-strip-bom
  node-strip-json-comments node-supports-color node-tapable node-tape node-tar node-terser node-text-table node-through node-time-stamp
  node-to-fast-properties node-to-regex-range node-tslib node-type-check node-typedarray node-typedarray-to-buffer
  node-unicode-canonical-property-names-ecmascript node-unicode-match-property-ecmascript node-unicode-match-property-value-ecmascript
  node-unicode-property-aliases-ecmascript node-unique-filename node-uri-js node-util node-util-deprecate node-uuid node-v8-compile-cache
  node-v8flags node-validate-npm-package-license node-validate-npm-package-name node-watchpack node-wcwidth.js node-webassemblyjs
  node-webpack-sources node-which node-wide-align node-wordwrap node-wrap-ansi node-wrappy node-write node-write-file-atomic node-y18n node-yallist
  node-yargs node-yargs-parser terser webpack
Suggested packages:
  node-babel-eslint node-esprima-fb node-inquirer libjs-angularjs libssl-doc node-babel-plugin-polyfill-es-shims node-babel7-debug javascript-common
  livescript chai node-jest-diff node-opener
Recommended packages:
  javascript-common build-essential node-tap
The following NEW packages will be installed:
  eslint gyp handlebars libjs-async libjs-events libjs-inherits libjs-is-typedarray libjs-prettify libjs-regenerate libjs-source-map
  libjs-sprintf-js libjs-typedarray-to-buffer libjs-util libnode-dev libssl-dev libuv1-dev node-abbrev node-agent-base node-ajv node-ajv-keywords
  node-ampproject-remapping node-ansi-escapes node-ansi-regex node-ansi-styles node-anymatch node-aproba node-archy node-are-we-there-yet
  node-argparse node-arrify node-assert node-async node-async-each node-babel-helper-define-polyfill-provider node-babel-plugin-add-module-exports
  node-babel-plugin-lodash node-babel-plugin-polyfill-corejs2 node-babel-plugin-polyfill-corejs3 node-babel-plugin-polyfill-regenerator node-babel7
  node-babel7-runtime node-balanced-match node-base64-js node-binary-extensions node-brace-expansion node-braces node-browserslist node-builtins
  node-cacache node-camelcase node-caniuse-lite node-chalk node-chokidar node-chownr node-chrome-trace-event node-ci-info node-cli-table node-cliui
  node-clone node-clone-deep node-color-convert node-color-name node-colors node-columnify node-commander node-commondir node-concat-stream
  node-console-control-strings node-convert-source-map node-copy-concurrently node-core-js node-core-js-compat node-core-js-pure node-core-util-is
  node-css-loader node-css-selector-tokenizer node-data-uri-to-buffer node-debbundle-es-to-primitive node-debug node-decamelize
  node-decompress-response node-deep-equal node-deep-is node-defaults node-define-properties node-defined node-del node-delegates node-depd
  node-diff node-doctrine node-electron-to-chromium node-encoding node-end-of-stream node-enhanced-resolve node-err-code node-errno node-error-ex
  node-es-abstract node-es-module-lexer node-es6-error node-escape-string-regexp node-escodegen node-eslint-scope node-eslint-utils
  node-eslint-visitor-keys node-espree node-esprima node-esquery node-esrecurse node-estraverse node-esutils node-events node-fancy-log
  node-fast-deep-equal node-fast-levenshtein node-fetch node-file-entry-cache node-fill-range node-find-cache-dir node-find-up node-flat-cache
  node-flatted node-for-in node-for-own node-foreground-child node-fs-readdir-recursive node-fs-write-stream-atomic node-fs.realpath
  node-function-bind node-functional-red-black-tree node-gauge node-get-caller-file node-get-stream node-glob node-glob-parent node-globals
  node-globby node-got node-graceful-fs node-gyp node-has-flag node-has-unicode node-hosted-git-info node-https-proxy-agent node-iconv-lite
  node-icss-utils node-ieee754 node-iferr node-ignore node-imurmurhash node-indent-string node-inflight node-inherits node-ini node-interpret
  node-ip node-ip-regex node-is-arrayish node-is-binary-path node-is-buffer node-is-extendable node-is-extglob node-is-glob node-is-number
  node-is-path-cwd node-is-path-inside node-is-plain-obj node-is-plain-object node-is-stream node-is-typedarray node-is-windows node-isarray
  node-isexe node-isobject node-istanbul node-jest-debbundle node-jest-worker node-js-tokens node-js-yaml node-jsesc node-json-buffer
  node-json-parse-better-errors node-json-schema node-json-schema-traverse node-json-stable-stringify node-json5 node-jsonify node-jsonparse
  node-kind-of node-levn node-loader-runner node-locate-path node-lodash node-lodash-packages node-lowercase-keys node-lru-cache node-make-dir
  node-memfs node-memory-fs node-merge-stream node-micromatch node-mime node-mime-types node-mimic-response node-minimatch node-minimist
  node-minipass node-mkdirp node-move-concurrently node-ms node-mute-stream node-n3 node-negotiator node-neo-async node-nopt
  node-normalize-package-data node-normalize-path node-npm-bundled node-npm-package-arg node-npm-run-path node-npmlog node-object-assign
  node-object-inspect node-once node-optimist node-optionator node-osenv node-p-cancelable node-p-limit node-p-locate node-p-map node-parse-json
  node-path-dirname node-path-exists node-path-is-absolute node-path-is-inside node-path-type node-picocolors node-pify node-pkg-dir node-postcss
  node-postcss-modules-extract-imports node-postcss-modules-values node-postcss-value-parser node-prelude-ls node-process-nextick-args node-progress
  node-promise-inflight node-promise-retry node-promzard node-prr node-pump node-punycode node-quick-lru node-randombytes node-read
  node-read-package-json node-read-pkg node-readable-stream node-readdirp node-rechoir node-regenerate node-regenerate-unicode-properties
  node-regenerator-runtime node-regenerator-transform node-regexpp node-regexpu-core node-regjsgen node-regjsparser node-repeat-string
  node-require-directory node-resolve node-resolve-cwd node-resolve-from node-resumer node-retry node-rimraf node-run-queue node-safe-buffer
  node-schema-utils node-semver node-serialize-javascript node-set-blocking node-set-immediate-shim node-shebang-command node-shebang-regex
  node-signal-exit node-slash node-slice-ansi node-source-list-map node-source-map node-source-map-support node-spdx-correct node-spdx-exceptions
  node-spdx-expression-parse node-spdx-license-ids node-sprintf-js node-ssri node-string-decoder node-string-width node-strip-ansi node-strip-bom
  node-strip-json-comments node-supports-color node-tapable node-tape node-tar node-terser node-text-table node-through node-time-stamp
  node-to-fast-properties node-to-regex-range node-tslib node-type-check node-typedarray node-typedarray-to-buffer
  node-unicode-canonical-property-names-ecmascript node-unicode-match-property-ecmascript node-unicode-match-property-value-ecmascript
  node-unicode-property-aliases-ecmascript node-unique-filename node-uri-js node-util node-util-deprecate node-uuid node-v8-compile-cache
  node-v8flags node-validate-npm-package-license node-validate-npm-package-name node-watchpack node-wcwidth.js node-webassemblyjs
  node-webpack-sources node-which node-wide-align node-wordwrap node-wrap-ansi node-wrappy node-write node-write-file-atomic node-y18n node-yallist
  node-yargs node-yargs-parser npm terser webpack
0 upgraded, 349 newly installed, 0 to remove and 1 not upgraded.
Need to get 13.8 MB of archives.
After this operation, 106 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Jesus Christ NPM, what is happening

Now that I have that nightmare factory installed.

"use strict";
const nodemailer = require("nodemailer");

const transporter = nodemailer.createTransport({
  host: "smtp.tem.scw.cloud",
  port: 587,
  // Just so I don't need to worry about it
  secure: false,
  auth: {
    // TODO: replace `user` and `pass` values from <https://forwardemail.net>
    user: 'scaleway-user-name',
    pass: 'scaleway-password'
  }
});

// async..await is not allowed in global scope, must use a wrapper
async function main() {
  // send mail with defined transport object
  const info = await transporter.sendMail({
    from: '"Dead People 👻" <[email protected]>', // sender address
    to: "[email protected]", // list of receivers
    subject: "Hello", // Subject line
    text: "Hello world", // plain text body
    html: "<b>Hello world?</b>", // html body
  });

  console.log("Message sent: %s", info.messageId);
}

main().catch(console.error);

Looks like Nodemailer doesn't seem to understand this is an IPv6 box.

node example.js
Error: connect ENETUNREACH 51.159.99.81:587 - Local (0.0.0.0:0)
    at internalConnect (node:net:1060:16)
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:464:18)
    at node:net:1244:9
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11) {
  errno: -101,
  code: 'ESOCKET',
  syscall: 'connect',
  address: '51.159.99.81',
  port: 587,
  command: 'CONN'
}

It looks like this should have been fixed here: https://github.com/nodemailer/nodemailer/pull/1311 but clearly isn't. What happens if I just manually set the IPv6 address.

Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: IP: 2001:bc8:1201:21:d6ae:52ff:fed0:6aac is not in the cert's list:

However if you set it to use an IP for host and a DNS entry for hostname, everything seems to work great.

"use strict";
const nodemailer = require("nodemailer");

const transporter = nodemailer.createTransport({
  host: "2001:bc8:1201:21:d6ae:52ff:fed0:6aac",
  port: 587,
  secure: false,
  tls: {
    rejectUnauthorized: true,
    servername: "smtp.tem.scw.cloud"},
  auth: {
    user: 'scaleway-username',
    pass: 'scaleway-password'
  }
});

// async..await is not allowed in global scope, must use a wrapper
async function main() {
  // send mail with defined transport object
  const info = await transporter.sendMail({
    from: '"Test" <[email protected]>', // sender address
    to: [email protected]", // list of receivers
    subject: "Hello ✔", // Subject line
    text: "Hello world?", // plain text body
    html: "<b>Hello world?</b>", // html body
  });

  console.log("Message sent: %s", info.messageId);
}

main().catch(console.error);

Alright well issue submitted here: https://github.com/TryGhost/Ghost/issues/17627

It is a little alarming that the biggest Node email package doesn't work with IPv6 and seemingly only one person noticed and tried to fix it. Well whatever, we have a workaround.

Python

Alright let's try to fix the pip problems I was seeing before in various scripts.

pip3 install requests
error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.

    If you wish to install a non-Debian-packaged Python package,
    create a virtual environment using python3 -m venv path/to/venv.
    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
    sure you have python3-full installed.

    If you wish to install a non-Debian packaged Python application,
    it may be easiest to use pipx install xyz, which will manage a
    virtual environment for you. Make sure you have pipx installed.

    See /usr/share/doc/python3.11/README.venv for more information.

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

Right I forgot Python was doing this now. Fine, I'll use venv, not a problem. I guess first I compile a version of Python if I want the latest? I don't see any newer ARM packages out there. Alright, compiling Python.

sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev

wget https://www.python.org/ftp/python/3.11.4/Python-3.11.4.tgz

cd Python-3.11.4/

sudo make -j 2

sudo make altinstall

Alright now pip works great on the latest version inside of a venv. My scripts all seem to work fine and there appears to be no issues. Whatever problem there was before is resolved. Specific shoutout to requests where I'm doing some strange things with network traffic and it seems to have no problems.

Conclusion

So the amount of work to get a pretty simple blog up was nontrivial, but we're here now. I have a patch for Ghost that I can apply to the container, Python seems to be working fine/great now and Docker seems to work as long as I use a user-created network with IPv6 strictly defined. The Docker default bridge also works if you specify the links inside of the docker-compose file, but that seems to be depricated so let's not waste too much time on that. For those looking for instructions on the Docker part I just followed the guide outlined here.

Now that everything is up and running it seems fine, but again if you are thinking of running an IPv6 only server infrastructure, set aside a lot of time for problem solving. Even simple applications like this require a lot of research to get up and running successfully with outbound network functioning and everything linked up in the correct way.

IPv6 Is A Disaster (but we can fix it)

August 04, 2023 in Networking

IP addresses have been in the news a lot lately and not for good reasons. AWS has announced they are charging $.005 per IPv4 address per hour, joining other cloud providers in charging for the luxury of a public IPv4 address. GCP charges $.004, same with Azure and Hetzner charges €0.001/h. Clearly the era of cloud providers going out and purchasing more IPv4 space is coming to an end. As time goes on, the addresses are just more valuable and it makes less sense to give them out for free.

So the writing is on the wall. We need to switch to IPv6. Now I was first told that we were going to need to switch to IPv6 when I was in high school in my first Cisco class and I'm 36 now, to give you some perspective on how long this has been "coming down the pipe". Up to this point I haven't done much at all with IPv6, there has been almost no market demand for those skills and I've never had a job where anybody seemed all that interested in doing it. So I skipped learning about it, which is a shame because it's actually a great advancement in networking.

Now is the second best time to learn though, so I decided to migrate this blog to IPv6 only. We'll stick it behind a CDN to handle the IPv4 traffic, but let's join the cool kids club. What I found was horrifying: almost nothing works out of the box. Major dependencies cease functioning right away and workarounds cannot be described as production ready. The migration process for teams to IPv6 is going to be very rocky, mostly because almost nobody has done the work. We all skipped it for years and now we'll need to pay the price.

Why is IPv6 worth the work?

I'm not gonna do a thing about what is IPv4 vs IPv6. There are plenty of great articles on the internet about that. Let's just quickly recap though "why would anyone want to make the jump to IPv6".

Address space (obviously)
Smaller number of header fields (8 vs 13 on v4)

Faster processing: No more checksum, so routers don't have to do a recalculation for every packet.
Faster routing: More summary routes and hierarchical routes. (Don't know what that is? No stress. Summary route = combining multiple IPs so you don't need all the addresses, just the general direction based on the first part of the address. Ditto with routes, since IPv6 is globally unique you can have small and efficient backbone routing.)
QoS: Traffic Class and Flow Label fields make QoS easier.
Auto-addressing. This allows IPv6 hosts on a LAN to connect without a router or DHCP server.
You can add IPsec to IPv6 with the Authentication Header and Encapsulating Security Payload.

Finally the biggest one: because IPv6 addresses are free and IPv4 ones are not.

Setting up an IPv6-Only Server

The actual setup process was simple. I provisioned a Debian box and selected "IPv6". Then I got my first surprise. My box didn't get an IPv6 address. I was given a /64 of addresses, which is 18,446,744,073,709,551,616. It is good to know that my small ARM server could scale to run all the network infrastructure for every company I've ever worked for on all public addresses.

Now this sounds wasteful but when you look at how IPv6 works, it really isn't. Since IPv6 is much less "chatty" than IPv4, even if I had 10,000 hosts on this network it doesn't matter. As discussed here it actually makes sense to keep all the IPv6 space, even if at first it comes across as insanely wasteful. So just don't think about how many addresses are getting sent to each device.

Important: resist the urge to optimize address utilization. Talking to more experienced networking folks, this seems to be a common trap people fall into. We've all spent so much time worrying about how much space we have remaining in an IPv4 block and designing around that problem. That issue doesn't exist anymore. A /64 prefix is the smallest you should configure on an interface.

Attempting to stick a smaller prefix, which is something I've heard people try, like a /68 or a /96 can break stateless address auto-configuration. Your mentality should be a /48 per site. That's what the Regional Internet Registries hands out when allocating IPv6. When thinking about network organization, you need to think about the nibble boundary. (I know, it sounds like I'm making shit up now). It's basically a way to make IPv6 easier to read.

Let's say you have 2402:9400:10::/48. You would divide it up as follows if you wanted only /64 for each box as a flat network.

Subnet #	Subnet Address
0	2402:9400:10::/64
1	2402:9400:10:1::/64
2	2402:9400:10:2::/64
3	2402:9400:10:3::/64
4	2402:9400:10:4::/64
5	2402:9400:10:5::/64

A /52 works a similar way.

Subnet #	Subnet Address
0	2402:9400:10::/52
1	2402:9400:10:1000::/52
2	2402:9400:10:2000::/52
3	2402:9400:10:3000::/52
4	2402:9400:10:4000::/52
5	2402:9400:10:5000::/52

You can still at a glance know which subnet you are looking at.

Alright I've got my box ready to go. Let's try to set it up like a normal server.

Problem 1 - I can't SSH in

This was a predictable problem. Neither my work or home ISP supports IPv6. So it's great that I have this box set up, but now I can't really do anything with it. Fine, I attach an IPv4 address for now, SSH in and I'll set up cloudflared to run a tunnel. Presumably they'll handle the conversion on their side.

Except that isn't how Cloudflare rolls. Imagine my surprise when the tunnel collapses when I remove the IPv4 address. By default the cloudflared utility assumes IPv4 and you need to go in and edit the systemd service file to add: --edge-ip-version 6. After this, the tunnel is up and I'm able to SSH in.

Problem 2 - I can't use GitHub

Alright so I'm on the box. Now it's time to start setting up stuff. I run my server setup script and it immediately fails. It's trying to access the installation script for hishtory, a great shell history utility I use on all my personal stuff. It's trying to pull the install file from GitHub and failing. "Certainly that can't be right. GitHub must support IPv6?"

Nope. Alright fine, seems REALLY bad that the service the entire internet uses to release software doesn't work with IPv6, but you know Microsoft is broke and also only cares about fake AI now, so whatever. I ended up using the TransIP Github Proxy which worked fine. Now I have access to Github. But then Python fails with urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>. Alright I give up on this. My guess is the version of Python 3 in Debian doesn't like IPv6, but I'm not in the mood to troubleshoot it right now.

Problem 3 - Can't set up Datadog

Let's do something more basic. Certainly I can set up Datadog to keep an eye on this box. I don't need a lot of metrics, just a few historical load numbers. Go to Datadog, log in and start to walk through the process. Immediately collapses. The simple setup has you run curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh. Now S3 supports IPv6, so what the fuck?

curl -v https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh
*   Trying [64:ff9b::34d9:8430]:443...
*   Trying 52.216.133.245:443...
* Immediate connect fail for 52.216.133.245: Network is unreachable
*   Trying 54.231.138.48:443...
* Immediate connect fail for 54.231.138.48: Network is unreachable
*   Trying 52.217.96.222:443...
* Immediate connect fail for 52.217.96.222: Network is unreachable
*   Trying 52.216.152.62:443...
* Immediate connect fail for 52.216.152.62: Network is unreachable
*   Trying 54.231.229.16:443...
* Immediate connect fail for 54.231.229.16: Network is unreachable
*   Trying 52.216.210.200:443...
* Immediate connect fail for 52.216.210.200: Network is unreachable
*   Trying 52.217.89.94:443...
* Immediate connect fail for 52.217.89.94: Network is unreachable
*   Trying 52.216.205.173:443...
* Immediate connect fail for 52.216.205.173: Network is unreachable

It's not S3 or the box, because I can connect to the test S3 bucket AWS provides just fine.

curl -v  http://s3.dualstack.us-west-2.amazonaws.com/
*   Trying [2600:1fa0:40bf:a809:345c:d3f8::]:80...
* Connected to s3.dualstack.us-west-2.amazonaws.com (2600:1fa0:40bf:a809:345c:d3f8::) port 80 (#0)
> GET / HTTP/1.1
> Host: s3.dualstack.us-west-2.amazonaws.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< x-amz-id-2: r1WAG/NYpaggrPl3Oja4SG1CrcBZ+1RIpYKivAiIhiICtfwiItTgLfm6McPXXJpKWeM848YWvOQ=
< x-amz-request-id: BPCVA8T6SZMTB3EF
< Date: Tue, 01 Aug 2023 10:31:27 GMT
< Location: https://aws.amazon.com/s3/
< Server: AmazonS3
< Content-Length: 0
<
* Connection #0 to host s3.dualstack.us-west-2.amazonaws.com left intact

Fine I'll do it the manual way through apt.

0% [Connecting to apt.datadoghq.com (18.66.192.22)]

Goddamnit. Alright Datadog is out. It's at this point I realize the experiment of trying to go IPv6 only isn't going to work. Almost nothing seems to work right without proxies and hacks. I'll try to stick as much as I can on IPv6 but going exclusive isn't an option at this point.

NAT64

So in order to access IPv4 resources from IPv6 you need to go through a NAT64 service. I ended up using this one: https://nat64.net/. Immediately all my problems stopped and I was able to access resources normally. I am a little nervous about relying exclusively on what appears to be a hobby project for accessing critical internet resources, but since nobody seems to care upstream of me about IPv6 I don't think I have a lot of choice.

I am surprised there aren't more of these. This is the best list I was able to find:

Most of them seem to be gone now. Dresel's link doesn't work, Trex in my testing had problems, August Internet is gone, most of the Go6lab test devices are down, Tuxis worked but they launched the service in 2019 and seem to have no further interaction with it. Basically Kasper Dupont seems to be the only person on the internet with any sort of widespread interest in allowing IPv6 to actually work. Props to you Kasper.

Basically one person props up this entire part of the internet.

Kasper Dupont

So I was curious about Kasper and emailed him to ask a few questions. You can see that back and forth below.

Me: I found the Public NAT64 service super useful in the transition but would love to know a little bit more about why you do it.

Kasper: I do it primarily because I want to push IPv6 forward. For a few years
I had the opportunity to have a native IPv6-only network at home with
DNS64+NAT64, and I found that to be a pleasant experience which I
wanted to give more people a chance to experience.

When I brought up the first NAT64 gateway it was just a proof of
concept of a NAT64 extension I wanted to push. The NAT64 service took
off, the extension - not so much.

A few months ago I finally got native IPv6 at my current home, so now
I can use my own service in a fashion which much more resembles how my
target users would use it.

Me: You seem to be one of the few remaining free public services like this on the internet and would love to know a bit more about what motivated you to do it, how much it costs to run, anything you would feel comfortable sharing.

Kasper: For my personal products I have a total of 7 VMs across different
hosting providers. Some of them I purchase from Hetzner at 4.51 Euro
per month: https://hetzner.cloud/?ref=fFum6YUDlpJz

The other VMs are a bit more expensive, but not a lot.

Out of those VMs the 4 are used for the NAT64 service and the others
are used for other IPv6 transition related services. For example I
also run this service on a single VM: http://v4-frontend.netiter.com/

I hope to eventually make arrangements with transit providers which
will allow me to grow the capacity of the service and make it
profitable such that I can work on IPv6 full time rather than as a
side gig. The ideal outcome of that would be that IPv4-only content
providers pay the cost through their transit bandwidth payments.

Me: Any technical details you would like to mention would also be great

Kasper: That's my kind of audience :-)

I can get really really technical.

I think what primarily sets my service aside from other services is
that each of my DNS64 servers is automatically updated with NAT64
prefixes based on health checks of all the gateways. That means the
outage of any single NAT64 gateway will be mostly invisible to users.
This also helps with maintenance. I think that makes my NAT64 service
the one with the highest availability among the public NAT64 services.

The NAT64 code is developed entirely by myself and currently runs as a
user mode daemon on Linux. I am considering porting the most
performance critical part to a kernel module.

This site

Alright so I got the basics up and running. In order to pull docker containers over IPv6 you need to add: registry.ipv6.docker.com/library/ to the front of the image name. So for instance:
image: mysql:8.0 becomes image: registry.ipv6.docker.com/library/mysql:8.0

Docker warns you this setup isn't production ready. I'm not really sure what that means for Docker. Presumably if it were to stop you should be able to just pull normally?

Once that was done, we set up the site as an AAAA DNS record and allowed Cloudflare to proxy, meaning they handle the advertisement of IPv6 and bring the traffic to me. One thing I did modify from before was previously I was using Caddy webserver but since I now have a hard reliance on Cloudflare for most of my traffic, I switched to Nginx. One nice thing you can do now that you know all traffic is coming from Cloudflare is switch how SSL works.

Now I have an Origin Certificate from Cloudflare hard-loaded into Nginx with Authenticated Origin Pulls set up so that I know for sure all traffic is running through Cloudflare. The certificate is signed for 15 years, so I can feel pretty confident sticking it in my secrets management system and not thinking about it ever again. For those that are interested there is a tutorial here on how to do it: https://www.digitalocean.com/community/tutorials/how-to-host-a-website-using-cloudflare-and-nginx-on-ubuntu-22-04

Alright the site is back up and working fine. It's what you are reading right now, so if it's up then the system works.

Unsolved Problems

My containers still can't communicate with IPv4 resources even though they're on an IPv6 network with an IPv6 bridge. The DNS64 resolution is working, and I've added fixed-cidr-v6 into Docker. I can talk to IPv6 resources just fine, but the NAT64 conversion process doesn't work. I'm going to keep plugging away at it.
Before you ping me I did add NAT with ip6tables.
SMTP server problems. I haven't been able to find a commercial SMTP service that has an AAAA record. Mailgun and SES were both duds as were a few of the smaller ones I tried. Even Fastmail didn't have anything that could help me. If you know of one please let me know: https://c.im/@matdevdug

Why not stick with IPv4?

Putting aside "because we're running out of addresses" for a minute. If we had adopted IPv6 earlier, the way we do infrastructure could be radically different. So often companies use technology like load balancers and tunnels not because they actually need anything that these things do, but because they need some sort of logical division between private IP ranges and a public IP address they can stick in an DNS A record.

If you break a load balancer into its basic parts, it is doing two things. It is distributing incoming packets onto the back-end servers and it s checking the health of those servers and taking unhealthy ones out of the rotation. Nowadays they often handle things like SSL termination and metrics, but it's not a requirement to be called a load balancer.

There are many ways to load balance, but the most common are as follows:

Round-robin of connection requests.
Weighted Round-Robin with different servers getting more or less.
Least-Connection with servers that have the fewest connections getting more requests.
Weighted Least-Connection, same thing but you can tilt it towards certain boxes.

What you notice is there isn't anything there that requires, or really even benefits from a private IP address vs a public IP address. Configuring the hosts to accept traffic from only one source (the load balancer) is pretty simple and relatively cheap to do, computationally speaking. A lot of the infrastructure designs we've been forced into, things like VPCs, NAT gateways, public vs private subnets, all of these things could have been skipped or relied on less.

The other irony is that IP whitelisting, which currently is a broken security practice that is mostly a waste of time as we all use IP addresses owned by cloud providers, would actually be something that mattered. The process for companies to purchase a /44 for themselves would have gotten easier with demand and it would have been more common for people to go and buy a block of IPs from American Registry for Internet Numbers (ARIN), Réseaux IP Européens Network Coordination Centre (RIPE), or Asia-Pacific Network Information Centre (APNIC).

You would never need to think "well is Google going to buy more IP addresses" or "I need to monitor GitHub support page to make sure they don't add more later". You'd have one block they'd use for their entire business until the end of time. Container systems wouldn't need to assign internal IP addresses on each host, it would be trivial to allocate chunks of public IPs for them to use and also advertise over standard public DNS as needed.

Obviously I'm not saying private networks serve no function. My point is a lot of the network design we've adopted isn't based on necessity but on forced design. I suspect we would have ended up designing applications with the knowledge that they sit on the open internet vs relying entirely on the security of a private VPC. Given how security exploits work this probably would have been a benefit to overall security and design.

So even if cost and availability isn't a concern for you, allowing your organization more ownership and control over how your network functions has real measurable value.

Is this gonna get better?

So this sucks. You either pay cloud providers more money or you get a broken internet. My hope is that the folks who don't want to pay push more IPv6 adoption, but it's also a shame that it has taken so long for us to get here. All these problems and issues could have been addressed gradually and instead it's going to be something where people freak out until the teams that own these resources make the required changes.

I'm hopeful the end result might be better. I think at the very least it might open up more opportunities for smaller companies looking to establish themselves permanently with an IP range that they'll own forever, plus as IPv6 gets more mainstream it will (hopefully) get easier for customers to live with. But I have to say right now this is so broken it's kind of amazing.

If you are a small company looking to not pay the extra IP tax, set aside a lot of time to solve a myriad of problems you are going to encounter.

Thoughts/corrections/objections: [email protected]

Serverless Functions Post-Mortem

July 28, 2023

Around 2016, the term "serverless functions" started to take off in the tech industry. In short order, it was presented as the undeniable future of infrastructure. It's the ultimate solution to redundancy, geographic resilience, load balancing and autoscaling. Never again would we need to patch, tweak or monitor an application. The cloud providers would do it, all we had to do is hit a button and deploy to internet.

I was introduced to it like most infrastructure technology is presented to me, which is as a veiled threat. "Looks like we won't need as many Operations folks in the future with X" is typically how executives discuss it. Early in my career this talk filled me with fear, but now that I've heard it 10+ times, I adopt a "wait and see" mentality. I was told the same thing about VMs, going from IBM and Oracle to Linux, going from owning the datacenter to renting a cage to going to the cloud. Every time it seems I survive.

Even as far as tech hype goes, serverless functions picked up steam fast. Technologies like AWS Lambda and GCP Cloud Functions were adopted by orgs I worked at very fast compared to other technology. Conference after conference and expert after expert proclaimed that serverless was inevitable. It felt like AWS Lambda and others were being adopted for production workloads at a breakneck pace.

Then, without much fanfare, it stopped. Other serverless technologies like GKE Autopilot and ECS are still going strong, but the idea of a serverless function replacing the traditional web framework or API has almost disappeared. Even cloud providers pivoted, positioning the tools as more "glue between services" than the services themselves. The addition of being able to run Docker containers as functions seemed to help a bit, but it remains a niche component of the API world.

What happened? Why were so many smart people wrong? What can we learn as a community about hype and marketing around new tools?

Promise of serverless

Above we see a serverless application as initially pitched. Users would ingress through the API Gateway technology, which handles everything from traffic management, CORS, authorization and API version management. It basically serves as the web server and framework all in one. Easy to test new versions with multiple versions of the same API at the same time, easy to monitor and easy to set up.

After that comes the actual serverless function. These could be written in whatever language you wanted and could run for up to 15 minutes as of 2023. So instead of having, say, a Rails application where you are combining the Model-View-Controller into a monolith, you can break it into each route and use different tools to solve for each situation.

This suggests how one might structure a new PHP applications for instance.

Since these were only invoked in response to a request coming from a user, it was declared a cost savings. You weren't paying for server resources you weren't using, unlike traditional servers where you would provision the expected capacity beforehand based on a guess. The backend would also endlessly scale, meaning it would be impossible to overwhelm the service with traffic. No more needing to worry about DDoS or floods of traffic.

Finally at the end would be a database managed by your cloud provider. All in all you aren't managing any element of this process, so no servers or software updates. You could deploy a thousand times a day and precisely control the rollout and rollback of code. Each function could be written in the language that best suited it. So maybe your team writes most things in Python or Ruby but then goes back through for high volume routes and does those in Golang.

Combined with technologies like S3 and DynamoDB along with SNS you have a compelling package. You could still send messages between functions with SNS topics. Storage was effectively unlimited with S3 and you had a reliable and flexible key-value store with DynamoDB. Plus you ditched the infrastructure folks, the monolith, any dependency on the host OS and you were billed by your cloud provider for your actual usage based on the millisecond.

Initial Problems

The initial adoption of serverless was challenging for teams, especially teams used to monolith development.

Local development. Typically a developer pulls down the entire application they're working on and runs it on their device to be able to test quickly. With serverless, that doesn't really work since the application is potentially thousands of different services written in different languages. You can do this with serverless functions but it's way more complicated.
Hard to set resources correctly. How much memory did this function need under testing can be very different from how much it needs under production. Developers tended to set their limits high to avoid problems, wiping out much of the cost savings. There is no easy way to adjust functions based on real-world data outside of doing it by hand one by one.
AWS did make this process easier with AWS Lambda Power Tuning but you'll still need to roll out the changes yourself function by function. Since even a medium sized application can be made up of 100+ functions, this is a non-trivial thing to do. Plus these aren't static things, changes can get rolled out that dramatically change the memory usage with no warning
Is it working? Observability is harder with a distributed system vs a monolith and serverless just added to that. Metrics are less useful as are old systems like uptime checks. You need, certainly in the beginning, to rely on logs and traces a lot more. For smaller teams especially, the monitoring shift from "uptime checks + grafana" to a more complex log-based profile of health was a rough adjustment.

All these problems were challenges but it seems many were able to get through it with momentum intact. We started to see a lot of small applications launch that were serverless function based, from APIs to hobby developer projects. All of this is reflected by the Datadog State of Serverless report for 2020 which you can see here.

At this point everything seems great. 80% of AWS container users have adopted Lambda in some capacity, paired with SQS and DynamoDB. NodeJS and Python are the dominant languages, which is a little eyebrow raising. This suggests that picking the right language for the job didn't end up happening, instead picking the language easiest for the developer. But that's fine, that is also an optimization.

What happened? What went wrong?

Production Problems

Across the industry we started to hear feedback from teams that had gone hard into serverless functions backing back out. I started to see problems in my own teams that had adopted serverless. The following trends came up in no particular order.

Latency. Traditional web frameworks and containers are fast at processing requests, typically hitting latency in database calls. Serverless functions were slow depending on the last time you invoked them. This led to teams needing to keep "functions warm." What does this mean?

When the function gets a request it downloads the code and gets ready to run it. After that for a period of time, the function is just ready to rerun until it is recycled and the process needs to be run again. The way around this at first was typically an EventBridge rule to keep the function running every minute. This kind of works but not really.

Later Provisioned Concurrency was added, which is effectively....a server. It's a VM where your code is already loaded. You are limited per account to how many functions you can have set to be Provisioned Concurrency, so it's hardly a silver bullet. Again none of this happens automatically, so its up to someone to go through and carefully tune each function to ensure it is in the right category.

Scaling. Serverless functions don't scale to infinity. You can scale concurrency levels up every minute by an additional 500 microVMs. But it is very possible for one function to eat all of the capacity for every other function. Again it requires someone to go through and understand what Reserved Concurrency each function needs and divide that up as a component of the whole.

In addition, serverless functions don't magically get rid of database concurrency limits. So you'll hit situations where a spike of traffic somewhere else kills your ability to access the database. This is also true of monoliths, but it is typically easier to see when this is happening when the logs and metrics are all flowing from the same spot.

In practice it is far harder to scale serverless functions than an autoscaling group. With autoscaling groups I can just add more servers and be done with it. With serverless functions I need an in-depth understanding of each route of my app and where those resources are being spent. Traditional VMs give you a lot of flexibility in dealing with spikes, but serverless functions don't.

There are also tiers of scaling. You need to think of KMS throttling, serverless function concurrency limit, database connection limits, slow queries. Some of these don't go away with traditional web apps, but many do. Solutions started to pop up but they often weren't great.

Teams switched from always having a detailed response from the API to just returning a 200 showing that the request had been received. That allowed teams to stick stuff into an SQS queue and process it later. This works unless there is a problem in processing, breaking the expectations from most clients that 200 means the request was successful, not that the request had been received.

Functions often needed to be rewritten as you went, moving everything you could to the initialization phase and keeping all the connection logic out of the handler code. The initial momentem of serverless was crashing into the rewrites as teams learned painful lesson after painful lesson.

Price. Instead of being fire and forget, serverless functions proved to be very expensive at scale. Developers don't think of routes of an API in terms of how many seconds they need to run and how much memory they use. It was a change in thinking and certainly compared to a flat per-month EC2 pricing, the spikes in traffic and usage was an unpleasant surprise for a lot of teams.

Combined with the cost of RDS and API Gateway and you are looking at a lot of cash going out every month.

The other cost was the requirement that you have a full suite of cloud services identical to production for testing. How do you test your application end to end with serverless functions? You need to stand up the exact same thing as production. Traditional applications you could test on your laptop and run tests against it in the CI/CD pipeline before deployment. Serverless stacks you need to rely a lot more on Blue/Green deployments and monitoring failure rates.

Slow deployments. Pushing out a ton of new Lambdas is a time-consuming process. I've waited 30+ minutes for a medium-sized application. God knows how long people running massive stacks were waiting.
Security. Not running the server is great, but you still need to run all the dependencies. It's possible for teams to spawn tons of functions with different versions of the same dependencies, or even choosing to use different libraries. This makes auditing your dependency security very hard, even with automation checking your repos. It is more difficult to guarantee that every compromised version of X dependency is removed from production than it would be for a smaller number of traditional servers.

Why didn't this work?

I think three primary mistakes were made.

The complexity of running a server in a modern cloud platform was massively overstated. Especially with containers, running a linux box of some variety and pushing containers to it isn't that hard. All the cloud platform offer load balancers, letting you offload SSL termination, so really any Linux box with Podman or Docker can run listening on that port until the box has some sort of error.

Setting up Jenkins to be able to monitor Docker Hub for an image change and trigger a deployment is not that hard. If the servers are just doing that, setting up a new box doesn't require the deep infrastructure skills that serverless function advocates were talking about. The "skill gap" just didn't exist in the way that people were talking about.
People didn't think critically about price. Serverless functions look cheap, but we never think about how many seconds or minute a server is busy. That isn't how we've been conditioned to think about applications and it showed. Often the first bill was a shocker, meaning the savings from maintenance had to be massive and they just weren't.
Really hard to debug problems. Relying on logs and X-Ray to figure out what went wrong is just much harder than pulling the entire stack down to your laptop and triggering the same requests. It is a new skill and one that people had not developed up to that point. The first time you have a long-running production issue that would have been trivial to fix in the old monolith application design style that persists for a long time in the serverless function world, the enthusiasm from leadership evaporates very quickly.

Conclusion

Serverless functions fizzled out and it's important for us as an industry to understand why the hype wasn't real. Important questions were skipped over in an attempt to increase buy-in to cloud platforms and simplify the deployment and development story for teams. Hopefully this provides us a chance to be more skeptical of promises like this in the future. We should have adopted a much more wait and see to this technology instead of rushing straight in and hitting all the sharp edges right away.

Currently serverless functions live as what they're best at, which is either glue between different services, triggering longer-running jobs or as very simple platforms that allow for tight cost control by single developers who are putting together something for public use. If you want to use something serverless for more, you would be better off looking at something like ECS with Fargate or Cloud Run in GCP.

CodePerfect 95 Review

July 14, 2023

I have a long history of loving text editors. Their simplicity and purity of design is appealing to me, as is their long lifespans. Writing a text editor that becomes popular really becomes a lifelong responsibility and opportunity, which is just very cool to me. They become subcultures onto themselves. IDEs I have less love for.

There's nothing wrong with using one, in fact I use them for troubleshooting on a pretty regular basis. I just haven't found one I love yet. They either have a million plugins (so I'm constantly getting notifications for updates) or they just have thousands upon thousands of features, so even to get started I need to watch a few YouTube tutorials and read a dozen pages of docs. I love JetBrains products but the first time I tried to use PyCharm for a serious project I felt like I was launching a shuttle into space.

However I find myself writing a lot of Golang lately, as it has become the common microservice language across a couple of jobs now. I actually like it, but I'm always looking for an IDE to help me write it faster and better. My workflow is typically to write it in Helix or Vim and then use the IDE for inspecting the code before putting it in a commit, or for faster debugging than have two tabs open in the Tmux and switching between them. It works, but it's not exactly an elegant solution.

I stumbled across CodePerfect 95 and fell in love with the visual styling. So I had to give it a try. Their site is here: https://codeperfect95.com/

Visuals

It's hard to overstate how much I love this design. It is very Mac OS 9 in a way that I just was instantly drawn to. Everything from the atypical color choices to the fonts are just classic Apple design.

Whoever picked this logo, I was instantly delighted with it.

There were a few quibbles. It should respect the system dark/light mode, even if it goes against the design of the application. That's a users preference and should get reflected in some way.

Also as far as I could tell, nothing about the font used or any of the design elements were customizable. This is fine for me, I actually prefer when tools have strong opinions and present them to me, but I know for some people the ability to switch the monospace font used is a big deal. In general there are just not a lot of options, which is great for me but you should be aware of.

Usage

Alright so I got a free 7 day trial when I downloaded it and I really tried to kick the tires as much as possible. So I converted over to it for all my work during that period. This app promises speed and delivers. It is as fast as a terminal application and comes with most of the window and tab customization you would typically turn to a tool like Tmux for.

It apparently indexes the project when you open it, but honestly it happened so fast I didn't even notice what it was doing. As fast as I could open the project and remember what the project was, I could search or do whatever. I'm sure if you work on giant projects that might not be the case, but nothing I threw at the index process seemed to choke it at all.

It supports panes and tabs, so basically using Cmd+number to switch panes. It's super fast and I found very comfortable. The only thing that is slightly strange is when you open a new pane, it shows absolutely nothing. No file path, no "click here to open". You need to understand that when you switch to an empty pane you have to open a file. This is what the pane view looks like:

Cmd+P is fuzzy find and works as expected. So if you are used to using Vim to search and open files, this is going to feel very familiar to you. Cmd+T is the Symbol GoTo which works like all of these you have ever used:

You can jump to the definition of an identifier, completion, etc. All of this worked exactly like you would think it does. It was very fast and easy to do. I really liked some of the completion stuff. For instance, Generate Function actually saved me a fair amount of time.

Given:

dog := Dog{}
bark(dog, 1, false)

You can mouse over and generate this:

func bark(v0 Dog, v1 int, v2 bool) {
  panic("not implemented")
}

This is their docs example but when I tested it, it seemed to work well.

The font is pretty easy to read but I would have love to tweak the colors a bit. They went with kind of a muted color scheme, whereas I prefer a strong visual difference between comments and actual code. All the UI elements are black and white, very strong contract, so to make the actual workspace muted and a little hard to read is strange.

VSCode defaults to a more aggressive and easier to read design, especially in sunlight.

Builds

So one of the primary reasons IDEs are so nice to use is the integrated build system. However with Golang builds are pretty straightforward typically, so there isn't a lot to report here. It's basically "what arguments do you pass to go build saved as a profile".

It works well though. No complaints and stepping through the build errors was easy and fast to do. Not fancy but works like it says on the box.

Work Impressions

I was able to do everything I would need to do with a typical Golang application inside the IDE, which is not a small task. I liked features like the Postfix completion which did actually save me a fair amount of time once I started using them.

However I ended up missing a few of the GoLand features like Code Coverage checking for tests and built-in support for Kubernetes and Terraform, just because it's common to touch all subsystems when I'm working on something and not just exclusively Go code. You definitely see some value with having a tool customized for one environment over having a general purpose tool with plugins, but it was a little hard to give up all the customization options with GoLand. Then again it reduces complexity and onboarding time, so it's a trade-off.

Pricing and License

First with a product like this I like to check the Terms and Conditions. I was surprised that they....basically don't have any.

Clearly no lawyers were involved in this process, which seems odd. This reads like a Ron Swanson ToS.

The way you buy licenses is also a little unusual. It's an attempt to bridge the Jetbrains previous perpetual license and the perpetual fallback license.

A key has two parts: a one-time perpetual license, and subscription-based automatic updates. You can choose either one, or both:

    License only
        A perpetual license locked to a particular version.
        After 3 included months of updates, locked to the final version.
    License and subscription
        A perpetual license with access to ongoing updates .
        When your subscription ends, your perpetual license is locked to the final version.
    Subscription only
        Access to the software during your subscription.
        You lose access when your subscription ends.

I'm also not clear what they mean by "cannot be expensed".

Why can't I expense it? According to what? You writing on a webpage "you cannot expense it"? This seems like a way to extract more money from people depending on whether they're using it at work or home.

Jetbrains does something similar but they have an actual license you agree to. There's no documentation of a license here, so I don't know if this matters at all. If CodePerfect wants to run their business like this, I guess they can, but they're going to need to have a document that says something like this:

3.4. This subscription is only for natural persons who are purchasing a subscription to Products using only their own funds. Notwithstanding anything to the contrary in this Agreement, you may not use any of the Products, and this grant of rights shall not be in effect, in the event that you do not pay Subscription fees using your own funds. If any third party pays the Subscription fees or if you expect or receive reimbursement for those fees from any third party, this grant of rights shall be invalid and void.

I feel like $40 for software where I only get 3 months of updates is not an amazing deal. Sublime Text is $99 for 3 years. Nova is $99 for one year. Examining the changelog it appears they're still closing relatively big bugs even now, so I would be a tiny bit nervous about getting locked into whatever version I'm at in three months forever. Changelog

The subscription was also not a great deal.

So I mean the easiest comparison would be GoLand.

$10 a month = $120 for the year and I get the perpetual fallback license. $100 for the year and I get CodePerfect (I understand the annual price break). The pricing isn't crazy but JetBrains is an established company with a known track record of shipping IDEs. I would be a bit hesitant to shell out for this based on a 7 day trial for a product that has existed for 302 days as of July 5th. I'd rather they charge me $99 for a license with 12 months of updates that just ends instead of a subscription. It's also strange that they don't seem to change the currency based on the location of the user.

My issue with all this is getting a one-time payment reimbursed is not bad. Subscriptions are typically frowned upon as expenses at most places I've worked unless they're training for the entire department. For my own personal usage, I would be hesitant to sign up for a new subscription from an unknown entity, especially when the ToS is a paragraph and the "license" I am agreeing to doesn't seem to exist? A lot of this is just new software growing pains, but I hope they're aware.

Conclusion

CodePerfect 95 is my favorite kind of software. It's functional yet fun, with some whimsy and joy mixed in with practical features. It works well and is as fast as promised. I enjoyed my week of using it, finding it to be mostly usable as JetBrains GoLand but in a lighter piece of software. So would I buy it?

I'm hesitant. I want to buy it, but there's zero chance I could get a legal department to approve this for an enterprise purchase. So my option would be to buy the more expensive version and expense it or just pay for it myself. Subscription fatigue is a real thing and I will typically pay a 20% premium to not have to deal with it. To not have to deal with a subscription I would need to buy a license every 3 months for $160 a year in total.

I can't get there yet. I've joined their newsletter and I'll keep an eye on it. If it continues to be a product in six months I'll pull the trigger. Switching workflows is a lot of work for me and it requires enough time to mentally adjust that I don't want to fall in love with a tool and then have it disappear. If they did $99 for a year license that just expired I'd buy it today.

Today the EU decided to give me a giant present

July 11, 2023

For those of you who have spent years dealing with the nightmarish process of carefully putting EU user data in its own silo, often in its own infrastructure in a different EU region, it looks like the nightmare might be coming to an end. See the new press release here: https://ec.europa.eu/commission/presscorner/detail/en/ip_23_3721

Some specific details I found really interesting in the full report (which is a doozy of a read(: https://commission.europa.eu/system/files/2023-07/Adequacy%20decision%20EU-US%20Data%20Privacy%20Framework.pdf

The EU-U.S. Data Privacy Framework introduces new binding safeguards to address all the concerns raised by the European Court of Justice, including limiting access to EU data by US intelligence services to what is necessary and proportionate, and establishing a Data Protection Review Court (DPRC), to which EU individuals will have access.

US companies can certify their participation in the EU-U.S. Data Privacy Framework by committing to comply with a detailed set of privacy obligations. This could include, for example, privacy principles such as purpose limitation, data minimisation and data retention, as well as specific obligations concerning data security and the sharing of data with third parties.

To certify under the EU-U.S. DPF (or re-certify on an annual basis), organisations are required to publicly declare their commitment to comply with the Principles, make their privacy policies available and fully implement them67. As part of their (re-)certification application, organisations have to submit information to the DoC on, inter alia, the name of the relevant organisation, a description of the purposes for
which the organisation will process personal data, the personal data that will be covered by the certification, as well as the chosen verification method, the relevant independent recourse mechanism and the statutory body that has jurisdiction to enforce compliance with the Principles68

Organisations can receive personal data on the basis of the EU-U.S. DPF from the date they are placed on the DPF list by the DoC. To ensure legal certainty and avoid 'false claims', organisations certifying for the first time are not allowed to publicly refer to their adherence to the Principles before the DoC has determined that the organisation's certification submission is complete and added the organisation to the DPF List69. To be allowed to continue to rely on the EU-U.S. DPF to receive personal data from the Union, such organisations must annually re-certify their participation in the framework. When an organisation leaves the EU-U.S. DPF for any reason, it must remove all statements implying that the organisation continues to participate in the Framework

So it looks similar to Privacy Shield but with more work being done on the US side to meet the EU requirements. This is all super new and we'll need to see how this shakes out in the practical implementation, but I'm extremely hopefully for less friction-filled interactions between EU and US tech companies.

GKE (Google Kubernetes Engine) Review

July 07, 2023 in DevOps

What if Kubernetes was idiot-proof?

Love/Hate Relationship

AWS and I have spent a frightening amount of time together. In that time I have come to love that weird web UI with bizarre application naming. It's like asking an alien not familiar with humans to name things. Why is Athena named Athena? Nothing else gets a deity name. CloudSearch, CloudFormation, CloudFront, Cloud9, CloudTrail, CloudWatch, CloudHSM, CloudShell are just lazy, we understand you are the cloud. Also Amazon if you are going to overuse a word that I'm going to search, use the second word so the right result comes up faster. All that said, I've come to find comfort in its primary color icons and "mobile phones don't exist" web UI.

Outside of AWS I've also done a fair amount of work with Azure, mostly in Kubernetes or k8s-adjacent spaces. All said I've now worked with Kubernetes on bare metal in a datacenter, in a datacenter with VMs, on raspberry pis in a cluster with k3s, in AWS with EKS, in Azure with AKS, DigitalOcean Kubernetes and finally with GKE in GCP. Me and the Kubernetes help documentation site are old friends at this point, a sea of purple links. I say all this to suggest that I have made virtually every mistake one can with this particular platform.

When being told I was going to be working in GCP (Google Cloud Platform) I was not enthused. I try to stay away from Google products in my personal life. I switched off Gmail for Fastmail, Search for DuckDuckGo, Android for iOS and Chrome for Firefox. It has nothing to do with privacy, I actually feel like I understand how Google uses my personal data pretty well and don't object to it on an ideological level. I'm fine with making an informed decision about using my personal data if the return to me in functionality is high enough.

I mostly move off Google services in my personal life because I don't understand how Google makes decisions. I'm not talking about killing Reader or any of the Google graveyard things. Companies try things and often they don't work out, that's life. It's that I don't even know how fundamental technology is perceived. Is Golang, which relies extensively on Google employees, doing well? Are they happy with it, or is it in danger? Is Flutter close to death or thriving? Do they like Gmail or has it lost favor with whatever executives are in charge of it this month? My inability to get a sense of whether something is doing well or poorly inside of Google makes me nervous about adopting their stack into my life.

I say all this to explain that, even though I was not excited to use GCP and learn a new platform. Even though there are parts of GCP that I find deeply frustrating as compared to its peers...there is a gem here. If you are serious about using Kubernetes, GKE is the best product I've seen on the market. It isn't even close. GKE is so good that if you are all-in on Kubernetes, it's worth considering moving from AWS or Azure.

I know, bold statement.

TL;DR

GKE is the best managed k8s product I've ever tried. It aggressively helps you do things correctly and is easy to set up and run.
GKE Autopilot is all of that but they handle all the node/upgrade/security etc. It's like Heroku-levels of easy to get something deployed. If you are a small company who doesn't want to hire or assign someone to manage infrastructure, you could grow forever on GKE Autopilot and still be able to easily migrate to another provider or the datacenter later on.
The rest of GCP is a bit of a mixed bag. Do your homework.

Disclaimer

I am not and have never been a google employee/contractor/someone they know exists. I once bombed an interview when I was 23 for an job at Google. This interview stands out to me because despite working with it every day for a year my brain just forgot how RAID parity worked on a data tranmission level. Got off the call and instantly all memory of how it worked returned to me. Needless to say nobody at Google cares that I have written this and it is just my opinions.

Corrections are always appreciated. Let me know at: [email protected]

Traditional K8s Setup

One common complaint about k8s is you have to set up everything. Even "hosted" platform often just provide the control plane, meaning almost everything else is some variation of your problem. Here's the typically collection of what you need to make decisions about in no particular order:

Secrets encryption: yes/no how
Version of Kubernetes to start on
What autoscaling technology are you going to use
Managed/unmanaged nodes
CSI drivers, do you need them, which ones
Which CNI, what does it mean to select a CNI, how do they work behind the scenes. This one in particular throws new cluster users because it seems like a nothing decision but it actually has profound impact in how the cluster operates
Can you provision load balancers from inside of the cluster?
CoreDNS, do you want it to cache DNS requests?
Vertical pod autoscaling vs horizontal pod autoscaling
Monitoring, what collects the stats, what default data do you get, where does it get stored (node-exporter setup to prometheus?)
Are you gonna use an OIDC? You probably want it, how do you set it up?
Helm, yes or no?
How do service accounts work?
How do you link IAM with the cluster?
How do you audit the cluster for compliance purposes?
Is the cluster deployed in the correct resilient way to guard against AZ outages?
Service mesh, do you have one, how do you install it, how do you manage it?
What OS is going to run on your nodes?
How do you test upgrades? What checks to make sure you aren't relying on a removed API? When is the right time to upgrade?
What is monitoring overall security posture? Do you have known issues with the cluster? What is telling you that?
Backups! Do you want them? What controls them? Can you test them?
Cost control. What tells you if you have a massively overprovisioned node group?

This isn't anywhere near all the questions you need to answer, but this is typically where you need to start. One frustration with a lot of k8s services I've tried in the past is they have multiple solutions to every problem and it's unclear which is the recommended path. I don't want to commit to the wrong CNI and then find out later that nobody has used that one in six months and I'm an idiot. (I'm often an idiot but I prefer to be caught for less dumb reasons).

Are these failings of kubernetes?

I don't think so. K8s is everything to every org. You can't make a universal tool that attempts to cover every edge case that doesn't allow for a lot of customization. With customization comes some degree of risk that you'll make the wrong choice. It's the Mac vs Linux laptop debate in an infrastructure sphere. You can get exactly what you need with the Linux box but you need to understand if all the hardware is supported and what tradeoffs each decision involves. With a Mac I'm getting whatever Apple thinks is the correct combination of all of those pieces, for better or worse.

If you can get away with Cloud Run or ECS, don't let me stop you. Pick the level of customization you need for the job, not whatever is hot right now.

Enter GKE

Alright so when I was hired I was tasked with replacing an aging GKE cluster that was coming to end of life running Istio. After running some checks, we weren't using any of the features of Istio, so we decided to go with Linkerd since it's a much easier to maintain service mesh. I sat down and started my process for upgrading an old cluster.

Check the node OS for upgrades, check the node k8s version
Confirm API usage to see if we are using outdated APIs
How do I install and manage the ancillary services and what are they? What installs CoreDNS, service mesh, redis, etc.
Can I stand up a clean cluster from what I have or was critical stuff added by hand? It never should be but it often is.
Map out the application dependencies and ensure they're put into place in the right order.
What controls DNS/load balancing and how can I cut between cluster 1 and cluster 2

It's not a ton of work, but it's also not zero work. It's also a good introduction to how applications work and what dependencies they have. Now my experience with recreating old clusters in k8s has been, to be blunt, a fucking disaster in the past. It typically involves 1% trickle traffic, everything returning 500s, looking at logs, figuring out what is missing, adding it, turning 1% back on, errors everywhere, look at APM, oh that app's healthcheck is wrong, etc.

The process with GKE was so easy I was actually sweating a little bit when I cut over traffic, because I was sure this wasn't going to work. It took longer to map out the application dependencies and figure out the Istio -> Linkerd part than it did to actually recreate the cluster. That's a first and a lot of it has to do with how GKE holds your hand through every step.

How does GKE make your life easier?

Let's walk through my checklist and how GKE solves pretty much all of them.

Node OS and k8 version on the node.

GCP offers a wide variety of OSes that you can run but recommends one I have never heard of before.

Container-Optimized OS from Google is an operating system image for your Compute Engine VMs that is optimized for running containers. Container-Optimized OS is maintained by Google and based on the open source Chromium OS project. With Container-Optimized OS, you can bring up your containers on Google Cloud Platform quickly, efficiently, and securely.

I'll be honest, my first thought when I saw "server OS based on Chromium" was "someone at Google really needed to get an OKR win". However after using it for a year, I've really come to like it Now it's not a solution for everyone, but if you can operate within the limits its a really nice solution. Here are the limits.

No package manager. They have something called the CoreOS Toolbox which I've used a few times to debug problems so you can still troubleshoot. Link
No non-containerized applications
No install third-party kernel modules or drivers
It is not supported outside of the GCP environment

I know, it's a bad list. But when I read some of the nice features I decided to make the switch. Here's what you get:

The root filesystem is always mounted as read-only. Additionally, its checksum is computed at build time and verified by the kernel on each boot.
Stateless kinda. /etc/ is writable but stateless. So you can write configuration settings but those settings do not persist across reboots. (Certain data, such as users' home directories, logs, and Docker images, persist across reboots, as they are not part of the root filesystem.)
Ton of other security stuff you get for free. Link

I love all this. Google tests the OS internally, they're scanning for CVEs, they're slowly rolling out updates and its designed to just run containers correctly, which is all I need. This OS has been idiot proof. In a year of running it I haven't had a single OS issue. Updates go out, they get patched, I don't notice ever. Troubleshooting works fine. This means I never need to talk about a Linux upgrade ever again AND the limitations of the OS means my applications can't rely on stuff they shouldn't use. Truly set and forget.

I don't run software I can't build from source.

Go nuts: https://cloud.google.com/container-optimized-os/docs/how-to/building-from-open-source

2. Outdated APIs.

There's a lot of third-party tools that do this for you and they're all pretty good. However GKE does it automatically in a really smart way.

Not my cluster but this is what it looks like

Basically the web UI warns you if you are relying on outdated APIs and will not upgrade if you are. Super easy to check "do I have bad API calls hiding somewhere".

3. How do I install and manage the ancillary services and what are they?

GKE comes batteries included. DNS is there but it's just a flag in Terraform to configure. Service accounts same thing, Ingress and Gateway to GCP is also just in there working. Hooking up to your VPC through a toggle in Terraform so you can naively routeable. They even reserve the Pods IPs before the pods are created which is nice and eliminates a source of problems.

They have their own CNI which also just works. One end of the Virtual Ethernet Device pair is attached to the Pod and the other is connected to the Linux bridge device cbr0. I've never encountered any problems with any of the GKE defaults, from the subnets it offers to generate for pods to the CNI it is using for networking. The DNS cache is nice to be able to turn on easily.

4. Can I stand up a clean cluster from what I have or was critical stuff added by hand?

Because everything you need to do happens in Terraform for GKE, it's very simple to see if you can stand up another cluster. Load balancing is happening inside of YAMLs, ditto for deployments, so standing up a test cluster and seeing if apps deploy correctly to it is very fast. You don't have to install a million helm charts to get everything configured just right.

However they ALSO have backup and restore built it!

Here is your backup running happily and restoring it is just as easy to do through the UI.

So if you have a cluster with a bunch of custom stuff in there and don't have time to sort it out, you don't have to.

5. Map out the application dependencies and ensure they're put into place in the right order.

This obviously varies from place to place, but the web UI for GKE does make it very easy to inspect deployments and see what is going on with them. This helps a lot, but of course if you have a service mesh that's going to be the one-stop shop for figuring out what talks to what when. The Anthos service mesh provides this and is easy to add onto a cluster.

6. What controls DNS/load balancing and how can I cut between cluster 1 and cluster 2

Alright so this is the only bad part. GCP load balancers provide zero useful information. I don't know why, or who made the web UIs look like this. Again, making an internal or external load balancer as an Ingress or Gateway with GKE is stupid easy with annotations.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: my-static-address
    kubernetes.io/ingress.allow-http: "false"
    networking.gke.io/managed-certificates: managed-cert
    kubernetes.io/ingress.class: "gce"

Why would this data be the most useful data?

I don't who this is for or why I would care from what region of the world my traffic is coming from. It's also not showing correctly on Firefox with the screen cut off on the right. For context, this is the correct information I want from a load balancer every single time:

The entire GCP load balancer thing is a tire-fire. The web UI to make load balancers breaks all the time. Adding an SSL through the web UI almost never works. They give you a ton of great information about the backend of the load balancer but adding things like a new TLS policy requires kind of a lot of custom stuff. I could go on and on.

Autopilot

Alright so lets say all of that was still a bit much for you. You want a basic infrastructure where you don't need to think about nodes, or load balancers, or operating systems. You write your YAML, you deploy it to The Cloud and then things happens automagically. That is GKE Autopilot

Here are all the docs on it. Let me give you the elevator pitch. It's a stupid easy way to run Kubernetes that is probably going to save you money. Why? Because selecting and adjusting the type and size of node you provision is something most starting companies mess up with Kubernetes and here you don't need to do that. You aren't billed for unused capacity on your nodes, because GKE manages the nodes. You also aren't charged for system Pods, operating system costs, or unscheduled workloads.

Hardening Autopilot is also very easy. You can see all the options that exist and are already turned on here. If you are a person who is looking to deploy an application where maintaining it cannot be a big part of your week, this is a very flexible platform to do it on. You can move to standard GKE later if you'd like. Want off GCP? It is not that much work to convert your YAML to work with a different hosted provider or a datacenter.

I went in with low expectations and was very impressed.

Why shouldn't I use GKE?

I hinted at it above. As good as GKE is, the rest of GCP is crazy inconsistent. First the project structure for how things work is maddening. You have an organization and below that are projects (which are basically AWS accounts). They all have their own permission structure which can be inherited from folders that you put the projects in. However since GCP doesn't allow for the combination of IAM premade roles into custom roles, you end up needing to write hundreds of lines of Terraform for custom roles OR just find a premade role that is Pretty Close.

GCP excels at networking, data visualization (outside of load balancing), kubernetes, serverless with cloud run and cloud functions and big data work. A lot of the smaller services on the edge don't get a lot of love. If you are heavy users of the following, proceed with caution.

GCP Secret Manager

For a long time GCP didn't have any secret manager, instead having customers encrypt objects in buckets. Their secret manager product is about as bare-bones as it gets. Secret rotation is basically a cron job that pushes to a Pub/Sub topic and then you do the rest of it. No metrics, no compliance check integrations, no help with rotation.

It'll work for most use cases, but there's just zero bells and whistles.

GCP SSL Certificates

I don't know how Let's Encrypt, a free service, outperforms GCPs SSL certificate generation process. I've never seen a service that mangles SSL certificates as bad as this. Let's start with just trying to find them.

The first two aren't what I'm looking for. The third doesn't take me to anything that looks like an SSL certificate. SSL certificates actually live at Security -> Certificate Manager. If you try to go there even if you have SSL certificates you get this screen.

I'm baffled. I have Google SSL certificates with their load balancers. How is the API not enabled?

To issue the certs it does the same sort of DNS and backend checking as a lot of other services. To be honest I've had more problems with this service issuing SSL certificates than any in my entire life. It was easier to buy certificates from Verisign. If you rely a lot on generating a ton of these quickly, be warned.

IAM recommender

GCP has this great feature which is it audits what permissions a role has and then tells you basically "you gave them too many permissions". It looks like this:

Great right? Now sometimes this service will recommend you modify the permissions to either a new premade role or a custom role. It's unclear when or how that happens, but when it does there is a little lightbulb next to it. You can click it to apply the new permissions, but since mine (and most peoples) permissions are managed in code somewhere, this obviously doesn't do anything long-term.

Now you can push these recommendations to Big Query, but what I want is some sort of JSON or CSV that just says "switch these to use x premade IAM roles". My point is there is a lot of GCP stuff that is like 90% there. Engineers did the hard work of tracking IAM usage, generating the report, showing me the report, making a recommendation. I just need an easier way to act on that outside of the API or GCP web console.

These are just a few examples that immediately spring to mind. My point being when evaluating GCP please kick the tires on all the services, don't just see that one named what you are expecting exists. The user experience and quality varies wildly.

I'm interested, how do I get started?

GCP terraform used to be bad, but now it is quite good. You can see the whole getting started guide here. I recommend trying Autopilot and seeing if it works for you just because its cheap.

Even if you've spent a lot of time running k8s, give GKE a try. It's really impressive, even if you don't intend to move over to it. The security posture auditing, workload metrics, backup, hosted prometheus, etc is all really nice. I don't love all the GCP products, but this one has super impressed me.

Developers Guide to Moving to Denmark

June 23, 2023

I've wanted to write a guide for tech workers looking to leave the US and
move to Denmark for awhile. I made the move over 4 years ago and finally
feel like I can write on the topic with enough detail to answer most
questions.

Denmark gets a lot of press in the US for being a socialist
paradise and is often held up as the standard by which things are
judged. The truth is more complicated, Europe has its own issues that may impact you more or less depending on your background.

Here's the short version: moving to Europe from the US is a significant
improvement in quality of life for most people. There are pitfalls, but
especially if you have children, every aspect of being a parent, from
the amount of time you get to spend with them to the safety and quality
of their schools, is better. If you have never considered it, you
should, even if not Denmark (although I can't help you with how). It
takes a year to do, so even if things seem ok right this second try to
think longer-term.

TL;DR
Reasons to move to Denmark

5 or 6 weeks of vacation a year (with some asterisks)
Public very good healthcare
Amazing work/life balance, with lots of time for hobbies and activities
Great public childcare at really affordable prices
Be in a union
Summer is amazing weather
Safety. Denmark is really safe compared to even a safe US area.
Very low stress. You don't worry about retirement or health insurance or childcare or any of the things that you might constantly obsess over.
Freedom from religious influence. Denmark has a very good firewall against religious influence in politics.
Danes are actually quite friendly. I know, they pretend they're not in media but they are. They won't start a conversation with you but they'd love to chat if you start one.

Reasons not to move to Denmark

You are gonna miss important stuff back home. People are gonna die and you won't be there. Weddings, birthdays, promotions and divorces are all still happening and you aren't there. I won't sugarcoat it, you are leaving most of your old life behind.
Eating out is a big part of your life; there are restaurants, but they're expensive and for the most part unimpressive. If someone who worked at Noma owns it, then it's probably great, but otherwise often meh.
You refuse to bike or take public transit. Owning a car here is possible but very expensive and difficult to park.
Lower salary. Tech workers make a lot less here before taxes.
Taxes. They're high. All Danes pay 15% but if you earn "top tax" you pay another 15%. Basically with a high salary you get bumped into this tax bracket.
Food. High quality ingredients at stores but overall very bland. You'll eat a lot of meals at your work cafeteria and it's healthy but uninspired.
Buying lots of items is your primary joy in life. Electronics are more expensive in the EU, Amazon doesn't exist in Denmark so your selection is much more limited.

Leaving

There are certainly no shortages of reasons today why one would consider leaving the US. From mass shootings to a broken political system where the majority is held hostage by the minority rural voter, it's not the best time to live in the US. When Trump was elected, I decided it was time to get out. Trump wasn't the reason I left (although man wouldn't you be a little bit impressed if it were?).

I was tired of everything being so hopeless. Everyone I knew in Chicago was working professional jobs, often working late hours, but nobody was getting ahead. I was lucky that I could afford a house and a moderately priced car, but for so many I knew it just seemed pointless. Grind forever and maybe you would get to buy a condo. All you could do was run the numbers with people and you realized they were never gonna get out.

Everything felt like it was building up to this explosion. I would go back to rural Ohio and people were desperately poor even though the economy had, on paper, been good for years. Flex scheduling at chain retail locations meant you couldn't take another job because your first job might call you at any time. Nobody had health insurance outside of the government-provided "Buckeye card". Friends I knew, people who had been moderates growing up, were carrying AR-15s to Wal-Mart and talking about the upcoming war with liberals. There were confederate flags everywhere in a state that fought on the Union side.

I'm not really qualified to talk about politics, so I won't. The only skill that lends itself to this situation was someone who has watched a lot of complex systems fail. This felt like watching a giant companies infrastructure collapse. Every piece had problems, different problems, that nobody seemed to understand holistically. I couldn't get it out of my head, this fear that I would need to flee quickly and wouldn't be able to. "If you think it's gonna explode you wanna go now before people realize how bad it is" was the thought that ran over and over in my head.

So I'd sit out the "burning to the ground" part. Denmark seemed the perfect place to do it. Often held up as the perfect society with its well-functioning welfare state, low levels of corruption and high political stability. They have a shortage of tech workers and I was one of those. I'd sell my condo, car and all my possessions and wait out the collapse. It wasn't a perfect plan (the US economy is too important to think its collapse wouldn't be felt in Denmark) but it was the best plan I could come up with.

Doing something is better than sitting on Twitter and getting sad all the time was my thought.

Is it a paradise?

No. I think the US media builds Denmark, Norway and Sweden up to unrealistic levels. It's a nice country with decent people and the trains, while DSB doesn't run on time, it does run and exist which is more than I can say for US trains. There are problems and often the problems don't get discussed until you get here. I'll try to give you some top level stuff you should be aware of.

There is a lot of anger towards communities that have immigrated from the Muslim world. These neighborhoods are often officially classified as "ghettos" under a 2018 law:

And the Danish state decides whether areas are deemed ghettoes not just by their crime, unemployment or education rates, but on the proportion of residents who are deemed “non-western” – meaning recent, first-, or second-generation migrants.

You'll sometimes hear this discussed as the "parallel societies" problem. That Denmark is not Danish enough anymore unless steps are taken to break up these neighborhoods and disperse their residents. The solution proposed was to change the terms: The Interior Ministry last week revealed proposed reforms that would remove the word "ghetto" in current legislation and reduce the share of people of "non-Western" origin in social housing to 30% within 10 years. Families removed from these areas would be relocated to other parts of the country.

It's not a problem that stops at one generation either. I've seen Danes whose family immigrated over a generation ago who speak fluent Danish (as they are Danish) be asked "where they come from" at social events multiple times. So even if you are a citizen, speak the language, go through the educational system, you aren't fully integrated by a lot of folks standards. This can be demoralizing for some people.

I also love their health system, but it's not fair to all the workers who maintain it. The medical staff don't get paid enough in Denmark for all the work they do, especially when you compare staff like nurses to nurses in the US. Similarity a lot of critical workers like daycare workers, teachers, etc, are in the same boat. It's not as bad as the US for teachers but there's still definitely a gap there.

Denmark also doesn't apply all these great benefits uniformly. Rural Denmark is pretty poor and has limited access to a lot of these services. It's not West Virginia, but some of the magic of a completely flat fair society disappears when you spend a week in rural Jutland. These towns are peeling paint and junk on the front lawn, just like you would expect in any poor small town. There's still a safety net and they still have a much better time of it than an American in the same situation, but still.

I hope Danish people reading this don't get upset. I'm not saying this to speak ill of your country, but sometimes I see people emotionally crash and burn when they come expecting liberal paradise and encounter many problems which look similar to ones back home. It's important to be realistic about what living here looks like. Denmark has not solved all racial, gender or economic issues in their society. They are still trying to, which is more than I can say for some.

Steps

The next part are all the practical steps I could think of. I'm glad to elaborate on any of them if there is useful information that is missing. If I missed something or you disagree, the easiest way to reach me is on the Fediverse at: [email protected].

Why do I say this is a guide for developers? Because I haven't run this by a large group to fact-check, just developers moving from the US or Canada. So this is true of what we've experienced, but I have no idea how true it is for other professions. Most of it should be applicable to anyone, but maybe not, you have been warned etc etc.

Getting a Visa

The first part of the process is the most time consuming but not very difficult. You need to get a job from an employer looking to sponsor someone for a work visa. Since developers tend towards the higher end of the pay scale, you'll likely qualify for a specific (and easier) visa process.

In terms of job posting sites I like Work in Denmark and Job Index. I used Work in Denmark and it was great. If a job listing is in Danish, don't bother translating it and applying, it means they want a local. Danish CVs are similar to US resumes but often folks include a photo in theirs. It's not a requirement, but I've seen it a fair amount when people are applying to jobs.

It can be a long time before you hear anything, which is just how it works. Even if you seem amazing for a job, my experience with US tech companies was often I'd hear back within a week for an interview. Often with Denmark its 2-3 weeks to get a rejection. Just gotta wait them out.

Where do you want to live

Short answer? As close to Copenhagen as possible. It's the capital city, it has the most resources by a lot and it is the easiest to adjust to IMO. I originally moved to Odense, the third largest city and found it far too small for me. I ended up passing time by hanging out in the Ikea food court because I ran out of things to do, which is as depressing as it sounds.

The biggest cities in Denmark are Copenhagen, Aarhus on the Jutland peninsula and Odense on the island of Fyn sitting between the two. Here's a map that shows you what I'm talking about.

A lot of jobs will contact you that are based in Jutland. I would think long and hard before committing to living in Jutland if you haven't spent a lot of time in Denmark. The further you get from Copenhagen, the more expectation there is that you are fluent in Danish. These cities are also painfully small by US standards.

Copenhagen: 1,153,615
Aarhus: 237,551
Odense: 145,931

Typically jobs in Jutland are more desperate for applicants and are easier to get for foreign workers. If you are looking for a smaller city or maybe even living out in the countryside (which is incredibly beautiful), it's a good option. Just be sure that's what you want to do. You'll want to enroll in Danish classes immediately at a faster rate to get around and do things like "read menus" and "answer the phone".

There are perks to not living in Copenhagen. My wife got to ride horses once a week, which is something she did as a little kid and could do again for a very reasonable $50 a month. I enjoyed the long walks through empty nature around Fyn and the leisurely pace of life for awhile. Just be sure, because these towns are very sleepy and can make you go a bit insane.

Interviews

Danish interviews are a bit different from US ones. Take home assignments and test projects are less common, with most companies comfortable assuming you aren't lying on your resume. They may ask for a GitHub handle just to see if you have anything up there. The pace is pretty relaxed compared to the US, don't expect a lot of live code challenges or random quizzes. You walk through the work you've done and they'll ask follow ups.

Even though the interviews are relaxed, they're surprisingly easy to fail. Danes really don't like bragging or "dominating" the conversation. Make sure you attribute victories to your team, you were part of a group that did all this great work. It's not cheap to move someone to Denmark, so try and express why you want to do it. A lot of foreign workers bounce off of Denmark when they move here, so you are trying to convince them you are worth the work.

After the interview you'll have....another interview. Then another interview. You'll be shocked how often people want to talk to you. This is part of the group consensus thing that is pretty important here. Jobs really want the whole team to be happy with a decision and get a chance to weigh in on who they work with. Managers and bosses have a lot less power than in the US and you see it from the very beginning of the interview.

Remember, keep it light, lots of self-deprecating humor. Danes love that stuff, poking fun at yourself or just injecting some laughter into the interview. They also love to hear how great Denmark is, so drop some of that in too. You'll feel a little weird celebrating their country in a job interview, but I've found it really creates positive feelings among the people you are talking to.

Don't answer the baby question. They can't directly ask you if you are gonna have kids, but often places bringing foreign workers over will dance around the question. "Oh it's just you and your partner? Denmark is a great place for kids." The right answer is no. I gave a sad no and stared off screen for a moment. I don't have any fertility issues, it just seemed an effective way to sell it.

Alright you got the job. Now we start the visa process for real. That was actually the easy part.

Sitting in VFS

This wasn't going to work. That was my thought as I sat in the waiting room of VFS Chicago, a visa application processing company. Think airport waiting area meets DMV. Basically for smaller countries it doesn't make sense for them to pay to staff places with employees to intake immigrants, so they outsource it to this depressing place. I was surrounded by families all carrying heavy binders and all I had was my tiny thin binder.

I watched in horror as a French immigration official told a woman "she was never getting to France" as a binder was closed with authority. Apparently the French staff their counter with actual French people who seem to take some joy in crushing dreams. This woman immediately started to cry and plead that she needed the visa, she had to get back. She had easily 200 pages of some sort of documentation. I looked on in horror as she collapsed sobbing into a seat.

On the flip side I had just watched a doctor get approved in three minutes. He walked in still wearing scrubs, said "I'm here to move to Sweden", they checked his medical credentials and stamped a big "APPROVED" on the document. If you or your spouse is a doctor or nurse, there's apparently nowhere in the EU who won't instantly give you a visa.

My process ended up fine, with some confusion over whether I was trying to move to the Netherlands or Denmark. "You don't want a Dutch visa, correct?" I was asked more than once. They took my photo and fingerprints and we moved on. Then I waited for a long time for a PDF saying "approved". I was a little bit sad they didn't mail me anything.

Work Visa Process

Just because it seems like nobody in either sphere understands how the other works

The specific visa we are trying to get is outlined here. This website is where you do everything. Danish immigration doesn't have phone support and nothing happens with paper. It's all through this website. Basically your employer fills out one part and you fill out the rest. It's pretty straight forward and the form is hard to mess up. But also your workplace has probably done it before and can answer most questions.

This can be weird for Americans where we are still a paper-based society. Important things come with a piece of paper generally. When my daughter was born in a Danish hospital I freaked out because when it was time to discharge her they were like "ok time to go!". "Certainly there's a birth certificate or something that I get about her?" The nurse looked confused and then told me "the church handles all that sort of stuff." She was correct, the church (for some reason) is where we got the document that we provided to the US to get her citizenship.

Almost nothing you'll get in this entire process is on paper. It's all through websites and email. Once you get used to it, it's fine, but I have the natural aversion to important documents existing only in government websites where (in the US) they can disappear with no warning. I recommend backups of everything even though it rarely comes up. The Danish systems mostly just work, or if they break they break for everyone.

IMPORTANT

There is a part of the process that they don't draw particular attention to. You need to get your biometrics taken, which means photo and fingerprints. This process is a giant pain in the ass in the US. You have a very limited time window from when you submit the application to get your biometrics recorded, so check the appointment availability BEFORE you hit submit. The place that offers biometric intake is VFS You have to get it done within 14 days of submitting and there are often no appointments.

Here are the documents you will need over and over:

full color copies of your passport including covers
the receipt from the application website showing you paid the fee. THIS IS INCREDIBLY IMPORTANT and the website does not tell you how important it is when you pay the fee. That ID number it generates is needed by everything.
Employment contract
Documentation of education. For me this included basically my resume and jobs I had done as a proxy for not having a computer science degree.

Make a binder and put all this stuff in, with multiple copies. It will save you a ton of work in the long-term. This binder is your life for this entire process. All hail the binder.

Alright you've applied after checking for a biometrics appointment. You paid your fee, sat through the interviews, put in the application. Now you wait for an email. It can take a mysterious amount of time, but you just need to be patient. Hopefully you get the good new email with your new CPR number. Congrats, you are in the Danish system.

Moving

Moving stuff to Denmark is a giant pain in the ass. There are a lot of international moving companies and I hear pretty universally bad things about all of them. You need to think of your possessions in terms of cargo containers. How many cargo containers do you currently have in your house worth of stuff and how much can you get rid of. Our moving company advised us to try and get within a 20 foot cargo container for the best pricing.

It's not a ton of space. We're talking 1,094 cubic feet.

You gotta get everything inside there and ideally you go way smaller. Moving prices can vary wildly between $1000 and $10,000 depending on how much junk you have. You cannot be sentimental here, you want to get rid of everything possible. Don't bring furniture, buy new stuff at Ikea. Forget bringing a car, the cost to register it in Denmark will be roughly what you paid for the car to begin with. Sell the car, sell the furniture, get rid of everything you can.

Check to see if anything with a plug will work. If your device shows an inscription for a range 110V-220V then all you need is a plug adapter. If you only see an inscription for 110V, then you need a transformer that will transform the electricity from 220V to 110V. Otherwise, if you attempt to plug in your device without a transformer, bad things happen. I wouldn't bother bringing anything that won't work with 220V. The plug adapters are cheap, but the transformers aren't.

Stuff you will want to stockpile

This is a pretty good idea of what American stuff you can get.

over the counter medicine, doesn't really exist here outside of Panodil.
Pepto, aspirin, melatonin, cold and flu pills, buy a lot of it cause you can't get more
Spices and Sauces
Cream of tartar
Pumpkin pie spice
Meatloaf mix
Good chili spice mixes or chili spices in general
Hot peppers, like a variety of dried peppers especially ones from Mexico are almost impossible to find here
Everything bagel seasoning, I just love it
Ranch dressing
Hot sauces, they're terrible here
BBQ sauces, also terrible here
Liquid butter for popcorn if that's your thing
Taco mix, it's way worse here
Foods
Cheez-its and Goldfish crackers don't exist
Gatorade powder (you can buy it per bottle but its expensive)
Tex-mex anything, really Mexican food in general
Cereal, American sugar cereal doesn't exist
Cooler ranch Doritos
Mac and Cheese
Good dill pickles (Danish pickles are sweet and gross)
Peanut butter - its here but its expensive

You are going to get used to Danish food, I promise, but it's painfully bland at first. There's a transition period and spices can help get you over the hurdle.

Note: If you eat a lot of peppers like jalapeños, it is too expensive to buy them every time. You will want to grow them in your house. This is common among American expats, but be aware if you are used to them being everywhere and cheap.

Medical Records
When you get your yellow card (your health insurance card), you are also assigned a doctor. In order to get your medical records into the Danish system, you need to bring them with you. If you don't have a complicated medical history I think it's fine to skip this step (they'll ask you all the standard questions) but if you have a more complicated health issue you'll want those documents with you. The lead time to get a doctors appointment here in Denmark for a GP isn't long, typically same week for kids and two weeks for adults.

Different people have different experiences with the health system in Denmark, but I want to give you a few high level notes. Typically Danes get a lot less medication than Americans, so don't expect to walk out of the doctors office with a prescription. There is a small fee for medicine, but it's a small fraction of what it costs with insurance in the US. Birth control pills, IUDs and other resources are easy to get and quite affordable (or free).

If you need a specific medication for a disease, try to get as much as you can from the US doctor. The process for getting specific medicine can sometimes be complicated in Denmark, possibly requiring a referral to a specialist and additional testing. You'll want to allocate some time between when you arrive and when you can get a new script. Generally it works but it might take awhile.

Landing

The pets and I waiting for the bus with my stolen luggage cart

My first week was one of the harder weeks I've had in my life. I landed and then immediately had to take off to go grab the dog and cat. The plan was simple: the pets had been flown on a better airline than me. I would grab them and then take the train from the airport to Odense. It's like an hour and a half train ride. Should be simple. I am all jitters when I land but I find the warehouse where the pets were unloaded.

Outside are hundreds of truck drivers and I realize I have made a critical error. People had told me over and over I didn't need to rent a car, which might have been true if I didn't have pets. However the distance between the warehouse and where I needed to be was too long to walk again with animals in crates. The truck drivers are sitting around laughing and drinking energy drinks while I wander around waiting for the warehouse to let me in.

I decide to steal an abandoned luggage cart outside of the UPS building. "I'm bringing it closer to where it should be anyway" is my logic. The drivers find this quite funny, with many jokes being made at my expense. Typically I'd chalk this up to paranoia but they are pointing and laughing at me. I get the dog and cat, they're not in great shape but they're alive. I give them some water and take off for the bus to the airport.

Loading two crated animals onto a city bus isn't easy in the best of times. Doing it while the cat pee smell coming out of one crate is enough to make your eyes water is another. I have taken over the middle of this bus and people are waving their hands in front of their faces due to the smell. After loading everyone on, I check Google Maps again and feel great. This bus is going to turn around but will take me back to the front of the airport where I want to go.

It does not do that. Instead it takes off to the countryside. After ten minutes of watching the airport disappear far into the background, I get off at the next stop. In front of a long line of older people (tourists?) I get the dog out of the box, throw the giant kennel into a dumpster, zip tie the cat kennel to the top of my suitcase and start off again.

We make it to the train where a conductor is visibly disgusted by the smell. I sit next to the bathroom hoping the smell of public train bathroom would cover it. I attempt to grab a taxi to take me to where I am staying to get set up. No go, there are no taxis. I had not planned for there to be no taxis. On the train I had swapped out the cat pad so the smell was not nearly so intense, but it still wasn't great.

I then walked the kilometers from the train station to where I was staying, sweating the entire time. The dog was desperately trying to escape after the trauma of flying and staying in the SAS animal holding area with race horses and other exotic animals. There were giant slugs on the ground everywhere, something I have since learned is just a Thing in Denmark. We eventually get there and I collapse onto the unmade bed.

What I have with me is what I'm going to need to get set up. There is a multi-month delay between when you land and when your stuff gets there, so for a long time you are starting completely fresh. The next day I start the millions of appointments you need to get set up.

Week 1

Alright you've landed, your stuff is on a boat on its way to you. Typically jobs will either put you up in corporate housing to let you find an apartment or they'll stick you in a hotel. You are gonna be overwhelmed at first, so try to take care of the basics. There is a great outline of all the steps here.

It is a pretty extreme culture shock at first. My first night in Denmark was a disaster. I didn't realize you had to buy the shopping bags and just stole a few by accident. So basically within 24 hours of landing I was already committing crimes. My first meal included an non-alcoholic beer because I assumed Carlsberg Nordic meant "lots of booze" not "no booze".

When you wake up have a plan for what you need to get done that day. It's really tiring, you are gonna be jet-lagged, you aren't used to biking so don't beat yourself up if you only get that one thing done. But you are time limited here so it's important to hit these milestones quickly. You are also going to burn though kind of a lot of cash to get set up. You'll make it up over time, but be aware.

Get a phone plan

You can bring a cellphone from the US and have it work here. Cellphone plans are quite cheap, with a pay as you go sim available for 99 dkk a month with 100 GB of data and 100 hours of talk time. You can get that deal here. If you require an esim, I recommend 3 although it is a bit more. They are here.

Find an apartment
The gold standard for apartment hunting is BoligPortal here. Findboliger was also ok but much lower amounts of inventory. You can get a list of all the good websites here.

These services cost money to you. I'm not exactly sure why (presumably because they can so why not). Just remember to cancel once you find the apartment.

Some tips for apartment hunting

Moving into an apartment in Denmark can be jaw droppingly expensive. Landlords are allowed to ask for up to 3 months of rent as a deposit AND 3 months of rent before you move in. You may have to pay 6 months of rent before you get a single paycheck from your new job.
You aren't going to get back all that deposit. Danish landlord companies are incredibly predatory in how this works. They will act quite casual when you move in, but come back when you move out and will inspect everything for an hour plus. You need to document all damage before you sign in, same as the US. But mentally you should write off half that deposit.
After you have moved in, you have 14 days to fill out a list of defects and send it to your landlord.
Don't pay rent in cash. If the landlord says pay in cash it's a scam. Move on.
See if you have separate meters in your apartment for water/electric. You want this ideally.
Fiber internet is surprisingly common in Denmark. In general they have awesome internet. If this is a priority ask the apartment folks about it. Even if the building you are looking at doesn't have it, chances are there is a building they manage that does.

This doesn't have anything to do with this, I just love this picture

Appliances
Danish washers and dryers are great. Their refrigerators suck so goddamn hard. They're small, for some reason a pool of water often forms at the bottom, the seal needs to be reglued from time to time, stuff freezes if its anywhere near the back wall. I've never seen a good fridge after three tries so just expect it to be crap.

All the normal kitchen appliances are here, but there are distinct tiers of fancy. Grocery stores like Netto often have cheap appliances like toasters, Ikea sells some, but stay away from the electronics stores like Power unless you know you want a fancy one of them. Amazon Germany will ship to Denmark and that's where I got my vacuum and a few other small items.

Due to the cost of eating out in Denmark you are going to be cooking a lot. So get whatever you need to make that process less painful. Here's what I found to be great:

Instant Pot: slow cooker and a rice cooker
Salad washer: their lettuce is very dirty
Hand blender: if you wanna do soups
Microwave: I got the cheapest I could find, weirdly no digital controls just a knob you turn. Not sure why
Coffee bean grinder: Pre-ground coffee is always bad, Danish stuff is nightmarish bad
Hot water kettle: just get one you'll use it all the time
Drip coffee maker: again surprisingly hard to find. Amazon Germany for me.
Vacuum

Kitchen Tools

Almost all stove-tops are induction so expect to have to buy new pots and pans, don't bring non-induction ones from the US
Counter space is limited and there is not a ton of kitchen storage in your average Danish apartment so think carefully about anything you might not need or use on a regular basis
Magasin will sell you any exotic tools you might want or need and there are plenty of specialist cooking stores around town

Go visit ICS
You can make an appointment here.

They will get you set up with MitID, the digital ID service. This is what you use to log into your bank account, government websites, the works. They'll also get you your yellow card as well as sign you up for your doctor. The process is pretty painless.

Bank

pick whichever you want, bring your US passport, Danish yellow card and employment contract
it takes forever, so also maybe a book
they'll walk you through what you need there but it's pretty straight forward
credit card rewards don't exist in Denmark and you don't really need a credit card for anything

If the bank person tells you they need to contact the US, ask to speak to someone else. I'm not sure why some Danish bank employees think this, but there is nobody at the US Department of Treasury they can speak to. It was a bizarre roadblock that left me trying to hunt down who they would be talking to at a giant federal organization. In the end another clerk explained she was wrong and just set me up, but I've heard this issue from other Americans so be aware.

I did enjoy how the woman was like "I'll just call the US" and I thought I am truly baffled at who she might be calling.

Moving In

Danish apartments don't come with light fixtures installed. This means your first night is gonna be pretty dark if you aren't prepared. Trust me, I know from having spent my first night sleeping on the floor in the dark because I assumed I would have lights to unpack stuff. You are gonna see these on the wall:

Here's the process to install a light fixture:

Turn off the power
Pop the inner plastic part out with a screwdriver
Put the wire from the light fixture through the hole
Strip the cables from the light fixture like 4 cm
Insert the two leads of your lamp into the N and M1 terminals
If colored, the blue wire goes into N and the brown wire into M1
If not colored it shouldn't matter

Here is a video that walks you through it.

You are gonna wanna do this while the sun is out for obvious reasons so plan ahead.

Buying a Bike

See me wearing jeans? Like a fucking idiot?

Your bike in Denmark is going to be your primary form of transportation. You ride it, rain or shine, everywhere. You'll haul groceries on it, carry Ikea stuff home on it, this thing is going to be a giant part of your life. Buying one is....tricky. You want something like this:

Here's the stuff you want:

Mudguards, Denmark rains a lot
Kevlar tires. Your bike tires will get popped at the worst possible moments, typically during a massive downpour.
Basket. You want a basket on the front and you want them to put it on. Sometimes men get weird about this but this isn't the time for that. Just get the basket.
Cargo rack on the back.
Wheel lock, the weird circular lock on the back wheel. It's what keeps people from stealing it (kinda). You also need a chain if the bike is new.
Lights, ideally permanently mounted lights. They're a legal requirement here for bikes and police do give tickets.
If you haven't changed a tube on a bike in awhile, practice it. You'll have to do it on the road sometime.
Get a road tool kit.
Get a flashlight in this tool kit because the sun sets early in the winter in Denmark and hell is trying to swap a tube in the dark by the light of a cellphone while its raining.
If you can get disc brakes, they're less work and last longer
Minimum three gears, five if you can.
Denmark always has a bike lane. Never ride with traffic.

It doesn't have to be that one but it should have everything that one does plus a flashlight

Bike Ownership

home insurance covers your bike mostly, but make sure you have that option(and get home insurance)
write down the frame number off the bike, it's also on the receipt. You need it for insurance claims
You should lubricate the chain every week with daily use and clean the chain at least once a month. A lot of people don't and end up with very broken bikes
Danes use hand signals to indicate.

You are expected to use these every time.

Danes are very serious about biking. You need to treat it like driving a car. Stay to the right unless you are passing, don't ride together blocking people from passing, move out of the way of people who ring their bells.
Never ever walk in a bike lane
Wear a helmet
Buy rain gear. It rained every morning on my way to work for a month when I first moved here. I got hit in the eye with hail and fell off the bike. You need gear.

Rain Gear

Rain jackets: Regnjakker
Best stuff is: https://www.hellyhansen.com/en_dk/ or McKinley on a budget.

Rain pants: regnbukser
I love the Patagonia rain pants cause they're not just hot rubber pants. Get some with air slots if you can.

You can grab a full set here if you don't want to mix and match: https://www.spejdersport.dk/asivik-rain-regnsaet-dame

Rain boots:
Tretorn is the brand to beat. You can grab that here: https://www.tretorn.dk/ They also sell all the gear you need.

Backpack:
Get a rain cover for the backpack and also get a waterproof backpack. I'm not kidding when I say it rains a lot. Rain covers are everywhere and I used a shopping bag for two months when I kept forgetting mine.

Alright you got your apartment, yellow card, bank account, bike and rain gear. You are ready to start going to work. Get ready for Danish work culture, which is pretty different from US work culture.

Work

Danish work can be a rough adjustment for someone growing up in the American style of work. I'll try to guide you through it. Danes have to work 37 hours a week, but in practice this can be a bit flexible. You'll want to be there at 9 your first day but don't be shocked if you are pretty alone when you get there. Danes often get to work a little later.

You'll want to join your union. You aren't eligible for the unemployment payouts since you are here on a work visa, but the union is still the best place to turn to in Denmark to get information about whether something is allowed or not. They're easy to talk to, with my union I submit an email and get a call the next day. They are also the ones who track what salaries are across the industry and whether you are underpaid. This is critical to salary negotiation and can be an immense amount of leverage when sitting down with your boss or employer.

Seriously, join a union

If you get fired in Denmark, you have the right to get your union in there to negotiate the best possible exit package for you. I have heard a lot of horror stories from foreigners moving to Denmark about not getting paid, about being lied to about what to do if they get hurt on the job, the list goes on and on. This is the group that can help you figure out what is and isn't allowed. They're a bargain at twice the price.

Schedules tend to be pretty relaxed in Denmark as long as you are hitting around that 37. It's socially acceptable to take an hour to run an appointment or take care of something. Lunches are typically short, like 30 minutes, with most workplaces providing food you pay for in a canteen. It's cheaper than bringing lunch and usually pretty good. A lot of Danes are vegetarian or vegan so that shouldn't be a problem.

Titles don't mean anything

This can be tricky for Americans who see "CTO" or "principal engineer" and act really deferential. Danes will give (sometimes harsh) feedback to management pretty often. This is culturally acceptable where management isn't really "above" anyone, it's just another role. You really want to avoid making decisions that impact other people without their approval, or at least the opportunity to give that approval, even in high management positions.

Danish work isn't the same level of competitive as US/China/India

As an American, if you want a high-paying job you need a combination of luck, family background and basically winning a series of increasingly tight competitions. You need to do well in high school and standardized tests to get into an ok university where you need to major in the right thing to make enough money to pay back the money you borrowed to go to the university. You need a job that offers good enough health insurance that you don't declare bankruptcy with every medical issue you encounter.

US Tech interviews are grueling, multi-day affairs involving a phone screen, take home, on-site personality and practical exam AND the job can fire you at any second with zero warning. You have to be consistently providing value on a project the executive level cares about. So it's not even enough to be doing a good job, you have to do a good job on whatever hobby project is hot that quarter.

Danes don't live in that universe. They are competitive people in terms of sports or certain schools, but they don't have the "if I fail I'm going to be in serious physical distress". So things like job titles, which to Americans are "how I tell you how important I am", mean nothing here. Don't try to impress with a long list of your previous titles, just be like "I worked a bunch of places and here's what I did". Always shoot for casual, not panicked and intense.

Cultural Norms

Dress is pretty casual. I've never seen people working in suits and ties outside of a bank or government office. There isn't AC in most places, so dress in the summer for comfort. Typically once a week someone brings in cake and there are beers or sodas provided by the workplace. Friday beer is actually kind of important and you don't want to always skip it. It's one of the big bonding opportunities in Denmark among coworkers.

Many things considered taboo in American workplaces are fine here. You are free to discuss salary and people often will. You are encouraged to join a union, which I did and found to be worthwhile. They'll help with any dispute or provide you with advice if you aren't sure if something is allowed. Saying you need to leave early is totally fine. Coffee and tea are always free but soda isn't and it's not really encouraged at any workplace I've been at in Denmark to consume soda every day.

There are requirements around desk ergonomics which means you can ask for things like a standing desk, ergonomic mouse and keyboard, standing pad, etc. Often workplaces will bring in someone to assess desks and provide recommendations, which can be useful. If you need something ask for it. Typically places will provide it without too much hassle.

Working Late/On-Call

It happens, but a lot less. Typically if you work after-hours or late you would be expected to get that time back later on by leaving early or coming in late. The 37 hours is all hours worked. The rules for on-call are a bit mixed and as far as I know aren't defined in any sort of on-call rules. Just be aware that your boss shouldn't be asking you to work late and unlike the US being on salary doesn't mean that you can be asked to work unlimited hours in a week.

Vacation

Danish vacation is mostly awesome. Here's the part that kinda stinks. Some jobs will ask that you use a big chunk of your vacation over a summer holiday, which is two or three weeks the office is closed during May 1 and September 30. Now your boss can require that you use your vacation during this period, which is a disaster for foreigners. The reason being is you don't have anywhere to go, everything is already booked in Denmark during the summer vacation and everything travel related is more expensive.

Plus you'll probably want to spend more of that vacation back home with family. So try to find a job that doesn't mandate when you use your vacation. Otherwise you'll be stuck either flying out at higher prices or doing a lame staycation in your apartment while everyone else flees to their summer houses in Jutland.

Conclusion

Is it worth it? I think so. You'll feel the reduction in stress within six months. For the first time maybe in your entire adult life, you'll have time to explore new hobbies. Wanna try basketweaving or kayaking or horseback riding? There's a club for that. You'll also have the time to try those things. It sounds silly but the ability to just relax during your off-time and not have to do something related to tech at all has had a profound impact on my stress levels.

Some weeks are easier than other. You'll miss home. It'll be sad. But you can push through and adapt if you want to. If I missed something or you need more information please reach out at [email protected] on the Fediverse. Good luck!

Monitoring is a Pain

June 09, 2023

And we're all doing it wrong (including me)

I have a confession. Despite having been hired multiple times in part due to my experience with monitoring platforms, I have come to hate monitoring. Monitoring and observability tools commit the cardinal sin of tricking people into thinking this is an easy problem. It is very simple to monitor a small application or service. Almost none of those approaches scale.

Instead monitoring becomes an endless series of small failures. Metrics disappeared for awhile, logs got dropped for a few hours, the web UI for traces doesn't work anymore. You set up these tools with the mentality of "set and forget" but they actually require ever increasing amounts of maintenance. Some of the tools break and are never fixed. The number of times I join a company to find an unloved broken Jaeger deployed has been far too many.

It feels like we have more tools than ever to throw at monitoring but we're not making progress. Instead the focus seems to be on increasing the output of applications to increase the revenue of the companies doing the monitoring. Very little seems to be happening around the idea of transmitting fewer logs and metrics over the wire from the client. I'm running more complicated stacks to capture massive amounts of data in order to use it less and less.

Here are the best suggestions I have along with my hopes and dreams. I encourage you to tell me I'm wrong and there are better solutions. It would (actually) make my life much easier so feel free: https://c.im/@matdevdug

Logs

They seem like a good idea right? Small little notes you leave for future you letting you know what is going on. Logs begin, in my experience, as basically "print statements stored to disk". Quickly this doesn't scale as disk space is consumed storing useless information that served a function during testing but now nobody cares about. "Let's use log levels". Alright now we're off to the confusing Olympics.

Log Levels Don't Mean Anything

Syslog Levels

Python Levels

Golang

2. Log formats are all over the place

JSON logging - easy to parse, but nested JSON can break parsers and the format is easy to change by developers
Windows event log - tons of data, unclear from docs how much of a "standard" it is
Common Event Format - good spec (you can read it here) but I've never seen anyone use it outside of network hardware companies.
GELF - a really good format designed to work nicely with UDP for logging (which is a requirement of some large companies) that I've never heard of before writing this. You can see it here.
Common Log Format - basically Apache logs: 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Nginx Log Format - log_format combined '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent"';

No consensus on what logs are even for

Traditionally the idea is that you use Debug locally or in your dev environment, info was typically thrown away (you probably don't need to know when an application has done something normal in intense detail) and then you would log everything above info. The problem is, especially with modern microservices with distributed requests, logging is often the only place you can say with any degree of confidence "we know everything that happened inside of the system".

What often happens is that someone will attach some ID header to the request, like a UUID. Then this UUID is returned back to the end consumer and this is how customer service can look through requests and determine "what happened to this customer at this time". So suddenly the logging platform becomes much more than capturing print statements that happen when stuff crashes, it's the primary tool that people use to debug any problems inside of the platform.

This impacts customer service (what customer did what when), it impacts auditing requirements when they require that you keep a record of every interaction. So soon the simple requirement of "please capture and send everything above info" turns into a MUCH bigger project where the log search and capture infrastructure is super mission critical. It's the only way you can work backwards to what specifically happened with any individual user or interaction. Soon this feeds into business analytics where logs because the source of truth for how many requests you got or is this new customer using the platform, etc.

Suddenly your very simple Syslog system isn't sufficient to do this because you cannot have someone SSHing into a box to run a query for customer service. You need some sort of user-friendly interface. Maybe you start with an ELK stack, but running Elasticsearch is actually a giant pain in the ass. You try out SigNoz and that works but now it's a new mission critical piece of infrastructure that often kinda get thrown out there.

Chances are this isn't someones full-time job in your org, they just happened to pick up logging. It's not supposed to be a full-time gig so I totally get it. They installed a few Helm charts, put it behind an OAuth proxy and basically hoped for the best. Instead they get a constant flood of complaints from consumers of the logging system. "Logs are missing, the search doesn't work, my parser doesn't return what I expect".

Logs start to serve as business intelligence source of truth, customer service tool, primary debugging tool, way you know deploys worked, etc. I've seen this pattern at several jobs and often the frailness of this approach is met with a "well it's worked pretty well up to this point".

Not me, I just put it in cloud/SaaS/object storage.

Great, but since you need every log line your costs grow with every incoming customer. That sucks for a ton of reasons, but if your applications are chatty or you just have a lot of requests in a day, it can actually become a serious problem. My experience is companies do not anticipate that the cost of monitoring an application can easily exceed the cost of hosting the application even for simple applications.

Logging always ends up the same way. You eventually either: add some sort of not-log system for user requests that you care about, you stick with the SaaS and then aggressively monitor usage hoping for the best and/or you maintain a full end to end logging infrastructure that writes everything out to a disk you manage.

Logs make sense as a concept but they don't work as an actual tool unless you are willing to basically commit real engineering time every cycle to keeping the logging functional OR you are willing to throw a lot of cash at a provider. On top of that, soon you'll have people writing log parsers to alert on certain situations happening which seems fine, but then the logs become even MORE critical and now you need to enforce logging structure standards or convert old log formats to the new format.

The other problem is logs are such a stupid thing to have to store. 99.9999% of them are never useful, the ones that are look exactly like the rest of them and you end up sticking them in object storage forever at some point where no human being will ever interact with them until the end of time. The number of times I've written some variation on "take terabytes of logs nobody has ever looked at from A and move to B" scripts is too high. Even worse, the cost of tools like Athena to run a query against a massive bucket means this isn't something where you want developers splunking looking for info.

Suggestions

If log messages are the primary way you monitor the entirety of a microservice-based system, you need to sit down and really think that through. What does it cost, how often does it have problems, can you scale it? Can you go without logs being stored?
When you have a log that must be stored for compliance or legal reasons, don't stick it into the same system you use to store every 200 - OK line. Write it to a database (ideally) or an object store outside of the logging pipeline. I've used DynamoDB for this and had it work pretty well by sticking it in an SQS pipeline -> Lambda -> Dynamo. Then your internal application can query this and you don't need to worry about log expiration with DynamoDB TTL.
If you aren't going to make logging a priority (which I totally respect), then you need to set and enforce a low SLA. An SLA of 99% is 7 hours and 14 minutes down a month. This is primarily a management problem, but it means you need to let the system experience problems to break people of the habit that it is an infinitely reliable source of truth.
Your org needs a higher SLA than that? Pay a SaaS and calculate that into the cost of running the app. It's important to set billing labels with external SaaS as per-app as possible. You need to be able to go back to teams and say "your application is costing us too much in observability", not "the business as a whole is spending a lot on observability".
Sampling is your friend. OpenTelemetry supports log sampling as an alpha feature here. It supports sampling based on priority which to me is key. You want some percentage of lower-priority logs but ideally as services mature you can continue to tune that down.
If you have to write a bunch of regex to parse it start praying to whatever gods you believe in that it's a stable format

Hopes and Dreams

Schema validation as a component of collectors for JSON logs. It seems weird that I can't really do this already but it should be possible to globally enforce whether logs are ingested into my system by ensuring they follow a org schema. It'd be great to enforce in the dev environment so people immediately see "hey logs don't show up".
Sampled logs being more of a thing. My dream would be to tie them to deployments so I crank the retention to 100% before I deploy, as I deploy and then for some period of time after I deploy. The collector makes an API call to see what is the normal failure rate for this application (how many 2xx, 4xx, 5xx) and then if the application sticks with that breakdown increase the sampling.

I love what GCP does here for flow logs:

Even though Google Cloud doesn't capture every packet, log record captures can be quite large. You can balance your traffic visibility and storage cost needs by adjusting the following aspects of logs collection:

Aggregation interval: Sampled packets for a time interval are aggregated into a single log entry. This time interval can be 5 seconds (default), 30 seconds, 1 minute, 5 minutes, 10 minutes, or 15 minutes.

Sample rate: Before being written to Logging, the number of logs can be sampled to reduce their number. By default, the log entry volume is scaled by 0.5 (50%), which means that half of entries are kept. You can set this from 1.0 (100%, all log entries are kept) to 0.0 (0%, no logs are kept).

Metadata annotations: By default, flow log entries are annotated with metadata information, such as the names of the source and destination VMs or the geographic region of external sources and destinations. Metadata annotations can be turned off, or you can specify only certain annotations, to save storage space.

Filtering: By default, logs are generated for every flow in the subnet. You can set filters so that only logs that match certain criteria are generated.

I want that for everything all the time.

Metrics

Alright logs are crap and the signal to noise ratio is all off. We're gonna use metrics instead. Great! Metrics begin as super easy. Adding Prometheus-compatible metrics to applications is simple with one of the client libraries. You ensure that Prometheus grabs those metrics, typically with some k8s DNS regex or internal zone DNS work. Finally you slap Grafana in front of Prometheus, adding in Login with Google and you are good to go.

Except you aren't, right? Prometheus is really designed to be running on one server. You can scale vertically as you add more metrics and targets, but there's a finite cap on how big you can grow. Plus when there is a Prometheus problem you lose visibility into your entire stack at once. Then you need to start designing for federation. This is where people panic and start to talk about paying someone to do it.

Three Scaling Options

You can either:
1. Adopt a hierarchical federation which is one Prometheus server scraping higher-level metrics from another server. It looks like this:

The complexity jump here cannot be overstated. You go from store everything and let god figure it out to needing to understand both what metrics matter and which ones matter less, how to do aggregations inside of Prometheus and you need to add out-of-band monitoring for all these new services. I've done it, it's doable but it is a pain in the ass.

2. Cross-Service Federation which is less complicated to set up but has its own weirdness. Basically it's normal Prometheus at the bottom and then less cardinality Prometheus reading from them and you point everything at "primary" node for lack of a better term.

This design works but it uses a lot of disk space and you still have the same monitoring problems as before. Plus again it's a big leap in complexity (however in practice I find managing this level of complexity even solo to be doable).

I need an actual end-to-end metrics and alerting platform

Alright so my examples work fine for just short-term metrics. You can scale to basically "the disk size of a machine" which in practice in the cloud is probably fine. However all of this has been for the consumption of metrics as tools for developers. Similar to logs, as metrics get more useful they also get interest outside of the scope of just debugging applications.

You can now track all sorts of things across the stack and compare things like "how successful was a marketing campaign". "Hey we need to know if Big Customer suddenly gets 5xxs on their API integration so we can tell their account manager." "Can you tell us if a customer stops using the platform so we know to reach out to them with a discount code?" These are all requests I've gotten and so many more, at multiple jobs.

I need A Lot of Metrics Forever

I like the Terminator-style font for the ingredients

So as time goes on inevitably the duration of metrics people want you to keep will increase, as will the cardinality. They want more specific information about not just services, but in many cases customers or specific routes. They'll also want to alert on those routes, store them for (maybe) forever and do all sorts of upstream things with the metrics.

This is where it starts to get Very Complicated.

Cortex

Cortex is a push-service, so basically you push the metrics to it from the Prometheus stack and then process it from there. There are some really nice features with Cortex, including deduplicating incoming samples from redundant Prometheus servers. So you can stand up a bunch of redundant Prometheus, point them at Cortex and then only store one metric. For this to work though you need to add a key-value store, so add another thing to the list. Here are all the new services you are adding:

I've used Cortex once, it's very good but it is a lot of work to run. Between the Prometheus servers you are running and managing this plus writing the configs and monitoring it, it has reached Big Project status. You probably want it running in its own k8s cluster or server group.

Thanos

Similar goals to Cortex, different design. It's a sidecar process that ingests the metrics and moves them using (to me) a more simple modular system. I've only just started to use Thanos but have found it to be a pretty straight-forward system. However it's still a lot to add on top of what started as a pretty simple problem. Of the two though, I'd recommend Thanos just based on ease of getting started. Here are the services you are adding:

Sidecar: connects to Prometheus, reads its data for query and/or uploads it to cloud storage.
Store Gateway: serves metrics inside of a cloud storage bucket.
Compactor: compacts, downsamples and applies retention on the data stored in the cloud storage bucket.
Receiver: receives data from Prometheus’s remote write write-ahead log, exposes it, and/or uploads it to cloud storage.
Ruler/Rule: evaluates recording and alerting rules against data in Thanos for exposition and/or upload.
Querier/Query: implements Prometheus’s v1 API to aggregate data from the underlying components.
Query Frontend: implements Prometheus’s v1 API to proxy it to Querier while caching the response and optionally splitting it by queries per day.

This is too complicated I'm gonna go with SaaS

Great but they're expensive. All the same rules as logging apply. You need to carefully monitor ingestion and ensure you aren't capturing high-cardinality metrics for no reason. Sticker shock when you get the first bill is common, so run some estimates and tests before you plug it in.

Suggestions

Define a hard limit for retention for metrics from day 1. What you are going to build really differs greatly depending on how long you are gonna keep this stuff. I personally cap the "only Prometheus" design at 30 days of metrics. I know people who go way longer with the federated designs but I find it helps to keep the 30 days as my north star of design.
If metrics are going to be your primary observability tool, don't do it in half measures. It's way harder to upgrade just Prometheus once the entire business is relying on it and downtime needs to be communicated up and down the chain. I'd start with either Thanos or Cortex from launch so you have a lot more flexibility if you want to keep a lot of metrics for a long period of time.
Outline an acceptable end state. If you are looking at a frightening number of metrics, Cortex is a better tool for sheer volume. I've seen a small group of people who knew it well manage Cortex at 1.6 million metrics a second with all the tools it provides to control and process that much data. However if the goal is less about sheer volume and more about long-term storage and accessibility, I'd go with Thanos.
Unlike a lot of folks I think you just need to accept that metrics are going to be something you spend a lot of time working with. I've never seen a completely hands-off system that Just Works at high volume without insane costs. You need to monitor them, change their ingestion, tinker with the configuration, go back, it's time consuming.

Tracing

Logs are good for knowing exactly what happened but have a bad signal to noise ratio. Metrics are great for knowing what happened but can't work for infinite cardinality. Enter tracing, the hot new thing from 5 years ago. Traces solve a lot of the problems from before allowing for tremendous amounts of data to be collected about requests through your stack. In addition it allows for amazing platform-agnostic monitoring. You can follow the request from your app to your load balancer to backend and microservices and back.

Now the real advantage of tracing to me is it comes out of the box with the idea of sampling. It is a debugging and troubleshooting tool, not something with compliance or business uses. Therefore it hasn't become completely mucked up with people jamming all sorts of weird requirements in there over time. You can very safely sample because its only for developer troubleshooting.

I'll be honest with you. My experience with setting tracing up has been "use SaaS and configure tracing". The one I've used the most and had the best experience with is Cloud Trace. It was easy to implement, controlling pricing was pretty straight-forward and I liked the troubleshooting element

The problem with me and tracing is nobody uses it. When I monitor the teams usage of traces, it is always a small fraction of the development team that ever logs in to use them. I don't know why the tool hasn't gotten more popularity among developers. It's possible folks are more comfortable with metrics and logs or perhaps they don't see the value (or maybe they feel like they know where the time-consuming services are in their stack so they just need the round-trip time off the load balancer). So far I haven't seen tracing "done right".

Hopefully that'll change some day.

Conclusion

Maybe I'm wrong about monitoring. Maybe everyone else is having a great time with it and I'm the one struggling. My experience has been monitoring is an unloved internal service of the worst kind. It requires a lot of work, costs a lot of money and never makes the company any money.

The point of AI chat is selling ads

May 19, 2023

It's all advertising, all the way down.

I like that the robot is an asshole about it AND brings its own wrench from home

The Robots are Coming

My entire life automation has been presented as a threat. It is hard to measure how often business has threatened this to keep wages down and keep workers increasing productivity. While the mechanism of threatened automation changes over time (factory line robots, computers, AI) the basic message remains the same. If you demand anything more from work at any time, we'll replace you.

The reason this never happens is automation is hard and requires intense organizational precision. You can't buy a factory robot and then decide to arbitrarily change things on the product. Human cashiers can deal with a much wider range of situations vs a robotic cashier. If an organization wants to automate everything, it would need to have a structure capable of detailing what it wanted to happen at every step. Along with leadership informed enough about how their product works to account for every edge case.

Is this possible? Absolutely, in fact we see it with call center decision trees, customer support flows and chat bots. Does it work? Define work! Does it reduce the amount of human workers you need giving unhelpful answers to questions? Yes. Are your users happy? No but that's not a metric we care about anymore.

Let us put aside the narrative that AI is coming for your job for a minute. Why are companies so interested in this technology that they're willing to pour billions into it? The appeal I think is a delivery system of a conversation vs serving you up a bunch of results. You see advertising in search results. Users are now used to scrolling down until the ads are gone (or blocking them when possible).

With AI bots you have users interact with data only through a service controlled by one company. The opportunity for selling ads to those users is immense. There already exists advertising marketplaces for companies to bid on spots to users depending on a wide range of criteria. If you are the company that controls all those pieces you can now run ads inside of the answer itself.

There is also the reality that AI is going to destroy web searching and social media. If these systems can replicate normal human text enough that a casual read cannot detect them and generate images on demand good enough that it takes detailed examination to determine that they're fake, conventional social media and web search cannot survive. Any algorithm can be instantly gamed, people can be endlessly impersonated or just overwhelmed with fake users posting real sounding opinions and objections.

So now we're in an arms race. The winner gets to be the exclusive source of truth for users and do whatever they want to monetize that position. The losers stop being relevant within a few years and joins the hall of dead and dying tech companies.

Scenario 1 - Buying a Car Seat

Meet Todd. He works a normal job, with the AI chatbot installed on his Android phone. He haven't opted out of GAID, so his unique ID is tracked across all of your applications. Advertising networks know he lives in the city of Baltimore and has a pretty good idea of his income, both from location information and the phone model information they get. Todd uses Chrome with the Topics API enabled and rolled out.

Already off the bat we know a lot about Todd. Based on the initial spec sheet for the taxonomy of topics (which is not a final draft/could change/etc etc) available from here: https://github.com/patcg-individual-drafts/topics, there's a ton of information we can get about Todd. You can download the IAB Tech Lab list of topics here: https://iabtechlab.com/wp-content/uploads/2023/03/IABTL-Audience-Taxonomy-1.1-Final-3.xlsx

Let's say Todd is in the following:

Demographic | Age Range | 30-34 |

Demographic | Education & Occupation | Undergraduate Education |

Demographic | Education & Occupation | Skilled/Manual Work |

Demographic | Education & Occupation | Full-Time |

Demographic | Household Data | $40000 - $49999 |

Demographic | Household Data | Adults (no children)

Demographic | Household Data | Median Home Value (USD) |

Demographic | Household Data | $200,000-$299,999 |

Demographic | Household Data | Monthly Housing Payment (USD) |

Demographic | Household Data | $1,000-$1,499 |

Interest | Automotive | Classic Cars |

That's pretty precise data about Todd. We can answer a lot of questions about him, what he does, where he lives, what kind of house he has and what kinds of advertising would speak to him. Now let's say we know all that already and can combine that information with a new topic which is:

Interest | Family and Relationships | Parenting |

Todd opens his chat AI app and starts to ask questions about what is the best car seat. Anyone who has ever done this search in real life knows Google search results are jammed-packed full of SEO spam, so you end up needing to do "best car seat reddit" or "best car seat wirecutter". Todd doesn't know that trick, so instead he turns to his good friend the AI. When the AI gets that query, it can route the request to the auction system to decide "who is going to get returned as an answer".

Is this nefarious? Only if you consider advertising on the web nefarious. This is mostly a more efficient way of doing the same thing other advertising is trying to do, but with a hyper-focus that other systems lack.

Auction System

The existing ad auction system is actually pretty well equipped to do this. The AI parses the question, determines what keywords apply to this question and then see who is bidding for those keywords. Depending on the information Google knows about the user (a ton of information), it can adjust the Ad Rank of different ads to serve up the response that is most relevant to that specific user. So Todd won't get a response for a $5000 car seat that is a big seller in the Bay Area because he doesn't make enough money to reasonably consider a purchase like that.

Instead Todd gets a response back from the bot steering him towards a cheaper model. He assumes the bot has considered the safety, user scores and any possible recalls when doing this calculation, but it didn't. It offered up the most relevant advertising response to his question with a link to buy the product in question. Google is paid for this response at likely a much higher rate than their existing advertising structure since it is so personalized and companies are more committed than ever to expanding their advertising buy with Google.

Since the bot doesn't show sources when it returns an answer, just the text of the answer, he cannot do any further research without going back to search. There is no safety check for this data since Amazon reviews are also broken. Another bot might return a different answer but how do you compare?

Unless Todd wants to wander the neighborhood asking people what they bought, this response is a likely winner. Even if the bot discloses that the link is a sponsored link, which presumably it will have to do, it doesn't change the effect of the approach.

Scenario 2 - Mary is Voting

Mary is standing in line waiting to vote. She know who she wants to vote for in big races, but the ballot is going to have a lot of smaller candidates on there as well. She's a pretty well-informed person but even she doesn't know where the local sheriff stands on the issues or who is a better judge over someone else. But she has some time before she gets to vote, so she asks the AI who is running for sheriff and information about them.

Mary uses an iPhone, so it hides her IP from the AI. She has also declined ATT, so the amount of information we know about her is pretty limited. We have some geoIP data off the private relay IP address. Yet we don't need that much information to do what we want to do.

Let's assume these companies aren't going to be cartoonishly evil for a minute and place some ethical guidelines on responses. If she were to ask "who is the better candidate for sheriff", we would assume the bot would return a list of candidates and information about them. Yet we can still follow that ethical guideline and have an opportunity to make a lot of money.

One of the candidates for sheriff recently had an embarrassing scandal. They're the front-runner candidate and will likely win as long as enough voters don't hear about this terrible thing he did. How much could an advertising company charge to not mention it? It's not a lie, you are still answering the question but you leave out some context. You could charge a tremendous amount for this service and still be (somewhat) ok. You might not even have to disclose it.

You already see this with conservative and liberal bent news in the US, so there is an established pattern. Instead of the bent being one way or the other, adjust the weights based on who pays more. It doesn't even need to be that blatant. You can even have the AI answer the question if asked "what is the recent scandal with candidate for sheriff x". The omission appears accidental.

Mary gets the list of candidates and reviews their stances on positions important to her. Everything she interacted with looked legitimate and data-driven with detailed answers to questions. It didn't mention the recent scandal so she proceeds to act as if it had never happened.

The ability to omit information the company wants to omit from surfacing to users at all in a world where the majority of people consume information from their phones after searching for it is massive. Even if the company has no particular interest in doing so for its own benefit, the ability to offer it or to tilt the scales is so powerful that it is hard to ignore.

The value of AI to advertising is the perception of its intelligence

What we are doing right now is publishing as many articles and media pieces as we can claiming how intelligent AI is. It can pass the bar exam, it can pass certain medical exams, it can even interpret medical results. This is creating the perception among people that this system is highly intelligent. The assumption people make is that this intelligence will be used to replace existing workers in those fields.

While that might happen, Google is primarily an ad company. They have YouTube ads which account for 10.2% of revenue, Google Network ads for 11.4%, and ads from Google Search & other properties for 57.2%. Meta is even more one-dimensional with 97.5% of its revenue coming from advertising. None of these companies are going to turn down opportunities to deploy their AI systems into workplaces, but that's slow growth businesses. It'll take years to convince hospitals to let them have their AI review the result, work through the regulatory problems of doing so, having the results peer-checked, etc.

Instead there's simpler, lower-hanging fruit we're all missing. By funneling users away from different websites where they do the data analysis themselves and towards the AI "answer", you can directly target users with high-cost advertising that will have a higher ROI than any conventional system. Users will be convinced they are receiving unbiased data-based answers while these companies will be able to use their control of side systems like phone OS, browser and analytics to enrich the data they know about the user.

That's the gold-rush element of AI. Whoever can establish their platform as the ones that users see as intelligent first and get it installed on phones will win. Once established it's going to be difficult to convince users to double-check answers across different bots. The winner will be able to grab the gold ring of advertising. A personalized recommendation from a trusted voice.

If this obvious approach occurred to me, I assume it's old news for people inside of these respective teams. Even if regulators "cracked down" we know the time delay between launching the technology and regulation of that technology is measured in years, not months. That's still enough time to generate the kind of insane year over year growth demanded by investors.

I'll always double-check the results

That presupposes you can. The ability to detect whether content is generated by an AI is extremely bad right now. There's no reason to think that it will get better quickly. So you will be alone cruising the internet looking for trusted sources on topics with search results that are going to be increasingly jam packed full of SEO-optimized junk text.

Will there be websites you can trust? Of course, you'll still be able to read the news. But even news sites are going to start adopting this technology (on top of many now being owned by politically-motivated owners). In a sea of noise, it's going to become harder and harder to figure out what is real and what is fake. These AI bots are going to be able to deliver concise answers without dealing with the noise.

Firehose of Falsehoods

According to a 2016 RAND Corporation study, the firehose of falsehood model has four distinguishing factors: it (1) is high-volume and multichannel, (2) is rapid, continuous, and repetitive, (3) lacks a commitment to objective reality; and (4) lacks commitment to consistency.[1] The high volume of messages, the use of multiple channels, and the use of internet bots and fake accounts are effective because people are more likely to believe a story when it appears to have been reported by multiple sources.[1] In addition to the recognizably-Russian news source, RT, for example, Russia disseminates propaganda using dozens of proxy websites, whose connection to RT is "disguised or downplayed."[8] People are also more likely to believe a story when they think many others believe it, especially if those others belong to a group with which they identify. Thus, an army of trolls can influence a person's opinion by creating the false impression that a majority of that person's neighbors support a given view.[1]

I think you are going to see this technique everywhere. The lower cost of flooding conventional information channels with fake messages, even obviously fake ones, is going to drown out real sources. People will need to turn to this automation just to be able to get quick answers to simple questions. By destroying the entire functionality of search and the internet, these tools will be positioned to be the only source of truth.

The amount of work you will need to do in order to find primary-source independent information about a particular topic, especially a controversial topic, is going to be so high that it will simply exceed the capacity of your average person. So while some simply live with the endless barrage of garbage information, others use AI bots to return relevant results.

That's the value. Tech companies won't have to compete with each other, or with the open internet or start-up social media websites. If you want your message to reach its intended audience, this will be the only way to do it in a sea of fake. That's the point and why these companies are going to throw every resource they have at this problem. Whoever wins will be able to exclude the others for long enough to make them functionally irrelevant.

Think I'm wrong? Tell me why on Mastodon: https://c.im/@matdevdug

MRSK Review

May 10, 2023

I, like the entire internet, has enjoyed watching the journey of 37Signals from cloud to managed datacenter. For those unfamiliar, it's worth a read here. This has spawned endless debates about whether the cloud is worth it or should we all be buying hardware again, which is always fun. I enjoy having the same debates every 5 years just like every person who works in tech. However mentioned in their migration documentation was a reference to an internal tool called "MRSK" which they used to manage their infrastructure. You can find their site for it here.

When I read this, my immediate thought was "oh god no". I have complicated emotions about creating custom in-house tooling unless it directly benefits your customers (which can include internal customers) enough that the inevitable burden of maintenance over the years is worth it. It's often easier to yeet out software than it is to keep it running and design around its limitations, especially in the deployment space. My fear is often this software is the baby of one engineer, adopted by other teams, that engineer leaves and now the entire business is on a custom stack nobody can hire for.

All that said, 37Signals has open-sourced MRSK and I tried it out. It was better than expected (clearly someone has put love into it) and the underlying concepts work. However if the argument is that this is an alternative to a cloud provider, I would expect to hit fewer sharper edges. This reeks of internal tool made by a few passionate people who assumed nobody would run it any differently than they do. Currently its hard to recommend to anyone outside of maybe "single developers who work with no one else and don't mind running into all the sharp corners".

How it works

The process to run it is pretty simple. Set up a server wherever (I'll use digital ocean) and configure it to start with an SSH key. You need to select Ubuntu (which is a tiny bummer and would have preferred Debian but whatever) and then you are off to the races.

Then select a public SSH key you already have in the account.

Setting up MRSK

On your computer run gem install mrsk if you have ruby or alias mrsk='docker run --rm -it -v $HOME/.ssh:/root/.ssh -v /var/run/docker.sock:/var/run/docker.sock -v ${PWD}/:/workdir ghcr.io/mrsked/mrsk' if you want to do it as a Docker container. I did the second option, sticking that line in my .zshrc file.

Once installed you run mrsk init which generates all you need.

The following is the configuration file that is generated and gives you an idea of how this all works.

# Name of your application. Used to uniquely configure containers.
service: my-app

# Name of the container image.
image: user/my-app

# Deploy to these servers.
servers:
  - 192.168.0.1

# Credentials for your image host.
registry:
  # Specify the registry server, if you're not using Docker Hub
  # server: registry.digitalocean.com / ghcr.io / ...
  username: my-user

  # Always use an access token rather than real password when possible.
  password:
    - MRSK_REGISTRY_PASSWORD

# Inject ENV variables into containers (secrets come from .env).
# env:
#   clear:
#     DB_HOST: 192.168.0.2
#   secret:
#     - RAILS_MASTER_KEY

# Call a broadcast command on deploys.
# audit_broadcast_cmd:
#   bin/broadcast_to_bc

# Use a different ssh user than root
# ssh:
#   user: app

# Configure builder setup.
# builder:
#   args:
#     RUBY_VERSION: 3.2.0
#   secrets:
#     - GITHUB_TOKEN
#   remote:
#     arch: amd64
#     host: ssh://[email protected]

# Use accessory services (secrets come from .env).
# accessories:
#   db:
#     image: mysql:8.0
#     host: 192.168.0.2
#     port: 3306
#     env:
#       clear:
#         MYSQL_ROOT_HOST: '%'
#       secret:
#         - MYSQL_ROOT_PASSWORD
#     files:
#       - config/mysql/production.cnf:/etc/mysql/my.cnf
#       - db/production.sql.erb:/docker-entrypoint-initdb.d/setup.sql
#     directories:
#       - data:/var/lib/mysql
#   redis:
#     image: redis:7.0
#     host: 192.168.0.2
#     port: 6379
#     directories:
#       - data:/data

# Configure custom arguments for Traefik
# traefik:
#   args:
#     accesslog: true
#     accesslog.format: json

# Configure a custom healthcheck (default is /up on port 3000)
# healthcheck:
#   path: /healthz
#   port: 4000

Good to go?

Well not 100%. On first run I get this:

❯ mrsk deploy
Acquiring the deploy lock
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
  ERROR (RuntimeError): Can't use commit hash as version, no git repository found in /workdir

Apparently the directory you work in needs to be a git repo. Fine, easy fix. Then I got a perplexing SSH error.

❯ mrsk deploy
Acquiring the deploy lock
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
  INFO [39265e18] Running /usr/bin/env mkdir mrsk_lock && echo "TG9ja2VkIGJ5OiAgYXQgMjAyMy0wNS0wOVQwOToyNzoxNloKVmVyc2lvbjog
SEVBRApNZXNzYWdlOiBBdXRvbWF0aWMgZGVwbG95IGxvY2s=
" > mrsk_lock/details on 206.81.22.60
  ERROR (Net::SSH::AuthenticationFailed): Authentication failed for user [email protected]

❯ ssh [email protected]
Welcome to Ubuntu 22.10 (GNU/Linux 5.19.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Tue May  9 09:26:40 UTC 2023

  System load:  0.0               Users logged in:       0
  Usage of /:   6.7% of 24.06GB   IPv4 address for eth0: 206.81.22.60
  Memory usage: 19%               IPv4 address for eth0: 10.19.0.5
  Swap usage:   0%                IPv4 address for eth1: 10.114.0.2
  Processes:    98

0 updates can be applied immediately.

New release '23.04' available.
Run 'do-release-upgrade' to upgrade to it.


Last login: Tue May  9 09:26:41 2023 from 188.177.18.83
root@ubuntu-s-1vcpu-1gb-fra1-01:~#

So Ruby SSH Authentication failed even though I had the host configured in the SSH config and the standard SSH login worked without issue. Then a bad thought occurs to me. "Does it care....what the key is called? Nobody would make a tool that relies on SSH and assume it's id_rsa right?"

❯ mrsk deploy
Acquiring the deploy lock
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
  INFO [6c25e218] Running /usr/bin/env mkdir mrsk_lock && echo "TG9ja2VkIGJ5OiAgYXQgMjAyMy0wNS0wOVQwOTo1Mjo0NloKVmVyc2lvbjog
SEVBRApNZXNzYWdlOiBBdXRvbWF0aWMgZGVwbG95IGxvY2s=
" > mrsk_lock/details on 142.93.110.241
Enter passphrase for /root/.ssh/id_rsa:

Moving past the bad SSH

Then I get this error:

❯ mrsk deploy
Acquiring the deploy lock
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
  INFO [3b53d161] Running /usr/bin/env mkdir mrsk_lock && echo "TG9ja2VkIGJ5OiAgYXQgMjAyMy0wNS0wOVQwOTo1ODoyOVoKVmVyc2lvbjog
SEVBRApNZXNzYWdlOiBBdXRvbWF0aWMgZGVwbG95IGxvY2s=
" > mrsk_lock/details on 142.93.110.241
Enter passphrase for /root/.ssh/id_rsa:
  INFO [3b53d161] Finished in 6.094 seconds with exit status 0 (successful).
Log into image registry...
  INFO [2522df8b] Running docker login -u [REDACTED] -p [REDACTED] on localhost
  INFO [2522df8b] Finished in 1.209 seconds with exit status 0 (successful).
  INFO [2e872232] Running docker login -u [REDACTED] -p [REDACTED] on 142.93.110.241
  Finished all in 1.3 seconds
Releasing the deploy lock
  INFO [2264c2db] Running /usr/bin/env rm mrsk_lock/details && rm -r mrsk_lock on 142.93.110.241
  INFO [2264c2db] Finished in 0.064 seconds with exit status 0 (successful).
  ERROR (SSHKit::Command::Failed): docker exit status: 127
docker stdout: Nothing written
docker stderr: bash: line 1: docker: command not found

docker command not found? I thought MRSK set it up.

From the GitHub:

This will:

    Connect to the servers over SSH (using root by default, authenticated by your ssh key)
    Install Docker on any server that might be missing it (using apt-get): root access is needed via ssh for this.
    Log into the registry both locally and remotely
    Build the image using the standard Dockerfile in the root of the application.
    Push the image to the registry.
    Pull the image from the registry onto the servers.
    Ensure Traefik is running and accepting traffic on port 80.
    Ensure your app responds with 200 OK to GET /up.
    Start a new container with the version of the app that matches the current git version hash.
    Stop the old container running the previous version of the app.
    Prune unused images and stopped containers to ensure servers don't fill up.

However:

root@ubuntu-s-1vcpu-1gb-fra1-01:~# which docker
root@ubuntu-s-1vcpu-1gb-fra1-01:~#

Fine I guess I'll install Docker. Not feeling like this is saving a lot of time vs rsyncing a Docker Compose file over.

sudo apt update
sudo apt upgrade -y
sudo apt install -y docker.io curl git
sudo usermod -a -G docker ubuntu

Now we have Docker on the machine.

Did it work after that?

Yeah so my basic Flask app needed to have a new route added to it, but once I saw that you need to configure a route at /up and did that, worked fine. The traffic is successfully paused during deployment and rerun once the application is healthy again. Overall once I got it running it worked much as intended.

I also tried accessories, which is their term for necessary internal services like mysql. These are more like standard Docker compose commands but they're nice to be able to include. Again, it feels a little retro to say "please install mysql on the mysql box" and just hope that box doesn't go down, but it's totally serviceable. I didn't encounter anything interesting with the accessory testing.

Impressions

MRSK is an interesting tool. I think, if the community adopts it and irons out the edge cases, it'll be a good building-block technology for people not interested in running infrastructure. Comparing it to Kubernetes is madness, in the same way I wouldn't compare a go-kart I made in my garage to semi-truck.

That isn't to hate on MRSK, I think it's a good idea to solve for people with less complicated concerns. However part of the reasons more complicated tools are complicated is because they cover more edgecases and automate more failure scenarios. MRSK doesn't cover for those, so it gets to be more simple, but as you grow more of those concerns shift back to you.

It's the difference between managing 5 hosts with Ansible and 1500. 5 is easy and scales well, 1500 becomes a nightmare. MRSK in its current state should be seen as a bridge technology unless your team expends the effort to customize it for your workflow and add in the gaps in monitoring.

If it were me and I was starting a company today, I'd probably invest the effort in something like GKE Autopilot where GCP manages almost all the node elements and I worry exclusively about what my app is doing. But I have a background in k8s so I understand I'm an edge case. If you are looking to start a company or a project and want to keep it strictly cloud-agnostic, MRSK does do it.

What I would love to see added to MRSK to work-proof it more:

Adding support for 1Password/secret manager for the SSH key component so it isn't a key on your local machine
Adding support for multiple users with different keys on the box managed inside of some secret configuration so you can tell what user did what deployment and rotation of keys is part of deployment as needed (you can set a user per config file but that isn't really granular enough to scale)
Fixing the issue where the ssh_config doesn't seem to be respected
Providing an example project in the documentation of what exactly you need to hit msrk deploy and have a functional project up and running
Let folks know that having the configuration file inside of a git repo is a requirement
Ideally integrating some concept of autoscaling group into the configuration with some lookup concept back to the config file (which you can do with an template but would be nice to build in)
Do these servers update themselves? What happens if Docker crashes? Can I pass resource limits to the service and not just accessories? A lot of missing pieces there.
mrsk details is a great way to quickly see the health status, but you obviously need to do more to monitor whether your app is functional or not. That's more on you than the MRSK team.

Should you use MRSK today

If you are a single developer who runs a web application, ideally a rails application, and you are provisioning your servers one by one with Terraform or whatever, where static IP addresses (internal or external) are something you can get and don't change often, this is a good tool for you. I wouldn't recommend using the accessories functionality, I think you'll probably want to use a hosted database service if possible. However it did work, so I mean just consider how critical uptime is to you when you roll this out.

However if you are on a team, I don't know if I can recommend this at the current juncture. Certainly not run from a laptop. If you integrate this into a CI/CD system where the users don't have access to the SSH key and you can lock that down such that it stops being a problem, it's more workable. However as (seemingly) envisioned this tool doesn't really scale to multiple employees unless you have another system swapping the deployment root SSH key at a regular interval and distributing that to end users.

You also need to do a lot of work around upgrades, health monitoring of the actual VMs, writing some sort of replacement system if the VM dies and you need to put another one in its place. What is the feedback loop back to this static config file to populate IP addresses, automating rollbacks if something fails, monitoring deployments to ensure they're not left in a bad state, staggering the rollout (which MRSK does support). There's a lot here that comes in the box with conventional tooling that you need to write here.

If you want to use it today

Here's the minimum I would recommend.

I'd use something like the 1Password SSH agent so you can at least distribute keys across the servers without having to manually add them to each laptop: https://developer.1password.com/docs/ssh/agent/
I'd set up a bastion server (which is supported by MSRK and did work in my testing). This is a cheap box that means you don't need to allow your application and database servers to be exposed directly to the internet. There is a decent tutorial on how to make one here: https://zanderwork.com/blog/jump-host/
Ideally do this all from within a CI/CD stack so that you are running it from one central location and can more easily centralize the secret storage.