SeriesRebuilding Remote AccessPart 3 of 5

The gateway: self-hosted NetBird behind Caddy, with a relay

Now the WireGuard control plane. This part stands up self-hosted NetBird on a DMZ gateway, fronts it with Caddy issuing TLS via DNS-01, points it at our SSO for auth, and adds a coturn relay for the peers that can't connect directly.

Real work, anonymized — generic domains/IPs throughout.

The identity spine can now answer “who are you?” with MFA. This part builds the thing that asks the question: the WireGuard control plane. We’re self-hosting NetBird — management, signal, and dashboard — on a gateway VM in the DMZ, fronted by Caddy for TLS, authenticating every peer through the Keycloak from Part 2.

Where I first met NetBird

I’d already used NetBird in a personal capacity — including wiring secure remote access into my own HPC homelab, where it was the front door to a login node (a separate series, in progress). That was one peer joining someone else’s control plane. This is the other half: running the entire control plane yourself, as the company’s primary access layer. Same tool, completely different responsibility — and a good example of a lab teaching you something you then do for real.

The mental model: a control plane, not a concentrator

OpenVPN is a concentrator — every packet flows through one server. WireGuard via NetBird is a mesh: peers try to talk directly, and the self-hosted services mostly just broker the introductions:

  • management — the brain: who’s enrolled, what groups, which routes and policies.
  • signal — helps two peers find each other and NAT-hole-punch a direct tunnel.
  • dashboard — the web UI, an OIDC client of our Keycloak.
  • coturn (relay) — the fallback for peers that can’t punch a direct path; their traffic relays through here instead of failing.

Why self-host instead of the SaaS

The hosted offering is genuinely good. We self-host because the directory, the policies, and the audit trail are things this company wants to own and keep on-prem — and because the same gateway already lives in our DMZ. Self-hosting also means our identity (Keycloak → directory) is the only place accounts exist; nothing about “who works here” leaves the building.

TLS first: Caddy with DNS-01

Everything public-facing terminates TLS at Caddy on the gateway. The wrinkle: these services live in the DMZ and I don’t want to open inbound HTTP just to solve ACME’s HTTP challenge. So Caddy uses the DNS-01 challenge — it proves domain control by writing a DNS record via the provider’s API, and never needs port 80 reachable from outside:

# gateway VM — Caddyfile (global ACME via DNS-01; staging first, then prod)
{
    acme_dns cloudflare {env.CLOUDFLARE_API_TOKEN}
}
login.example.com  { reverse_proxy idp.int.example.com:8080 }   # Keycloak
mesh.example.com   { reverse_proxy localhost:80 }               # NetBird dashboard (+ /api, /signal blocks)
connect.example.com { root * /srv/landing; file_server }        # the user landing page

Why DNS-01 (and staging first)

DNS-01 issues certs — including wildcards — without any inbound HTTP, which is exactly right for DMZ services behind a firewall. And I point ACME at Let’s Encrypt staging until the whole chain works, because the production CA has rate limits you can burn through fast while debugging. Flip to the production CA only once staging certs issue cleanly.

curl -I https://login.example.com

Once ACME was flipped from staging to the production CA, this came back from Caddy with a valid certificate chain — TLS terminating cleanly at the gateway for a service running plain HTTP behind it.

OIDC clients: teaching NetBird to use our SSO

NetBird doesn’t store passwords — it delegates to Keycloak. That needs two OIDC clients in the realm: a public PKCE client for the dashboard, and a device-authorization client for the CLI/app login on machines without a browser.

Why a device-flow client for the CLI

When you run the NetBird client on a headless box or a phone, there’s no neat browser redirect. The OAuth 2.0 Device Authorization Grant is the “go to this URL and enter this code” flow — the user authenticates (with MFA) on a device that does have a browser, and the client gets its token. It’s the same pattern smart TVs use to log you in.

Record the issuer (https://login.example.com/realms/<realm>) and client IDs, then feed them to NetBird’s management.json and dashboard environment so every enrollment authenticates through Keycloak.

NetBird + coturn, and opening exactly three ports

The services come up via compose; coturn gets the realm, a static auth secret, and a bounded relay port range:

docker compose up -d        # management, signal, dashboard, coturn
docker compose ps           # all healthy, logs clean

docker compose ps showed management, signal, dashboard, and coturn all up and healthy, with clean logs — the control plane was live.

Then the only inbound the edge firewall forwards to the gateway: 443/tcp (dashboard + API + TLS), 3478/udp (coturn/STUN), and the relay UDP range. The existing OpenVPN rule (1194/udp) is left completely alone — additive, remember.

Bound the relay port range, and leave the old rule

coturn needs a range of UDP ports for relayed media/data — define a small, explicit range (e.g. a 100-port window) rather than opening a wide swath, and forward exactly that. And do not touch the OpenVPN forward: the entire migration depends on it staying live as the fallback.

What’s next

The control plane is up, TLS is valid, and peers can authenticate through our SSO with MFA. But an authenticated peer that can’t reach anything internal is useless. Part 4 turns the gateway into a routing peer, advertises the internal subnets, pushes internal DNS, and writes the group-based, default-deny access policies that decide who can reach what.