Why we're tearing out a VPN that works — Abellana

The old OpenVPN setup wasn't broken — it was a pile of per-user certificates and a manual onboarding ritual. Here's the case for replacing it with a self-hosted WireGuard mesh tied to our identity provider, and the one rule that makes the swap safe.

A note on this series: this is real work, anonymized. Company, hostnames, IPs, and domains are stand-ins (example.com, 10.0.0.x); the architecture, decisions, and lessons are exactly as they happened.

The remote-access VPN at work isn’t broken. People connect, they reach internal systems, it’s been fine for years. So why am I about to replace the whole thing? Because “it works” and “it’s a good idea to keep running” are different claims — and the way it works is a slow tax I pay on every new hire and every offboarding.

This series is the build log for swapping a classic OpenVPN setup for a self-hosted WireGuard mesh (NetBird) fronted by single sign-on. Part 1 is the part worth getting right before touching anything: why, and the one discipline that keeps a migration like this from becoming an outage.

The mental model: identity should come from your directory, not a pile of certs¶

Here’s what the old world looks like. OpenVPN authenticates each user with a client certificate — a .ovpn file with an embedded key. To onboard someone, the script SSHes to the VPN box, runs the EasyRSA “new client” command, generates a .ovpn, and hands it to the user. Offboarding means remembering to revoke that cert.

Why certificate-per-user quietly rots

Every certificate is a long-lived secret living in someone’s Downloads folder, synced to who-knows-where. There’s no MFA in front of it — possession of the file is the auth. Access isn’t tied to whether the person still has a job; it’s tied to whether their cert is still valid. The source of truth for “who can connect” drifts away from your directory and into a sprawl of files. None of this is dangerous today; all of it is debt.

The target world inverts that: the directory is the source of truth. You already have an identity system — for us, FreeIPA — that knows who works here and what teams they’re on. Access should be a property of that, gated by MFA, granted and revoked centrally. The VPN client should just ask “who are you?” and get an answer from the identity provider. Lose your laptop, get offboarded, change teams — the directory changes and access follows, with no file to chase down.

What we’re building¶

        remote user (laptop / phone)
              │  WireGuard, authenticated by SSO + MFA
              ▼
   ┌─────────────────────────┐        ┌──────────────────────────┐
   │   gateway VM (DMZ+LAN)   │        │     identity VM (LAN)     │
   │  Caddy (TLS, DNS-01)     │  OIDC  │  Keycloak  ── LDAPS ──▶   │
   │  NetBird mgmt/signal     │◀──────▶│  (federates the directory)│
   │  coturn relay            │        │  Postgres                 │
   └────────────┬────────────┘        └─────────────┬────────────┘
                │ advertises internal routes          │ read-only bind
                ▼                                      ▼
        internal network 10.0.0.0/24            FreeIPA directory

Three moving parts, each with one job:

Keycloak (on a LAN-only identity VM) federates the existing directory read-only and becomes the single sign-on front door, where MFA is enforced.
NetBird (self-hosted on a gateway VM) is the WireGuard control plane — it authenticates peers via Keycloak and wires up the mesh.
Caddy terminates TLS for all of it, getting certs automatically via DNS-01.

The old OpenVPN box doesn’t get touched yet. That’s not an accident — it’s the whole strategy.

The one rule: the new stack is additive; the old VPN stays live¶

Never cut over by deletion

The new mesh is built alongside the working VPN, not on top of it. OpenVPN keeps running on its own port the entire time. Users migrate in waves; each wave only stops using OpenVPN after they’re confirmed working on the mesh. At every phase the new stack can be torn down (docker compose down, drop one firewall rule) with zero impact on the thing people rely on today. The VPN is the fallback until it isn’t needed — and even then it stays up for site-to-site backups.

The corollary I committed to up front: the proof-of-concept is production. No throwaway test domains, no “we’ll redo it properly later.” Real DNS names, real certs, real identity federation from day one — because the gap between a POC and prod is exactly where migrations die. Test like you fly.

Honest limits / what I’m deliberately not doing yet¶

Not decommissioning OpenVPN. It keeps carrying a site-to-site backup tunnel indefinitely. This project only moves the interactive road-warrior users.
MFA is staged. I start with Keycloak-native MFA (passkeys/TOTP) because it’s under my control and unblocks the build; moving MFA authority to the cloud IdP (Entra Conditional Access) is a deliberately later phase, so I don’t block the whole project on a licensing conversation.
One known weak spot: a remote office on the far side of the world sees ~250 ms RTT. That’s a transport/geography problem a VPN swap won’t fix — the honest answer is regional compute, tracked separately. I’m not going to pretend the mesh makes physics faster.

What’s next¶

The decision is made and the architecture is drawn. Part 2 builds the identity spine: standing up Keycloak on its own VM and federating the existing directory over LDAPS, read-only, so the mesh can authenticate people against the system that already knows them.