SeriesRebuilding Remote AccessPart 5 of 5

Cutover without downtime: migrating users while the old VPN stays up

The new mesh works — now move real people onto it without a bad day. This part is the discipline that makes a migration boring: a parallel pilot, simplified onboarding, wave-by-wave cutover with the old VPN as a live safety net, and backups you've actually restored.

Real work, anonymized — generic domains/hosts throughout.

By Part 4 the new stack is complete: SSO with MFA, a WireGuard mesh, internal routes and DNS, group-scoped access. But “the stack works” and “the company is migrated” are very different milestones, and the gap between them is where careful projects still cause outages. This part is about making the cutover boring — which is the highest compliment you can pay a migration.

The mental model: additive, reversible, and proven before trusted

One principle has governed this whole series, and it comes due now:

Why the old VPN stays up through the entire cutover

Every step so far has been additive — the new stack went up beside OpenVPN, never on top of it. That means cutover isn’t a switch you flip; it’s people moving across a bridge that’s already load-tested, while the old bridge stays open behind them. Nobody stops using OpenVPN until they’re confirmed working on the mesh. If a wave goes sideways, they’re still on the old VPN — there is no “rollback,” because nothing was torn down. The new thing has to earn each user’s trust before it gets it.

First: a real pilot, not a demo

Pick two or three friendly users — and deliberately include one on a high-latency link (for us, an office ~250 ms away) to sanity-check the transport under the worst conditions, not the best. Each pilot: install the app → log in with directory creds + MFA → confirm internal access and DNS work → and confirm their OpenVPN still works in parallel.

Every pilot ran the same path cleanly: install the app, log in with directory credentials and MFA, reach internal systems — and their OpenVPN kept working in parallel the whole time, so there was never a moment they were stranded.

Backups online before anyone depends on this

The moment real users rely on the new stack, it’s production — so backups come first, not later: VM snapshots, a Postgres dump of Keycloak, the mesh control-plane data directory, all shipped off the host and on a schedule. And critically: do a test restore once, before the pilot ends. A backup you haven’t restored is a hope, not a backup.

Simplify onboarding while you’re in there

The old onboarding script SSHed to the VPN box, minted a client cert, and emailed a .ovpn file. Migrating is the natural moment to delete that whole ritual:

Why fix onboarding now instead of later

A new hire onboarded the old way gets a .ovpn you’ll just have to migrate again. Cut the OpenVPN block out of the script now — keep the directory user/group/home creation, drop the cert dance — and new users land on the mesh by default: create the directory account → Keycloak federates it automatically → the user self-enrolls from a one-page guide. One fewer artifact, one fewer manual step, and it preps the next phase (matching directory identities to the cloud IdP by email).

The end-to-end check passed with zero .ovpn involvement: create the directory account → Keycloak federates it automatically → the user self-enrolls from the one-page guide.

Cut over in waves, behind a landing page

Build a simple landing page (connect.example.com): per-OS download links, a “Log in” button, three steps, and a plain line — “the old VPN still works until [date].” Then migrate by team: send the wave its link, confirm each person is connected on the mesh, then have them stop using OpenVPN. Keep the old VPN as fallback until the wave is confirmed stable, and only then move the next one.

When the last interactive user is across, the old VPN doesn’t get deleted — it drops to carrying site-to-site backup tunnels only. You can revoke individual client certs, but the server stays. (This is also where the broader maintenance window comes in: the retired pieces — old single-purpose VMs that this project replaced — get cleaned up after everyone’s confirmed on the new stack, never before.)

The rollback that’s always available

At every wave, the escape hatch is the same: the user falls back to OpenVPN, which never stopped working. If the whole stack needed to vanish, it’s docker compose down and dropping one firewall rule — the old VPN is untouched underneath. Reversibility isn’t a feature you add at the end; it’s the property you preserve at every step.

Honest limits / what’s deliberately next

  • OpenVPN isn’t gone, and won’t be — it keeps the site-to-site tunnels. This was never about killing OpenVPN, only about getting people off per-user certs.
  • MFA still lives in Keycloak. The planned next phase federates the cloud IdP (Entra) into Keycloak and moves MFA authority to Conditional Access — and then closes the loop with an onboarding form that provisions identity from a single source, reconciling it down to the directory’s POSIX accounts. That’s a whole project of its own, and the right call was to not block this migration on it.
  • The far-office latency is unchanged. ~250 ms is physics; the durable fix is regional compute, tracked separately. The mesh didn’t make it worse — but I won’t pretend it made it better.

The one takeaway

If there’s a single idea worth keeping from this series, it’s the one that made every part safe: build the new thing alongside the old, keep it reversible, and make it earn trust one wave at a time. The flashy part is the WireGuard mesh and the SSO. The part that actually matters — the reason nobody had a bad day — is that the old VPN was still humming the entire time. Boring cutovers are the goal. Boring is hard-won.