How to run a maintenance window without losing sleep

Eight VMs decommissioned, four teams consolidated, one new Windows server stood up. The window ran a day early, hit one real incident, and the post-window verify script passed clean. Here's the full story.

The window was scheduled for Wednesday June 17. The client’s POC called Tuesday morning and said go ahead and do it today. So we did.

Eight VMs decommissioned, four engineering teams consolidated onto dedicated infrastructure, one new Windows server stood up from scratch, one real incident mid-window, post-window verify script passed clean. This is the full story, prep included, because the reason the incident got caught and fixed in ten minutes was work done before a single VM was touched.

What we were doing

A client runs a small private compute cluster: about fifteen VMs spread across four bare-metal hypervisors, running KVM/libvirt on Linux. Engineering teams had been running on general-purpose VMs that accumulated over years, each one a special snowflake. The consolidation gave each team one Linux VM and (where needed) one Windows VM, both sized for their actual workload. Eight VMs got turned off permanently. Four got kept. One new Windows VM was created from scratch.

Why consolidate in the first place?

The old layout had teams sharing VMs with no clear ownership, which made it impossible to set resource limits, enforce access boundaries, or know who to call when something broke. Giving each team a dedicated VM fixes all three problems. The bare-metal hosts become admin-only infrastructure; team VMs become the thing teams actually log into. Simple mental model, much easier to support.

The scope in one table:

VM Action Team
shannon, ada, feynman, dirac Decommission (none, roles absorbed)
centos7, rhel8, rocky8 Decommission (OS upgrades long overdue)
marconi Decommission (Packaging migrates to fermi)
turing Keep, resize Optics
fermi Keep Packaging
faraday Keep CMOS/SoC
nyquist Keep, resize (rescued at the last minute) Systems (new team)
cwinserv Keep Optics Windows
pwinserv Create Packaging Windows
cserver, gserver, fserver, mserver Restrict to admins Infrastructure

That last row is the subtle part. The bare-metal hosts stayed on, but they stopped being general login targets. You access your team’s VM; if you need something at the hypervisor level, you call IT.

Step one: map the blast radius before you touch anything

The first thing I did was enumerate what’s actually running on every VM scheduled for decommission. Not what the documentation says is running. What’s actually there.

# For each VM, before you touch it:
virsh dominfo <vmname>          # state, memory, vCPUs
virsh domblklist <vmname>       # disk images and their paths
virsh dumpxml <vmname>          # full spec: networks, PCI passthrough, everything

Why not trust the docs?

Documentation drifts. A VM that was “just a test box” in 2023 might have a cron job that three people depend on now. The actual running state is authoritative; everything else is a guess. On this client’s fleet, two VMs I thought were idle had active SSH sessions when I checked.

After the audit: IPA (FreeIPA, the directory server) had host entries for all nine decommission targets. DNS and DHCP were served by dnsmasq keyed to MAC addresses. Neither of those is self-correcting – you have to explicitly remove the entries or they’ll conflict later.

The pre-window checklist: - Remove decommission targets from dnsmasq (host entries and DHCP leases) - Remove them from IPA (host entries, netgroups, HBAC rules) - Update /etc/hosts on any machine that has static entries pointing to them - Check cron jobs on the hypervisors for any jobs referencing decommissioned hosts

That last one is easy to miss. A backup cron that tries to reach a VM that no longer exists will fail silently for weeks before anyone notices.

The backup strategy: recourse first, then speed

This is the part most maintenance windows get wrong. People back up data. The right thing to back up before a major infrastructure change is recoverability – the specific things that let you undo the window if it goes badly.

This client runs rsnapshot for file-level backups and virsh backup-begin for VM image snapshots. The full picture:

ELI5: what rsnapshot does

rsnapshot is a backup tool that uses hard links to store many versions of a directory without multiplying the disk space. It keeps, say, seven daily copies and four weekly copies, but if a file hasn’t changed between runs, both copies point to the same data on disk. You get the browsing experience of “full backup each time” at the storage cost of “only store what changed.” The downside is it’s file-level – it can’t take a consistent snapshot of a running database mid-write. For databases you need application-level dumps in addition.

ELI5: crash-consistent vs. application-consistent backups

A crash-consistent backup is like pulling the power cord and then copying the disk. The data is all there, but anything mid-write is in an ambiguous state, the same way your laptop would be if it lost power mid-save. Most VMs recover fine from this, the same way most applications recover from a crash. An application-consistent backup coordinates with the running software (quiescing writes, flushing caches) before snapping the image. It’s safer, slower, and requires guest agent support. For the decommission VMs, crash- consistent is fine – we’re not restoring them, just keeping them as a reference.

Before the window, the priority backup isn’t the decommission targets. It’s the things we can’t recreate from scratch: the Gitea instance (source code repository), the IPA server (directory, all user accounts), and the NFS home and project stores.

On this client’s setup, the Gitea data turned out to be safer than expected: the repository data directory resolves via symlink to an NFS export on the primary file server, which is already covered by the daily rsnapshot job. Losing the Gitea VM would lose the application layer (the web UI, the service) but not the actual code. That’s a two-hour recovery, not a crisis. Still, I added a gitea dump cron that runs nightly so the database schema and config are also portable.

The recourse backup script ran in two phases before the window opened:

  1. KEEPERS first – consistent images of every VM we were keeping, taken live. If the window went completely sideways and we needed to roll back to “how it was at 5:58 PM,” we had it.
  2. DECOM set second – crash-consistent images of the eight VMs being retired. Retention is 30 days. The “I can’t believe we actually need these” insurance.

The whole thing lands on a 17T USB drive mounted at /mnt/backup via filesystem label.

Why a USB drive?

Because it’s physically separable from the server. If the hypervisor has a hardware failure during the window, you want your backups on a device you can walk out of the building with. Network-attached storage is convenient but it shares the failure domain. For a client this size, a labeled USB drive that auto-mounts by label (so any compatible drive labeled backup just works) is the right call. Larger shops use tape or geographically separate storage; the principle is the same.

The runbook: writing it forces the thinking

A runbook is a sequential list of steps with expected outcomes. The act of writing it finds the gaps.

The ones I found writing this one: - The new Windows VM (pwinserv) needs a DHCP reservation before it boots, or it’ll get a random IP and IPA enrollment will fail. - The IPA client enrollment command needs to run before the VM is rebooted, not after. Rebooting first means the hostname doesn’t resolve yet, enrollment fails, and you’re debugging SSSD in a window where you have three other things running in parallel. - The welcome page (a landing page served from an internal web server) needs its routing table updated. I scripted this but the script assumes the old VMs are still responding. Run it before the decom, not after.

None of these are obvious until you write the steps out in order and ask “what does this step assume?”

The runbook lives in the repo at /root/etc/consolidation/RUNBOOK.md on the hypervisor. Everything the window needs is staged there: the backup script, the dnsmasq config diff, the IPA cleanup commands, the welcome page swap script.

The verification harness: prove it worked

After I finished the runbook, I wrote a test script. Two modes:

verify-consolidation.sh pre – run it before the window opens. Checks that every keeper VM is reachable via SSH, that IPA can authenticate a test user on each, that NFS mounts resolve on the team VMs, that the new Windows server template exists and is ready, and that /mnt/backup has the space needed. Pass/fail per check, written to a timestamped log. Exit code is the number of failures. If it exits nonzero, you don’t start the window.

verify-consolidation.sh post – run it before you leave. Checks that decommissioned targets no longer respond to ping, that team VMs answer SSH with IPA credentials, that the new Windows server has a DHCP lease and responds to RDP, and that dnsmasq is serving the right addresses. Same log format. That log is the evidence the window succeeded.

Why automate the verification?

Because at 9:30 PM after four hours of live infrastructure changes, you will miss things. The test script won’t. Writing it also forces you to define what “success” actually means in specific, checkable terms. “Everything seems fine” is not a success criterion. “All team VMs respond to SSH with IPA credentials and faraday’s NFS mounts resolve” is.

The thing I didn’t expect: this is already a cluster

While auditing the post-consolidation layout, I noticed something. Three team VMs (Optics, Packaging, CMOS), each on dedicated hardware, each with priority access to their own iron. Sound familiar?

That’s the condo model. The same scheduling concept that makes Sherlock work: groups own nodes, get guaranteed priority, idle capacity gets shared. This client runs it informally – an engineer either has access to their team’s VM or they don’t, and there’s no way to burst onto another team’s idle resources. Add Slurm and a license-aware GRES plugin (this client runs FlexLM for EDA tooling), and you have a proper scheduler that can fill idle cycles across teams automatically.

That’s a future post. For this window, the goal was just getting the new layout stable.

How it went

The window ran June 16, a day ahead of schedule. Here’s how it went.

The work itself went smoothly. Eight VMs shut down cleanly. nyquist got its vCPU resize. All four team VMs came out of the window at spec: SSH working, NFS mounts resolving, IPA authentication clean. Root disks grew from 24G to 52G on every team VM without a reboot. Bare-metal hypervisors locked down. The recourse backup (680G) landed before the first VM was touched.

One real incident mid-window. After applying the new dnsmasq config and restarting the service, NFS access broke for all four team VMs simultaneously.

Root cause: dnsmasq uses the first name in a host-record entry as the canonical name for reverse DNS (PTR records). The staged config had entries written short-name-first:

host-record=db01,db01.internal,10.x.x.x

When dnsmasq restarted, PTR lookups started returning bare short names (db01.) instead of FQDNs (db01.internal.). The NFS server authorizes clients by matching their reverse-DNS lookup against a netgroup in FreeIPA. Short names didn’t match, so every team VM got access denied on mount.

The fix is one character of ordering:

host-record=db01.internal,db01,10.x.x.x

FQDN first, short name second. After rewriting all entries, restarting dnsmasq, and flushing the NFS server’s export cache and the SSSD netgroup cache on each client, mounts came back on all four VMs.

ELI5: why does PTR record order matter for NFS?

When your laptop connects to a file server, the server looks up your IP address to find your hostname – that’s a PTR (reverse DNS) lookup. It then checks whether that hostname is in an authorized list. If the PTR returns a short name (db01) but the authorized list uses FQDNs (db01.internal), the comparison fails and you get access denied, even though you’re the right machine. The NFS server isn’t being paranoid; it just can’t tell that db01 and db01.internal refer to the same host without doing extra lookups it doesn’t do by default.

The verify script. After finishing the remaining backlog items (Gitea nightly cron, backup list cleanup, disk-guard on the Gitea server, adding the new team VM to the remote access portal), verify-consolidation.sh post ran with one failure: a substring check that uses LIKE '%ada%' to confirm a retired VM’s connections were removed from the remote access portal. The check also matches a kept VM whose name contains those three letters. Known false positive, documented in the runbook. Every other check passed.

That log file is the paper trail. Timestamped, one line per check, PASS or FAIL with the exact command and output that produced it. If anyone asks six months from now whether the window was verified clean, the answer is yes and there’s a file to prove it.

What’s still pending. One Windows VM (for the Packaging team) is standing but not yet configured – it’s usable infrastructure, just needs a few manual Windows-side steps before that team gets handed the keys. Disk expansion on the Optics Windows server is also pending, waiting on a confirmed target size. Neither blocks anyone right now.

The one thing worth taking from this

The incident (broken NFS from a dnsmasq config mistake) was caught in about ten minutes because we knew exactly what to check. The verify script ran after every major step, not just at the end. When NFS broke, the symptom was immediate and the blast radius was clear: all four team VMs, same failure mode, same timestamp. That points directly at dnsmasq.

Without the harness, the same incident still gets fixed eventually. It just takes longer, and longer at 8 PM during a maintenance window is not fun.


The tools used in this post: KVM/libvirt, FreeIPA, dnsmasq, rsnapshot, virsh. All running on Linux. The specifics (host counts, team names, IP ranges) are generalized to protect the client.