Warewulf 4 and the art of treating servers as cattle
A stateless node has no OS on its disk. It PXE-boots one golden image into RAM. This is how big clusters stay sane. Getting there on a brand-new EL10 stack took a pile of debugging I didn't expect. Here's all of it.
- Stateless provisioning: the one table you need
- How Warewulf 4 thinks: containers and overlays
- Building the compute image
- Overlays: the right file at the wrong time
- Registering the nodes
- Booting the VMs, and the panic that stopped me cold
- Getting Slurm to actually talk
- The payoff
- Honest limits
- What’s next
This is the part the whole series has been building toward. Back in Part 1 I claimed the single most important idea in HPC ops is compute nodes are cattle, not pets. Warewulf is where that stops being a slogan and becomes a working mechanism. By the end of this post, two diskless VMs boot from nothing, pull one golden image over the network, and come up as identical compute nodes.
I also picked the cutting-edge stack on purpose (OpenHPC 4 on Rocky 10, which means Warewulf 4), and that’s where I hit most of my surprises. There were a lot of them. They’re all here, because the surprises are the actual education.
Stateless provisioning: the one table you need¶
Two ways to run a node:
| Stateful node (a “pet”) | Stateless node (a “cow”) | |
|---|---|---|
| OS lives on | local disk | one shared image, in RAM |
| After a reboot | whatever drifted onto it persists | guaranteed identical to every node |
| Patch the fleet | touch every machine | update one image, reboot |
| Disk dies | reinstall, reconfigure | irrelevant, there’s no OS disk |
| “Is c573 like c574?” | unanswerable | true by construction |
Why stateless is the only thing that scales
A stateful node installs the OS on local disk. With thousands of nodes that’s unmanageable: they drift apart as patches land unevenly, a dead disk means a reinstall, and “is c573 configured like c574?” becomes genuinely unanswerable. A stateless node has no OS on disk. It PXE-boots a single golden image into RAM and runs from there. Reboot a misbehaving node and it comes back provably identical to every other node. Drift is impossible by construction. You update the cluster by updating one image. This is how large sites stay sane, and it’s the model Sherlock uses. Warewulf builds that image, serves DHCP/TFTP/PXE, and tracks which node gets which image.
How Warewulf 4 thinks: containers and overlays¶
Warewulf 4 is a clean break from the Warewulf 3 most older guides describe. Worth knowing the shape before driving it:
| Warewulf 3 (older guides) | Warewulf 4 (this build) | |
|---|---|---|
| Node image | VNFS chroot tarball | OCI container (same format as Docker) |
| CLI | wwsh, wwvnfs, wwbootstrap |
a single wwctl |
| Services | httpd + dhcpd + tftp |
one warewulfd (+ dnsmasq) |
| Node database | MariaDB | plain files (nodes.conf) |
| Per-node config | “files” | overlays (templated, two kinds) |
The two ideas that matter most in v4 are container images and overlays:
Containers as node images, and the two kinds of overlay
The base image is literally an OCI container you import, customize with wwctl container
exec, then build into the bootable form. Every node boots that same image, so anything
a node needs has to be baked into the container (there’s no disk to install onto later).
The per-node bits (its IP, the shared Munge key, slurm.conf) ride in overlays, which
come in two flavors that turn out to matter enormously: system overlays are applied in
the initramfs before the OS starts, and runtime overlays are applied after boot by
a small wwclient agent and refreshed periodically. Put a file in the wrong one and it
arrives at the wrong time. I learned that the hard way (twice).
Building the compute image¶
Import a stock Rocky 10 container and confirm it landed:
$ wwctl container import docker://quay.io/rockylinux/rockylinux:10 rockylinux-10
Copying blob 530d6b37ba46 done
Writing manifest to image destination
info unpack layer: sha256:530d6b37ba46a527ac6dfd8fa14e3b44a6abd963d7ba147d3751a1650febf4b6
$ wwctl container list
IMAGE NAME
----------
rockylinux-10
Then install everything a compute node needs into the container. The base image only has Rocky’s repos, so EPEL, CRB, and OpenHPC have to go in first, same as on the head node:
wwctl container exec rockylinux-10 -- /bin/bash -c '
dnf install -y dnf-plugins-core epel-release
dnf config-manager --set-enabled crb
dnf install -y http://repos.openhpc.community/OpenHPC/4/EL_10/x86_64/ohpc-release-4-1.el10.x86_64.rpm
dnf install -y ohpc-base-compute ohpc-slurm-client openssh-server chrony munge \
kernel iproute NetworkManager nfs-utils
'
Why “into the container” and not “onto the node”
There is no node to install onto yet. The node is diskless. Everything a compute node runs must be baked into the image first. This is the cattle-not-pets discipline made concrete: you never SSH into a node to fix it, you change the image and rebuild.
That dnf line already has two of my surprises hidden in it.
Surprise 1: the kernel installs to the wrong place. Warewulf needs a kernel in the
image’s /boot to serve over PXE. On EL10 the kernel package drops vmlinuz into
/lib/modules, and the /boot-copy step that normally happens via a kernel-install plugin
doesn’t run inside a chroot. So /boot had the initramfs but no kernel:
$ ls /boot # inside the container
initramfs-6.12.0-211.22.1.el10_2.x86_64.img
symvers-6.12.0-211.22.1.el10_2.x86_64.xz
$ find / -xdev -name 'vmlinuz*'
/usr/lib/modules/6.12.0-211.22.1.el10_2.x86_64/vmlinuz
The fix is a one-line copy, done in the container:
wwctl container exec rockylinux-10 -- /bin/bash -c \
'KVER=$(ls /lib/modules); cp /lib/modules/$KVER/vmlinuz /boot/vmlinuz-$KVER'
Surprise 2: nfs-utils isn’t pulled in, which I didn’t discover until much later when
/home refused to mount with a baffling NFS: mount program didn't pass remote address,
even when I mounted by IP. The cause was simply that the image had no mount.nfs helper.
It’s in the package list above now because I learned to put it there.
There’s also a Munge detail that’s pure cattle-cluster plumbing:
Make the node’s Munge identity match the head node’s
The shared Munge key only works if the munge user owns it with the same UID on every
machine. The head node’s munge UID is assigned dynamically (995 here), and a fresh
minimal container picks a different one (998). Left alone, the node rejects the key as
“owned by the wrong UID.” The fix is to pin the container’s munge user to the head
node’s UID and re-own its directories, including /var/log/munge, which I missed the
first time and which kept munged from even starting. (The slurm user needs no such
fix. OpenHPC pins it to a fixed UID, so it matches automatically. Only munge drifts.)
With the kernel in place, Munge aligned, the slurmd spool directory created, and a couple of
tmpfiles.d entries added, you build the image into its bootable form:
wwctl container build rockylinux-10
You can even check what the nodes will actually receive by peeking inside the built image, which is a cpio archive:
$ cpio -t < /srv/warewulf/provision/images/rockylinux-10.img | grep -E 'boot/(vmlinuz|initramfs)'
boot/initramfs-6.12.0-211.22.1.el10_2.x86_64.img
boot/vmlinuz-6.12.0-211.22.1.el10_2.x86_64
Both halves present. That habit (verify the artifact, not the source) saved me more than once.
Overlays: the right file at the wrong time¶
The shared Munge key and slurm.conf go into an overlay, and the node’s user accounts get
synced from the head node by a built-in syncuser overlay so UIDs line up. Two lessons here
cost me real time.
Rebuild the image AND the overlays, image first
The syncuser overlay reads the image’s /etc/passwd at overlay-build time. I changed
the Munge UID inside the image, rebuilt the image, and the nodes still came up with the
old UID. The culprit: I hadn’t rebuilt the overlays, so they were still carrying the old
snapshot and clobbering the fresh image on every boot. The rule that fixes a whole class
of “I changed it but nothing changed” confusion: wwctl container build, then
wwctl overlay build. Two layers, both need committing, image first.
The other overlay lesson is about timing, and it bit both slurmd and the /home mount at
boot. A file that has to exist before early boot (a tmpfiles.d rule, for instance) cannot
live in a runtime overlay, because wwclient applies those after the OS is already up. It
has to be in the image or a system overlay. Knowing which layer runs when is half of
operating Warewulf 4.
Registering the nodes¶
Nodes are registered by MAC, then assigned the image, overlays, and kernel arguments:
wwctl node add c1 --ipaddr=10.0.0.2 --netmask=255.255.255.0 --hwaddr=52:54:00:aa:00:01
wwctl node add c2 --ipaddr=10.0.0.3 --netmask=255.255.255.0 --hwaddr=52:54:00:aa:00:02
wwctl node set c1,c2 --image=rockylinux-10
Why registration is MAC-based, and don’t forget the netmask
PXE happens before a node has any identity: no hostname, no IP, nothing on disk. The only
thing it announces is its MAC in the DHCP request. Warewulf matches that MAC and replies
“you are c1, here’s your IP and your boot image.” That’s how an anonymous booting VM
becomes a known cluster member. And pass --netmask. I left it off once, and Warewulf
generated a NetworkManager profile with a blank address= (no prefix means no valid
CIDR), so the node booted with no IP at all and nothing could reach it. A blank netmask is
a silent, infuriating failure.
Booting the VMs, and the panic that stopped me cold¶
The compute VMs are diskless with PXE as the only boot path. They cannot boot any other way, exactly like a real diskless node. The first time I booted them, both panicked five seconds in:
[ 4.922296] Initramfs unpacking failed: write error
[ 5.325850] Kernel panic - not syncing: VFS: Unable to mount root fs on "" or unknown-block(0,0)
Why a diskless node needs surprising amounts of RAM
Initramfs unpacking failed: write error is the kernel running out of memory while
unpacking the image. Remember the model: a stateless node loads the entire OS image
into RAM and runs from it. My nodes had 3 GiB, and the image plus the unpack overhead
didn’t fit. Bumping them to 6 GiB fixed it instantly. This is the “why 6 GiB on a diskless
node?” promise from Part 1, paid off with a crash log. On real hardware with hundreds of
gigabytes it’s a non-issue, but in a RAM-constrained lab it’s the kind of limit you only
learn by hitting it.
A couple of smaller VM-creation gotchas rode along: the e1000 NIC needed its PXE ROM
symlinked where QEMU looks for it, virt-install --print-xml 1 (with the step number) avoids
emitting two XML documents that virsh define chokes on, and --pxe quietly sets the VM to
destroy itself on reboot, which you patch back to restart or the node powers off the
first time it reboots.
With 6 GiB, the boot runs the full Warewulf 4 sequence, every stage visible in the
warewulfd log:
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:ipxe
send /etc/warewulf/ipxe/default.ipxe -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:kernel
send .../boot/vmlinuz-6.12.0-211.22.1.el10_2.x86_64 -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:image
send /srv/warewulf/provision/images/rockylinux-10.img.gz -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:system
send .../overlays/c1/__SYSTEM__.img.gz -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:runtime
send .../overlays/c1/__RUNTIME__.img.gz -> c1
iPXE config, then kernel, then the image, then the system and runtime overlays. That’s the whole stateless boot in one screen, served over HTTP to a VM with no disk.
Getting Slurm to actually talk¶
The nodes booted, but slurmd would not start. Its log showed a strange one:
error: _xgetaddrinfo: getaddrinfo((null):6818) failed: Name or service not known
fatal: Unable to bind listen port (6818): Address family not supported by protocol
When a node can’t resolve its own name
slurmd figures out which node it is by matching its hostname to a NodeName in
slurm.conf, then resolves that name to an address to bind. On EL10, nss-myhostname
answers a lookup of the node’s own name with its IPv6 link-local address, which slurmd
can’t bind, so it ends up with nothing. The clean fix is to stop it resolving a name at
all: give each node an explicit NodeAddr in slurm.conf (NodeAddr=10.0.0.[2-3]), the
literal IP. This is standard production practice anyway, and it sidesteps the whole trap.
Two more in the same family. The head node’s firewall was silently dropping Slurm’s control
ports until I put the cluster NIC in the trusted zone (the symptom was a wonderfully
misleading “malformed RPC” error). And because slurmd and the /home mount can start
before the network is fully up, the clean cold-boot fix was to gate them on
network-online.target and set ReturnToService=2 so a node that briefly went down during
the boot race returns to service on its own, no manual scontrol resume.
The payoff¶
After all of that, both nodes come up on their own and Slurm sees them:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 2:00:00 2 idle c[1-2]
owners up 1-00:00:00 2 idle c[1-2]
$ srun -N2 -l hostname
0: c1
1: c2
Two machines with no operating system on their disks, booted entirely from one image over the network, running a command the scheduler placed on both of them at once. That’s the cattle-not-pets model working end to end. Reboot either node and it comes back identical. Change the image, rebuild, reboot, and every node gets the update. There’s no drift possible because there’s nothing to drift.
Honest limits¶
A few things this lab fakes that a real cluster does for real:
- No Lustre. Shared storage here is NFS exported from the head node. At real scale NFS melts when hundreds of nodes write at once. Lustre spreads that load across many storage servers. The concepts are clear even without running it.
- No InfiniBand. VMs use virtio Ethernet. MPI will run but not fast.
- Single image for all nodes. A real cluster often has multiple images for different node types (CPU vs GPU). The mechanism is identical, you just point nodes at different images.
And the honest meta-point: I counted more than a dozen distinct fixes between “import a container” and “both nodes idle.” Most guides show the happy path. The happy path is not where the learning is.
What’s next¶
Two nodes, stateless, running identical images, and the scheduler can place work on them. The cluster exists. Now it needs to enforce a policy. Part 5 is the centerpiece: Slurm’s condo scheduling model, and the money shot, watching a high-priority owner job preempt a running guest job, live.