SeriesHPC LabPart 4 of 5

Warewulf 4 and the art of treating servers as cattle

A stateless node has no OS on its disk. It PXE-boots one golden image into RAM. This is how big clusters stay sane. Getting there on a brand-new EL10 stack took a pile of debugging I didn't expect. Here's all of it.

This is the part the whole series has been building toward. Back in Part 1 I claimed the single most important idea in HPC ops is compute nodes are cattle, not pets. Warewulf is where that stops being a slogan and becomes a working mechanism. By the end of this post, two diskless VMs boot from nothing, pull one golden image over the network, and come up as identical compute nodes.

I also picked the cutting-edge stack on purpose (OpenHPC 4 on Rocky 10, which means Warewulf 4), and that’s where I hit most of my surprises. There were a lot of them. They’re all here, because the surprises are the actual education.

Stateless provisioning: the one table you need

Two ways to run a node:

Stateful node (a “pet”) Stateless node (a “cow”)
OS lives on local disk one shared image, in RAM
After a reboot whatever drifted onto it persists guaranteed identical to every node
Patch the fleet touch every machine update one image, reboot
Disk dies reinstall, reconfigure irrelevant, there’s no OS disk
“Is c573 like c574?” unanswerable true by construction

Why stateless is the only thing that scales

A stateful node installs the OS on local disk. With thousands of nodes that’s unmanageable: they drift apart as patches land unevenly, a dead disk means a reinstall, and “is c573 configured like c574?” becomes genuinely unanswerable. A stateless node has no OS on disk. It PXE-boots a single golden image into RAM and runs from there. Reboot a misbehaving node and it comes back provably identical to every other node. Drift is impossible by construction. You update the cluster by updating one image. This is how large sites stay sane, and it’s the model Sherlock uses. Warewulf builds that image, serves DHCP/TFTP/PXE, and tracks which node gets which image.

How Warewulf 4 thinks: containers and overlays

Warewulf 4 is a clean break from the Warewulf 3 most older guides describe. Worth knowing the shape before driving it:

Warewulf 3 (older guides) Warewulf 4 (this build)
Node image VNFS chroot tarball OCI container (same format as Docker)
CLI wwsh, wwvnfs, wwbootstrap a single wwctl
Services httpd + dhcpd + tftp one warewulfd (+ dnsmasq)
Node database MariaDB plain files (nodes.conf)
Per-node config “files” overlays (templated, two kinds)

The two ideas that matter most in v4 are container images and overlays:

Containers as node images, and the two kinds of overlay

The base image is literally an OCI container you import, customize with wwctl container exec, then build into the bootable form. Every node boots that same image, so anything a node needs has to be baked into the container (there’s no disk to install onto later). The per-node bits (its IP, the shared Munge key, slurm.conf) ride in overlays, which come in two flavors that turn out to matter enormously: system overlays are applied in the initramfs before the OS starts, and runtime overlays are applied after boot by a small wwclient agent and refreshed periodically. Put a file in the wrong one and it arrives at the wrong time. I learned that the hard way (twice).

Building the compute image

Import a stock Rocky 10 container and confirm it landed:

$ wwctl container import docker://quay.io/rockylinux/rockylinux:10 rockylinux-10
Copying blob 530d6b37ba46 done
Writing manifest to image destination
info unpack layer: sha256:530d6b37ba46a527ac6dfd8fa14e3b44a6abd963d7ba147d3751a1650febf4b6
$ wwctl container list
IMAGE NAME
----------
rockylinux-10

Then install everything a compute node needs into the container. The base image only has Rocky’s repos, so EPEL, CRB, and OpenHPC have to go in first, same as on the head node:

wwctl container exec rockylinux-10 -- /bin/bash -c '
  dnf install -y dnf-plugins-core epel-release
  dnf config-manager --set-enabled crb
  dnf install -y http://repos.openhpc.community/OpenHPC/4/EL_10/x86_64/ohpc-release-4-1.el10.x86_64.rpm
  dnf install -y ohpc-base-compute ohpc-slurm-client openssh-server chrony munge \
                 kernel iproute NetworkManager nfs-utils
'

Why “into the container” and not “onto the node”

There is no node to install onto yet. The node is diskless. Everything a compute node runs must be baked into the image first. This is the cattle-not-pets discipline made concrete: you never SSH into a node to fix it, you change the image and rebuild.

That dnf line already has two of my surprises hidden in it.

Surprise 1: the kernel installs to the wrong place. Warewulf needs a kernel in the image’s /boot to serve over PXE. On EL10 the kernel package drops vmlinuz into /lib/modules, and the /boot-copy step that normally happens via a kernel-install plugin doesn’t run inside a chroot. So /boot had the initramfs but no kernel:

$ ls /boot      # inside the container
initramfs-6.12.0-211.22.1.el10_2.x86_64.img
symvers-6.12.0-211.22.1.el10_2.x86_64.xz
$ find / -xdev -name 'vmlinuz*'
/usr/lib/modules/6.12.0-211.22.1.el10_2.x86_64/vmlinuz

The fix is a one-line copy, done in the container:

wwctl container exec rockylinux-10 -- /bin/bash -c \
  'KVER=$(ls /lib/modules); cp /lib/modules/$KVER/vmlinuz /boot/vmlinuz-$KVER'

Surprise 2: nfs-utils isn’t pulled in, which I didn’t discover until much later when /home refused to mount with a baffling NFS: mount program didn't pass remote address, even when I mounted by IP. The cause was simply that the image had no mount.nfs helper. It’s in the package list above now because I learned to put it there.

There’s also a Munge detail that’s pure cattle-cluster plumbing:

Make the node’s Munge identity match the head node’s

The shared Munge key only works if the munge user owns it with the same UID on every machine. The head node’s munge UID is assigned dynamically (995 here), and a fresh minimal container picks a different one (998). Left alone, the node rejects the key as “owned by the wrong UID.” The fix is to pin the container’s munge user to the head node’s UID and re-own its directories, including /var/log/munge, which I missed the first time and which kept munged from even starting. (The slurm user needs no such fix. OpenHPC pins it to a fixed UID, so it matches automatically. Only munge drifts.)

With the kernel in place, Munge aligned, the slurmd spool directory created, and a couple of tmpfiles.d entries added, you build the image into its bootable form:

wwctl container build rockylinux-10

You can even check what the nodes will actually receive by peeking inside the built image, which is a cpio archive:

$ cpio -t < /srv/warewulf/provision/images/rockylinux-10.img | grep -E 'boot/(vmlinuz|initramfs)'
boot/initramfs-6.12.0-211.22.1.el10_2.x86_64.img
boot/vmlinuz-6.12.0-211.22.1.el10_2.x86_64

Both halves present. That habit (verify the artifact, not the source) saved me more than once.

Overlays: the right file at the wrong time

The shared Munge key and slurm.conf go into an overlay, and the node’s user accounts get synced from the head node by a built-in syncuser overlay so UIDs line up. Two lessons here cost me real time.

Rebuild the image AND the overlays, image first

The syncuser overlay reads the image’s /etc/passwd at overlay-build time. I changed the Munge UID inside the image, rebuilt the image, and the nodes still came up with the old UID. The culprit: I hadn’t rebuilt the overlays, so they were still carrying the old snapshot and clobbering the fresh image on every boot. The rule that fixes a whole class of “I changed it but nothing changed” confusion: wwctl container build, then wwctl overlay build. Two layers, both need committing, image first.

The other overlay lesson is about timing, and it bit both slurmd and the /home mount at boot. A file that has to exist before early boot (a tmpfiles.d rule, for instance) cannot live in a runtime overlay, because wwclient applies those after the OS is already up. It has to be in the image or a system overlay. Knowing which layer runs when is half of operating Warewulf 4.

Registering the nodes

Nodes are registered by MAC, then assigned the image, overlays, and kernel arguments:

wwctl node add c1 --ipaddr=10.0.0.2 --netmask=255.255.255.0 --hwaddr=52:54:00:aa:00:01
wwctl node add c2 --ipaddr=10.0.0.3 --netmask=255.255.255.0 --hwaddr=52:54:00:aa:00:02
wwctl node set c1,c2 --image=rockylinux-10

Why registration is MAC-based, and don’t forget the netmask

PXE happens before a node has any identity: no hostname, no IP, nothing on disk. The only thing it announces is its MAC in the DHCP request. Warewulf matches that MAC and replies “you are c1, here’s your IP and your boot image.” That’s how an anonymous booting VM becomes a known cluster member. And pass --netmask. I left it off once, and Warewulf generated a NetworkManager profile with a blank address= (no prefix means no valid CIDR), so the node booted with no IP at all and nothing could reach it. A blank netmask is a silent, infuriating failure.

Booting the VMs, and the panic that stopped me cold

The compute VMs are diskless with PXE as the only boot path. They cannot boot any other way, exactly like a real diskless node. The first time I booted them, both panicked five seconds in:

[    4.922296] Initramfs unpacking failed: write error
[    5.325850] Kernel panic - not syncing: VFS: Unable to mount root fs on "" or unknown-block(0,0)

Why a diskless node needs surprising amounts of RAM

Initramfs unpacking failed: write error is the kernel running out of memory while unpacking the image. Remember the model: a stateless node loads the entire OS image into RAM and runs from it. My nodes had 3 GiB, and the image plus the unpack overhead didn’t fit. Bumping them to 6 GiB fixed it instantly. This is the “why 6 GiB on a diskless node?” promise from Part 1, paid off with a crash log. On real hardware with hundreds of gigabytes it’s a non-issue, but in a RAM-constrained lab it’s the kind of limit you only learn by hitting it.

A couple of smaller VM-creation gotchas rode along: the e1000 NIC needed its PXE ROM symlinked where QEMU looks for it, virt-install --print-xml 1 (with the step number) avoids emitting two XML documents that virsh define chokes on, and --pxe quietly sets the VM to destroy itself on reboot, which you patch back to restart or the node powers off the first time it reboots.

With 6 GiB, the boot runs the full Warewulf 4 sequence, every stage visible in the warewulfd log:

request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:ipxe
send /etc/warewulf/ipxe/default.ipxe -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:kernel
send .../boot/vmlinuz-6.12.0-211.22.1.el10_2.x86_64 -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:image
send /srv/warewulf/provision/images/rockylinux-10.img.gz -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:system
send .../overlays/c1/__SYSTEM__.img.gz -> c1
request from hwaddr:52:54:00:aa:00:01 ipaddr:10.0.0.2 | stage:runtime
send .../overlays/c1/__RUNTIME__.img.gz -> c1

iPXE config, then kernel, then the image, then the system and runtime overlays. That’s the whole stateless boot in one screen, served over HTTP to a VM with no disk.

Getting Slurm to actually talk

The nodes booted, but slurmd would not start. Its log showed a strange one:

error: _xgetaddrinfo: getaddrinfo((null):6818) failed: Name or service not known
fatal: Unable to bind listen port (6818): Address family not supported by protocol

When a node can’t resolve its own name

slurmd figures out which node it is by matching its hostname to a NodeName in slurm.conf, then resolves that name to an address to bind. On EL10, nss-myhostname answers a lookup of the node’s own name with its IPv6 link-local address, which slurmd can’t bind, so it ends up with nothing. The clean fix is to stop it resolving a name at all: give each node an explicit NodeAddr in slurm.conf (NodeAddr=10.0.0.[2-3]), the literal IP. This is standard production practice anyway, and it sidesteps the whole trap.

Two more in the same family. The head node’s firewall was silently dropping Slurm’s control ports until I put the cluster NIC in the trusted zone (the symptom was a wonderfully misleading “malformed RPC” error). And because slurmd and the /home mount can start before the network is fully up, the clean cold-boot fix was to gate them on network-online.target and set ReturnToService=2 so a node that briefly went down during the boot race returns to service on its own, no manual scontrol resume.

The payoff

After all of that, both nodes come up on their own and Slurm sees them:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up    2:00:00      2   idle c[1-2]
owners       up 1-00:00:00      2   idle c[1-2]
$ srun -N2 -l hostname
0: c1
1: c2

Two machines with no operating system on their disks, booted entirely from one image over the network, running a command the scheduler placed on both of them at once. That’s the cattle-not-pets model working end to end. Reboot either node and it comes back identical. Change the image, rebuild, reboot, and every node gets the update. There’s no drift possible because there’s nothing to drift.

Honest limits

A few things this lab fakes that a real cluster does for real:

  • No Lustre. Shared storage here is NFS exported from the head node. At real scale NFS melts when hundreds of nodes write at once. Lustre spreads that load across many storage servers. The concepts are clear even without running it.
  • No InfiniBand. VMs use virtio Ethernet. MPI will run but not fast.
  • Single image for all nodes. A real cluster often has multiple images for different node types (CPU vs GPU). The mechanism is identical, you just point nodes at different images.

And the honest meta-point: I counted more than a dozen distinct fixes between “import a container” and “both nodes idle.” Most guides show the happy path. The happy path is not where the learning is.

What’s next

Two nodes, stateless, running identical images, and the scheduler can place work on them. The cluster exists. Now it needs to enforce a policy. Part 5 is the centerpiece: Slurm’s condo scheduling model, and the money shot, watching a high-priority owner job preempt a running guest job, live.