SeriesHPC LabPart 2 of 5

The two networks bare-metal guides skip, and booting the head node

Almost every \"my node won't PXE boot\" problem is a network problem. Here's the two-network design, including the DHCP-less network that makes stateless booting possible, and how to stand up the head node on it.

In Part 1 we settled the mental model: a cluster is five roles on a private network, and compute nodes are cattle. This part is about that private network, the piece every bare-metal guide assumes you already have, and the piece that, when it’s wrong, produces the single most common cluster failure: a node that won’t PXE boot.

If you get one thing right in this whole build, get the networks right.

Why two networks, and why one of them must not run DHCP

OpenHPC’s official guide assumes two physical networks already exist. On a single box we fake both as libvirt virtual networks:

   default (NAT, 192.168.122.0/24)        hpc-prov (isolated, 10.0.0.0/24)
   ───────────────────────────────        ───────────────────────────────
   head node → internet for packages      provisioning, DHCP, PXE, NFS,
   compute nodes never touch it            MPI, Slurm all ride here

The first network is boring: NAT, so the head node can reach the internet to download packages, with nothing exposed inbound. The second is where all the interesting failures live, and it has one non-negotiable rule:

Why the cluster network must NOT run DHCP

Warewulf is going to be the DHCP server on 10.0.0.0/24. It hands each compute node a specific boot file based on its MAC address. If libvirt also runs DHCP on that network, you have two DHCP servers racing to answer the same broadcast, and nodes randomly fail to boot. libvirt’s dnsmasq only serves DHCP if you give it a <dhcp> range, so we deliberately omit one and leave UDP/67 free for Warewulf.

That single omission is the difference between “PXE works first try” and an afternoon of confused debugging.

Network 1: the NAT path to the internet

This one usually already exists. Just make sure it’s running and survives reboot:

virsh net-list --all
virsh net-start default      2>/dev/null || true
virsh net-autostart default  2>/dev/null || true

If it’s missing, define it from XML. It’s a NAT forward, a bridge, and a DHCP range, and here libvirt’s DHCP is fine because nothing PXE-boots on this network:

<network>
  <name>default</name>
  <forward mode='nat'/>
  <bridge name='virbr0' stp='on' delay='0'/>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp><range start='192.168.122.2' end='192.168.122.254'/></dhcp>
  </ip>
</network>

Network 2: the isolated provisioning network (no DHCP)

Here’s the important one. Note what’s missing: there is no <dhcp> block, and there is no <forward> element at all. Leaving <forward> out is what makes the network fully isolated, with no route off it. (There’s no such thing as forward mode='none' in libvirt, which is its own small trap. You make a network isolated by omitting the element, not by setting it to none.)

<network>
  <name>hpc-prov</name>
  <bridge name='virbr-hpc' stp='on' delay='0'/>
  <!-- no <forward> element at all = isolated: no NAT, no route off this network -->
  <ip address='10.0.0.254' netmask='255.255.255.0'>
    <!-- deliberately NO <dhcp> block: Warewulf owns DHCP here -->
  </ip>
</network>
virsh net-define /tmp/hpc-prov.xml
virsh net-start hpc-prov && virsh net-autostart hpc-prov

Three deliberate choices, each with a reason:

  • Isolated keeps PXE/DHCP broadcast traffic off your real LAN. You do not want a rogue DHCP server answering your office network.
  • No libvirt DHCP keeps port 67 free for Warewulf (the whole point above).
  • The host keeps 10.0.0.254 on the bridge purely so you can reach the nodes from the host for debugging. The head node itself will own 10.0.0.1.

A DOWN bridge is not a broken bridge

Right after you create it, ip -br addr shows virbr-hpc as DOWN even though it already has the right 10.0.0.254 address. That’s normal and worth internalizing now so you don’t chase it later. A Linux bridge reports NO-CARRIER/DOWN until something is actually plugged into it, and nothing is yet. The moment the head node’s NIC attaches, it flips to UP. A DOWN bridge with a correct IP is fine.

Track A shortcut

If you’re doing the manual “Slurm-first” track before Warewulf, you can keep this network exactly as-is and just assign static node IPs by hand. The no-DHCP rule only starts to matter once Warewulf takes over provisioning.

Building the head node, the headless way

The head node (sms) has two NICs, one on each network, and I built it from the Rocky 10 minimal ISO.

Why the minimal ISO

A head node should run only what it needs. Starting from minimal forces you to learn each service as you add it, instead of inheriting a pile of distro defaults you don’t understand and can’t audit.

My first instinct was the obvious virt-install --cdrom ... --graphics vnc and clicking through the graphical installer. That works fine on a laptop. It does not work on a headless server you only reach over SSH: there’s no display to attach to, and the installer just sits there forever waiting for input it will never get.

That dead end turned out to be a gift, because the fix is the more professional approach anyway: an unattended kickstart install. A kickstart file answers every installer question up front, so the machine installs itself with zero interaction. It’s the same technique real sites use to provision servers reproducibly, and it’s the stateful-install counterpart to the stateless Warewulf nodes coming later in the series.

Here’s the kickstart that defines the entire head node. The two network lines are the interesting part:

text
lang en_US.UTF-8
timezone America/Los_Angeles --utc

rootpw --plaintext rockylab
user --name=bobby --groups=wheel --plaintext --password=rockylab

clearpart --all --initlabel
autopart --type=plain
bootloader --location=mbr --append="console=ttyS0,115200n8 console=tty0"

# Pin each NIC by MAC so the assignment is deterministic
network --device=52:54:00:9c:d0:9c --bootproto=dhcp --onboot=on --activate
network --device=52:54:00:53:c0:f0 --bootproto=static --ip=10.0.0.1 --netmask=255.255.255.0 --onboot=on --activate --nodefroute --nodns
network --hostname=sms

firewall --enabled --service=ssh
selinux --permissive
firstboot --disable
reboot

%packages
@^minimal-environment
chrony
openssh-server
%end

Rocky 10’s installer rejected my first kickstart

My first attempt carried an ignoredisk --only-use=vda line (and a matching --boot-drive=vda) out of habit. Rocky 10 ships Anaconda 40, which evaluates that disk name at parse time and aborts with “Disk ‘vda’ given in ignoredisk command does not exist” before the disk is even visible. The VM has exactly one disk, so the fix was to drop both lines and let clearpart --all and autopart do their thing. A small thing, but the kind of version-specific surprise that’s invisible until you hit it.

Why pin NICs by MAC, and why --nodefroute

Interface names can reorder between boots, and on a cluster a swapped NIC means every later config points at the wrong network. Pinning each NIC by its MAC makes the assignment deterministic. And --nodefroute on the cluster NIC is how you enforce “no gateway on the isolated network”: the default route stays on the internet NIC, and the cluster NIC carries only local 10.0.0.0/24 traffic.

Why static on the cluster NIC

The entire cluster refers to the head node as 10.0.0.1: Slurm config, NFS exports, Warewulf, /etc/hosts, all of it. If that address floated via DHCP, every service would break the first time the head node rebooted and got a different lease.

You hand the kickstart to virt-install by injecting it into the installer’s initrd and pointing the installer kernel at it. --graphics none plus a serial console means the whole install runs over SSH, with no VNC anywhere:

virt-install \
  --name sms \
  --vcpus 2 --memory 4096 \
  --disk path=/var/lib/libvirt/images/sms.qcow2,size=40,format=qcow2 \
  --location ~/isos/Rocky-10-latest-x86_64-minimal.iso \
  --initrd-inject ~/sms-ks.cfg \
  --extra-args "inst.ks=file:/sms-ks.cfg console=ttyS0,115200n8" \
  --osinfo detect=on,require=off \
  --network network=default,mac=52:54:00:9c:d0:9c \
  --network network=hpc-prov,mac=52:54:00:53:c0:f0 \
  --graphics none --noautoconsole

Then watch it install over the serial console (detach with Ctrl + ]):

virsh console sms

qcow2 is thin-provisioned

size=40 is a ceiling, not an upfront allocation. The head node will store the compute image it serves later, so give it room.

A couple of small surprises showed up, the kind a clean tutorial usually hides. After the install rebooted, libvirt left the domain powered off, so it needed a virsh start sms before I could reconnect. And the DHCP NIC came up named ksdev0 instead of a normal ens/enp name, a quirk of pinning it by MAC in kickstart. Both are cosmetic, and the NIC that actually matters, the cluster one, came up as a clean ens3.

The payoff is logging in and seeing the networking land exactly as designed:

[bobby@sms ~]$ hostnamectl
   Static hostname: sms
     Virtualization: kvm
 Operating System: Rocky Linux 10.2 (Red Quartz)
     Architecture: x86-64
[bobby@sms ~]$ ip -br addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
ksdev0           UP             192.168.122.108/24 fe80::5054:ff:fe9c:d09c/64
ens3             UP             10.0.0.1/24 fe80::5054:ff:fe53:c0f0/64
[bobby@sms ~]$ ip route
default via 192.168.122.1 dev ksdev0 proto dhcp src 192.168.122.108 metric 101
10.0.0.0/24 dev ens3 proto kernel scope link src 10.0.0.1 metric 100
192.168.122.0/24 dev ksdev0 proto kernel scope link src 192.168.122.108 metric 101

Internet NIC on a DHCP lease, cluster NIC nailed to 10.0.0.1, and the default route leaving through the internet side. That last detail is the one to check: anything bound for the outside world goes out ksdev0, while ens3 carries cluster traffic only. Exactly what --nodefroute bought us.

Reaching the cluster like a real one: NetBird

On a real cluster you don’t sit at the console. You reach the login node over a VPN. The same workflow fits here with NetBird (a WireGuard mesh): enroll sms as a mesh peer and SSH to it over its overlay IP from anywhere, with no port-forwarding off the host.

The same tool, at work

Using NetBird here as a single peer is what got me comfortable enough to run it for real. I’m now migrating a company’s entire remote access onto a self-hosted NetBird control plane with SSO. That’s a separate series, Rebuilding Remote Access.

  your laptop ──NetBird (100.x overlay)──► sms (login node)
                                              │  libvirt L2 bridge 10.0.0.0/24
                                              ├── c1   (PXE/DHCP/Slurm, must be real L2)
                                              └── c2
curl -fsSL https://pkgs.netbird.io/install.sh | sh
sudo netbird up --setup-key <YOUR_SETUP_KEY>   # unattended re-enrollment
netbird status                                 # Management Connected + an overlay IP

The rule that keeps this from breaking the cluster:

NetBird is the front door, not the fabric

NetBird goes on the head node only. It does not replace the hpc-prov network. WireGuard is Layer-3, point-to-point, NOARP, and carries no DHCP/PXE broadcast, and a diskless node has no WireGuard agent at power-on anyway. So compute nodes keep talking over the libvirt L2 bridge, while NetBird is purely the access layer. Different layers, different jobs. Mixing them up is how people accidentally break PXE.

This mirrors production exactly: users reach the cluster through a controlled access layer to the login node, and the compute nodes stay unreachable from the outside, which is correct. (If you later want your laptop to reach c1/c2 directly, you make sms a NetBird routing peer advertising 10.0.0.0/24 and enable net.ipv4.ip_forward. Forget the forwarding and you get the classic “the route exists but nothing reaches the subnet” gotcha.)

What’s next

We have two networks and a head node sitting on both, reachable like a real login node. Part 3 wires up the cluster’s brain: OpenHPC’s repos, Munge for authentication, and the Slurm controller, the daemons that turn a couple of VMs into something that can actually schedule work.