What an HPC cluster actually is (and how to fake one on a single box)

Strip away the marketing and a supercomputer is five roles wired to a private network. Here's the mental model, and the groundwork for rebuilding a Sherlock-style HPC cluster on one Linux box.

On this page

First, the plain-English version
A supercomputer is five roles, not one big computer
The one idea: compute nodes are cattle, not pets
What I’m building
Two tracks, in order
The groundwork: confirm KVM, then get out of sudo
Honest limits (the part that makes it credible)
What’s next

This one’s a personal homelab project, built on my own hardware. I’m writing it up as I go, so each part only covers what I’ve actually finished.

I wanted to learn how a real HPC cluster is operated. Not the brochure version, but the actual day-to-day of the person who keeps a cluster like Sherlock running. The catch is that I don’t have a server room. So I’m building one as a virtual cluster on a single Ubuntu box: Slurm scheduling, stateless provisioning, a shared filesystem, the whole stack. This series is the build log, written in real time.

This first part covers the thing most guides skip past, which is what a cluster even is, plus the groundwork of getting the host ready to run one. The fun stuff comes later, but it all rests on this.

First, the plain-English version¶

If you’re not a systems person, here’s the whole thing in a paragraph. A “supercomputer” isn’t one giant magic machine. It’s a roomful of ordinary computers plus software that does two jobs: let lots of researchers share those computers fairly, and keep every computer identical so the place doesn’t collapse into a thousand one-off snowflakes nobody can fix. I wanted to learn how the people who run those rooms actually do it, so I built a tiny version at home on a single machine: a couple of pretend “worker” computers, one “front desk” computer that people log into, and the traffic-cop software that decides whose work runs where and when. This series is me building that, piece by piece, writing down why each piece exists. Whenever a section gets too technical, the one idea underneath it all stays simple: treat the worker machines as interchangeable and disposable, and running a thousand of them gets no harder than running one.

A supercomputer is five roles, not one big computer¶

Strip away the marketing and an HPC cluster is a handful of distinct roles wired together on a private network. That’s really all it is. Each role has one job:

Role	Job	Sherlock	This lab
Login / head node	Where users SSH in, edit code, submit jobs. Also runs management.	`login.sherlock.stanford.edu`	`sms` VM
Management / provisioning	Builds and serves the OS image to every compute node; owns DHCP/TFTP/PXE.	dedicated mgmt nodes	folded into `sms`
Scheduler / controller	Decides which job runs on which node, when.	Slurm `slurmctld`	`slurmctld` on `sms`
Compute nodes	Where jobs actually run. Disposable, identical, many.	thousands	`c1`, `c2` VMs
Shared storage	One filesystem every node sees, so a job finds its data anywhere.	Lustre, NFS	NFS from `sms`

A supercomputer is just these five roles, scaled up and wired well. In this lab the first three collapse onto one VM (sms, short for System Management Server), and the compute nodes are a couple of disposable VMs. The shape is the same as the real thing. Only the scale is different.

The one idea: compute nodes are cattle, not pets¶

This is the single most important idea in HPC operations, and it’s what makes everything downstream make sense.

Cattle, not pets

You don’t SSH into a compute node and apt install things. You define one OS image, and every node boots that image. A node that misbehaves doesn’t get nursed back to health. It gets rebooted and reprovisioned from the canonical image in minutes. When you have thousands of nodes, hand-tending each one (treating them as pets) is impossible. Treating them as interchangeable, disposable cattle is the only thing that scales.

That principle is the reason a cluster needs a provisioning role at all, and it’s why a whole part of this series will go to Warewulf, the tool that serves one image to every node over the network. Hold onto it. It’s the thread running through the entire build.

What I’m building¶

Here’s the whole lab on one diagram: one head node on two networks, and two diskless compute nodes that PXE-boot an image they never store on disk.

                    Internet
                       │
                 ┌─────┴─────┐  NAT network "default" (192.168.122.0/24)
                 │  ksdev0   │  ← head node reaches the internet for packages
        ┌────────┴───────────┴────────┐
        │            sms               │   "sms" = System Management Server
        │  (head / login / scheduler)  │     - Slurm controller (slurmctld)
        │  ens3 = 10.0.0.1             │     - Warewulf provisioning (DHCP/TFTP/HTTP)
        │  Rocky Linux 10, 2 vCPU/4GB  │     - NFS server (/home, /opt)
        └──────────────┬───────────────┘
                       │  isolated provisioning network "hpc-prov" (10.0.0.0/24)
          ┌────────────┼────────────┐
     ┌────┴────┐  ┌────┴────┐   (add c3, c4… later)
     │   c1    │  │   c2    │   diskless / stateless compute nodes
     │10.0.0.2 │  │10.0.0.3 │   PXE-boot the image Warewulf serves
     │2vCPU/6GB│  │2vCPU/6GB│   Rocky Linux 10, no local OS install
     └─────────┘  └─────────┘

The whole thing fits comfortably on a 16-vCPU, 31 GiB host:

sms: 2 vCPU, 4 GiB, 40 GiB disk
c1 and c2: 2 vCPU, 6 GiB each, with no OS disk at all (stateless)
That’s roughly 6 vCPU and 16 GiB of RAM in total, which leaves headroom to add more compute nodes later by copying a VM definition. That “add a node by copying a definition” move is the cattle principle paying off.

Why 6 GiB on a diskless node?

It looks like a lot for a node with no disk. The reason is the whole point of stateless booting: the node has no local OS, so it loads the entire compute image into RAM and runs from there. The image plus the running system has to fit in memory with room to spare. Skimp on this and the node panics mid-boot trying to unpack the image. I learned that number the hard way, and Part 4 has the crash log to prove it.

Why Rocky on the guests but Ubuntu on the host?

The host just needs KVM, and any Linux will do for that. The guests run Rocky Linux 10 because OpenHPC officially targets Enterprise Linux (RHEL, Rocky, Alma), and Sherlock itself is a RHEL-family distro. Matching the guest OS to the real environment is half the point of a learning lab. (OpenHPC’s current 4.x release targets Enterprise Linux 10 specifically, which is why this is Rocky 10 and not 9.)

Two tracks, in order¶

I’m doing this in two passes on purpose.

Track A, Slurm-first (about half a day). Build the head node and compute nodes by hand, wire up Slurm, and get jobs running. It’s the fastest path to a working scheduler you understand end to end.
Track B, Sherlock-real (a day or two). Replace the hand-built compute nodes with OpenHPC and Warewulf stateless provisioning, so nodes netboot an image you control. This is the showcase, and the part that mirrors how Sherlock is actually run.

Doing A first means I always have a working cluster to fall back on. And when Warewulf automates the node build in B, I’ll understand what it’s automating because I did it by hand once. That contrast is the whole point.

The groundwork: confirm KVM, then get out of `sudo`¶

This is as far as I’ve actually gotten so far, and it’s the right place to be careful. Before any VMs exist, the host has to be able to run them at full speed, and my user needs to drive libvirt without typing sudo in front of every command.

First, confirm the CPU can do hardware virtualization and that /dev/kvm is present:

$ egrep -c '(vmx|svm)' /proc/cpuinfo
32
$ ls -l /dev/kvm
crw-rw----+ 1 root kvm 10, 232 Jun 10 06:26 /dev/kvm

A non-zero count means the virtualization extensions are there (32 here, one per hardware thread), and /dev/kvm exists and is group-accessible.

Why check for /dev/kvm first

KVM needs the CPU’s virtualization extensions. Without /dev/kvm, libvirt quietly falls back to pure-software emulation (TCG), and the “cluster” would crawl. One ls now saves you from chasing mysterious slowness later.

Then install the virtualization stack and add yourself to the groups that let you skip sudo:

sudo apt install -y qemu-kvm libvirt-daemon-system libvirt-clients \
  virtinst virt-manager bridge-utils libosinfo-bin

sudo usermod -aG libvirt,kvm "$USER"   # then log out and back in, or run: newgrp libvirt

The proof that it worked is being in both groups and getting a clean, sudo-free answer from virsh:

$ id -nG | tr ' ' '\n' | grep -E 'libvirt|kvm'
kvm
libvirt
$ virsh list --all
 Id   Name   State
--------------------

An empty table, no sudo, no error. That’s the green light. The host can run VMs, and I’m talking to libvirt as myself. There are no VMs yet, which is exactly right at this stage.

‘libvirtd is inactive’ is not a bug

On Ubuntu 24.04 libvirt is socket-activated, so libvirtd reads as “inactive” until the first virsh call wakes it. Don’t go chasing a dead service. It’s working as designed.

Honest limits (the part that makes it credible)¶

A single-box lab can’t be a real supercomputer, and pretending otherwise would defeat the point. Three things this lab deliberately won’t do:

No InfiniBand or RDMA. Real Sherlock nodes talk over a low-latency fabric, while these VMs use virtio Ethernet. MPI still runs, it just won’t be fast. You learn the concepts of a fabric and topology-aware scheduling by reading, not by reproducing them.
No real Lustre. Standing up Lustre is a project of its own, so I’m using NFS as the shared-filesystem stand-in. Because NFS will visibly bottleneck under parallel load, you get to feel exactly why parallel filesystems exist.
One physical host. Every “node” shares one CPU and one disk, so this proves I can operate the stack, not benchmark it. That’s the right scope for what I’m after.

Saying these out loud is a feature, not an apology. The goal was never a fast cluster. It was to operate the same stack a real one runs, from end to end.

What’s next¶

The mental model is in place and the host can run VMs. Part 2 builds the networks that bare-metal guides tend to gloss over, including the isolated, DHCP-less provisioning network that makes stateless booting possible, and then stands up the head node itself. I’ll publish it once it’s actually built.