SeriesHPC LabPart 5 of 5

The condo model and the money shot: watching Slurm preempt a job

Sherlock's condo model: research groups buy nodes and get guaranteed priority; everyone else uses idle cycles and gets bumped when an owner shows up. Here we configure it on two VMs and watch a high-priority job preempt a running one, live.

Everything so far (the networks, the shared trust, the stateless nodes) existed to get us here: a working cluster that can make decisions. This final part configures the thing Sherlock is actually known for among its users, the condo model, and ends with the demo worth recording: a high-priority job evicting a running one, in real time.

The mental model: a condo, not a hotel

Sherlock runs an owner-based (“condo”) model, and it’s a genuinely elegant piece of social engineering encoded in a scheduler:

What the condo model actually is

Research groups buy nodes and get guaranteed priority on that hardware, like owning a condo. But idle owned hardware is wasteful, so everyone else can run on it on a low-priority, preemptible basis, and gets bumped the instant an owner needs their nodes back. Owners get the guarantee they paid for, the cluster stays busy with guest work in the gaps, and nobody’s idle hardware goes to waste. Two partitions and a preemption rule express the entire deal.

We replicate it with our two compute nodes: an owners partition that preempts, and a normal partition that yields.

The core slurm.conf

This is the config that earns its keep. The interesting lines aren’t the node definitions, they’re the scheduling knobs:

ClusterName=lab
SlurmctldHost=sms
ReturnToService=2                      # a node that briefly went down rejoins on its own
AuthType=auth/munge

# Scheduling: cores/memory placement + QOS-driven preemption
SchedulerType=sched/backfill
SelectType=select/cons_tres            # schedule by cores/memory, not whole nodes
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/basic
PreemptType=preempt/qos                # a high-priority QOS can preempt a low one
PreemptMode=REQUEUE                    # preempted jobs requeue, they're not killed

AccountingStorageType=accounting_storage/slurmdbd

# NodeAddr pins the literal IP so slurmd never has to resolve its own name (see Part 4)
NodeName=c[1-2] NodeAddr=10.0.0.[2-3] CPUs=2 RealMemory=2900 State=UNKNOWN

# Partitions = queues. This is the condo model in two lines.
PartitionName=normal Nodes=c[1-2] Default=YES MaxTime=02:00:00 State=UP PriorityTier=1
PartitionName=owners Nodes=c[1-2]              MaxTime=24:00:00 State=UP PriorityTier=100

Each non-obvious knob is doing real work:

  • cons_tres + CR_Core_Memory schedule at the core/memory level so two small jobs can share a node. Whole-node scheduling would waste a 128-core machine on a 1-core job.
  • sched/backfill lets small jobs jump ahead to fill gaps as long as they don’t delay the big job at the front of the queue. It’s the trick that keeps utilization high, and the part of scheduling people most often misunderstand.
  • preempt/qos + PreemptMode=REQUEUE are the heart of the condo deal: a higher-priority QOS can evict a lower one, and the bumped guest goes back in the queue rather than the bin. Guests get free cycles without permanently losing work.
  • ReturnToService=2 is a small one I earned in Part 4. When a node briefly drops out during a boot race, this lets it rejoin on its own once slurmd re-registers, instead of sitting down until I manually resume it.

Declare memory under the node’s real RAM

RealMemory=2900 is deliberately conservative, well under the node’s physical memory. If you declare every last megabyte, the OS has no headroom, nodes OOM, and Slurm marks them DRAINED. Leaving a margin is the difference between a stable cluster and one that mysteriously sheds nodes under load.

Bring it up, and the three-suspect debugging rule

The same slurm.conf goes on the head node and into the nodes’ overlay, then start the controller:

cp /etc/slurm/slurm.conf /srv/warewulf/overlays/generic/rootfs/etc/slurm/slurm.conf
wwctl overlay build
systemctl restart slurmctld
sinfo
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up    2:00:00      2   idle c[1-2]
owners       up 1-00:00:00      2   idle c[1-2]

Both partitions, both nodes idle. That’s the green light.

If nodes show DOWN/UNKNOWN, it’s almost always one of three things

In my experience 99% of “node won’t join” cases are: (a) clock skew, (b) a mismatched Munge key, or (c) a name that doesn’t resolve. Every one traces back to a specific earlier step in this series, which is exactly why we did trust and naming first. On EL10 there’s a fourth, sneakier flavor of (c): the node resolving its own name to an IPv6 link-local address. That’s what NodeAddr in the config above quietly defends against.

QOS: where the owner class gets its power

preempt/qos means the preemption rule lives in the QOS layer, not the partitions. Two QOS, two accounts, and a user who can submit to both:

sacctmgr -i add qos normal
sacctmgr -i add qos owners
sacctmgr -i modify qos normal set priority=10
sacctmgr -i modify qos owners set priority=1000
sacctmgr -i modify qos owners set preempt=normal     # owners QOS evicts normal QOS

sacctmgr -i add account physics Description="Node owners (condo)"
sacctmgr -i add account public  Description="Guest users"
sacctmgr -i add user bobby account=physics qos=normal,owners defaultqos=owners

That preempt=normal line is the whole condo contract in one assignment: an owners job is allowed to evict a normal job.

Fairshare is the next layer, and the part most people miss

I kept scheduling priority simple here (priority/basic) and let QOS do the preemption, because preemption is the demo. The layer Sherlock adds on top is fairshare (priority/multifactor): it tracks how much each account has recently used and lowers the priority of heavy users so lighter ones catch up, so the queue self-balances over time. Stack it with the condo model and you get the full picture: owners get guaranteed hardware, and fairshare keeps the guests sharing idle cycles equitable among themselves.

The money shot: preemption, live

Now the demo, run as a regular user (bobby), not root, the way a real user would. First, fill both nodes with a long, low-priority guest job and confirm it’s running:

$ sbatch -p normal --qos=normal -N2 --exclusive --wrap='sleep 600'
$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
 JOBID     NAME     USER      QOS ST NODELIST(REASON)
    10    guest    bobby   normal  R c[1-2]

Running on both nodes. Now an owner arrives needing the same hardware, and watch what happens:

$ sbatch -p owners --qos=owners -N2 --exclusive --wrap='hostname; sleep 40'
$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
 JOBID     NAME     USER      QOS ST NODELIST(REASON)
    10    guest    bobby   normal PD (BeginTime)
    11    owner    bobby   owners  R c[1-2]

There it is. The guest job flips from R to PD (pending, requeued) and the owner job starts immediately on the same nodes. No human intervened. The scheduler enforced the condo deal on its own. And when the owner job finishes, the guest comes straight back:

$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
 JOBID     NAME     USER      QOS ST NODELIST(REASON)
    10    guest    bobby   normal  R c[1-2]

Run, preempted, requeued, resumed, all automatic.

Why this is the one thing to screen-record

In three commands you’ve demonstrated the exact behavior the whole system exists for: owners get instant access to their hardware, guests get idle cycles but yield gracefully, and their work is requeued rather than destroyed. If you record one thing from a build like this, make it this moment.

Honest limits (what this lab is not)

The series opened with scope honesty, so it should close with it too. This proves I can operate the stack, not that I’ve reproduced a supercomputer:

  • No low-latency fabric. Virtio Ethernet, not InfiniBand/RDMA. MPI runs but isn’t fast, and topology-aware scheduling is something I’ve read about, not reproduced.
  • NFS, not Lustre. A real parallel filesystem (MDS/OSS/OST) is a lab of its own. I used NFS precisely so the bottleneck is visible rather than hidden. It does work, though: a job running as bobby on both nodes wrote to the same $HOME over NFS, which is the whole shared-storage contract a real job depends on.
  • One physical host. Every “node” shares one CPU and disk, so this is an operations and correctness exercise, not a benchmark. The 6 GiB-per-node RAM ceiling from Part 4 is a symptom of exactly that.

The honest next layers (InfiniBand and topology-aware scheduling, Lustre internals, GPU scheduling with gres, cgroup job isolation) are things I’d learn on real hardware. Being able to say “I built X end to end, and I know Y is the next layer” is a stronger, more honest position than pretending a single box is a cluster.

How it all maps to Sherlock

Everything in this series has a direct counterpart on the real cluster:

What I built here What it is on Sherlock
owners vs normal + preemption The condo model: groups buy nodes, others use idle cycles
Slurm fairshare + QOS Keeps thousands of users equitable
Warewulf 4 stateless boot Thousands of identical, drift-free nodes; a flaky node just re-images
OpenHPC curated stack How the software environment avoids drift
Lmod modules How users select among conflicting software versions
Munge + chrony Cluster auth, and the clock-skew gotcha behind most “won’t join” bugs
NFS vs Lustre Why $HOME is NFS but $SCRATCH is a parallel filesystem

The one takeaway

If you remember a single thing from this whole series, make it the idea from Part 1: compute nodes are cattle, not pets. Every design choice here (stateless images, MAC-based provisioning, a curated stack, partitions instead of hand-tuned machines) falls out of that one principle. Internalize it and an HPC cluster stops looking like a thousand servers and starts looking like one system you can actually reason about.

That was the goal. Not to own a supercomputer, but to learn to think like the person who runs one.