The condo model and the money shot: watching Slurm preempt a job
Sherlock's condo model: research groups buy nodes and get guaranteed priority; everyone else uses idle cycles and gets bumped when an owner shows up. Here we configure it on two VMs and watch a high-priority job preempt a running one, live.
Everything so far (the networks, the shared trust, the stateless nodes) existed to get us here: a working cluster that can make decisions. This final part configures the thing Sherlock is actually known for among its users, the condo model, and ends with the demo worth recording: a high-priority job evicting a running one, in real time.
The mental model: a condo, not a hotel¶
Sherlock runs an owner-based (“condo”) model, and it’s a genuinely elegant piece of social engineering encoded in a scheduler:
What the condo model actually is
Research groups buy nodes and get guaranteed priority on that hardware, like owning a condo. But idle owned hardware is wasteful, so everyone else can run on it on a low-priority, preemptible basis, and gets bumped the instant an owner needs their nodes back. Owners get the guarantee they paid for, the cluster stays busy with guest work in the gaps, and nobody’s idle hardware goes to waste. Two partitions and a preemption rule express the entire deal.
We replicate it with our two compute nodes: an owners partition that preempts, and a
normal partition that yields.
The core slurm.conf¶
This is the config that earns its keep. The interesting lines aren’t the node definitions, they’re the scheduling knobs:
ClusterName=lab
SlurmctldHost=sms
ReturnToService=2 # a node that briefly went down rejoins on its own
AuthType=auth/munge
# Scheduling: cores/memory placement + QOS-driven preemption
SchedulerType=sched/backfill
SelectType=select/cons_tres # schedule by cores/memory, not whole nodes
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/basic
PreemptType=preempt/qos # a high-priority QOS can preempt a low one
PreemptMode=REQUEUE # preempted jobs requeue, they're not killed
AccountingStorageType=accounting_storage/slurmdbd
# NodeAddr pins the literal IP so slurmd never has to resolve its own name (see Part 4)
NodeName=c[1-2] NodeAddr=10.0.0.[2-3] CPUs=2 RealMemory=2900 State=UNKNOWN
# Partitions = queues. This is the condo model in two lines.
PartitionName=normal Nodes=c[1-2] Default=YES MaxTime=02:00:00 State=UP PriorityTier=1
PartitionName=owners Nodes=c[1-2] MaxTime=24:00:00 State=UP PriorityTier=100
Each non-obvious knob is doing real work:
cons_tres+CR_Core_Memoryschedule at the core/memory level so two small jobs can share a node. Whole-node scheduling would waste a 128-core machine on a 1-core job.sched/backfilllets small jobs jump ahead to fill gaps as long as they don’t delay the big job at the front of the queue. It’s the trick that keeps utilization high, and the part of scheduling people most often misunderstand.preempt/qos+PreemptMode=REQUEUEare the heart of the condo deal: a higher-priority QOS can evict a lower one, and the bumped guest goes back in the queue rather than the bin. Guests get free cycles without permanently losing work.ReturnToService=2is a small one I earned in Part 4. When a node briefly drops out during a boot race, this lets it rejoin on its own onceslurmdre-registers, instead of sittingdownuntil I manually resume it.
Declare memory under the node’s real RAM
RealMemory=2900 is deliberately conservative, well under the node’s physical memory. If
you declare every last megabyte, the OS has no headroom, nodes OOM, and Slurm marks them
DRAINED. Leaving a margin is the difference between a stable cluster and one that
mysteriously sheds nodes under load.
Bring it up, and the three-suspect debugging rule¶
The same slurm.conf goes on the head node and into the nodes’ overlay, then start the
controller:
cp /etc/slurm/slurm.conf /srv/warewulf/overlays/generic/rootfs/etc/slurm/slurm.conf
wwctl overlay build
systemctl restart slurmctld
sinfo
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 2:00:00 2 idle c[1-2]
owners up 1-00:00:00 2 idle c[1-2]
Both partitions, both nodes idle. That’s the green light.
If nodes show DOWN/UNKNOWN, it’s almost always one of three things
In my experience 99% of “node won’t join” cases are: (a) clock skew, (b) a
mismatched Munge key, or (c) a name that doesn’t resolve. Every one traces back to a
specific earlier step in this series, which is exactly why we did trust and naming
first. On EL10 there’s a fourth, sneakier flavor of (c): the node resolving its own name
to an IPv6 link-local address. That’s what NodeAddr in the config above quietly
defends against.
QOS: where the owner class gets its power¶
preempt/qos means the preemption rule lives in the QOS layer, not the partitions. Two QOS,
two accounts, and a user who can submit to both:
sacctmgr -i add qos normal
sacctmgr -i add qos owners
sacctmgr -i modify qos normal set priority=10
sacctmgr -i modify qos owners set priority=1000
sacctmgr -i modify qos owners set preempt=normal # owners QOS evicts normal QOS
sacctmgr -i add account physics Description="Node owners (condo)"
sacctmgr -i add account public Description="Guest users"
sacctmgr -i add user bobby account=physics qos=normal,owners defaultqos=owners
That preempt=normal line is the whole condo contract in one assignment: an owners job is
allowed to evict a normal job.
Fairshare is the next layer, and the part most people miss
I kept scheduling priority simple here (priority/basic) and let QOS do the preemption,
because preemption is the demo. The layer Sherlock adds on top is fairshare
(priority/multifactor): it tracks how much each account has recently used and lowers
the priority of heavy users so lighter ones catch up, so the queue self-balances over
time. Stack it with the condo model and you get the full picture: owners get guaranteed
hardware, and fairshare keeps the guests sharing idle cycles equitable among themselves.
The money shot: preemption, live¶
Now the demo, run as a regular user (bobby), not root, the way a real user would. First,
fill both nodes with a long, low-priority guest job and confirm it’s running:
$ sbatch -p normal --qos=normal -N2 --exclusive --wrap='sleep 600'
$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
JOBID NAME USER QOS ST NODELIST(REASON)
10 guest bobby normal R c[1-2]
Running on both nodes. Now an owner arrives needing the same hardware, and watch what happens:
$ sbatch -p owners --qos=owners -N2 --exclusive --wrap='hostname; sleep 40'
$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
JOBID NAME USER QOS ST NODELIST(REASON)
10 guest bobby normal PD (BeginTime)
11 owner bobby owners R c[1-2]
There it is. The guest job flips from R to PD (pending, requeued) and the owner job
starts immediately on the same nodes. No human intervened. The scheduler enforced the condo
deal on its own. And when the owner job finishes, the guest comes straight back:
$ squeue -o '%.6i %.8j %.8u %.8q %.2t %R'
JOBID NAME USER QOS ST NODELIST(REASON)
10 guest bobby normal R c[1-2]
Run, preempted, requeued, resumed, all automatic.
Why this is the one thing to screen-record
In three commands you’ve demonstrated the exact behavior the whole system exists for: owners get instant access to their hardware, guests get idle cycles but yield gracefully, and their work is requeued rather than destroyed. If you record one thing from a build like this, make it this moment.
Honest limits (what this lab is not)¶
The series opened with scope honesty, so it should close with it too. This proves I can operate the stack, not that I’ve reproduced a supercomputer:
- No low-latency fabric. Virtio Ethernet, not InfiniBand/RDMA. MPI runs but isn’t fast, and topology-aware scheduling is something I’ve read about, not reproduced.
- NFS, not Lustre. A real parallel filesystem (MDS/OSS/OST) is a lab of its own. I used
NFS precisely so the bottleneck is visible rather than hidden. It does work, though: a job
running as
bobbyon both nodes wrote to the same$HOMEover NFS, which is the whole shared-storage contract a real job depends on. - One physical host. Every “node” shares one CPU and disk, so this is an operations and correctness exercise, not a benchmark. The 6 GiB-per-node RAM ceiling from Part 4 is a symptom of exactly that.
The honest next layers (InfiniBand and topology-aware scheduling, Lustre internals, GPU
scheduling with gres, cgroup job isolation) are things I’d learn on real hardware. Being
able to say “I built X end to end, and I know Y is the next layer” is a stronger, more honest
position than pretending a single box is a cluster.
How it all maps to Sherlock¶
Everything in this series has a direct counterpart on the real cluster:
| What I built here | What it is on Sherlock |
|---|---|
owners vs normal + preemption |
The condo model: groups buy nodes, others use idle cycles |
| Slurm fairshare + QOS | Keeps thousands of users equitable |
| Warewulf 4 stateless boot | Thousands of identical, drift-free nodes; a flaky node just re-images |
| OpenHPC curated stack | How the software environment avoids drift |
| Lmod modules | How users select among conflicting software versions |
| Munge + chrony | Cluster auth, and the clock-skew gotcha behind most “won’t join” bugs |
| NFS vs Lustre | Why $HOME is NFS but $SCRATCH is a parallel filesystem |
The one takeaway¶
If you remember a single thing from this whole series, make it the idea from Part 1: compute nodes are cattle, not pets. Every design choice here (stateless images, MAC-based provisioning, a curated stack, partitions instead of hand-tuned machines) falls out of that one principle. Internalize it and an HPC cluster stops looking like a thousand servers and starts looking like one system you can actually reason about.
That was the goal. Not to own a supercomputer, but to learn to think like the person who runs one.