manage_slurm
#
The manage_slurm
command can be used to manage user-iniated SLURM clusters. It may also be used by an administrator to manage the ECCO-level SLURM cluster
General help#
manage_slurm --help
Help for subcommands#
manage_slurm [command] --help
Show available clusters#
manage_slurm list
Create a cluster#
manage_slurm new machine1,machine2,.
where the first node becomes the “head node”. A reservation (see biohpc_res) is needed on all nodes.
Add a node#
manage_slurm addNode masterNode machine
will add machine
to the cluster that has masterNode
as the head node. Again, a reservation is needed on machine
, and all users currently with cluster access also need to be added to that reservation.
Example (complete)#
Adjust these:
node=cbsuecco10
account=2556
headnode=cbsueccosl01
then run this
biohpc_res new $account --server $node --hours 72
biohpc_res list
Now grap the reservation number:
reservation=12345
and add that reservation to the cluster:
biohpc_res edit $reservation --hide
biohpc_res edit $reservation --add ECCO # adds the entire ECCO group to the reservation
manage_slurm addNode $headnode $node
Debugging SLURM#
Sometimes, things go wrong. For instance,
> sinfo
regular* up infinite 2 drain cbsuecco02,cbsueccosl04
What does drain
mean, and how to put the nodes back into service?
Find the reason#
scontrol show nodes=cbsueccosl04 | grep Reason
which might return
Reason=Kill task failed [root@2025-02-21T18:19:11]
This means a job crashed or timed out leaving open files which made it hard for SLURM to kill it and a timeout kicked in. If you notice something like this, you (as an ‘operator’ of your cluster) can re-active the node by running
scontrol update nodename=cbsueccosl04 state=resume