I imagine it is normally supposed to take over a machine with no workloads not related to it usually running
It wouldn’t run quietly in a corner taking control only from time to time
Slurm is responsible for the accounting and scheduling of resources needed to process queued jobs (hpc workloads) and would be using nearly all system resources at all times.
slurmctld is responsible for the scheduling and accounting of machine resource for nodes that belong to the slurm cluster. It does so by communicating to the
slurmd units to acquire their resource usage, and with the
slurmdbd to commit resource usage and cluster metrics.
slurmdbd is responsible for transacting with the database.
slurmd is the compute daemon that executes the
slurmstepd process which is the program that is responsible for preforming the work to get the job done/running the computation. In a slurm deployment, each component;
slurmdbd runs on its own server. There are generally one or two
slurmctld components (an active controller and a backup controller), one or two
slurmdbd (active and backup), and N
slurmd nodes. The
slurmd and subsequently the
slurmstepd process are what use the resources of the node to carry out the computation.
Hopefully this helps clear things up.
- In particular except for developer machines we wouldn’t expect this to run on a desktop?
- We don’t expect casual users to install this at all?
Correct. Unless they want to run a real hpc workload on their local box in dev mode (what is achieved by setting the
The reason it can’t be confined is because the process needs to run as the effective uid of the user running the workload.
Imagine you have 1000s of users trying to run workloads on a large cluster where all of the users are members of an active directory realm. User home, scratch space, long term storage are all supplied as mounted network filesystems to all nodes in the cluster. In this way an active directory user can have private and shared file system space on every node in the cluster. The same users are running hpc workloads that need access to their user space in the active directory controlled filesystem(s). For this to be possible, the compute daemon process of slurm, slurmstepd need to execute under the effective uid of the active directory user in order for slurm to account for the resources used by the job and more importantly, so that the
slurmstepd can access files owned by the active directory user in network filesystems.
setegid() to drop the slurmsted compute process privilege to the effective uid of the user executing the process so that the process can access resources owned by the user.
An except from slurm code that drops privs