Request for Classic confinement: Slurm

jdstrand · July 1, 2020, 1:42pm

@egeeirl - I’m not saying that you should continue to press on making snap_daemon work since based on other comments, it won’t fit the needs for --user, but for posterity, initgroups() uses setgroups() under the hood in a non-sandbox compliant manner (so you’d need to either patch or use the LD_PRELOAD technique). I’ve updated system-usernames to mention this.

jdstrand · July 1, 2020, 2:19pm

@egeeirl and @jamesbeedy - sorry for the delay, this is a new use case that needed to be investigated and thought through.

@pedronis - this seems to be a new use case for classic. Some discussion happened in: Can a confined Snap run as a different uid and or guid?.

IME, slurm, is an orchestration tool for HPC. AIUI, slurm can and does utilize the snapd_daemon for certain actions, but in certain environments the slurmd process itself needs to run (arbitrary) commands as the effective uid/gid user in order access resources for the user (@jamesbeedy referred to this as the ‘active directory user’ which seem to tied to slurm’s concept of ‘partitions’ (see the overview url)). Other use cases are to run commands as other users for accounting purposes (AIUI, the current design of slurm is such that it performs resource accounting by tracking the uid that the process ran as as opposed to something like having users login in with an account and executing all commands as the same user and performing tracking via use of the account).

The slurm website states “Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters”. So, like flowbot, this is similar to juju. The closest use case in our processes for classic is ‘management snaps’ which we’ve identified as an unsupported use case.

In my limited understanding of slurm, we could perhaps require that slurm be modified to run anything that should be non-root as snap_daemon with slurm then being further modified to perform accounting with this in mind, but IME that would change how users expect to use slurm (ie, they’d have to ‘login’ in some manner). Even if that change was made, slurm’s functionality wrt partitions would be limited.

While orchestration tools like slurm (and juju (and potentially the recent flowbot request) do ‘manage’ systems, these systems are not for managing fleets of devices/laptops/servers/etc which have login users and are instead about scale out computing on systems primarily without login users. In that light, I think we should consider the ‘orchestration snaps’ use case as something different from ‘management snaps’.

@pedronis - thoughts?

jamesbeedy · July 1, 2020, 4:50pm

@jdstrand @pedronis slurm is a resource scheduler for HPC applications. It is closer in comparison to apache spark than to juju. E.g. we use juju to facilitate the slurm lifecycle; we can scale our slurm clusters using juju to meet the resource needs of individual hpc jobs that users may run on slurm.

The hpc space is majorly an enterprise and academic space, in which heightened security and extended red tape are the norm. Active directory realms, access tracking and resource accounting on a per user basis are required with no exceptions.

An example:
A user will have files in a location in an active directory controlled filesystem that the slurmd (compute daemon) process needs to access, on all nodes in the cluster.

When the user executes a slurm job from a central location using srun/sbatch, the slurmd execute under the effective uid of the user that kicked off the job.
The slurmd needs to execute as the effective userid of the active directory user that kicked off the process so that it can access files in the user space on each of the compute nodes.

We have tried many alternatives to support these conventions outside of producing a classic snap.

In conclusion, we have determined that we need to use a classically confined snap to support these use cases.

Hope this helps. Thanks!

egeeirl · July 1, 2020, 9:16pm

Slurm isn’t managing or even really orchestrating anything. Depending on the use case, it is effectively a middleware. Take this use-case for example:

An organization wants to use StarCCM to simulate some heavy-duty fluid dynamics. In this case, Slurm is simply a tool that StarCCM leverages to complete the task.

Since this is a massively intense computational workload, it would be extremely useful to delegated out to multiple compute nodes. That’s where Slurm comes in.

StarCCM, MPI, and Slurm (and other components) work together to slice the workload into computational chunks that can be resolved through N number of compute nodes via Slurm. As such, Slurm isn’t necessarily managing a host in a traditional sense. Slurm also isn’t really orchestrating or provisioning anything, either.

In the current use case, Juju is used to deploy and provision Slurm across several clusters .

jdstrand · July 2, 2020, 8:50pm

That is a fair distinction and perhaps we’ll want to define this as a different use case from orchestration. The main point though is that slurm isn’t about managing the compute nodes’ OS, user, etc configuration (ie like puppet or chef might do), it is about putting workloads on them (which in my mind is orchestrating the computation). Semantics aside, ‘management snaps’ is not a supported use case for classic and I’m putting forth that slurm is not a ‘management snap’ but rather something else, which IME is an important distinction when considering slurm for classic confinement.

egeeirl · July 2, 2020, 9:12pm

Semantics aside, ‘management snaps’ is not a supported use case for classic and I’m putting forth that slurm is not a ‘management snap’ but rather something else, which IME is an important distinction when considering slurm for classic confinement.

Sure, that seems fine. So this is something that you Snapcrafters will discuss on your side and get back to me or is there anything else I need to do?

jdstrand · July 2, 2020, 10:02pm

@egeeirl - classic snaps run unrestricted on the systems they are installed on and for this reason we treat reviews of classic snaps differently than other sorts of reviews. For use cases for classic snaps that aren’t listed in our current processes, reviewers ask for snapd architect involvement, which I did when I asked for @pedronis to comment. That discussion will happen in this topic and for the moment there is nothing else you need to do, though he may have followup questions for you/other reviewers.

egeeirl · July 3, 2020, 12:33am

For use cases for classic snaps that aren’t listed in our current processes, reviewers ask for snapd architect involvement

Perfect, thank you

jamesbeedy · July 12, 2020, 6:37pm

@jdstrand @pedronis nudge, bump

pedronis · July 15, 2020, 3:53pm

@jdstrand the ways this seems to be different than a managent snap are:

I imagine it is normally supposed to take over a machine with no workloads not related to it usually running
It wouldn’t run quietly in a corner taking control only from time to time
In particular except for developer machines we wouldn’t expect this to run on a desktop?
We don’t expect casual users to install this at all?

pedronis · July 15, 2020, 3:56pm

These same arguments though make it a bit unclear why it couldn’t be confined at some point?.

jamesbeedy · July 16, 2020, 3:27pm

@pedronis

I imagine it is normally supposed to take over a machine with no workloads not related to it usually running

Correct.
It wouldn’t run quietly in a corner taking control only from time to time

Correct.

Slurm is responsible for the accounting and scheduling of resources needed to process queued jobs (hpc workloads) and would be using nearly all system resources at all times.

(EDIT) Correction/Expansion:
slurmctld is responsible for the scheduling and accounting of machine resource for nodes that belong to the slurm cluster. It does so by communicating to the slurmd units to acquire their resource usage, and with the slurmdbd to commit resource usage and cluster metrics. slurmdbd is responsible for transacting with the database. slurmd is the compute daemon that executes the slurmstepd process which is the program that is responsible for preforming the work to get the job done/running the computation. In a slurm deployment, each component; slurmd, slurmctld, slurmdbd runs on its own server. There are generally one or two slurmctld components (an active controller and a backup controller), one or two slurmdbd (active and backup), and N slurmd nodes. The slurmd and subsequently the slurmstepd process are what use the resources of the node to carry out the computation.
Hopefully this helps clear things up.
(END EDIT)

In particular except for developer machines we wouldn’t expect this to run on a desktop?
Correct.
We don’t expect casual users to install this at all?
Correct. Unless they want to run a real hpc workload on their local box in dev mode (what is achieved by setting the snap.mode=all.

The reason it can’t be confined is because the process needs to run as the effective uid of the user running the workload.

Imagine you have 1000s of users trying to run workloads on a large cluster where all of the users are members of an active directory realm. User home, scratch space, long term storage are all supplied as mounted network filesystems to all nodes in the cluster. In this way an active directory user can have private and shared file system space on every node in the cluster. The same users are running hpc workloads that need access to their user space in the active directory controlled filesystem(s). For this to be possible, the compute daemon process of slurm, slurmstepd need to execute under the effective uid of the active directory user in order for slurm to account for the resources used by the job and more importantly, so that the slurmstepd can access files owned by the active directory user in network filesystems.

Slurm runs seteuid() and setegid() to drop the slurmsted compute process privilege to the effective uid of the user executing the process so that the process can access resources owned by the user.

An except from slurm code that drops privs

jdstrand · July 24, 2020, 6:52pm

Right, but that is a current limitation of the sandbox. It is plausible that it could be extended in ways that would allow your snap to work under strict confinement. One of the considerations for classic that we ask ourselves is if something could ever be made classic, and @pedronis and I believe that to be ‘yes’.

@jamesbeedy - thank you again for the additional information. I believe this is quite close to a decision now.

jdstrand · July 24, 2020, 6:54pm

I believe this is a good summary. As @jamesbeedy mentioned, the answers to all of these is ‘correct’ (yes). Based on this distillation, IME, this is a new supported use case for classic. If you agree, can you agree and we can proceed with the request?

pedronis · July 29, 2020, 6:54pm

yes, it seems a new supported case (not sure it will be easy to distill it for re-application though) with a possible path to confinement at some point, so I agree.

alexmurray · July 30, 2020, 12:33am

The requirements for classic are understood. @advocacy, can you please perform the vetting?

popey · July 30, 2020, 9:05am

Vetting done. +1 from advocacy.

alexmurray · July 30, 2020, 11:28am

Granting use of classic. This is now live.

jamesbeedy · July 31, 2020, 3:58pm

@jdstrand @popey @pedronis @alexmurray @egeeirl thank you!

jdstrand · July 31, 2020, 5:56pm

Thanks, I’ll write something up for you to review.