Request for Classic confinement: Slurm

egeeirl · June 18, 2020, 8:41pm

This request is a follow up to an issue/limitation I encountered which requires us to support a use-case with Classic.

There are common scenarios when an operator wishes to run a Slurm command as another user, oftentimes for accounting purposes. A simple command that will invoke the issue:

slurm.srun --uid 1000 -N1 -l uname -r

The command above tries to run as UID 1000 but cannot because of the confinement mode. As such, it leads me to this request.

Strict confinement is still appropriate, especially for testing and development for Slurm clusters so this request shouldn’t replace our existing Snap tracks.

jamesbeedy · June 19, 2020, 10:02pm

@jdstrand @alexmurray what do you need from us to move forward here?

jdstrand · June 22, 2020, 6:13pm

What are representative use cases where the user might specify --uid (eg, especially but not limited to “accounting purposes”)? If it is just to run something as non-root, we have system-usernames that would allow someone to run slurm.srun --uid 584788 -N1 -l uname -r.

egeeirl · June 22, 2020, 9:23pm

we have system-usernames that would allow someone to run slurm.srun --uid 584788 -N1 -l uname -r

I tried this and ran into permissions issues because of how Slurm uses system calls to switch user contexts. Let me test it once more and report back with any errors I encounter.

jdstrand · June 22, 2020, 9:57pm

Use of snap_daemon requires some care so if running into trouble, be sure to see ‘Usage considerations’ in system-usernames.

egeeirl · June 23, 2020, 9:59pm

Ran a pair of test cases today and immediately hit a wall. I have the daemons running as snap_daemon but user switching, even to the snap_daemon user, is not permitted. Here are the tests and logs:

root@slurm-test:/tmp# srun --uid 1000 -l uname -a
srun: error: initgroups: Operation not permitted
srun: fatal: Unable to assume uid=1000

dmesg log:

audit: type=1326 audit(1592949033.121:584): auid=0 uid=0 gid=0 ses=31 pid=855358 comm="srun" exe="/snap/slurm/255/slurm-bins/srun" sig=0 arch=c000003e syscall=116 compat=0 ip=0x7fe0c1516d9d code=0x50000
audit: type=1326 audit(1592949033.121:585): auid=0 uid=0 gid=0 ses=31 pid=855358 comm="srun" exe="/snap/slurm/255/slurm-bins/srun" sig=0 arch=c000003e syscall=116 compat=0 ip=0x7fe0c1517f10 code=0x50000

And trying to run as the snap_daemon user:

root@slurm-test:/tmp# srun --uid 584788 -l uname -a
srun: error: initgroups: Operation not permitted
srun: fatal: Unable to assume uid=584788

dmesg log:

audit: type=1326 audit(1592949216.864:600): auid=0 uid=0 gid=0 ses=31 pid=856524 comm="srun" exe="/snap/slurm/255/slurm-bins/srun" sig=0 arch=c000003e syscall=116 compat=0 ip=0x7f789d71ad9d code=0x50000
audit: type=1326 audit(1592949216.864:601): auid=0 uid=0 gid=0 ses=31 pid=856524 comm="srun" exe="/snap/slurm/255/slurm-bins/srun" sig=0 arch=c000003e syscall=116 compat=0 ip=0x7f789d71bf10 code=0x50000

For reference, the srun --uid command provides the following explanation of how the command works with users:

–uid =< user >

Attempt to submit and/or run a job as user instead of the invoking user id. The invoking user’s credentials will be used to check access permissions for the target partition. User root may use this option to run jobs as a normal user in a RootOnly partition for example. If run as root, srun will drop its permissions to the uid specified after node allocation is successful. user may be the user name or numerical user ID. This option applies to job and step allocations.

jamesbeedy · June 23, 2020, 10:15pm

hey @jdstrand, the main reason we need to execute jobs under uid/gid of the user, is because the slurmd process that executes the job needs to run as the effective uid/gid of the active directory user in order to access filesystem resources owned by the active directory user. Also, as @egeeirl mentioned, slurmdbd accounts for the cluster resources used for each job ran by each user. If we can’t execute the job under the uid/gid of the user then we have no way to pair up the resource accounting to the user that ran the job.

jamesbeedy · June 24, 2020, 10:12pm

hey @jdstrand, does that work for you, or is there something else you need to move forward with this?

jamesbeedy · June 30, 2020, 4:11pm

@jdstrand how is it going? Is there a way for us to move forward on this?

jdstrand · July 1, 2020, 1:42pm

@egeeirl - I’m not saying that you should continue to press on making snap_daemon work since based on other comments, it won’t fit the needs for --user, but for posterity, initgroups() uses setgroups() under the hood in a non-sandbox compliant manner (so you’d need to either patch or use the LD_PRELOAD technique). I’ve updated system-usernames to mention this.

jdstrand · July 1, 2020, 2:19pm

@egeeirl and @jamesbeedy - sorry for the delay, this is a new use case that needed to be investigated and thought through.

@pedronis - this seems to be a new use case for classic. Some discussion happened in: Can a confined Snap run as a different uid and or guid?.

IME, slurm, is an orchestration tool for HPC. AIUI, slurm can and does utilize the snapd_daemon for certain actions, but in certain environments the slurmd process itself needs to run (arbitrary) commands as the effective uid/gid user in order access resources for the user (@jamesbeedy referred to this as the ‘active directory user’ which seem to tied to slurm’s concept of ‘partitions’ (see the overview url)). Other use cases are to run commands as other users for accounting purposes (AIUI, the current design of slurm is such that it performs resource accounting by tracking the uid that the process ran as as opposed to something like having users login in with an account and executing all commands as the same user and performing tracking via use of the account).

The slurm website states “Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters”. So, like flowbot, this is similar to juju. The closest use case in our processes for classic is ‘management snaps’ which we’ve identified as an unsupported use case.

In my limited understanding of slurm, we could perhaps require that slurm be modified to run anything that should be non-root as snap_daemon with slurm then being further modified to perform accounting with this in mind, but IME that would change how users expect to use slurm (ie, they’d have to ‘login’ in some manner). Even if that change was made, slurm’s functionality wrt partitions would be limited.

While orchestration tools like slurm (and juju (and potentially the recent flowbot request) do ‘manage’ systems, these systems are not for managing fleets of devices/laptops/servers/etc which have login users and are instead about scale out computing on systems primarily without login users. In that light, I think we should consider the ‘orchestration snaps’ use case as something different from ‘management snaps’.

@pedronis - thoughts?

jamesbeedy · July 1, 2020, 4:50pm

@jdstrand @pedronis slurm is a resource scheduler for HPC applications. It is closer in comparison to apache spark than to juju. E.g. we use juju to facilitate the slurm lifecycle; we can scale our slurm clusters using juju to meet the resource needs of individual hpc jobs that users may run on slurm.

The hpc space is majorly an enterprise and academic space, in which heightened security and extended red tape are the norm. Active directory realms, access tracking and resource accounting on a per user basis are required with no exceptions.

An example:
A user will have files in a location in an active directory controlled filesystem that the slurmd (compute daemon) process needs to access, on all nodes in the cluster.

When the user executes a slurm job from a central location using srun/sbatch, the slurmd execute under the effective uid of the user that kicked off the job.
The slurmd needs to execute as the effective userid of the active directory user that kicked off the process so that it can access files in the user space on each of the compute nodes.

We have tried many alternatives to support these conventions outside of producing a classic snap.

In conclusion, we have determined that we need to use a classically confined snap to support these use cases.

Hope this helps. Thanks!

egeeirl · July 1, 2020, 9:16pm

Slurm isn’t managing or even really orchestrating anything. Depending on the use case, it is effectively a middleware. Take this use-case for example:

An organization wants to use StarCCM to simulate some heavy-duty fluid dynamics. In this case, Slurm is simply a tool that StarCCM leverages to complete the task.

Since this is a massively intense computational workload, it would be extremely useful to delegated out to multiple compute nodes. That’s where Slurm comes in.

StarCCM, MPI, and Slurm (and other components) work together to slice the workload into computational chunks that can be resolved through N number of compute nodes via Slurm. As such, Slurm isn’t necessarily managing a host in a traditional sense. Slurm also isn’t really orchestrating or provisioning anything, either.

In the current use case, Juju is used to deploy and provision Slurm across several clusters .

jdstrand · July 2, 2020, 8:50pm

That is a fair distinction and perhaps we’ll want to define this as a different use case from orchestration. The main point though is that slurm isn’t about managing the compute nodes’ OS, user, etc configuration (ie like puppet or chef might do), it is about putting workloads on them (which in my mind is orchestrating the computation). Semantics aside, ‘management snaps’ is not a supported use case for classic and I’m putting forth that slurm is not a ‘management snap’ but rather something else, which IME is an important distinction when considering slurm for classic confinement.

egeeirl · July 2, 2020, 9:12pm

Semantics aside, ‘management snaps’ is not a supported use case for classic and I’m putting forth that slurm is not a ‘management snap’ but rather something else, which IME is an important distinction when considering slurm for classic confinement.

Sure, that seems fine. So this is something that you Snapcrafters will discuss on your side and get back to me or is there anything else I need to do?

jdstrand · July 2, 2020, 10:02pm

@egeeirl - classic snaps run unrestricted on the systems they are installed on and for this reason we treat reviews of classic snaps differently than other sorts of reviews. For use cases for classic snaps that aren’t listed in our current processes, reviewers ask for snapd architect involvement, which I did when I asked for @pedronis to comment. That discussion will happen in this topic and for the moment there is nothing else you need to do, though he may have followup questions for you/other reviewers.

egeeirl · July 3, 2020, 12:33am

For use cases for classic snaps that aren’t listed in our current processes, reviewers ask for snapd architect involvement

Perfect, thank you

jamesbeedy · July 12, 2020, 6:37pm

@jdstrand @pedronis nudge, bump

pedronis · July 15, 2020, 3:53pm

@jdstrand the ways this seems to be different than a managent snap are:

I imagine it is normally supposed to take over a machine with no workloads not related to it usually running
It wouldn’t run quietly in a corner taking control only from time to time
In particular except for developer machines we wouldn’t expect this to run on a desktop?
We don’t expect casual users to install this at all?

pedronis · July 15, 2020, 3:56pm

These same arguments though make it a bit unclear why it couldn’t be confined at some point?.