Require some snaps to have classic confinement

matheushent · October 10, 2024, 6:24pm

Hello, greetings from the Omnivector team.

We have recently made a push to publish our software to the snap store, however, we identified that we will need some snaps to have classic confinement. The snaps are:

Install vantage-agent on Linux | Snap Store
Install jobbergate-agent on Linux | Snap Store
license-manager-agent (couldn’t put the link because new users can put only 2 links :/)

The technical detail that we identified is the fact that these agents interact with the slurm binaries, more specifically scontrol, sbatch, sacctmgr and squeue. The common path where these binaries are located at is /usr/bin, however, it’s up for the HPC admin of a cluster to determine exactly where they are, so we need to allow end users to customize the binaries’ paths in the agents. For this reason, we needed to make the snaps to have classic confinement.

cav · October 10, 2024, 10:59pm

Hey @matheushent, according to Process for reviewing classic confinement snaps , Classic requests should fall under at least one of the supported categories. Could you please clarify if these snaps fit within any of the supported categories? Thanks!

matheushent · October 11, 2024, 1:23pm

Hey @cav , the snaps fit the category:

HPC or orchestration agents/software for running workloads on systems without traditional users where the systems are otherwise managed outside of the agent (ie, the software simply orchestrates running workloads on systems and doesn’t manage the systems themselves).

Essentially, the snaps install agents whose purpose are to assist end users to manage licenses and submit workloads.

jslarraz · October 14, 2024, 2:09pm

Hey @matheushent

I’m not familiar to slurm, so is there a reason why system binaries must be used rather than staging slurm library into the snap?

Thanks

matheushent · October 14, 2024, 3:07pm

Hey @jslarraz. Despite being possible to pack the slurm binaries in the snap (check out slurm-snap in which @nuccitheboss has contributed), the agents are supposed to work with existing slurm clusters, which means the binaries and configuration files will be already existing for most use cases.

Consider the scenario where Slurm is already being run in the system. This means the *.conf files are all in place and the slurm daemons (slurmctld, slurmdbd, slurmd and possibly slurmrestd) are already running. If the agents stage the slurm binaries, this means the agents will not be able to communicate with the existing cluster, therefore there’s no value in installing them.

jslarraz · October 16, 2024, 10:15am

I may be missing something, but in the scenario you are proposing, you could stage slurm libraries inside the snap and use personal-files and/or system-files interfaces to access the existing *.conf files from the host system. Then the agents (running from the snap) can communicate with the daemons running in the host system using the configuration from the host system *.conf files via the usual means (ipv4 according to this documentation). Shouldn’t that work?

matheushent · October 16, 2024, 12:49pm

Not necessarily. The documentation you provided is about slurm components communicating each other. Me, as an end user for example, will not be able to access slurm resources calling some endpoint at port 6817 (slurmctld) directly; I could do some reverse engineering and wrap some HTTP requests as the slurm commands do, but that seems a lot of overhead. I could communicate with slurmrestd at port 6820, but that would be a total new deal. Even so, the agent would require access to the binary at /usr/bin/scontrol to issue a JWT token and authenticate against the slurmrestd API.

Consider the slurmdbd daemon for example. This daemon would need to access a database outside of the snap confinement. I’m not sure how it would work to have two slurm clusters all accessing the same *.conf files but in different confinements regarding the OS.

Also, consider the case where a user foo changes any configuration in the slurm.conf file. Now, to apply this change, the user will need to run scontrol reconfigure. If the agents stage the slurm binaries, the user will also need to run a one shot daemon for each agent so the slurm configuration is also applied in the slurm of each agent. That doesn’t seem a good experience the snap will provide IMO.

The other point I see is the consumable CPU and RAM of each process. Despite the daemons themselves are not “heavy”, we (Omnivector) have seen folks to “battle” for freeing CPU and RAM of non-workload processes.

jslarraz · October 17, 2024, 7:58am

Me, as an end user for example, will not be able to access slurm resources calling some endpoint at port 6817 (slurmctld) directly

In general you need to look for the means used by the applications ( scontrol, sbatch, sacctmgr and squeue) to communicate with the daemons (slurmctld, slurmdbd, slurmd). It should be possible to use most IPC alternatives (linux sockets, dbus, etc.) .

Consider the slurmdbd daemon for example. This daemon would need to access a database outside of the snap confinement. I’m not sure how it would work to have two slurm clusters all accessing the same *.conf files but in different confinements regarding the OS.

I’m not proposing to deploy a complete new cluster within the snap. My proposal here was to pack the agent as you have already done and stage the required applications ( scontrol, sbatch, sacctmgr and squeue). Then, the agent could communicate with the cluster daemons (slurmctld, slurmdbd, slurmd) that are running in the host system using the staged applications and the cluster configuration (*.conf from the host can be access using personal|system-files).

Most probably I’m missing something here, so please let me know

Thanks!

jamesbeedy · October 17, 2024, 4:53pm

Hopefully, I can add some missing context here. The slurm binary have to Match what’s running on the cluster. We cannot be responsible for producing a different build of each of our agents for every single slum version that exist. Our agents should use binaries provided with a slum installation.that live on the system. Hope this helps!

jslarraz · October 18, 2024, 8:00am

Oh, that’s a very good point @jamesbeedy. Then, the technical requirements is clear to me, the slurm agent needs to run slurm binaries from the host system (matching the daemon version) that may be placed at arbitrary location (decided by the cluster admin). It also falls under the supported category HPC.

I’ll begin the publisher vetting

Thanks!

matheushent · October 21, 2024, 2:19pm

Do you have an estimate of when it will get approved for publishing?

jamesbeedy · October 30, 2024, 7:19pm

@jslarraz thanks! I think we need to be an approved publisher first, can you help get this pushed through please?

Thanks

jslarraz · November 12, 2024, 10:19am

The publisher is vetted and the account verified. This is now live

matheushent · November 12, 2024, 11:58am

Thank you very much!