Request for classic confinement: ondemand

nuccitheboss · April 15, 2024, 10:56pm

Greetings Store Team

I am requesting classic confinement for Open Ondemand which is registered as ondemand on the Snap Store. Open OnDemand is an interactive, web-based portal for using supercomputing resources over the internet. Rather than requiring supercomputing cluster users to have a terminal-centric workflow, Open OnDemand provides integrations that users can utilize to start XFCE or MATE VDI sessions, request web applications like Jupyter or VSCode, manage jobs on their favorite workload scheduler, read the latest announcements from cluster administrators, and more. You can view my current work on the ondemand snap on GitHub to audit the source code: https://github.com/charmed-hpc/ondemand-snap.

You can verify my legitness from other snaps that I have published such as Spack, Slurm, and Marktext. Please let me know if you have any questions about Open OnDemand or the snap package itself!

Why I need classic confinement for Open OnDemand

The ondemand snap qualifies for classic confinement under the following categories:

IDEs
HPC or orchestration agents/software for running workloads on systems without traditional users where the systems are otherwise managed outside of the agent (ie, the software simply orchestrates running workloads on systems and doesn’t manage the systems themselves). Note: many HPC/orchestration applications can run with strict confinement and classic should only be granted if snapd does not support the specific use case (eg, the need for user accounting)

As an IDE, Open OnDemand is used for developing workloads intended to be run on supercomputers; it provides VDI and interactive sessions that enable research software engineers to get direct access to their advanced computational resources. Open OnDemand also enables users to manage development workspaces via a web-based terminal and graphical file explorer.

As an HPC orchestration agent for running workloads, Open OnDemand uses nginx under the hood as a reverse proxy for accessing custom interactive applications written in Node, Ruby, or Python. These interactive applications are requested by Open OnDemand, and then typically started by a workload scheduler such as Slurm. The endpoint is then given back to the user by Open OnDemand, and then the user can log into their interactive session.

Where things get interesting as an orchestrator it is very much like Slurm when scheduling workloads as Open OnDemand assigns the effective uid and gid of the requesting user to the NGINX process so that no one else can mess with it. The nginx process is mapped to the user so that only they can manage it or elect to share their interactive session with a co-colaborator. This need to drop privileges down to the effective uid and gid of the requesting user is the same reason why the Slurm snap was granted classic confinement.

jslarraz · April 22, 2024, 7:43am

Hey @nuccitheboss

Thanks for your request! I’m not familiar with ondemand, could you please clarify if your snap is intended to be installed on every computing node (and it is in responsible to start jobs on nodes it self) or it just needs to be installed in one management instance and communicate over the network with other job schedulers (such as Slurm).

Thanks!

nuccitheboss · April 22, 2024, 11:29am

Hi there @jslarraz

Open OnDemand is typically installed on a central management node, and then uses the CLI to communicate with various configured job schedulers. For example, in the case of Slurm, Open OnDemand uses the Slurm CLI commands and Slurm cluster configuration file to communicate with the workload scheduler: https://osc.github.io/ood-documentation/latest/installation/resource-manager/slurm.html

Open OnDemand will run interactive jobs on compute nodes. For example, the typical flow for requesting an interactive desktop session is the following:

User authenticates through Open OnDemand. Typically an IDP provider with OpenIDC enabled.
If it’s the first time the user is logging in, start an nginx session specific to that user.
Start the dashboard interactive application.
Request computational resources through a workload manager like Slurm.
Generate an nginx configuration the proxies traffic to that interactive session.
Start interactive job on compute resources. For an interactive desktop, this typically means starting a TurboVNC session on the compute node.
Start nginx proxy. Proxy will route traffic direct;y to VNC session.
Use Slurm to kill interactive session once user is done.

Here’s the documentation for the top-level architecture of Open OnDemand to help visualize how everything works in the backend: https://osc.github.io/ood-documentation/latest/architecture.html. Also, here’s a gif from the upstream Open OnDemand project that demonstrates Open OnDemand in action: https://github.com/OSC/ondemand/blob/master/docs/imgs/open_ondemand_demo.gif

Please let me know if you have any further questions!

jslarraz · April 22, 2024, 4:02pm

@nuccitheboss thanks a lot from all the information!

Maybe I’m misunderstanding something, but from the container context diagram in https://osc.github.io/ood-documentation/latest/architecture.html, it looks like ondemand does not start jobs itself but contacts to the scheduler/computing node via different networks protocols (http, https, ssh…).

In such a scenario I wonder whether it could be made to work under strict confinement as it should not need to execute arbitrary commands (but instruct the scheduler to run them in the computing node). Have you tried to make this snap strictly confined?

nuccitheboss · April 22, 2024, 5:51pm

Open OnDemand schedules/manages its own kind of tasks alongside integrating with various workload schedulers using a CLI utility that wraps nginx named nginx_stage.

Open OnDemand is capable of requesting resources via configured clusters (e.g. use Slurm to start a Jupyter Server session on a distributed environment with 48 GPUs allocated to it), but Open OnDemand also manages processes locally with nginx_stage which isn’t captured by the container context diagram. For example, Open OnDemand will schedule a dashboard session locally that is namespaced to each logged in user - the nginx process is run as the effective uid and gid of the logged in user. There are other applications run locally as well such as a file explorer and web-based terminal. nginx_stage needs escalated privileges to manage the different nginx processes running for each user.

I have tried, but I cannot use strict confinement as Open OnDemand is also very much an IDE as an HPC workload orchestration agent.

Open OnDemand will run interactive applications locally via nginx_stage, and some of those applications need generous access to the underlying host. For example, the “File Explorer” application is used by many Open OnDemand users to manage their files on the supercomputing cluster, and the “File Editor” application is used to edit files on the cluster. The home, personal-files, and system-files are not sufficient here as Open OnDemand users will commonly access files outside their home directory due to limited storage quotas; it would be too onerous to map every possible directory/file that users would want to access on the cluster. There’s many different ways an supercomputing facilities may chose to set up the cluster file system, and the users’ requirements are heterogeneous, so they expect a wide range of flexibility. This is a similar reason to why Spack needed classic confinement: Request for classic confinement: Spack.

Open OnDemand is an IDE similar to PyCharm or VSCode where there’s a built-in file explorer and editor, but rather than having a dedicated window on the desktop, you instead access its various tools and applications through the web browser instead. Let me know if you have any further questions

cav · April 26, 2024, 5:45am

Thank you for your explanation!

Similarly to spack and because ondemand fits within more than one of the supported categories for classic confinement as per Process for reviewing classic confinement snaps , including “HPC orchestration agent for running workloads”, the requirements for classic are understood.

I have vetted the publisher. This is now live.