SLURM auto-connect for network-control [Was: SLURM Snap (transfer ownership)]

Hello,

I am requesting to be the maintainer of the slurm* snaps. I have been building the snapped SLURM stack for some time now and am ready to start releasing my snaps to various channels via snap store.

For this to happen I need the “omnivector-solutions” user to be the authoritative owner of the snap name “slurm”.

(Similar to munge https://snapcraft.io/munge)

Could someone assist me in grating the omnivector-solutions user ownership on these names in the snapstore?

Thanks!

Sure thing! However, can I trouble you to register the slurm snap name as you would any other? (https://snapcraft.io/account/register-snap).

If you get a “this name is reserved” message, you should also have a “file a dispute / request anyway” which will send your request to our review queue, which I can then approve.

I’ll keep an eye out for your request!

  • Daniel

@roadmr done and done. Thanks!

@roadmr we were able to get our slurm snap working in strict mode using the drop-user/system-usernames technique alongside the addition of the physical-memory-control and the network-control plugs. We are now trying to setup our release process and get an initial build released to the snap store. Is it possible to get a review started on our snap at this point so that we can proceed with the release process?

@jamesbeedy Can you please respond to @jdstrand’s question above? Without this information this request cannot proceed.

@alexmurray yes. I assume by your response that it isn’t possible to enable auto connect for the physical-memory-control plug and the network-control plug? If so, enabling auto-connect for the network-control plug and the capability to publish the snap would be very much appreciated. Thanks!

@alexmurray @roadmr do you need anything else from me to be able to move this along?

Could you please explain how the the two interfaces are used by slurm?

@zyga-snapd I was led to believe that physical-memory-control is needed by slurmrestd when it initializes and creates its endpoints.

We can see it failing here:

May 10 18:39:54 ubuntu-dev systemd[1]: Started Service for snap application slurm.slurmrestd.
May 10 18:39:54 ubuntu-dev slurm.slurmrestd[19097]: <----------------------------- Starting SLURMRESTD ------------------------------>
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/diag/ to 0x55ddc3bff143
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/diag/ with tag 0
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/ping/ to 0x55ddc3bff801
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/ping/ with tag 1
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/jobs/ to 0x55ddc3c04172
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/jobs/ with tag 2
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/job/{job_id} to 0x55ddc3c0336c
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/job/{job_id} with tag 3
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/job/submit to 0x55ddc3c00f5b
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/job/submit with tag 4
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/nodes/ to 0x55ddc3c044d6
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/nodes/ with tag 5
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/node/{node_name} to 0x55ddc3c044d6
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/node/{node_name} with tag 6
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/partitions/ to 0x55ddc3c04be3
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/partitions/ with tag 7
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /slurm/v0.0.35/partition/{partition_name} to 0x55ddc3c04be3
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /slurm/v0.0.35/partition/{partition_name} with tag 8
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /openapi.yaml to 0x55ddc3bff096
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /openapi.yaml with tag 9
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /openapi.json to 0x55ddc3bff096
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug4: bind_operation_handler: new path /openapi.json with tag 10
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: debug3: bind_operation_handler: binding /openapi to 0x55ddc3bff096
May 10 18:39:54 ubuntu-dev slurmrestd[19097]: fatal: bind_operation_handler: failure registering OpenAPI for path: /openapi
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Main process exited, code=dumped, status=6/ABRT
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Failed with result 'core-dump'.
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Service hold-off time over, scheduling restart.
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Scheduled restart job, restart counter is at 6.
May 10 18:39:54 ubuntu-dev systemd[1]: Stopped Service for snap application slurm.slurmrestd.
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Start request repeated too quickly.
May 10 18:39:54 ubuntu-dev systemd[1]: snap.slurm.slurmrestd.service: Failed with result 'core-dump'.
May 10 18:39:54 ubuntu-dev systemd[1]: Failed to start Service for snap application slurm.slurmrestd.

We ended up getting led down this path because we traced executing the slurmrestd process with strace and were able to see where it failed; allocating memory for the above endpoint.

Adding the physical-memory-control plug was the only way we could get past this issue (other then using classic confinement) of slurmrestd not being able to allocate memory for its endpoints (the strace output was telling us -1/failing when slurmrestd attempts to allocate memory via the bind operations shown above).

I’m having a hard time reproducing the error that led us to adding the physical-memory-control plug, and can’t find the history where it was previously happening.

Now, I seem to be able to run slurmrestd just fine without physical-memory-control … I’m totally beat here. I guess we don’t need the plug after all.

I’ll remove the physical-memory-control plug now.

Thanks for entertaining this conversation.

This doesn’t seem to indicate what is failing. The interface in question allows writing to /dev/mem, is that explicitly what fails? I would prefer to see an apparmor DENIED message as application messages are not precise enough, usually, to explain what was attempted and how it failed.

Totally. Let me dig at this a bit more and get an actual repro on what we were seeing earlier. I’ve removed the physical-memory-control plug for now as things seem to work without it.

1 Like

I have manually rejected revision 77 and subsequent revisions that use classic will be auto-rejected. Your new uploads that don’t use classic should then pass automated review.

@jdstrand This is awesome. Thank you! Thank you everyone!

@jdstrand Is it possible to have auto-connect for our network-control plug?

There isn’t enough detail to process this. From the snap’s description: “Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.” Can you describe in detail why network-control is needed by your snap?

Since you responded in Can a confined Snap run as a different uid and or guid? that you may be pursuing classic, I’ve we’ll not process this just yet.

@jdstrand Right, it looks like we may be pursuing classic confinement. Let’s hold off on this for now. Thanks!