Everything seems to work (almost), but in terms of performance it’s pretty bad and I see a lot of audit messages in the log triggered by seccomp audit, for example:
Note that syscall=321 corresponds to sys_bpf() (that is used a lot by scx_rustland).
The overhead is so big that I’m even triggering the global sched-ext watchdog and the scheduler is kicked out:
[ 452.746587] sched_ext: BPF scheduler "rustland" errored, disabling
[ 452.746595] sched_ext: runnable task stall (kauditd[66] failed to run for 6.986s)
At the moment, as a workaround, I’m booting the kernel with audit=0 and with this option everything works perfectly fine, but I was wondering if there was a more fine-grained way to silent these messages (instead of disabling audit logging system-wide), i.e., using some special config / keyword in my snapcraft.yaml or something similar.
Well, ignoring a few other pleasantries that come to mind ;).
The interface description suggests this would likely be something that requires a slightly higher review than the standard process (I.E proof from upstream they’re aware of the snap / are happy for you maintaining it) but given the nature of your snap and the fact that the overheads are so high that it becomes unusable without, I’d say you’ve good chances.
Being unfamiliar with BPF and especially this application of it, it might also be worth making a new interface entirely if this is a new class of snap that could fit the snap model well (I might be wrong, but it certainly feels like a new class to me!). Ultimately this would go through a completely separate security review and actually needs implementing which would take weeks/months, but if there’s a demand for it, it might be worth doing.
It’s not urgent for now, at the moment I’m simply experimenting the idea of providing multiple hot-pluggable Linux schedulers as snaps, because I think it’d be cool. But they still require a sched-ext enabled kernel, that we don’t officially support yet (I’m testing this with my “unofficial” sched-ext kernel from ppa:arighi/sched-ext).
About introducing a new class, right now everything seems to work fine just with system-trace for me (at the end to interact with BPF we simply need to be allowed to use the bpf() syscall). Maybe, in the future, if more BPF technologies are used to replace “kernel components” we may want to have a more generic bpf-interface, or similar, to better represent this type of programs.
Just tested with nertwork-control and it seems enough, so I’ll switch to that for now, even if it sounds a bit counter-intuitive for a scheduler to require network control… at the end I just need to allow the bpf() syscall and read files from procfs.
Where can I find a detailed list of capabilities/syscalls that are allowed by these interfaces?
Perfect, this is exactly what I needed! And now I know that I actually need the process-control interface, because scx-rustland may also call sched_setscheduler(), so network-control isn’t enough…
Sounds like a new interface then. Feel free t propose it in the forum or open a PR to the snapd repository.
The reason we do not have an interface which allows just the bpf() syscall, is that the syscall in itself isn’t that useful. You usually need additional syscalls or path objects or network to act upon to attach our BPF objects. Then the seccomp log is supposed to let you know that the program is doing trying to do something it isn’t allowed to do under the currently active policy.
Sure, I’ll send a PR to snapd. Potentially with more technologies relying on BPF struct_ops in the future we may have more tools/apps that just need access to the bpf() syscall and /sys/kernel/btf/vmlinux. And in this case enabling a whole network-control or process-control interface sounds a bit too overkill, so I think it’d be useful to have a more specific “bpf” interface.