I just finished to package an application, Volatility which is a python framework for memory analysis (forensics).
I expected to have performances close to native, but I am quite disappointed:
% time volatility-phocean.volatility -f ~/Temp/intrusion/dump.vmem imageinfo
Volatility Foundation Volatility Framework 2.6.1
INFO : volatility.debug : Determining profile based on KDBG search...
Suggested Profile(s) : Win7SP1x64, Win7SP0x64, Win2008R2SP0x64, Win2008R2SP1x64_24000, Win2008R2SP1x64_23418, Win2008R2SP1x64, Win7SP1x64_24000, Win7SP1x64_23418
AS Layer1 : WindowsAMD64PagedMemory (Kernel AS)
AS Layer2 : FileAddressSpace (/home/phocean/Temp/intrusion/dump.vmem)
PAE type : No PAE
DTB : 0x187000L
KDBG : 0xf80002c530a0L
Number of Processors : 1
Image Type (Service Pack) : 1
KPCR for CPU 0 : 0xfffff80002c54d00L
KUSER_SHARED_DATA : 0xfffff78000000000L
Image date and time : 2014-11-14 13:13:49 UTC+0000
Image local date and time : 2014-11-14 14:13:49 +0100
volatility-phocean.volatility -f ~/Temp/intrusion/dump.vmem imageinfo 29,61s user 2,36s system 99% cpu 31,965 total
So, ~30 sec.
To compare with my virtualenv:
(vol) % time vol.py -f ~/Temp/intrusion/dump.vmem imageinfo
Volatility Foundation Volatility Framework 2.6.1
INFO : volatility.debug : Determining profile based on KDBG search...
Suggested Profile(s) : Win7SP1x64, Win7SP0x64, Win2008R2SP0x64, Win2008R2SP1x64_24000, Win2008R2SP1x64_23418, Win2008R2SP1x64, Win7SP1x64_24000, Win7SP1x64_23418
AS Layer1 : WindowsAMD64PagedMemory (Kernel AS)
AS Layer2 : FileAddressSpace (/home/phocean/Temp/intrusion/dump.vmem)
PAE type : No PAE
DTB : 0x187000L
KDBG : 0xf80002c530a0L
Number of Processors : 1
Image Type (Service Pack) : 1
KPCR for CPU 0 : 0xfffff80002c54d00L
KUSER_SHARED_DATA : 0xfffff78000000000L
Image date and time : 2014-11-14 13:13:49 UTC+0000
Image local date and time : 2014-11-14 14:13:49 +0100
vol.py -f ~/Temp/intrusion/dump.vmem imageinfo 21,15s user 1,42s system 99% cpu 22,574 total
So, around 20 sec.
That’s almost a 30% gap. It is going to be a huge impact on large memory files and complex plugins.
Is such an overhead to be expected? Could I take any step in my snapcraft to improve it?
Thanks for your message. This kind of overhead is not expected. Do you see this on subsequent runs too or is the slowness only happening on the first run? I wonder if there is anything bad happening with the pyc files for example.
Thanks! I was able to reproduce this with snapcraft 2.43 (from the deb) and 3.6 (from the snap). In both cases the “pyc” files were missing. After manually adding python -m compileall . during the snapcraft build things got better. I still see a gap though that is worth investigating.
This is an interesting snap - it looks like running the python code directly is still significantly faster than the snap even when I use “snap try --devmode ./prime”. I did an strace -c of the snap and noticed it is performing a lot of syscalls (lseek,mmap) - this indicates seccomp overhead. And indeed - when switch from “@complain” mode in the seccomp devmode profile to “@unrestricted” the gap goes away.
In summary - this needs closer investigation if we can do anything about our bpf profile writing or if the kernel can be optimized so that very syscall heavy apps (like this one) get less performance impact.
Fwiw - the numbers I have on my machine with a random example memory image is: ~14s when running directly from the virtualenv and when using @unrestricted in the seccomp profile and ~20s with the regular snap. I see about 1134304 syscalls in total, mostly lseek/read (from strace -c).
Say for mmap(), it’d take ~14 instructions to go out with ALLOW. Keep in mind that it’s cBPF, so no fancy JIT or things like that. Note that sycalls are scanned lineraly, the higher the number the worse it gets.
Turns out Oracle has some patches for libseccomp to generate rules using a binary search lookup rather than a linear scan, see https://github.com/seccomp/libseccomp/issues/116. There’s even a blog post where they mention getpid() (sycall number 110), see https://blogs.oracle.com/linux/seccomp%3A-safe-and-secure-and-slow-no-more. However, the gain for low number syscalls is probably less (say, ~4-5 prologue instructions + log2(<num>) lookup + 1 epilogue). Say rough estimate for mmap() (no# 9), 14 instructions could potentially become 7-8 instructions, ~50-60% of the original cost. Compare that with their getppid() case, we can probably drop the static prologue/epilogue as insignificant, and we’d get log2(110)/110 ~6% of the original cost (checks out with their Time to execute gettppid() numbers).
Maybe we could collaborate with upstream and look into getting the PR to land?
It is known that the bpf ordering can impact performance and that this was an area of improvement for snapd. As it turns out, this is also an area of improvement for libseccomp and they have in 2.4 made bpf optimizations that should improve things. Curious if this snap is faster with libseccomp 2.4.1 on the system at the time snap-seccomp is run. Also note, with libseccomp 2.4, libseccomp unconditionally orders the bpf however it wants and snapd is not in a position to adjust that. AIUI, upstream libseccomp wants to perform more optimizing and would welcome contributions (see aforementioned comment from @mborzecki).
The results obtained running volatility (without compiled Python modules) and the same image as @mvo are rather disappointing:
@unrestricted ~18s
2.4.1 ~20.53s
bintree ~20.25s
The gains of binary tree are negligible for this particular snap. I then proceeded to hack a bit on the BPF simulator built into libseccomp and added a count of instructions executed before reaching a ret <action> instruction. The patch is included below.
BPF simulator patch
diff --git a/tools/scmp_bpf_sim.c b/tools/scmp_bpf_sim.c
index 73d056b..2eebcec 100644
--- a/tools/scmp_bpf_sim.c
+++ b/tools/scmp_bpf_sim.c
@@ -155,7 +155,7 @@ static void end_action(uint32_t action, unsigned int line)
static void bpf_execute(const struct bpf_program *prg,
const struct seccomp_data *sys_data)
{
- unsigned int ip, ip_c;
+ unsigned int ip, ip_c, inst_cnt;
struct sim_state state;
bpf_instr_raw *bpf;
unsigned char *sys_data_b = (unsigned char *)sys_data;
@@ -167,9 +167,11 @@ static void bpf_execute(const struct bpf_program *prg,
/* initialize the machine state */
ip_c = 0;
ip = 0;
+ inst_cnt = 0;
memset(&state, 0, sizeof(state));
while (ip < prg->i_cnt) {
+ inst_cnt++;
/* get the instruction and bump the ip */
ip_c = ip;
bpf = &prg->i[ip++];
@@ -215,6 +217,9 @@ static void bpf_execute(const struct bpf_program *prg,
ip += jf;
break;
case BPF_RET+BPF_K:
+ if (opt_verbose) {
+ fprintf(stderr, "instruction count: %d\n", inst_cnt);
+ }
end_action(k, ip_c);
break;
default:
That kind of explains why the gain is so little with the experimental branch. Strace data shows that lseek() and read() account for 97% of all the syscalls executed by the benchmarked command. With lseek() being basically the same, the result for volatility are influenced by other syscalls. For instance openat() (probably due to Python compiling modules?), pops up high on my list and it also exhibits significantly shorter instruction count with bintree profile. By the looks of it, the experimental branch seems capable of delivering better average performance.
I somewhat mispoke here. Prior to 2.4, you could influence the bpf sort order based on the order that the rules were added to the filter, but in 2.4 you cannot do that anymore. However, libseccomp provides seccomp_syscall_priority() which allows tuning the priority. @mborzecki did some instrumenting here and I’ll let him comment with details, but giving the highest priority to the most used syscalls of volatility only brought a small gain. For volatility, it appears that the act of filtering at all is what introduces the overhead since it is calling lseek() nearly 1 million times in @mborzecki’s test runs.
UPDATE: @mborzecki updated these values with averages and I have edited this comment to account for that for clarity. The updated values are:
@unrestricted 18.18s
2.4.1 ~20.26s
bintree ~20.40s
bintree with prioritized ~20.25s
With 2.4.1 there is a ~10% slowdown (18.18s/20.26s) and with the upcoming bintree, an ~11% slowdown (18s/20.40s) for volatility. Adding seccomp_syscall_priority() with bintree to tune the priority for lseek() and read() highest (and a few others), this can be reduced back to ~10% for volatility (18.18s/20.25s).
We’ve always known that there was an overhead for using seccomp filtering. This topic and @mborzecki demonstrate how to measure it. volatility itself is somewhat pathological in that it is calling two syscalls 1.2 million times (lseek() at almost a million and read() at 135k). @mborzecki’s measurements show that sort order isn’t the only factor and that there is kernel overhead mediating the syscall beyond searching for it, but all considered it isn’t terrible even for pathological cases like volatility. There are also things snapd can do without any architectural changes or changes to the kernel to make the overhead as low as 10% for this pathological case and likely much better for standard cases where bintree would shine.
Today, volatility can compile and ship the pyc files to cut the slowdown by more than half. We’ll pickup bintree in the fullness of time with libseccomp 2.5 to bring an additional small improvement. If warranted, we can in the future explore how to expose seccomp_syscall_priority() or even which sort algorithm to use in snapd to bring this down even further (eg, perhaps by letting snap developers specify the priority of syscalls in their snap.yaml when the defaults aren’t good enough; or adding an ‘optimize’ command to snapcraft/etc to measure usage and make this easier for the developer).
Thank you @mborzecki and @mvo for the measurements. It seems clear to me we should add some benchmarking in this area which includes non-pathological cases, and then see what the next steps are.
I tried to collect more sample. The variance of respective runs is quite annoying and it’s hard to take any measurements reliably. Unsurprisingly collecting more samples changes the picture a little bit. We’ll need a better benchmark if we are to investigate this further.
For some totally unscientific results from running volatility across 10 runs:
2.4.1, bintree, prioritized indicate profiles generated with 2.4.1, bintree patches and snap-seccomp syscall priority tweaks respectively (disassembly of the prioritized profile is here: https://paste.ubuntu.com/p/DHV5tfChq6/)
unrestricted is basically seccomp filtering being disabled
The priorized sample had the following syscall priorities assigned from 255 down in this order: "lseek", "read", "munmap", "fstat", "mmap", "openat", "stat". Remaining syscalls were left with their default priorities.
Now that we have investigated this a bit, we can prepare a better, more representative benchmark. This one is heavily skewed towards read() and lseek(), which are already handled early. If the tool used recvmsg()/sendmsg() I would expect a higher impact of bintree patch.