Concerns about performance

phocean · July 7, 2019, 12:21pm

I just finished to package an application, Volatility which is a python framework for memory analysis (forensics).

I expected to have performances close to native, but I am quite disappointed:

% time volatility-phocean.volatility -f ~/Temp/intrusion/dump.vmem imageinfo
Volatility Foundation Volatility Framework 2.6.1
INFO    : volatility.debug    : Determining profile based on KDBG search...
          Suggested Profile(s) : Win7SP1x64, Win7SP0x64, Win2008R2SP0x64, Win2008R2SP1x64_24000, Win2008R2SP1x64_23418, Win2008R2SP1x64, Win7SP1x64_24000, Win7SP1x64_23418
                     AS Layer1 : WindowsAMD64PagedMemory (Kernel AS)
                     AS Layer2 : FileAddressSpace (/home/phocean/Temp/intrusion/dump.vmem)
                      PAE type : No PAE
                           DTB : 0x187000L
                          KDBG : 0xf80002c530a0L
          Number of Processors : 1
     Image Type (Service Pack) : 1
                KPCR for CPU 0 : 0xfffff80002c54d00L
             KUSER_SHARED_DATA : 0xfffff78000000000L
           Image date and time : 2014-11-14 13:13:49 UTC+0000
     Image local date and time : 2014-11-14 14:13:49 +0100
volatility-phocean.volatility -f ~/Temp/intrusion/dump.vmem imageinfo  29,61s user 2,36s system 99% cpu 31,965 total

So, ~30 sec.

To compare with my virtualenv:

(vol) % time vol.py -f ~/Temp/intrusion/dump.vmem imageinfo
Volatility Foundation Volatility Framework 2.6.1
INFO    : volatility.debug    : Determining profile based on KDBG search...
          Suggested Profile(s) : Win7SP1x64, Win7SP0x64, Win2008R2SP0x64, Win2008R2SP1x64_24000, Win2008R2SP1x64_23418, Win2008R2SP1x64, Win7SP1x64_24000, Win7SP1x64_23418
                     AS Layer1 : WindowsAMD64PagedMemory (Kernel AS)
                     AS Layer2 : FileAddressSpace (/home/phocean/Temp/intrusion/dump.vmem)
                      PAE type : No PAE
                           DTB : 0x187000L
                          KDBG : 0xf80002c530a0L
          Number of Processors : 1
     Image Type (Service Pack) : 1
                KPCR for CPU 0 : 0xfffff80002c54d00L
             KUSER_SHARED_DATA : 0xfffff78000000000L
           Image date and time : 2014-11-14 13:13:49 UTC+0000
     Image local date and time : 2014-11-14 14:13:49 +0100
vol.py -f ~/Temp/intrusion/dump.vmem imageinfo  21,15s user 1,42s system 99% cpu 22,574 total

So, around 20 sec.

That’s almost a 30% gap. It is going to be a huge impact on large memory files and complex plugins.

Is such an overhead to be expected? Could I take any step in my snapcraft to improve it?

Source: https://github.com/phocean/volatility-snap

mvo · July 8, 2019, 3:33pm

Thanks for your message. This kind of overhead is not expected. Do you see this on subsequent runs too or is the slowness only happening on the first run? I wonder if there is anything bad happening with the pyc files for example.

phocean · July 8, 2019, 4:24pm

Yes, it persists on subsequent runs, unfortunately.

mvo · July 8, 2019, 4:25pm

Thanks! I was able to reproduce this with snapcraft 2.43 (from the deb) and 3.6 (from the snap). In both cases the “pyc” files were missing. After manually adding python -m compileall . during the snapcraft build things got better. I still see a gap though that is worth investigating.

mvo · July 8, 2019, 4:52pm

This is an interesting snap - it looks like running the python code directly is still significantly faster than the snap even when I use “snap try --devmode ./prime”. I did an strace -c of the snap and noticed it is performing a lot of syscalls (lseek,mmap) - this indicates seccomp overhead. And indeed - when switch from “@complain” mode in the seccomp devmode profile to “@unrestricted” the gap goes away.

In summary - this needs closer investigation if we can do anything about our bpf profile writing or if the kernel can be optimized so that very syscall heavy apps (like this one) get less performance impact.

Fwiw - the numbers I have on my machine with a random example memory image is: ~14s when running directly from the virtualenv and when using @unrestricted in the seccomp profile and ~20s with the regular snap. I see about 1134304 syscalls in total, mostly lseek/read (from strace -c).

mborzecki · July 8, 2019, 8:11pm

Took a quick look at the profiles generated. Example profile for test-snapd-sh:

$ ./scmp_bpf_disasm < /var/lib/snapd/seccomp/bpf/snap.test-snapd-sh.test-snapd-sh.bin
 line  OP   JT   JF   K                                                                
=================================                                                      
 0000: 0x20 0x00 0x00 0x00000004   ld  $data[4]                       
 0001: 0x15 0x01 0x00 0xc000003e   jeq 3221225534 true:0003 false:0002
 0002: 0x05 0x00 0x00 0x00000173   jmp 0374                           
 0003: 0x20 0x00 0x00 0x00000000   ld  $data[0]                       
 0004: 0x35 0x00 0x02 0x40000000   jge 1073741824 true:0005 false:0007
 0005: 0x15 0x01 0x00 0xffffffff   jeq 4294967295 true:0007 false:0006
 0006: 0x06 0x00 0x00 0x00000000   ret KILL                           
 0007: 0x15 0xee 0x00 0x00000000   jeq 0    true:0246 false:0008      
 0008: 0x15 0xed 0x00 0x00000001   jeq 1    true:0246 false:0009      
 0009: 0x15 0xec 0x00 0x00000002   jeq 2    true:0246 false:0010      
 0010: 0x15 0xeb 0x00 0x00000003   jeq 3    true:0246 false:0011      
 0011: 0x15 0xea 0x00 0x00000004   jeq 4    true:0246 false:0012      
 0012: 0x15 0xe9 0x00 0x00000005   jeq 5    true:0246 false:0013      
 0013: 0x15 0xe8 0x00 0x00000006   jeq 6    true:0246 false:0014      
 0014: 0x15 0xe7 0x00 0x00000007   jeq 7    true:0246 false:0015      
 0015: 0x15 0xe6 0x00 0x00000008   jeq 8    true:0246 false:0016      
 0016: 0x15 0xe5 0x00 0x00000009   jeq 9    true:0246 false:0017      
 0017: 0x15 0xe4 0x00 0x0000000a   jeq 10   true:0246 false:0018      
 0018: 0x15 0xe3 0x00 0x0000000b   jeq 11   true:0246 false:0019      
 0019: 0x15 0xe2 0x00 0x0000000c   jeq 12   true:0246 false:0020
...

On amd64, the syscall numbers are:

read() 0
mmap() 9
lseek() 8

Say for mmap(), it’d take ~14 instructions to go out with ALLOW. Keep in mind that it’s cBPF, so no fancy JIT or things like that. Note that sycalls are scanned lineraly, the higher the number the worse it gets.

Turns out Oracle has some patches for libseccomp to generate rules using a binary search lookup rather than a linear scan, see https://github.com/seccomp/libseccomp/issues/116. There’s even a blog post where they mention getpid() (sycall number 110), see https://blogs.oracle.com/linux/seccomp%3A-safe-and-secure-and-slow-no-more. However, the gain for low number syscalls is probably less (say, ~4-5 prologue instructions + log2(<num>) lookup + 1 epilogue). Say rough estimate for mmap() (no# 9), 14 instructions could potentially become 7-8 instructions, ~50-60% of the original cost. Compare that with their getppid() case, we can probably drop the static prologue/epilogue as insignificant, and we’d get log2(110)/110 ~6% of the original cost (checks out with their Time to execute gettppid() numbers).
Maybe we could collaborate with upstream and look into getting the PR to land?

jdstrand · July 8, 2019, 8:48pm

It is known that the bpf ordering can impact performance and that this was an area of improvement for snapd. As it turns out, this is also an area of improvement for libseccomp and they have in 2.4 made bpf optimizations that should improve things. Curious if this snap is faster with libseccomp 2.4.1 on the system at the time snap-seccomp is run. Also note, with libseccomp 2.4, libseccomp unconditionally orders the bpf however it wants and snapd is not in a position to adjust that. AIUI, upstream libseccomp wants to perform more optimizing and would welcome contributions (see aforementioned comment from @mborzecki).

mborzecki · July 9, 2019, 9:26am

I played a little with different libseccomp versions and the experimental and wanted to share some findings.

I’ve rebuilt 2.3.3, 2.4.1, master and a bintree branch from https://github.com/drakenclimber/libseccomp/tree/rfc-bintree-v2 , then proceeded to compile BPF profiles with each version.

The disassembly of the profile generated with 2.4.1 version can be found here https://paste.ubuntu.com/p/wRTDmq39Sw/ and a bintree one is right here: https://paste.ubuntu.com/p/Q9VYyhchRR/.

The results obtained running volatility (without compiled Python modules) and the same image as @mvo are rather disappointing:

@unrestricted ~18s
2.4.1 ~20.53s
bintree ~20.25s

The gains of binary tree are negligible for this particular snap. I then proceeded to hack a bit on the BPF simulator built into libseccomp and added a count of instructions executed before reaching a ret <action> instruction. The patch is included below.

BPF simulator patch

diff --git a/tools/scmp_bpf_sim.c b/tools/scmp_bpf_sim.c
index 73d056b..2eebcec 100644
--- a/tools/scmp_bpf_sim.c
+++ b/tools/scmp_bpf_sim.c
@@ -155,7 +155,7 @@ static void end_action(uint32_t action, unsigned int line)
 static void bpf_execute(const struct bpf_program *prg,
 			const struct seccomp_data *sys_data)
 {
-	unsigned int ip, ip_c;
+	unsigned int ip, ip_c, inst_cnt;
 	struct sim_state state;
 	bpf_instr_raw *bpf;
 	unsigned char *sys_data_b = (unsigned char *)sys_data;
@@ -167,9 +167,11 @@ static void bpf_execute(const struct bpf_program *prg,
 	/* initialize the machine state */
 	ip_c = 0;
 	ip = 0;
+	inst_cnt = 0;
 	memset(&state, 0, sizeof(state));
 
 	while (ip < prg->i_cnt) {
+		inst_cnt++;
 		/* get the instruction and bump the ip */
 		ip_c = ip;
 		bpf = &prg->i[ip++];
@@ -215,6 +217,9 @@ static void bpf_execute(const struct bpf_program *prg,
 				ip += jf;
 			break;
 		case BPF_RET+BPF_K:
+			if (opt_verbose) {
+				fprintf(stderr, "instruction count: %d\n", inst_cnt);
+			}
 			end_action(k, ip_c);
 			break;
 		default:

The results are summarized in the table:

| syscall      | <nr> | 2.4.1 #inst | bintree  #inst |
|--------------+------+-------------+----------------|
| read()       |    0 |           6 |             14 |
| lseek()      |    8 |          14 |             14 |
| mmap()       |    9 |          15 |             14 |
| getppid()    |  110 |         108 |             16 | (Oracle's blog post)
| epoll_wait() |  232 |         192 |             16 |
| openat()     |  257 |         204 |             16 |

That kind of explains why the gain is so little with the experimental branch. Strace data shows that lseek() and read() account for 97% of all the syscalls executed by the benchmarked command. With lseek() being basically the same, the result for volatility are influenced by other syscalls. For instance openat() (probably due to Python compiling modules?), pops up high on my list and it also exhibits significantly shorter instruction count with bintree profile. By the looks of it, the experimental branch seems capable of delivering better average performance.

jdstrand · July 9, 2019, 2:20pm

I somewhat mispoke here. Prior to 2.4, you could influence the bpf sort order based on the order that the rules were added to the filter, but in 2.4 you cannot do that anymore. However, libseccomp provides seccomp_syscall_priority() which allows tuning the priority. @mborzecki did some instrumenting here and I’ll let him comment with details, but giving the highest priority to the most used syscalls of volatility only brought a small gain. For volatility, it appears that the act of filtering at all is what introduces the overhead since it is calling lseek() nearly 1 million times in @mborzecki’s test runs.

jdstrand · July 9, 2019, 3:00pm

UPDATE: @mborzecki updated these values with averages and I have edited this comment to account for that for clarity. The updated values are:

@unrestricted 18.18s
2.4.1 ~20.26s
bintree ~20.40s
bintree with prioritized ~20.25s

With 2.4.1 there is a ~10% slowdown (18.18s/20.26s) and with the upcoming bintree, an ~11% slowdown (18s/20.40s) for volatility. Adding seccomp_syscall_priority() with bintree to tune the priority for lseek() and read() highest (and a few others), this can be reduced back to ~10% for volatility (18.18s/20.25s).

We’ve always known that there was an overhead for using seccomp filtering. This topic and @mborzecki demonstrate how to measure it. volatility itself is somewhat pathological in that it is calling two syscalls 1.2 million times (lseek() at almost a million and read() at 135k). @mborzecki’s measurements show that sort order isn’t the only factor and that there is kernel overhead mediating the syscall beyond searching for it, but all considered it isn’t terrible even for pathological cases like volatility. There are also things snapd can do without any architectural changes or changes to the kernel to make the overhead as low as 10% for this pathological case and likely much better for standard cases where bintree would shine.

Today, volatility can compile and ship the pyc files to cut the slowdown by more than half. We’ll pickup bintree in the fullness of time with libseccomp 2.5 to bring an additional small improvement. If warranted, we can in the future explore how to expose seccomp_syscall_priority() or even which sort algorithm to use in snapd to bring this down even further (eg, perhaps by letting snap developers specify the priority of syscalls in their snap.yaml when the defaults aren’t good enough; or adding an ‘optimize’ command to snapcraft/etc to measure usage and make this easier for the developer).

Thank you @mborzecki and @mvo for the measurements. It seems clear to me we should add some benchmarking in this area which includes non-pathological cases, and then see what the next steps are.

mborzecki · July 9, 2019, 3:12pm

Strace syscall summary for volatility:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 84.46    2.653077           2    991820        22 lseek
 12.42    0.390072           2    135093           read
  0.95    0.029716           3      8359           munmap
  0.56    0.017636           3      5646           fstat
  0.44    0.013712           1      9882           mmap
  0.44    0.013693           2      6687      5260 openat
  0.26    0.008121           3      2531      1175 stat
  0.17    0.005387           4      1241           brk
  0.11    0.003491           2      1432           close
  0.03    0.001013          63        16           futex
  0.03    0.000915          12        76           getdents

I tried to collect more sample. The variance of respective runs is quite annoying and it’s hard to take any measurements reliably. Unsurprisingly collecting more samples changes the picture a little bit. We’ll need a better benchmark if we are to investigate this further.

For some totally unscientific results from running volatility across 10 runs:

|     |  2.4.1 | bintree | prioritized | unrestricted |
|-----+--------+---------+-------------+--------------|
|     |  20.02 |   20.54 |       20.23 |        17.96 |
|     |  20.24 |   20.47 |       20.09 |         18.1 |
|     |  20.22 |   20.34 |       20.23 |        18.26 |
|     |  20.28 |   20.26 |       19.94 |        18.08 |
|     |  20.08 |   20.38 |       20.07 |        18.18 |
|     |  20.42 |   20.34 |       20.34 |           18 |
|     |  20.13 |   20.46 |       20.31 |        18.33 |
|     |   20.3 |   20.41 |       20.46 |        18.54 |
|     |  20.62 |   20.39 |       20.35 |        18.22 |
|     |  20.26 |   20.35 |       20.52 |        18.14 |
|-----+--------+---------+-------------+--------------|
| avg | 20.257 |  20.394 |      20.254 |       18.181 |
| dev | 0.1719 |  0.0800 |      0.1805 |       0.1696 |

2.4.1, bintree, prioritized indicate profiles generated with 2.4.1, bintree patches and snap-seccomp syscall priority tweaks respectively (disassembly of the prioritized profile is here: https://paste.ubuntu.com/p/DHV5tfChq6/)
unrestricted is basically seccomp filtering being disabled

The priorized sample had the following syscall priorities assigned from 255 down in this order: "lseek", "read", "munmap", "fstat", "mmap", "openat", "stat". Remaining syscalls were left with their default priorities.

Now that we have investigated this a bit, we can prepare a better, more representative benchmark. This one is heavily skewed towards read() and lseek(), which are already handled early. If the tool used recvmsg()/sendmsg() I would expect a higher impact of bintree patch.