Performance problem on PI4 vs pi3

Hello, We used to run core 18 and our app on RPI3B V1.2. This works properly but we need to add more and more algorithm and we decided to move on RPI4B. While a gain of performance is attended, results disapoint us. While sysbench shows a large improvement between rpi3b and rpi4b; running our applications do not show the same factor of enhancement.
So maybe our app doesn’t take benefits of arm72 architecture?
Do you have any hints in terms of optimisation flag for GCC?
We juste use -O3.
Does the pi-kernel is optimized for the architecture? pi-kernel snap info show that there’s kernel dedicated for pi4 we tried to rebuild the OS targeting this kernel but the system stucks at splash screen.

Removing the quiet splash shows the following: first gpu boots, then rainbow screen is shown, then 4 raspberry appear on the left upper side corner on the screen and then black screen and no activity, just the green led heart beat.

Any idea? does anyone succeed to use the pi4-kernel (version77).

to make full use of the cortex-a72 specifics, you need to use the arm64 architecture (for your image and your app), not armhf … note though that this will make applications use significatly more memory (32bit binaries will allocate 64bit words in ram even for 32bit variables).

i dont think there is any pi4 specific kernel in any stable channel, there are some pre-release bits in specific candidate and beta channels that were used for bringing up the architecture but you should not use them in production … the valid pi-kernel to use on a core18 based pi4 arm64 image is 18-pi/stable (currently version 5.3.0-1036.38)

Hello ogra and thanks for your answer, we already use arm64 on both platform (rpi3 and rpi4), and you’re right there’s no stable pi4 kernel, that’s why we choose "by default"the candidate one to test but it seems to fail at arm boot step. What are the differences between pi-kernel and pi4-kernel? is there any source available?

How to understand such bad performance of our application (between 0 and 10% of improvements) but also of the irq management level? I mean we use SPI0 for can bus through a MCP2515 and the process irq/42-mcp251x has approximately the same cpu load on both platforms. This sounds to me that we are not taking benefits of A72 architecture.
uname -a returns:
5.3.0-1035-raspi2 #37-Ubuntu SMP Mon Sep 28 18:15:02 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
lscpu show:

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid````

well, there is no such package as pi4-kernel and the code from the test channels like the 18-pi4/candidate track has long been merged into the 18-pi tracks, 18-pi is what you should be using and this one should make use of the cortex-a72 features of the SoC …

weather the peripherials like SPI or CAN will benefit from that i cant tell, but typically the cortex-a72 features usually only apply to the SoC itself …

if you see any slowness with peripherials probably a look at the devicetree overlays might make sense, if you see slowness of apps there might be ways to set specific compile time options (neon and thumb2 come to mind here)