Discussion: CFLAGS on Raspberry Pi3 in 32bit mode

dr_wulsen · Last edited by dr_wulsen on Wed Aug 30, 2017 9:02 am; edited 3 times in total

Welcome together to a discussion which is aimed at finding the optimum CFLAGS (architecture-specific) for a Raspberry Pi3 in 32-bit mode.
Some parts are a little contrary in themselves (as the very basic -march) and I would like to shed some light on what to use and what to avoid.
I obtain the flags GCC uses by gcc -<options> -v -Q --help=target

Architecture specific flags that I would like to discuss:

-march
Basic as it is, there is confusion. A more recent GCC version that I patched to not segfault with "native" reported an armv8-a+crc.
In /proc/cpuinfo the RPi3 identifies itself as armv7-a and GCC-5.4 will use armv7-a with the "native" flag.
Code generated with armv8-a+crc runs fine on the RPi3 in 32-bit mode.
I have tried to "benchmark" this with FLAC encoding a 10-minute WAV file. But no notable difference occurs.
Also the latter referenced floating-point benchmarks don't show a difference.
Since there is no difference, I will stay with what GCC reports, which is armv7-a
Any reasons to use armv8-a+crc?
Choice: armv7-a

-mcpu
The GCC manual states that mcpu (if stated without -march or -mtune) is used to determine -march and -mtune.
GCC-5.4 will set -march to armv2 in case -mcpu=cortex-a53 is set. This seems less than desirable.
As -mtune in conjunction with -march can do the same (And they override -mcpu if all three are set) I would leave it empty.
Choice: unset

-mtune
Invoking GCC-5.4 with -mtune=native reports back: -mtune=[default] which I guess is quite the same as "unset"
Since the RPi3 is a Cortex-A53, Setting it appropriate seems reasonable.
Choice: cortex-a53

-mfpu
-native will give you the vfpv3-d16 FPU. /proc/cpuflags states support for neon and vfpv4. But it gets more interesting
Wikipedia says:
[*]VFPv3: 32 64-bit FPU registers as standard
[*]VFPV3-D16: As above, but with only 16 64-bit FPU registers
[*]VFPV4: Has 32 64-bit FPU registers as standard, adds both half-precision support as a storage format and fused multiply-accumulate instructions to the features of VFPv3
So at least setting it to neon-vfpv4 shoud be reasonable. However, my patched GCC reported me back: crypto-neon-fp-armv8
I have tested with a whetstone single-precision benchmark, testing all single available -mfpu options in conjunction with -march=armv7-a and -march=armv8-a+crc.
Results can be found here
vfpv3-* gives the best performance (best tried with gcc maybe?) but does not allow for NEON to be used as FP unit.
so I will (almost) stick with the default for one time.
UPDATE: Found information that "neon" or "neon-fp16" actually enables "neon-vfpv3" and that explains why in my benchmarks "neon" was always on par with "vfpv3".
This allows us to use NEON FP when required (and with auto-vectorization) and have the vfpv3.
Since we support the half extension (see my next post) we can use "neon-fp16".
To enable the "half" extension, you must add ieee|alternative".
Choice: neon-fp16 (neon-vfpv4 likely with more recent GCC versions beyond 6.4.0)

-mfp16-format
According to here it is required to be set to either "ieee" or "alternative" to enable the half precision extension.
I have selected "alternative" (ARM format) as it covers a bigger range.
Choice: alternative

funsafe-math-optimizations
Not architecture-specific, but required to generate NEON code by auto-vectorization, says the GCC manual
I will try to compile my whole system with that and don't expect any issues - I'm not number crunching or doing something terribly scientific.
Choice: funsafe-math-optimizations

-mfloat-abi
-native will give you hard which is good.
Again, the GCC docs: ‘hard’ allows generation of floating-point instructions and uses FPU-specific calling conventions.
This is the closest to the hardware we can get here, so it is desirable.
Choice: hard

-mfix-cortex-m3-ldrd
-native will enable it by default
GCC docs say: Some Cortex-M3 cores can cause data corruption when ldrd instructions with overlapping destination and base registers are used.
This option avoids generating these instructions. This option is enabled by default when -mcpu=cortex-m3 is specified.
As our core is no Cotex-M3 I think we can disable this. My guess is that performance won't be decreased, but also if it's increased, we won't be able to benchmark it
Choice: disabled

-mrestrict-it
-native will enable it by default.
GCC docs: Restricts generation of IT blocks to conform to the rules of ARMv8. IT blocks can only contain a single 16-bit instruction from a select set of instructions.
This option is on by default for ARMv8 Thumb mode.
-Quick info: IT-Block means an "IF-THEN" block for ARM Thumb mode.
ARM Infocenter says:
This 16-bit Thumb instruction is available in ARMv6T2 and above.
In ARM code, IT is a pseudo-instruction that does not generate any code.
So we can safely leave it on and it won't affect us here.
Choice: default (enabled)

-mtls-dialect
-march=native will set it to "gnu"
GCC docs: The ‘gnu2’ dialect selects the GNU descriptor scheme, which provides better performance for shared libraries.
The GNU descriptor scheme is compatible with the original scheme, but does require new assembler, linker and library support.
I have not benchmarked a thing, but gnu2 works on my system. If something does not compile with it, we can add it to package.env
Choice: gnu2

-mvectorize-with-neon-double default: disabled; -mvectorize-with-neon-quad default: enabled
ARM Infocenter says:
GCC 4.4 does not support vectorization with varying vector sizes. By default, it vectorizes for doubleword registers only.
You can instruct gcc to vectorize for quadword registers instead by specifying -mvectorize-with-neon-quad on the command line.
I believe that the default is what to stick to, so I left it unchanged.
Choice: default

This all leads me to following CFLAGS: -O2 -pipe -fomit-frame-pointer -march=armv7-a -mtune=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp16 -mno-fix-cortex-m3-ldrd -mtls-dialect=gnu2 -funsafe-math-optimizations
_________________
There's no stupid questions, only stupid answers.

NeddySeagoon · Posted: Mon Aug 28, 2017 11:02 pm Post subject:

dr_wulsen,

For benchmarks see this post and others by the same author.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

dr_wulsen · Posted: Wed Aug 30, 2017 7:54 am Post subject:

Hi Neddy,

thank you for pointing me there. However, I am interested in the effects of different CFLAGS and the clarity what to use and why to do it.

UPDATE: I've tested the whetstone-DP performance with GCC-6.4.0.
It's imporoved overall, but the pattern itself remains the same - vfpv3 and neon still are the same.

Now I found some information on Stackoverflow which fit well into my data.
If "neon" enables vfpv3 and neon, as stated there - that explains my measurements.
It gives the same (highest) performance as "vfpv3" but still we can use NEON code for FP operations if desired.
And as the Cortex-A53 supports the half precision extension we can use the "neon-fp16" as FPU setting.

That was the one rated hightest in my benchmarks, and it allows us to use NEON, so this will be my choice unless someone proofs me wrong.
_________________
There's no stupid questions, only stupid answers.

NeddySeagoon · Posted: Wed Aug 30, 2017 7:58 am Post subject:

dr_wulsen,

The benchmark software is the same regardless.

You build the benchmark suit with one set of CFLAGS and test, then rinse and repeat with different settings.
Benchmarks tell you how good you are at running benchmarks. they say little about real world performance.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

dr_wulsen · Posted: Wed Aug 30, 2017 10:32 am Post subject:

Hi Neddy,

I'm aware of that - I've built the whetstone with all possible mfpu settings and learned a lot about the fpu/neon stuff, also I've built lbzip2 with the most promising flags and the ones i expected to perform best, but still neon-fp 16 beats neon-fp-armv8 by three seconds when compressing 200mb of random (the same random placed in a file in RAM).
The results are linked in my original post. I was surprised that vfpv3/neon beats vfpv4/neon and armv8/neon there.
Interesting was also that gcc dosn't use the "half" extension unless told to do so.
I'm hoping that by time, people share their experiences with different architecture-specific flags here (best in real-world scenarios) so we can max out what the pi has to offer.

I've switched to gcc-6.4.0 as it showed improved performance which (also according to changelog) is due to "improved code generation for cortex-a53". For those purposes it was nice to do the benchmark and see that on safe cflags (-O2) the performance improved for all implemantations, pushing the neon-fp16 even higher.
_________________
There's no stupid questions, only stupid answers.

NeddySeagoon · Posted: Wed Aug 30, 2017 1:51 pm Post subject:

dr_wulsen.

gcc can't tell when the loss of precision due to using the "half" extension is acceptable and when its not.
It avoids other tradeoffs like that on other arches too.

-Os may be faster on the Pi than -O2. It has a tiny cache and -Os may make better use of it that -O2.
Like all optimisations, no matter what you use, its not optimum for everything.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.