Everybody wants the best compiler flags (cflags) to put in its make.conf file for speed and optimization, so after days of tests on recent processors i found interesting news.
Everything said here applies on athlon 64, athlon 64 dual core and opteron. No other processor tested.
Compiler version used: gcc-4.1.2 , glibc-2.6.1
profile: amd64 no-multilib (that means pure 64 bit system, but that doesn't affect results).
So, for the impatients, here you have the 2 night & days performance tests result:
######################################################
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
######################################################
Shocked??? that looks so normal...
no -ffast-math? no -ftrace or other exotic flags?
And, just a note: -fomit-frame-pointer and -finline-functions are inlined by default with -O3, so they are actually not usefull. I just put them along because sometimes ebuilds replace -O3 with -O2, and the hope is that the two flags are kept along with -O2. And -pipe is not a real cflags, that's just to speed up compile time. So the result is -O3.
Please don't argue with me on those results, i'll not answer neither go in-deep anyway. If you want to know a bit more, read ahead, but then don't argue with me anyway.
--
For who wants to know how i came to that result.
The hardware:
Used machines are 2 similar asus motherboard based, one with an athlon 64 and one with an athlon 64 dual core, and a supermicro server board (dual cpu support) with a single opteron.
How i performed the tests:
- I started from acovea (simple said, an evolutionary compiler-flags tester) running all tests twice on all machines and noting the results: acovea best flags, acovea optimistic flags, and acovea pessimistic flags. Every test run (i said every) gave me differents flags. But i had to start from somewhere. (note: acovea's tests are very specific to a narrow range of operations, they are a kind of cpu sector test).
- I need daily usage, and daily said i'll never sacrifice one cpu range of operations for another, so i had to do a cross-strip of cflags, deleting from acovea's best and acovea's optimistic any flag appearing into any of the acovea's pessimistic flags. From the remaining i made 4 groups of flags build-up from common best flags and common optimistic flags, plus i added some "common used" groups of best flags (aherm.. well.. at least "thought to be best flags"..). Total: 7 groups of cflags to test.
- Now i needed to choose some program to compile. Question: will i gain productivity from a 7% speed increase in processing openoffice documents? Not me. My typing speed is still slower then my processor..
- All programs were compiled on each machine with each group of flags and tested everytime in the same way. Yeah that was a whole weekend of testing.
- On the fastest machine i added some test taking exactly acovea's best flags.
Conclusions:
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
So let gcc do his job. If you need more pure processor speed, don't waist your time and buy a faster one.
No much else to say.
Some considerations (that's before somebody posts something already said or verified around):
- Latest versions of gcc do a very good job in optimization. Some flags made a noticable difference in the past 3.x serie, now that's over.
- Cflags are a bunch of a lot and more then hundred. Some of them taken alone are more pessimistic then bind togheter with others, and some of them are more harmful in some tasks then in other where - in contrast - can affect positively performance.
- Modern programs use a wide range of tasks, from floating point calculation to memory block manipulation, you cannot find a combination of flags that is best for everything.
- The -march=athlon64 flag is the first one you have to consider. Compared to this flag anything else is not relevant.
- A good programmed application can gain much more then any best found combination of compiler flags. Programmers: review slow routines.
- A difference of 20$ when buying a processor can make a 30% speed increase rendering an image with blender. Tweaking flags maybe 5%. You'll not see the difference surfing with konqueror neither downloading mails.
- Acovea flags give sometime an increase of 40% in performance on... acovea's test. That's because they are very specific tests performing very specific tasks at time. On daily applicaton they are worst. Anyways. That said remember that acovea is a good and sometimes very useful program (see ahead).
- if you have a very specific program that performs a very specific task that wastes a lot of time (let's say you are a researcher and you wrote a C program to solve field equations of GR in Riemann's manifolds using a sub-division approach) that can take days on your machine, then i suggest to take the heaviest functions and test them in acovea to find best performance cflags. You may really increase speed.
As last topic, lets nuke some legend:
- You will not gain from 64bit compiled programs on a modern 64bit athlon (or higher) system. False. You WILL gain from 64bit compiled programs. In some case even 20%.
- You will not gain from a dual core then a single core processor. Well.. there is something true in this statement if you use gcc. Applications aren't still optimized for dual processors/dual core. Some benchmarks found on the internet say you can actually have loss of performance in some cases. I don't have a direct experience for this (the single core i used is slower anyway then the dual core, so i can't compare the results) but my guess is that the processor wastes more time trying to divide threads and tasks then performing everything on a single core. Using icc seems to increase performances a lot. Yafray seems to me to gain much more then other programs compiled on the dual core machine.
- -Os (or -O) is better because apps load faster. Are you serious? Modern programs are made of a bunch of libraries, they load and perform while needed, and they usually don't exceed a few megs. The kwrite launcher is a few kilo. And its libraries too. They are loaded in sequence, and the startup of an application wastes more time initialiting and executing libraries then loading them. So that's just False in most of cases (of course i'm not considering old pc's running short of ram).
- -O1 -fomit-frame-pointer -finline-functions is comparable to -O2. False, and the difference is noticeable.
- -O2 is better then -O3. False, but the difference is often not noticeable.
- -fomit-frame-pointer AND -finline-functions are the first cflags to consider. True. The difference between -O2 and -O3 is kinda annihilated when adding the two flags to -O2, with a preference for the first one.
- -ffast-math or -funsafe-math-optimizations or -mfpmath=387 or any combination of the 3 compile faster code. I wonder how many post i read about this of people claiming they are absolutely sure about this (very common is to bundle the -funsafe-math-optimizations with the -mfpmath=387, there are guys out there that say they got an impressive 50% increase on some applications!)... WTF!!!! Did they really test it or they all read about somebody else who read about? This is absolutely false. on any amd64 system every floating-point-processor-stressing-task performs slower (and not only...). A lot slower.
--
I'll not put the results, i really have a dozen of hand written sheets on my desk, and i don't think i'll ever find time to waste to reorder and post them, so please dont' ask me for those.
For the most curious, i include a set of tests for povray performed on the fastest machine (other results are coherent). Same rendering scene for every test. Scene includes transparency and reflections, radiosity calculated. Pure -O1 and -O2 are omitted, they are not of interest as they are worst anyways. I just want to highlight the interesting and more discussed differences.
Flags -> Rendering time (the faster the better)
-Acovea's best common (stripped) -> 1:13
-Acovea's best positive (stripped) -> 1:11
-Any of acovea's best set tried -> always more then 1:17 (i got even a 1:21 in a case)
-Os -> 1:16
-O1 -fomit-frame-pointer -finline-functions -> 1:14
-O2 -fomit-frame-pointer -finline-functions -> 1:10
-O3 -> 1:09 (!)
-O3 -mfpmath=387 -> 1:12
-O3 -ffast-math -> 1:16
-O3 -ffast-math -mfpmath -> 1:17
-O3 -funsafe-math-optimizations -> 1:18
-O3 -funsafe-math-optimizations -mfpmath=387 -> 1:18







