| View previous topic :: View next topic |
| Author |
Message |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Fri Nov 17, 2006 1:15 am Post subject: Good CFLAGS for Intel Core 2 |
|
|
I'm trying to figure out how to produce good code with gcc using an Intel Core 2 CPU. Since there isn't anything official in gcc that applies to Core 2, I thought I would do some experiments with custom CFLAGS. (gcc currently doesn't have an architecture flag for the Core 2, and there won't be one in gcc 4.2.0 either.) I'm running a strictly 64-bit system here, so thats the mode I'm testing here. I'm using Acovea to sort out what flags are important.
First off, the march/mtune flags. The only ones that gcc will accept in 64-bit are 'opteron' and 'nocona', all others produce "CPU you selected does not support x86-64 instruction set". Neither one is really appropriate for Core 2... in fact, I've found its actually best to leave march/mtune unspecified. MMX, SSE, and SSE2 are built into the 64bit specification, so they don't need to be specified.
Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. My guess this is because Core 2 processors have pretty large caches, so cache misses aren't as much of a concern as they would be on other CPUs. -O2 is a tad better than -O1, I haven't done enough testing to say for sure though. -O3 doesn't seem to provide much benefit, takes more time to compile, makes much larger binaries, and sometimes produces incorrect results, so I'm not using it.
Then there are the flags that determine how floating-point math is done: -mfpmath={sse|sse,387|387}. For some reason, its fastest to leave this unspecified, which is strange because the documentation says -mfpmath=sse is default for x86-64, yet the actual default is faster than -mfpmath=sse. !?
One interesting thing I've found is that you can actually get a slight speedup from disabling MMX: -mno-mmx. Perhaps disabling MMX allows programs to avoid overhead? Or maybe its a gcc bug?
Flags I won't be testing: -ffast-math and friends. Not suitable for system-wide CFLAGS, because they break several algorithms.
So, for my own machine, the CFLAGS are set to "-01 -pipe".
I'll be working more on getting some hard numbers from Acovea, and I'll post the results here. |
|
| Back to top |
|
 |
desultory Administrator

Joined: 04 Nov 2005 Posts: 4648
|
|
| Back to top |
|
 |
kernelOfTruth Veteran


Joined: 20 Dec 2005 Posts: 3722 Location: Vienna, Austria; Germany; hello world :)
|
|
| Back to top |
|
 |
lplatypus n00b

Joined: 26 Mar 2004 Posts: 16
|
Posted: Sun Nov 19, 2006 6:50 pm Post subject: Re: Good CFLAGS for Intel Core 2 |
|
|
| karp wrote: | | MMX, SSE, and SSE2 are built into the 64bit specification, so they don't need to be specified. |
What about -msse3 ? |
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Sun Nov 19, 2006 10:44 pm Post subject: |
|
|
SSE3 isn't enabled by default. But gcc doesn't seem to take advantage of it very well, because the acovea results I'm getting indicate it hurts performance when enabled, somehow.
Last I heard, SSSE3, which Core 2 supports, won't make it until 4.3:
http://gcc.gnu.org/ml/gcc-patches/2006-09/msg01285.html
Oh, and since my first post I have come across several benchmarks that benefit from -O2 over -O1, and none that are hurt. So my current CFLAGS is "-O2 -pipe". |
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Mon Nov 20, 2006 12:40 am Post subject: |
|
|
Okay, here's my first result. I tweaked the gcc 4.0 Opteron config file from the Acovea website, and added gcc 4.1 and Core 2 flags. The benchmark is based on a simple raytracer from http://www.ffconsultancy.com/free/ray_tracer/languages.html. I chose a raytracer because it has a good mix of indirection, mathematics, and branching.
| Code: | Optimistic options:
-mno-push-args (1.988)
-fno-tree-copyrename (1.692)
-fno-strict-aliasing (1.938)
-finline-functions (2.825)
-funroll-all-loops (2.48)
Pessimistic options:
-mtune=opteron (-1.707)
-fno-guess-branch-probability (-1.757)
-fno-tree-dce (-1.954)
-fno-reorder-blocks (-1.658)
-fno-inline (-2.102)
-fno-rename-registers (-1.658)
-funroll-loops (-1.855)
-fbranch-target-load-optimize (-1.954)
Acovea's Best-of-the-Best:
g++ -lrt -lm -O2 -mtune=nocona -mno-mmx -momit-leaf-frame-pointer -mno-push-args -fno-defer-pop -fno-if-conversion2 -fno-tree-dominator-opts -fno-tree-dse -fno-tree-ter -fno-tree-sra -fno-tree-copyrename -fno-merge-constants -fno-cse-follow-jumps -fno-cse-skip-blocks -fno-strength-reduce -fno-caller-saves -fno-sched-interblock -fno-regmove -fno-delete-null-pointer-checks -fno-inline-functions-called-once -fno-tree-pre -finline-functions -fgcse-after-reload -fno-tree-vect-loop-version -fno-early-inlining -finline-limit=700 -fno-zero-initialized-in-bss -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-loop-linear -fno-peephole -ftracer -funroll-all-loops -o /tmp/ACOVEA6FD7F1B7 raytrace.cpp
Acovea's Common Options:
g++ -lrt -lm -O2 -mtune=nocona -mno-mmx -mno-push-args -fno-if-conversion2 -fno-tree-ter -fno-delete-null-pointer-checks -finline-functions -funroll-all-loops -o /tmp/ACOVEAB05D5D16 raytrace.cpp
-O1:
g++ -lrt -lm -O1 -o /tmp/ACOVEAF66682E3 raytrace.cpp
-O2:
g++ -lrt -lm -O2 -o /tmp/ACOVEA2ECD22C2 raytrace.cpp
-O3:
g++ -lrt -lm -O3 -o /tmp/ACOVEADCAA04ED raytrace.cpp
-Os:
g++ -lrt -lm -Os -o /tmp/ACOVEA90194FC2 raytrace.cpp
A relative graph of fitnesses:
Acovea's Best-of-the-Best: ************************ (0.519285)
Acovea's Common Options: ************************ (0.509968)
-O1: ************************************ (0.771686)
-O2: ********************************** (0.717607)
-O3: *************************** (0.586796)
-Os: ************************************************** (1.05193)
|
|
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Tue Nov 21, 2006 1:23 am Post subject: |
|
|
The Huffman encoding benchmark included with Acovea:
| Code: | Optimistic options:
-fno-tree-copyrename (1.975)
-fno-strength-reduce (1.921)
-fno-sched-interblock (1.813)
Pessimistic options:
-mtune=nocona (-2.561)
-mno-sse2 (-2.183)
-floop-optimize2 (-2.129)
-fno-if-conversion (-2.561)
-fno-tree-lrs (-2.129)
-fno-tree-ch (-2.237)
-fno-expensive-optimizations (-1.535)
-fno-align-functions (-1.697)
-frename-registers (-1.643)
-funroll-loops (-1.967)
-fbranch-target-load-optimize (-1.535)
Acovea's Best-of-the-Best:
gcc -lrt -lm -std=gnu99 -O2 -mieee-fp -momit-leaf-frame-pointer -mno-push-args -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -fno-delayed-branch -fno-thread-jumps -fno-guess-branch-probability -fno-loop-optimize -fno-if-conversion2 -fno-tree-ccp -fno-tree-dse -fno-tree-ter -fno-tree-sra -fno-tree-copyrename -fno-tree-fre -fno-thread-jumps -fno-crossjumping -fno-optimize-sibling-calls -fno-cse-follow-jumps -fno-strength-reduce -fno-peephole2 -fno-schedule-insns -fno-schedule-insns2 -fno-sched-interblock -fno-regmove -fno-strict-aliasing -fno-delete-null-pointer-checks -fno-reorder-functions -fno-unit-at-a-time -fno-inline-functions-called-once -fno-tree-pre -fgcse-after-reload -fno-tree-vect-loop-version -fno-defer-pop -fno-inline -finline-limit=700 -fmodulo-sched -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-loop-im -ftree-loop-ivcanon -fivopts -fno-split-ivs-in-unroller -fprefetch-loop-arrays -fno-peephole -fno-web -ftracer -fpeel-loops -o /tmp/ACOVEAD78DAA47 /usr/share/libacovea/benchmarks/huffbench.c
Acovea's Common Options:
gcc -lrt -lm -std=gnu99 -O2 -maccumulate-outgoing-args -mno-align-stringops -fno-strength-reduce -fno-schedule-insns -fno-sched-interblock -fno-unit-at-a-time -fno-defer-pop -o /tmp/ACOVEAF636505B /usr/share/libacovea/benchmarks/huffbench.c
A relative graph of fitnesses:
Acovea's Best-of-the-Best: ********************************** (1.42823)
Acovea's Common Options: *************************************** (1.63152)
-O1: *************************************** (1.62643)
-O2: ************************************** (1.58268)
-O3: ************************************** (1.58807)
-Os: ************************************************** (2.07581)
|
|
|
| Back to top |
|
 |
schmobag Tux's lil' helper

Joined: 17 Feb 2004 Posts: 91 Location: Los Angeles
|
Posted: Tue Nov 21, 2006 12:00 pm Post subject: |
|
|
I'm a little confused about -Os.
You original post says that you've never seen -Os beat -O1.
The gcc man page defines -Os like this:
| Code: |
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code
size. It also performs further optimizations designed to reduce code size.
-Os disables the following optimization flags: -falign-functions -falign-jumps
-falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition
-fprefetch-loop-arrays -ftree-vect-loop-version
|
So it seems that generally, -Os is -O1 plus all the -O2 optimizations that don't increase code size. If that was all, then it would seem logically impossible for -O1 to beat -Os. But then the definition says that -Os "also performs further optimizations designed to reduce code size." Mysterious. Are those the optimizations that slow -Os down so that -O1 beats it?
I confess to having no idea how to read your Acovea benchmarks. In the raytracer one, -Os is much lower than the others. In the huffman one, -Os is a bit higher. Does that mean that -Os is the worst for the raytracer, and the best for the huffman, or is it the reverse? |
|
| Back to top |
|
 |
batistuta Veteran


Joined: 29 Jul 2005 Posts: 1384 Location: Aachen
|
Posted: Tue Nov 21, 2006 12:39 pm Post subject: |
|
|
| schmobag wrote: | | In the raytracer one, -Os is much lower than the others. In the huffman one, -Os is a bit higher. Does that mean that -Os is the worst for the raytracer, and the best for the huffman, or is it the reverse? |
First of all, thanks Karp for the work. Great useful thread.
Scmobag, I'm not sure if we are looking at the same graphs. The ones that I see, -s0 is always the slowest one: in Huffman by a bit, in raytrace by a bit more. The way I interpret the results is that since Os does not optimize when increasing size, many things linke function inlining, which are probably called a lot in raytracing and in Hoffman, are omitted. This will have an impact in performance. Other things like loop unrolling, same thing. And matrix operations have lots of loop things that could get optimized.
So it looks to me that the particular test shows Os as a non optimal flag. But other applications could show very different results. For instance I would expect a Web browser to not show these differences as much.
In the end of the day, it looks like the rule of thumb still prevales: choose your architecture, and -O2 for a good compromise between compilation speed, performance, and stability. Then Acovea, as the website suggests, is great for tuning particular applications that could benefit from optimization (like raytracing).
Nevertheless, these results are very interesting. Thanks for sharing the information! |
|
| Back to top |
|
 |
schmobag Tux's lil' helper

Joined: 17 Feb 2004 Posts: 91 Location: Los Angeles
|
Posted: Tue Nov 21, 2006 12:55 pm Post subject: |
|
|
| batistuta wrote: | | Scmobag, I'm not sure if we are looking at the same graphs. |
We were looking at the same ones, but I misread the first one. The number on the -Os line starts with 1.xxx, but I read that as .1xxxx. Quite a difference. |
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Tue Nov 21, 2006 2:41 pm Post subject: |
|
|
The best test for -Os would probably be to recompile an entire system with it, and then measure overall system performance compared to an entirely -O2 system. Having only a benchmark give up some code size for The Good of the Cache doesn't mean much when the kernel, glibc, etc are still large. But if they *all* made a sacrifice, it could be worth it.
That said, I don't know if I feel up to recompiling my entire system multiple times to test this. Besides, I'm not even sure if there is a good overall-system-performance benchmark... maybe kernel-compilation? |
|
| Back to top |
|
 |
batistuta Veteran


Joined: 29 Jul 2005 Posts: 1384 Location: Aachen
|
Posted: Tue Nov 21, 2006 4:13 pm Post subject: |
|
|
kernel compilation is always a special case. So special that it has its own set of flags, some are ignored even if you pass them. So I don't think kernel is a good candidate for it...  |
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Tue Nov 21, 2006 4:57 pm Post subject: |
|
|
| I think you misunderstand... I wouldn't be changing the flags used inside of the kernel build process. I'd be comparing how toolchains compiled with various settings perform on a standard task: compiling the kernel. |
|
| Back to top |
|
 |
karp n00b

Joined: 20 May 2002 Posts: 38 Location: Champaign, IL
|
Posted: Tue Nov 21, 2006 5:04 pm Post subject: |
|
|
| batistuta wrote: |
In the end of the day, it looks like the rule of thumb still prevales: choose your architecture, and -O2 for a good compromise between compilation speed, performance, and stability. |
Well, actually, I found that not specifying the architecture is best, in this particular situation. |
|
| Back to top |
|
 |
batistuta Veteran


Joined: 29 Jul 2005 Posts: 1384 Location: Aachen
|
Posted: Tue Nov 21, 2006 5:13 pm Post subject: |
|
|
| karp wrote: | | I'd be comparing how toolchains compiled with various settings perform on a standard task: compiling the kernel. |
ok, I understand
| Quote: | | Well, actually, I found that not specifying the architecture is best, in this particular situation. |
This might be because your architecture is not fully supported. And once again, this depends on the particular application. |
|
| Back to top |
|
 |
anli Tux's lil' helper

Joined: 08 Sep 2006 Posts: 78
|
Posted: Mon Dec 18, 2006 8:26 am Post subject: |
|
|
karp,
Thanks for your information!
Just want to clarify one moment. After these months which arch cfg you treat as the best - 'nocona' or without any arch flag at all? |
|
| Back to top |
|
 |
Keruskerfuerst Veteran

Joined: 01 Feb 2006 Posts: 1717
|
|
| Back to top |
|
 |
peaceful Apprentice


Joined: 06 Jun 2003 Posts: 285 Location: Utah
|
Posted: Thu Jan 04, 2007 11:50 am Post subject: Re: Good CFLAGS for Intel Core 2 |
|
|
| karp wrote: |
Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. |
I did extensive compiler flag testing of a project called ANNEvolve (Artificial Neural Network Evolution software) in December, 2004. I tested most relevant combinations of 15-20 different compiler flags on OS X on a G4 and G5, Linux on a POWER4 (cousin to both the G4 and G5), and Linux on a Pentium 4. I found that a simple "-Os" resulted in the fastest code on the G4, G5, and POWER4 processors, while -O3 resulted in the fastest code on the Pentium 4. None of the other dozens of flags I tried had any positive effect on runtime speed.
So, I'm not surprised that Apple would use -Os for all their PowerPC-based code. I would hope they don't blindly use it for their intel code as well.
Bear in mind that this was one specific project written in C doing artificial neural network simulation, and this on GCC 3.something. The task at hand was heavy in floating-point calculations. |
|
| Back to top |
|
 |
Element Dave n00b

Joined: 10 Nov 2006 Posts: 54
|
Posted: Tue Jan 23, 2007 5:56 pm Post subject: Re: Good CFLAGS for Intel Core 2 |
|
|
| karp wrote: | | Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. |
Define "beating" in this context. I am not at all surprised that Apple uses "-Os" by default. Many people keep failing to recognize the simple and incontrovertible fact that smaller code will ALWAYS perform better when CPU utilization is at less than maximum. The CPU is almost never the bottleneck on a modern desktop. Only if CPU utilization is constantly near its limit does it make any sense to use "-O2" or "-O3" for a system default. Actually, I don't think it's a very good idea to have "-Oanything" defined globally, but that's another matter. There is also another complication: gcc sometimes produces bigger (!) code with "-Os" compared to "-O2". Unfortunately, it does not seem uncommon for this to happen. For example, cairo is much larger when compiled with "-Os" than it is with "-O2", using the gcc currently in the stable x86 branch. It would be interesting to compare the size of every compiled file on otherwise identical systems compiled with "-Os" and "-O2". |
|
| Back to top |
|
 |
nxsty Veteran


Joined: 23 Jun 2004 Posts: 1556 Location: .se
|
Posted: Sat Feb 03, 2007 2:06 pm Post subject: Re: Good CFLAGS for Intel Core 2 |
|
|
| karp wrote: | | Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. My guess this is because Core 2 processors have pretty large caches, so cache misses aren't as much of a concern as they would be on other CPUs. -O2 is a tad better than -O1, I haven't done enough testing to say for sure though. -O3 doesn't seem to provide much benefit, takes more time to compile, makes much larger binaries, and sometimes produces incorrect results, so I'm not using it. |
That's not really comparable. Apple's gcc works differently than FSF's gcc for some optimization levels. -Os in apple's gcc is just like -O2 except that it disables some flags, unlike -Os in FSF's gcc which also tunes some optimizations for size rather than speed. The apple gcc specific -Oz optimization is the same as -Os in FSF's gcc. |
|
| Back to top |
|
 |
likewhoa Guru

Joined: 04 Oct 2006 Posts: 461 Location: Brooklyn, New York
|
Posted: Sat Feb 03, 2007 4:45 pm Post subject: |
|
|
I got opteron on amd64 i will do some acovea runs  |
|
| Back to top |
|
 |
yaneurabeya Veteran


Joined: 13 May 2004 Posts: 1754 Location: Silicon Valley
|
Posted: Sun Mar 25, 2007 12:15 am Post subject: |
|
|
Interesting thread. It may be because of my CLI resolution compared to the size of the FB resolution on the suse livecd, but it appears that things are trolling along a bit slower on my Core 2 Duo box. I only have 2GB of ram in the box currently and need to add the other 2 (running it in 64-bit mode).
Current CFLAGS: -O2 -pipe -fno-strict-aliasing -funroll-loops -msse -msse2 -msse3 <- guess I don't need -msse3 any more, do I?
If there's anything I can help test out, I'd be more than happy to. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|