Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Good CFLAGS for Intel Core 2
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Fri Nov 17, 2006 6:15 am    Post subject: Good CFLAGS for Intel Core 2 Reply with quote

I'm trying to figure out how to produce good code with gcc using an Intel Core 2 CPU. Since there isn't anything official in gcc that applies to Core 2, I thought I would do some experiments with custom CFLAGS. (gcc currently doesn't have an architecture flag for the Core 2, and there won't be one in gcc 4.2.0 either.) I'm running a strictly 64-bit system here, so thats the mode I'm testing here. I'm using Acovea to sort out what flags are important.

First off, the march/mtune flags. The only ones that gcc will accept in 64-bit are 'opteron' and 'nocona', all others produce "CPU you selected does not support x86-64 instruction set". Neither one is really appropriate for Core 2... in fact, I've found its actually best to leave march/mtune unspecified. MMX, SSE, and SSE2 are built into the 64bit specification, so they don't need to be specified.

Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. My guess this is because Core 2 processors have pretty large caches, so cache misses aren't as much of a concern as they would be on other CPUs. -O2 is a tad better than -O1, I haven't done enough testing to say for sure though. -O3 doesn't seem to provide much benefit, takes more time to compile, makes much larger binaries, and sometimes produces incorrect results, so I'm not using it.

Then there are the flags that determine how floating-point math is done: -mfpmath={sse|sse,387|387}. For some reason, its fastest to leave this unspecified, which is strange because the documentation says -mfpmath=sse is default for x86-64, yet the actual default is faster than -mfpmath=sse. !?

One interesting thing I've found is that you can actually get a slight speedup from disabling MMX: -mno-mmx. Perhaps disabling MMX allows programs to avoid overhead? Or maybe its a gcc bug?

Flags I won't be testing: -ffast-math and friends. Not suitable for system-wide CFLAGS, because they break several algorithms.

So, for my own machine, the CFLAGS are set to "-01 -pipe".

I'll be working more on getting some hard numbers from Acovea, and I'll post the results here.
Back to top
View user's profile Send private message
desultory
Administrator
Administrator


Joined: 04 Nov 2005
Posts: 7059

PostPosted: Fri Nov 17, 2006 7:25 am    Post subject: Reply with quote

Actual testing, nice. Interesting results too.

Some relevant links:
The Intel Core 2 Solo/Duo (Allendale, Conroe, Merom) entry in the Safe CFLAGS Wiki article.
Topic Core 2 Duo - Merom.
Back to top
View user's profile Send private message
kernelOfTruth
Watchman
Watchman


Joined: 20 Dec 2005
Posts: 5345
Location: Vienna, Austria; Germany; hello world :)

PostPosted: Sat Nov 18, 2006 9:06 pm    Post subject: Reply with quote

Interesting thread, karp

please keep us updated ! :)
_________________
Unofficial minimal livecd x86/amd64 w/reiser4+truecrypt (by Neo2)
2.6.37.2_plus_v1: BFS, CFS,THP,compaction, zcache or TOI
Hardcore Linux user since 2004 :D
Back to top
View user's profile Send private message
lplatypus
n00b
n00b


Joined: 26 Mar 2004
Posts: 16

PostPosted: Sun Nov 19, 2006 11:50 pm    Post subject: Re: Good CFLAGS for Intel Core 2 Reply with quote

karp wrote:
MMX, SSE, and SSE2 are built into the 64bit specification, so they don't need to be specified.

What about -msse3 ?
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Mon Nov 20, 2006 3:44 am    Post subject: Reply with quote

SSE3 isn't enabled by default. But gcc doesn't seem to take advantage of it very well, because the acovea results I'm getting indicate it hurts performance when enabled, somehow.

Last I heard, SSSE3, which Core 2 supports, won't make it until 4.3:
http://gcc.gnu.org/ml/gcc-patches/2006-09/msg01285.html

Oh, and since my first post I have come across several benchmarks that benefit from -O2 over -O1, and none that are hurt. So my current CFLAGS is "-O2 -pipe".
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Mon Nov 20, 2006 5:40 am    Post subject: Reply with quote

Okay, here's my first result. I tweaked the gcc 4.0 Opteron config file from the Acovea website, and added gcc 4.1 and Core 2 flags. The benchmark is based on a simple raytracer from http://www.ffconsultancy.com/free/ray_tracer/languages.html. I chose a raytracer because it has a good mix of indirection, mathematics, and branching.

Code:
Optimistic options:

                          -mno-push-args  (1.988)
                    -fno-tree-copyrename  (1.692)
                    -fno-strict-aliasing  (1.938)
                      -finline-functions  (2.825)
                      -funroll-all-loops  (2.48)

Pessimistic options:

                          -mtune=opteron  (-1.707)
           -fno-guess-branch-probability  (-1.757)
                           -fno-tree-dce  (-1.954)
                     -fno-reorder-blocks  (-1.658)
                             -fno-inline  (-2.102)
                   -fno-rename-registers  (-1.658)
                          -funroll-loops  (-1.855)
           -fbranch-target-load-optimize  (-1.954)

Acovea's Best-of-the-Best:
g++ -lrt -lm -O2 -mtune=nocona -mno-mmx -momit-leaf-frame-pointer -mno-push-args -fno-defer-pop -fno-if-conversion2 -fno-tree-dominator-opts -fno-tree-dse -fno-tree-ter -fno-tree-sra -fno-tree-copyrename -fno-merge-constants -fno-cse-follow-jumps -fno-cse-skip-blocks -fno-strength-reduce -fno-caller-saves -fno-sched-interblock -fno-regmove -fno-delete-null-pointer-checks -fno-inline-functions-called-once -fno-tree-pre -finline-functions -fgcse-after-reload -fno-tree-vect-loop-version -fno-early-inlining -finline-limit=700 -fno-zero-initialized-in-bss -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-loop-linear -fno-peephole -ftracer -funroll-all-loops -o /tmp/ACOVEA6FD7F1B7 raytrace.cpp

Acovea's Common Options:
g++ -lrt -lm -O2 -mtune=nocona -mno-mmx -mno-push-args -fno-if-conversion2 -fno-tree-ter -fno-delete-null-pointer-checks -finline-functions -funroll-all-loops -o /tmp/ACOVEAB05D5D16 raytrace.cpp

-O1:
g++ -lrt -lm -O1 -o /tmp/ACOVEAF66682E3 raytrace.cpp

-O2:
g++ -lrt -lm -O2 -o /tmp/ACOVEA2ECD22C2 raytrace.cpp

-O3:
g++ -lrt -lm -O3 -o /tmp/ACOVEADCAA04ED raytrace.cpp

-Os:
g++ -lrt -lm -Os -o /tmp/ACOVEA90194FC2 raytrace.cpp


A relative graph of fitnesses:

     Acovea's Best-of-the-Best: ************************                              (0.519285)
       Acovea's Common Options: ************************                              (0.509968)
                           -O1: ************************************                  (0.771686)
                           -O2: **********************************                    (0.717607)
                           -O3: ***************************                           (0.586796)
                           -Os: **************************************************    (1.05193)
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Tue Nov 21, 2006 6:23 am    Post subject: Reply with quote

The Huffman encoding benchmark included with Acovea:
Code:
Optimistic options:

                    -fno-tree-copyrename  (1.975)
                    -fno-strength-reduce  (1.921)
                   -fno-sched-interblock  (1.813)

Pessimistic options:

                           -mtune=nocona  (-2.561)
                               -mno-sse2  (-2.183)
                        -floop-optimize2  (-2.129)
                      -fno-if-conversion  (-2.561)
                           -fno-tree-lrs  (-2.129)
                            -fno-tree-ch  (-2.237)
            -fno-expensive-optimizations  (-1.535)
                    -fno-align-functions  (-1.697)
                      -frename-registers  (-1.643)
                          -funroll-loops  (-1.967)
           -fbranch-target-load-optimize  (-1.535)

Acovea's Best-of-the-Best:
gcc -lrt -lm -std=gnu99 -O2 -mieee-fp -momit-leaf-frame-pointer -mno-push-args -maccumulate-outgoing-args -mno-align-stringops -minline-all-stringops -fno-delayed-branch -fno-thread-jumps -fno-guess-branch-probability -fno-loop-optimize -fno-if-conversion2 -fno-tree-ccp -fno-tree-dse -fno-tree-ter -fno-tree-sra -fno-tree-copyrename -fno-tree-fre -fno-thread-jumps -fno-crossjumping -fno-optimize-sibling-calls -fno-cse-follow-jumps -fno-strength-reduce -fno-peephole2 -fno-schedule-insns -fno-schedule-insns2 -fno-sched-interblock -fno-regmove -fno-strict-aliasing -fno-delete-null-pointer-checks -fno-reorder-functions -fno-unit-at-a-time -fno-inline-functions-called-once -fno-tree-pre -fgcse-after-reload -fno-tree-vect-loop-version -fno-defer-pop -fno-inline -finline-limit=700 -fmodulo-sched -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-loop-im -ftree-loop-ivcanon -fivopts -fno-split-ivs-in-unroller -fprefetch-loop-arrays -fno-peephole -fno-web -ftracer -fpeel-loops -o /tmp/ACOVEAD78DAA47 /usr/share/libacovea/benchmarks/huffbench.c

Acovea's Common Options:
gcc -lrt -lm -std=gnu99 -O2 -maccumulate-outgoing-args -mno-align-stringops -fno-strength-reduce -fno-schedule-insns -fno-sched-interblock -fno-unit-at-a-time -fno-defer-pop -o /tmp/ACOVEAF636505B /usr/share/libacovea/benchmarks/huffbench.c

A relative graph of fitnesses:

     Acovea's Best-of-the-Best: **********************************                    (1.42823)
       Acovea's Common Options: ***************************************               (1.63152)
                           -O1: ***************************************               (1.62643)
                           -O2: **************************************                (1.58268)
                           -O3: **************************************                (1.58807)
                           -Os: **************************************************    (2.07581)
Back to top
View user's profile Send private message
schmobag
Tux's lil' helper
Tux's lil' helper


Joined: 17 Feb 2004
Posts: 91
Location: Los Angeles

PostPosted: Tue Nov 21, 2006 5:00 pm    Post subject: Reply with quote

I'm a little confused about -Os.

You original post says that you've never seen -Os beat -O1.

The gcc man page defines -Os like this:

Code:

-Os Optimize for size.  -Os enables all -O2 optimizations that do not typically increase code
           size.  It also performs further optimizations designed to reduce code size.

           -Os disables the following optimization flags: -falign-functions  -falign-jumps
           -falign-loops -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
           -fprefetch-loop-arrays  -ftree-vect-loop-version


So it seems that generally, -Os is -O1 plus all the -O2 optimizations that don't increase code size. If that was all, then it would seem logically impossible for -O1 to beat -Os. But then the definition says that -Os "also performs further optimizations designed to reduce code size." Mysterious. Are those the optimizations that slow -Os down so that -O1 beats it?

I confess to having no idea how to read your Acovea benchmarks. In the raytracer one, -Os is much lower than the others. In the huffman one, -Os is a bit higher. Does that mean that -Os is the worst for the raytracer, and the best for the huffman, or is it the reverse?
Back to top
View user's profile Send private message
batistuta
Veteran
Veteran


Joined: 29 Jul 2005
Posts: 1384
Location: Aachen

PostPosted: Tue Nov 21, 2006 5:39 pm    Post subject: Reply with quote

schmobag wrote:
In the raytracer one, -Os is much lower than the others. In the huffman one, -Os is a bit higher. Does that mean that -Os is the worst for the raytracer, and the best for the huffman, or is it the reverse?

First of all, thanks Karp for the work. Great useful thread.

Scmobag, I'm not sure if we are looking at the same graphs. The ones that I see, -s0 is always the slowest one: in Huffman by a bit, in raytrace by a bit more. The way I interpret the results is that since Os does not optimize when increasing size, many things linke function inlining, which are probably called a lot in raytracing and in Hoffman, are omitted. This will have an impact in performance. Other things like loop unrolling, same thing. And matrix operations have lots of loop things that could get optimized.
So it looks to me that the particular test shows Os as a non optimal flag. But other applications could show very different results. For instance I would expect a Web browser to not show these differences as much.

In the end of the day, it looks like the rule of thumb still prevales: choose your architecture, and -O2 for a good compromise between compilation speed, performance, and stability. Then Acovea, as the website suggests, is great for tuning particular applications that could benefit from optimization (like raytracing).

Nevertheless, these results are very interesting. Thanks for sharing the information!
Back to top
View user's profile Send private message
schmobag
Tux's lil' helper
Tux's lil' helper


Joined: 17 Feb 2004
Posts: 91
Location: Los Angeles

PostPosted: Tue Nov 21, 2006 5:55 pm    Post subject: Reply with quote

batistuta wrote:
Scmobag, I'm not sure if we are looking at the same graphs.


We were looking at the same ones, but I misread the first one. The number on the -Os line starts with 1.xxx, but I read that as .1xxxx. Quite a difference.
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Tue Nov 21, 2006 7:41 pm    Post subject: Reply with quote

The best test for -Os would probably be to recompile an entire system with it, and then measure overall system performance compared to an entirely -O2 system. Having only a benchmark give up some code size for The Good of the Cache doesn't mean much when the kernel, glibc, etc are still large. But if they *all* made a sacrifice, it could be worth it.

That said, I don't know if I feel up to recompiling my entire system multiple times to test this. Besides, I'm not even sure if there is a good overall-system-performance benchmark... maybe kernel-compilation?
Back to top
View user's profile Send private message
batistuta
Veteran
Veteran


Joined: 29 Jul 2005
Posts: 1384
Location: Aachen

PostPosted: Tue Nov 21, 2006 9:13 pm    Post subject: Reply with quote

kernel compilation is always a special case. So special that it has its own set of flags, some are ignored even if you pass them. So I don't think kernel is a good candidate for it... :roll:
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Tue Nov 21, 2006 9:57 pm    Post subject: Reply with quote

I think you misunderstand... I wouldn't be changing the flags used inside of the kernel build process. I'd be comparing how toolchains compiled with various settings perform on a standard task: compiling the kernel.
Back to top
View user's profile Send private message
karp
n00b
n00b


Joined: 20 May 2002
Posts: 38
Location: Champaign, IL

PostPosted: Tue Nov 21, 2006 10:04 pm    Post subject: Reply with quote

batistuta wrote:

In the end of the day, it looks like the rule of thumb still prevales: choose your architecture, and -O2 for a good compromise between compilation speed, performance, and stability.


Well, actually, I found that not specifying the architecture is best, in this particular situation.
Back to top
View user's profile Send private message
batistuta
Veteran
Veteran


Joined: 29 Jul 2005
Posts: 1384
Location: Aachen

PostPosted: Tue Nov 21, 2006 10:13 pm    Post subject: Reply with quote

karp wrote:
I'd be comparing how toolchains compiled with various settings perform on a standard task: compiling the kernel.

ok, I understand

Quote:
Well, actually, I found that not specifying the architecture is best, in this particular situation.

This might be because your architecture is not fully supported. And once again, this depends on the particular application.
Back to top
View user's profile Send private message
anli
Tux's lil' helper
Tux's lil' helper


Joined: 08 Sep 2006
Posts: 80

PostPosted: Mon Dec 18, 2006 1:26 pm    Post subject: Reply with quote

karp,

Thanks for your information!

Just want to clarify one moment. After these months which arch cfg you treat as the best - 'nocona' or without any arch flag at all?
Back to top
View user's profile Send private message
Keruskerfuerst
Veteran
Veteran


Joined: 01 Feb 2006
Posts: 1722

PostPosted: Wed Dec 20, 2006 6:42 pm    Post subject: Reply with quote

Has anyone done a CPU test for AMD Opteron or AMD Athlon64 as discribed here: http://gentoo-wiki.com/TIP_Acovea
Back to top
View user's profile Send private message
peaceful
Apprentice
Apprentice


Joined: 06 Jun 2003
Posts: 287
Location: Utah

PostPosted: Thu Jan 04, 2007 4:50 pm    Post subject: Re: Good CFLAGS for Intel Core 2 Reply with quote

karp wrote:

Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%.


I did extensive compiler flag testing of a project called ANNEvolve (Artificial Neural Network Evolution software) in December, 2004. I tested most relevant combinations of 15-20 different compiler flags on OS X on a G4 and G5, Linux on a POWER4 (cousin to both the G4 and G5), and Linux on a Pentium 4. I found that a simple "-Os" resulted in the fastest code on the G4, G5, and POWER4 processors, while -O3 resulted in the fastest code on the Pentium 4. None of the other dozens of flags I tried had any positive effect on runtime speed.

So, I'm not surprised that Apple would use -Os for all their PowerPC-based code. I would hope they don't blindly use it for their intel code as well.

Bear in mind that this was one specific project written in C doing artificial neural network simulation, and this on GCC 3.something. The task at hand was heavy in floating-point calculations.
Back to top
View user's profile Send private message
Element Dave
n00b
n00b


Joined: 10 Nov 2006
Posts: 74

PostPosted: Tue Jan 23, 2007 10:56 pm    Post subject: Re: Good CFLAGS for Intel Core 2 Reply with quote

karp wrote:
Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%.


Define "beating" in this context. I am not at all surprised that Apple uses "-Os" by default. Many people keep failing to recognize the simple and incontrovertible fact that smaller code will ALWAYS perform better when CPU utilization is at less than maximum. The CPU is almost never the bottleneck on a modern desktop. Only if CPU utilization is constantly near its limit does it make any sense to use "-O2" or "-O3" for a system default. Actually, I don't think it's a very good idea to have "-Oanything" defined globally, but that's another matter. There is also another complication: gcc sometimes produces bigger (!) code with "-Os" compared to "-O2". Unfortunately, it does not seem uncommon for this to happen. For example, cairo is much larger when compiled with "-Os" than it is with "-O2", using the gcc currently in the stable x86 branch. It would be interesting to compare the size of every compiled file on otherwise identical systems compiled with "-Os" and "-O2".
Back to top
View user's profile Send private message
nxsty
Veteran
Veteran


Joined: 23 Jun 2004
Posts: 1556
Location: .se

PostPosted: Sat Feb 03, 2007 7:06 pm    Post subject: Re: Good CFLAGS for Intel Core 2 Reply with quote

karp wrote:
Secondly, optimization level. Supposedly Apple does all their compilation using -Os (optimizing for small size), but I have yet to personally see -Os beating -O1 or -O2, and quite often -Os lags by 10-20%. My guess this is because Core 2 processors have pretty large caches, so cache misses aren't as much of a concern as they would be on other CPUs. -O2 is a tad better than -O1, I haven't done enough testing to say for sure though. -O3 doesn't seem to provide much benefit, takes more time to compile, makes much larger binaries, and sometimes produces incorrect results, so I'm not using it.


That's not really comparable. Apple's gcc works differently than FSF's gcc for some optimization levels. -Os in apple's gcc is just like -O2 except that it disables some flags, unlike -Os in FSF's gcc which also tunes some optimizations for size rather than speed. The apple gcc specific -Oz optimization is the same as -Os in FSF's gcc.
Back to top
View user's profile Send private message
likewhoa
l33t
l33t


Joined: 04 Oct 2006
Posts: 695
Location: Brooklyn, New York

PostPosted: Sat Feb 03, 2007 9:45 pm    Post subject: Reply with quote

Keruskerfuerst wrote:
Has anyone done a CPU test for AMD Opteron or AMD Athlon64 as discribed here: http://gentoo-wiki.com/TIP_Acovea


I got opteron on amd64 i will do some acovea runs :)
Back to top
View user's profile Send private message
yaneurabeya
Veteran
Veteran


Joined: 13 May 2004
Posts: 1754
Location: Silicon Valley

PostPosted: Sun Mar 25, 2007 5:15 am    Post subject: Reply with quote

Interesting thread. It may be because of my CLI resolution compared to the size of the FB resolution on the suse livecd, but it appears that things are trolling along a bit slower on my Core 2 Duo box. I only have 2GB of ram in the box currently and need to add the other 2 (running it in 64-bit mode).

Current CFLAGS: -O2 -pipe -fno-strict-aliasing -funroll-loops -msse -msse2 -msse3 <- guess I don't need -msse3 any more, do I?

If there's anything I can help test out, I'd be more than happy to.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum