View previous topic :: View next topic |
Author |
Message |
Twist Guru
Joined: 03 Jan 2003 Posts: 414 Location: San Diego
|
Posted: Sun Dec 26, 2004 12:55 pm Post subject: Acovea analysis results against real world programs |
|
|
Well, it's not very good. I have been testing my acovea flag results (posted here ) against more traditional "optimized" CFLAGS. The results have not argued strongly in favor of using Acovea based recommendations.
My system is as follows:
Athlon64 3400+ w/1GB memory
Gentoo 2004.3 stable, with exceptions noted
gcc-3.4.3, glibc-2.3.4.20040808-r1
For each test, I would run the given app against sample data three times with my "normal" CFLAGS, then recompile and run it three times against the acovea CFLAGS, averaging the results. No other significant load existed at the time on the machine. No window system was running (GDM was and therefor xorg, as were my standard services like NFS and Samba, but they weren't actively doing anything). The actual tests were performed from an SSH session from another machine.
My original acovea results:
Code: |
Score | So? | Switch (annotation)
------------------------------------------------------------------------------
35.8 | Yes | -minline-all-stringops
32.6 | Yes | -mno-push-args
31.8 | Maybe | -finline-functions (-O3)
31.8 | Yes | -fexpensive-optimizations (-O2)
30.4 | Maybe | -fschedule-insns (-O2)
30.3 | Maybe | -fpeel-loops
30.1 | Yes | -fno-if-conversion2 (! -O1)
29.8 | Yes | -fno-defer-pop (! -O1)
29.7 | Yes | -fcse-skip-blocks (-O2)
29.1 | Maybe | -frerun-loop-opt (-O2)
28.3 | Yes | -fsched-interblock (-O2 GCC 3.3)
28.2 | Yes | -foptimize-sibling-calls (-O2)
27.4 | Yes | -falign-jumps (-O2 GCC 3.3)
27.4 | Maybe | -fstrict-aliasing (-O2)
26.9 | Maybe | -fno-merge-constants (! -O1)
26.5 | Maybe | -finline-limit
26.1 | Maybe | -falign-functions
25.7 | Maybe | -fno-delayed-branch (! -O1)
25.4 | Maybe | -fpeephole2 (-O2)
25.4 | Maybe | -freorder-functions (-O2 GCC 3.3)
25.0 | Maybe | -fno-signaling-nans (fast math)
25.0 | Maybe | -freorder-blocks (-O2)
24.7 | No | -fstrength-reduce (-O2)
24.4 | Maybe | -frerun-cse-after-loop (-O2)
24.3 | Yes | -fmove-all-movables
24.2 | Maybe | -fcse-follow-jumps (-O2)
23.6 | Maybe | -fschedule-insns2 (-O2)
23.2 | Maybe | -fno-math-errno (fast math)
22.8 | Yes | -fsched-spec (-O2 GCC 3.3)
22.8 | Maybe | -maccumulate-outgoing-args
22.5 | Maybe | -fdelete-null-pointer-checks (-O2)
22.5 | Maybe | -falign-labels (-O2 GCC 3.3)
22.4 | Maybe | -fno-thread-jumps (! -O1)
22.3 | Maybe | -mieee-fp
22.2 | Maybe | -ftracer
22.0 | Maybe | -mno-align-stringops
21.4 | Maybe | -fno-crossjumping (! -O1)
21.3 | Maybe | -fno-cprop-registers (! -O1)
21.3 | Yes | -funit-at-a-time
21.1 | Maybe | -frename-registers (-O3)
20.9 | Maybe | -ffinite-math-only (fast math)
20.8 | Maybe | -fno-trapping-math (fast math)
20.6 | Maybe | -funswitch-loops
20.4 | No | -fweb
20.2 | Maybe | -fcaller-saves (-O2)
20.1 | No | -falign-loops (-O2 GCC 3.3)
19.9 | No | -fgcse (-O2)
19.1 | No | -fno-omit-frame-pointer (! -O1)
17.3 | No | -funsafe-math-optimizations (fast math)
17.1 | No | -fno-if-conversion (! -O1)
15.6 | No | -fregmove (-O2)
15.4 | Maybe | -fbranch-target-load-optimize
15.1 | No | -fprefetch-loop-arrays
13.6 | No | -fnew-ra
13.4 | No | -fno-inline
12.2 | No | -freduce-all-givs
12.2 | No | -funroll-all-loops
11.5 | No | -fforce-mem (-O2)
8.7 | No | -funroll-loops
5.2 | No | -fno-loop-optimize (! -O1)
4.6 | No | -ffloat-store
0.0 | No | -fno-guess-branch-probability (! -O1)
0.0 | No | -fbranch-target-load-optimize2
0.0 | No | -mfpmath=387
0.0 | No | -mfpmath=sse
0.0 | No | -mfpmath=sse,387
|
My "normal" optimized CFLAGS:
Code: |
CFLAGS="-O3 -march=athlon64 -mtune=athlon64 -ftracer -pipe"
|
CFLAGS recommended by acovea, see note below:
Code: |
CFLAGS="-O? -march=athlon64 -mtune=athlon64 -minline-all-stringops -mno-push-args -fexpensive-optimizations -fno-if-conversion2 -fno-defer-pop -fcse-skip-blocks -fsched-interblock -foptimize-sibling-calls -falign-jumps -fno-strength-reduce -fmove-all-movables -fsched-spec -funit-at-a-time -fno-web -fno-align-loops -fno-gcse -fomit-frame-pointer -fno-unsafe-math-optimizations -fif-conversion -fno-regmove -fno-prefetch-loop-arrays -fno-new-ra -finline -fno-reduce-all-givs -fno-unroll-all-loops -fno-force-mem -fno-unroll-loops -floop-optimize -fno-float-store -fguess-branch-probability -fno-branch-target-load-optimize2"
|
Acovea "alt" set:
Code: |
CFLAGS="-O3 -march=athlon64 -mtune=athlon64 -minline-all-stringops -mno-push-args -fno-if-conversion2 -fno-defer-pop -fno-strength-reduce -fmove-all-movables -funit-at-a-time -fno-align-loops -fno-gcse -fno-regmove -fno-force-mem -pipe"
|
Note: I am aware -march normally implies -mtune. I leave -mtune present in the case that -march is filtered for some reason. For the acovea flags, I used the following methodology: I explicitly include all flags marked as "Yes", explicitly exclude all flags marked as "No", and then vary from -O1 to -O2 and finally -O3. For the acovea "alt" set I use -O3 and only explicitly include "Yes" indications, some of which it should be noted are logical not conditions against compilation methods.
TESTS
Test for flac-1.1.1
In this test I encoded Tchaikovsky's 1812 Overture using the "--best" flag option for flac.
Results:
Code: |
ACOVEA -O1:
real 2m3.003s
user 2m2.616s
sys 0m0.313s
ACOVEA -O2:
real 2m4.853s
user 2m4.430s
sys 0m0.333s
ACOVEA -O3:
real 2m4.395s
user 2m3.971s
sys 0m0.348s
ACOVEA alt:
real 1m2.734s
user 1m2.348s
sys 0m0.323s
REGULAR:
real 1m9.937s
user 1m9.545s
sys 0m0.326s
|
Test for lame-3.96.1
In this test I encoded the above 1812 Overture from raw .wav to mp3 using no special options.
Code: |
ACOVEA -O1:
real 1m12.179s
user 1m11.916s
sys 0m0.210s
ACOVEA -O2:
real 1m10.361s
user 1m10.109s
sys 0m0.203s
ACOVEA -O3:
FAILED - Segmentation fault (compiled twice to make sure)
ACOVEA alt:
FAILED - Segmentation fault (compiled twice to make sure)
REGULAR:
real 1m6.611s
user 1m6.354s
sys 0m0.189s
|
Test for bzip2-1.0.2-r3
In this test I compressed the raw .WAV of the previously used Tchaikovsky's 1812 Overture. The file is fairly large, with a size of 166368764 bytes. No flags for bzip2 were used.
Results:
Code: |
ACOVEA -O1:
real 0m50.877s
user 0m50.321s
sys 0m0.475s
ACOVEA -O2:
real 0m48.955s
user 0m48.435s
sys 0m0.447s
ACOVEA -O3:
real 0m46.516s
user 0m45.972s
sys 0m0.471s
ACOVEA alt:
real 0m42.366s
user 0m41.845s
sys 0m0.460s
REGULAR:
real 0m43.687s
user 0m43.162s
sys 0m0.450s
|
Conclusions
I am aware my test cases are drawn from a specific class of programs, that being encode/decode style logic. This is the easiest case to find reproducable results with; if others want to try more complex types of programs with 100% reproducable data sets, by all means please do!
In the examples given, Acovea based results can't really be recommended. It's true in one case they resulted in an approximately 11% performance increase for the flac encoding, but in other tests it either performed worse, much worse, or failed to execute compared to "normal" optimizing CFLAGS. The interaction of the flags recommended appears highly situational and largely just noise when compared with the GCC "meta" flags of -O settings.
I would hazard a guess that acovea's default benchmarks are simply not indicative of the programs I used to test, and therefore made little if any headway in optimizing. Short of running an acovea style analysis of each program individually, I'm not sure how this would be fixed.
In the meantime, I'm sticking with my default CFLAGS =)
-Twist |
|
Back to top |
|
|
ebrostig Bodhisattva
Joined: 20 Jul 2002 Posts: 3152 Location: Orlando, Fl
|
Posted: Mon Dec 27, 2004 1:11 am Post subject: |
|
|
It is difficult to set individual flags that will give an overall improvement in speed. It all depends on what the program you want to run does and how it does it internally. In order to optimize a specific program you will have to perform the type of tests that you have done and adjust flags individually. That is not desirable in general.
The gcc suite sets internally many flags based on the -O? flag, they are all documented in the gcc man pages.
I have done numerous tests myself on my AMD64 3200+ and have come up with a set of flags that overall gives the most optimal performance and stability. The last is not the least important, as you found out with some programs that segfaulted when run.
In general, it is best to stick with a minimal amount of flags and use the ones recommended for each platform.
I think you have done a great job and I applaud you for your persistence in testing the various combination. Great write-up!
Erik _________________ 'Yes, Firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.' |
|
Back to top |
|
|
georgz Tux's lil' helper
Joined: 06 Dec 2002 Posts: 137 Location: Munich, Germany
|
Posted: Wed Dec 29, 2004 12:30 pm Post subject: |
|
|
Quote: | I have done numerous tests myself on my AMD64 3200+ and have come up with a set of flags that overall gives the most optimal performance and stability. |
Which flags do you use? Are different flags suggested/recommended for 64bit or 32bit installations with Athlon64? |
|
Back to top |
|
|
smokeslikeapoet Tux's lil' helper
Joined: 03 Apr 2003 Posts: 96 Location: Cordova, TN USA
|
Posted: Wed Dec 29, 2004 12:36 pm Post subject: |
|
|
Instead of using acovea I benchmarked my system in much the same way. I used Lame and some default optimizations. I md5 summed all of the resulting mp3s. -O3 gave me the best time. The I started adding other combinations of cflags until I started noticing speed improvements. Again I md5 summed the resulting mp3s. I threw out the cflags that gave me different md5 sums, most notably -ffast-math. Then I started taking the cflags out that gave me no significant improvement in encoding time until I was left with the minimal cflags that reduced my encoding time by 40%. In case you were wondering here are my -cflags for my Athlon 1800+ on an Epox Via 8HKA+.
Code: | CFLAGS="-march=athlon-xp -mtune=athlon-xp -O3 -pipe -fomit-frame-pointer -fforce-addr -falign-functions=16 -falign-jumps=16 -falign-loops=16 -falign-labels=1 -fprefetch-loop-arrays -maccumulate-outgoing-args" |
I doubt acovea would give me any significant improvement. _________________ -SmokesLikeaPoet
Folding@Home |
|
Back to top |
|
|
MighMoS Guru
Joined: 24 Apr 2003 Posts: 416 Location: @ ~
|
Posted: Wed Dec 29, 2004 5:15 pm Post subject: |
|
|
I'm curious as to people using -O3, due to the fact that most tests agree that inlining functions slow down code on modern processors. As well as redundant CFLAGS such as specifying -fomit-frame-pointer on -O2 and above, because the GCC man page states that this is already implied.
Not to start another rant again, but actually reading the man (or info ) pages can help a lot too, and save time. _________________ jabber: MighMoS@jabber.org
localhost # export HOME=`which heart` |
|
Back to top |
|
|
Twist Guru
Joined: 03 Jan 2003 Posts: 414 Location: San Diego
|
Posted: Wed Dec 29, 2004 8:48 pm Post subject: |
|
|
Quote: | I'm curious as to people using -O3, due to the fact that most tests agree that inlining functions slow down code on modern processors. |
Qualify "most test results". I think that's probably "some test results I read", as I find that is most often the case and then people generalize. Not trying to knock against you, it's just been my very common experience.
The answer is I don't trust any of them as a generalization and try to test it myself to see. GCC has evolved recently at a very fast pace and its level of support for different processors varies considerably. What is true for one class of processor with a specific cycle rate, cache, and instruction set may be completely different for another. Thus, I test it myself.
Quote: | As well as redundant CFLAGS such as specifying -fomit-frame-pointer on -O2 and above, |
For a very simple reason, and yes many of them have RTFM. If you RTFM the portage manual, you will realize that occasionally portage will filter some flags without telling you at the ebuild level. It's therefore valid to string individual flags after your "meta" optimization flag, in the hopes that if the ebuild filters say -O3 you will still retain some optimization behaviors. In fairness however anything that filters "-O2" would most likely filter all flags, so not much point there.
The specific combination you point out, "-O2 -fomit-frame-pointer", is not the default behavior for Intel class processors. From the gcc man page:
Quote: | "-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging. |
Since omitting the frame pointer is destructive to rewinding on Intel class processors, GCC does not do this until explicitly indicated on those systems. So hopefully you didn't give your pet peeve advice to anybody running an Intel class system =)
-Twist |
|
Back to top |
|
|
MighMoS Guru
Joined: 24 Apr 2003 Posts: 416 Location: @ ~
|
Posted: Wed Dec 29, 2004 9:37 pm Post subject: |
|
|
Twist wrote: | Since omitting the frame pointer is destructive to rewinding on Intel class processors, GCC does not do this until explicitly indicated on those systems. So hopefully you didn't give your pet peeve advice to anybody running an Intel class system =)
-Twist | Actually, I havn't, because I just read up on it the other day. Sorry about the small rant there, and you are right about "most test results". *backs away slowly* _________________ jabber: MighMoS@jabber.org
localhost # export HOME=`which heart` |
|
Back to top |
|
|
ciaranm Retired Dev
Joined: 19 Jul 2003 Posts: 1719 Location: In Hiding
|
Posted: Wed Dec 29, 2004 10:29 pm Post subject: |
|
|
MighMoS wrote: | I'm curious as to people using -O3, due to the fact that most tests agree that inlining functions slow down code on modern processors. |
Because most of the people you see who post their CFLAGS are the sort who don't have a clue what they're doing, and who just assume that bigger numbers and longer CFLAGS lines equates to faster code. |
|
Back to top |
|
|
rhill Retired Dev
Joined: 22 Oct 2004 Posts: 1629 Location: sk.ca
|
Posted: Wed Dec 29, 2004 11:21 pm Post subject: |
|
|
thanks twist, i was getting all set to go into MythBusters mode, but you ranted for me.
seriously there needs to be a GCC Myths FAQ
Quote: | -O2 does not include -fomit-frame-pointers on intel archs |
Quote: | -mfpmath=sse,387 is BROKEN in any current release and will eat your children |
Quote: | -mmmx and -msse -msse2 are a waste of time and also BROKEN |
stuff like that, but written by someone who knows what they are talking about.
--de. _________________ by design, by neglect
for a fact or just for effect |
|
Back to top |
|
|
ciaranm Retired Dev
Joined: 19 Jul 2003 Posts: 1719 Location: In Hiding
|
Posted: Wed Dec 29, 2004 11:29 pm Post subject: |
|
|
dirtyepic wrote: | stuff like that, but written by someone who knows what they are talking about. |
I used to have one of those, but I got too much abuse from lovech^W clueless ricers over it, so I got rid of it.
Seriously though, I'm trying to get the following in as official policy on how we handle CFLAGS:
Quote: |
Guidelines for Flag Filtering
If a package breaks with any reasonable CFLAGS, it is best to filter the problematic flag if a bug report is received. Reasonable CFLAGS are -march=, -mcpu=, -mtune= (depending upon arch), -O2, -Os and -fomit-frame-pointer. Note that -Os should usually be replaced with -O2 rather than being stripped entirely. The -fstack-protector flag should probably be in this group too, although our hardened team claim that this flag never ever breaks anything...
If a package breaks with other CFLAGS, it is perfectly ok to close the bug with a WONTFIX suggesting that the user picks more sensible global CFLAGS. Similarly, if a bug report is received and is determined or suspected to be caused by daft CFLAGS, an INVALID resolution is appropriate.
|
Take from that what you will about what you should have in make.conf... |
|
Back to top |
|
|
Twist Guru
Joined: 03 Jan 2003 Posts: 414 Location: San Diego
|
Posted: Thu Dec 30, 2004 12:06 am Post subject: |
|
|
Quote: | Actually, I havn't, because I just read up on it the other day. Sorry about the small rant there, and you are right about "most test results". *backs away slowly* |
LOL ok I guess I came across a bit too strong there. I was honestly just trying to convey the idea that -fomit-frame-pointer was not automatic with -O or above on Intel arch machines.
As for the 'most test results' thing, it's a common problem that I fall into myself, even as a coder and somebody who is very conversant with compilers and their behavior. This is why conceptually I like Acovea; it seems that it's either flawed somewhat in implementation (not enough breadth to the example benchmark code) or simply that GCC is prone to many contradictory behaviors that can't be generalized across an architecture, but must be taken in context to a specific set of code. I tend to favor the latter myself, but again it means nothing without more extensive testing =)
Quote: | I used to have one of those, but I got too much abuse from lovech^W clueless ricers over it, so I got rid of it.
Seriously though, I'm trying to get the following in as official policy on how we handle CFLAGS: |
I think that is an ok set of rules for the general case, sure. While it's annoying to get non-bugs submitted by Gentoo users who are doing unreasonble things with the compiler, it sort of comes with the territory and is part of the Gentoo flexibility/experience, so I would urge you to not turn to the dark side of bitterness on this issue =). I think the "stable" keyword ebuilds should all be responsible for handling any set of input CFLAGS to retain stable behavior (note that this most likely means rejecting almost all of them) and that your proposed policy would get us there.
If wishes were fishes though...I'd love to use the participatory nature of the Gentoo community to get definitive on some of this stuff. For instance, while we can label -fomit-frame-pointer as "safe" in that it doesn't break any known ebuilds, it would be great if we had a bug-buddy like facility to actually KNOW that for sure as part of the base install. Except maybe not as cumbersome and ugly as bug-buddy =). Something like -ftracer with the newer GCC releases, which (according to the GCC mailing list) should be entirely safe and improve the ability of other optimizations. -funit-at-a-time should also be safe, short of consuming extra memory for compiles, but I honestly don't have a feel at all for whether it breaks anything as I don't use it. It would be great if we could poll and consolidate results with some of these flag variants automatically.
Ah well. In the meantime, don't try this at home! Experienced coder here attempting compilations on a closed course with appropriate safety gear. The sponsers remind you to not exceed your ability or that of your gear by sticking with stable keywords and not overriding ebuild behavior. Thank you, drive through.
-Twist |
|
Back to top |
|
|
ciaranm Retired Dev
Joined: 19 Jul 2003 Posts: 1719 Location: In Hiding
|
Posted: Thu Dec 30, 2004 12:12 am Post subject: |
|
|
If you want stable, don't set CFLAGS at all in make.conf. Just rely upon the profile-provided settings. Gentoo developers are not here to correct every single possible stupid thing you can do with make.conf. |
|
Back to top |
|
|
rhill Retired Dev
Joined: 22 Oct 2004 Posts: 1629 Location: sk.ca
|
Posted: Thu Dec 30, 2004 1:19 am Post subject: |
|
|
that kinda throws the whole 'freedom of choice' philosophy out the window though. sorry, just poking your buttons. i do appreciate the all work you do here for us and gentoo in general.
seriously though, i was surprised that "-pipe" isn't on that whitelist. are there actually situations where -pipe needs to be filtered or has caused problems (just curious). _________________ by design, by neglect
for a fact or just for effect
Last edited by rhill on Thu Dec 30, 2004 1:22 am; edited 1 time in total |
|
Back to top |
|
|
ciaranm Retired Dev
Joined: 19 Jul 2003 Posts: 1719 Location: In Hiding
|
Posted: Thu Dec 30, 2004 1:21 am Post subject: |
|
|
dirtyepic wrote: | that kinda throws the whole 'freedom of choice' philosophy out the window though. sorry, just poking your buttons. |
Oh, you're free to use other flags, and developers are free to ignore any bugs you submit if you do.
Quote: | seriously though, i was surprised that "-pipe" isn't on that whitelist. are there actually situations where -pipe needs to be filtered or has caused problems (just curious). |
-pipe doesn't count, it's not an optimisation flag and it doesn't alter the code produced. No problems with it though, guess I could explicitly say so... |
|
Back to top |
|
|
rhill Retired Dev
Joined: 22 Oct 2004 Posts: 1629 Location: sk.ca
|
Posted: Thu Dec 30, 2004 1:25 am Post subject: |
|
|
ciaranm wrote: | Oh, you're free to use other flags, and developers are free to ignore any bugs you submit if you do. |
yeah, definitely. no argument there.
Quote: | -pipe doesn't count, it's not an optimisation flag and it doesn't alter the code produced. No problems with it though, guess I could explicitly say so... |
oh ok. it is a CFLAG however, and the guideline didn't mention optimization flags only. i'm unfamiliar with how the filtering works of course, so perhaps the mistake was mine.
cheers. _________________ by design, by neglect
for a fact or just for effect |
|
Back to top |
|
|
Hypnos Advocate
Joined: 18 Jul 2002 Posts: 2889 Location: Omnipresent
|
Posted: Thu Dec 30, 2004 4:54 am Post subject: |
|
|
Twist,
Thanks for you work -- I'm glad someone has done something useful with my reporting scripts.
Comments:
* It seems that, apart from compilation problems, your Acovea "alt" CFLAGS did pretty well. This suggests that Acovea, for the algorithms you have chosen, has more reliably found negatives than affirmatives (apparently, the "maybe"'s from -O3 provided a big performance boost).
* The algorithms you have chosen are far more complex and heuristic than those employed by Acovea as benchmarks. On the former, this means that memory-intensive optimizations might be beneficial since you are moving a lot of data and burning a lot of cycles anyway. On the latter, I'm not knowledgeable enough to impute how this would affect the performance of specific switches ....
* Is not GCC optimization for AMD notoriously bad? As you say in another post, the cross-dependencies of the various switches might be too extensive for even Acovea to dissect with its evolution.
* Here are my CFLAGS for my P4-Mobile:
Code: | CFLAGS="-pipe -Wall -O2 -march=pentium4 -mcpu=pentium4 -maccumulate-outgoing-args -minline-all-stringops -fmove-all-movables -fno-if-conversion2 -fno-crossjumping -fno-delayed-branch -fno-omit-frame-pointer -fno-merge-constants -fno-thread-jumps" |
I can't say one way or the other on performance movements (apart from placebo), but these flags have been prodigiously stable. _________________ Personal overlay | Simple backup scheme |
|
Back to top |
|
|
Twist Guru
Joined: 03 Jan 2003 Posts: 414 Location: San Diego
|
Posted: Thu Dec 30, 2004 6:01 am Post subject: |
|
|
Hypnos,
BTW, before anything else, wanted to thank you for your ebuild and test scripts for Acovea. Fine work that I was too lazy to do myself.
Quote: | It seems that, apart from compilation problems, your Acovea "alt" CFLAGS did pretty well. |
Yes - I would hazard to guess that GCC is decent about deciding on its own when a method is negative (probably based on total instruction/tick count) and simply doesn't use it. So although those options came out as "no" according to Acovea, in real use GCC might benefit from them occasionally.
Quote: | The algorithms you have chosen are far more complex and heuristic than those employed by Acovea as benchmarks. |
The biggest fault I can find with my "real world" examples is that they are all memory intensive. They all pump a lot of data in total, they all want to do lots of fairly wide address space lookups and compares, etc. However, it's the nature of the beast that these type of apps are not only good demonstrations but also where I tend to spend a lot of wait time in real life. For purely algorithmic benchmarks, I could have used nbench or the like, and for heavy mathmatics, xfractint or celestia on a complex solution I suppose. Might still go back and do that.
Quote: | Is not GCC optimization for AMD notoriously bad? |
AMD themselves are actively helping the GCC crew in getting their instruction scheduling up to par, and it is reportedly vastly improved in the later versions. Since I tested with 3.4.3, I figured that was good enough. It's definitely true that the GCC 2.9 series was simply awful with AMD procs, and early 3 series (aside from general brokeness and stability issues) wasn't renowned either. I could and probably will run the same kind of comparison on one of my P4 machines, I just haven't gotten around to it yet.
-Twist |
|
Back to top |
|
|
moocha Watchman
Joined: 21 Oct 2003 Posts: 5722
|
Posted: Thu Dec 30, 2004 6:24 am Post subject: |
|
|
Twist wrote: | Something like -ftracer with the newer GCC releases, which (according to the GCC mailing list) should be entirely safe and improve the ability of other optimizations. |
Which only goes to show that the GCC mailing list can't be entirely trusted, since -ftracer breaks teTeX in a very weird fashion (executables don't crash but weirdly duplicate the file name they get passed, which of course causes the file not to be found). For details see https://bugs.gentoo.org/show_bug.cgi?id=50417 (ebuild *still* doesn't filter that flag, and I'm pretty peeved about it.. I even begged nicely )
As far as I'm aware, teTeX is the only package broken by -ftracer though. I use a bashrc-based filtering so teTeX doesn't get passed -ftracer but the rest do.
My own flags (development desktop, dual P3, lots of L2 cache): Code: | CFLAGS="-march=pentium3 -mtune=pentium3 -O2 -pipe \
-fno-ident -fomit-frame-pointer -momit-leaf-frame-pointer -ftracer \
-fweb -frename-registers -finline-functions -finline-limit=280" |
The last line actually takes -O2 to -O3 - it's there because many ebuilds filter -O3. I chose to ignore that, but then that's my choice, and I wholeheartedly agree with the default restrictive filtering.
As to your Acovea findings - it's hardly surprising. The best optimizations for any software are, in this order:
(a) Having a good design from the start and not as an afterthought
(b) Using algorithms that are best suited for the task
(c) Using the compiler's profiling facilities to identify bottlenecks
.
.
.
(somewhere around letter m) Compiler flags
_________________ Military Commissions Act of 2006: http://tinyurl.com/jrcto
"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety."
-- attributed to Benjamin Franklin |
|
Back to top |
|
|
Hypnos Advocate
Joined: 18 Jul 2002 Posts: 2889 Location: Omnipresent
|
Posted: Thu Dec 30, 2004 7:00 am Post subject: |
|
|
moocha wrote: | As to your Acovea findings - it's hardly surprising. The best optimizations for any software are, in this order:
(a) Having a good design from the start and not as an afterthought
(b) Using algorithms that are best suited for the task
(c) Using the compiler's profiling facilities to identify bottlenecks
.
.
.
(somewhere around letter m) Compiler flags
|
Ah, but as Twist shows above, compiler flags can certainly be deleterious! _________________ Personal overlay | Simple backup scheme |
|
Back to top |
|
|
dberkholz Retired Dev
Joined: 18 Mar 2003 Posts: 1008 Location: Minneapolis, MN, USA
|
Posted: Thu Dec 30, 2004 9:17 pm Post subject: |
|
|
moocha wrote: | As far as I'm aware, teTeX is the only package broken by -ftracer though. I use a bashrc-based filtering so teTeX doesn't get passed -ftracer but the rest do. |
-ftracer also broke gtk+ last time I tried it. That was a lot of fun to track down, since the problem resulted in a mysterious collection of broken apps that used gtk+. |
|
Back to top |
|
|
mbalino n00b
Joined: 09 Aug 2004 Posts: 30 Location: Edmonton
|
Posted: Thu Dec 30, 2004 10:30 pm Post subject: |
|
|
"-march=athlon-xp -m3dnow -msse -mfpmath=sse -mmmx -O3 -pipe -fforce-addr -fomit-frame-pointer -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -falign-functions=4 -maccumulate-outgoing-args -ffast-math -fprefetch-loop-arrays"
This my flags for BARTON 3000+ w/1024DDR400 SATA150 80Gb
KT600 / VT8237
All system are functionally since 15/11/2004 wtihout any problem.
kernel 2.6.9-ac12 and 2.6.10-ck1 are tested |
|
Back to top |
|
|
hq4ever Apprentice
Joined: 15 Aug 2004 Posts: 167
|
|
Back to top |
|
|
Twist Guru
Joined: 03 Jan 2003 Posts: 414 Location: San Diego
|
Posted: Fri Dec 31, 2004 6:31 pm Post subject: |
|
|
Quote: | I'm sorry for this newb question here but where does the "m" in front of these flag's came from ?
Shouldn't it be "-3dnow -sse -fpmath=sse -mmx" like here http://gentoo-portage.com/USE ? |
USE flags are specific to Gentoo and indicate a system-level interest (or not) in the application/feature indicated by the flag.
Compile flags are switches to indicate to GCC particular code generation behavior. In this case, -f indicates an "option", whereas -m indicates a "machine option". Most commonly -m is something that is specific to the processor type that is the compile target.
It is correct to use -m to specify fpmath, sse, and mmx switches. All are particular to the processor, not to code generation in general.
-Twist |
|
Back to top |
|
|
procyon112 n00b
Joined: 28 Apr 2005 Posts: 16 Location: Seattle, Washington, USA
|
Posted: Sat Apr 30, 2005 1:34 am Post subject: Invalid test |
|
|
This test in invalid. Because you are evolving compile flags independently for each test, then accepting the ones that on average give you the best performance, the test is not even as good as:
1) start with no optimizations and run each program, taking a reading.
2) turn on an optimization, test, take a reading.
3) turn on a different optimization and test.
4) The optimizations that give benefits, use, the others drop.
The genetic algorithm is probably worse, because it does not do a comprehensive test, and takes MUCH longer. The GA test is supposed to show which flags work best IN TANDEM, so taking the best average results will probably result in worse performance than O2 or O3, which the gcc team has probably already tested for best average performance independantly. What you need to do is:
1) Only include in the list of flags to test, those which you will have no qualms using in your final system build, ie, leave out -malign-double
2) For each generation of the GA, *ALL* benchmarks are run and a rating is given to that "set" of flags as the GA fitness function
3) run the GA until you are satisfied with the overall results (since the set of flags is rather small as far as GA's are concerned, 20 generations should be good with a population of 50-100).
4) use ALL the flags of the winner GA on your system, because what you are testing is not "flag -fomg-fast is beneficial" but rather "flags -fsometimes-good -falmost-never -fduh-use-me-always and -mim-a-typewriter when used in tandem beats -O3 on average"
Basically, what I am saying, is that if you run 6 independant GA's then take the average results, your data is completely meaningless and you're better off sticking with the tried and true "-O2 -pipe". Rewrite this GA if you want to get real data out of it. |
|
Back to top |
|
|
Hypnos Advocate
Joined: 18 Jul 2002 Posts: 2889 Location: Omnipresent
|
Posted: Sat Apr 30, 2005 4:16 am Post subject: Re: Invalid test |
|
|
procyon112 wrote: | This test in invalid. Because you are evolving compile flags independently for each test, then accepting the ones that on average give you the best performance, the test is not even as good as:
1) start with no optimizations and run each program, taking a reading.
2) turn on an optimization, test, take a reading.
3) turn on a different optimization and test.
4) The optimizations that give benefits, use, the others drop. |
Yes, except that you lose information about poor interactions altogether. By picking out the best average flags, you are not just extracting the switches which are beneficial over a variety of algorithms, but also those that "play nice" with others. This varies from machine to machine, it seems.
Quote: | What you need to do is:
1) Only include in the list of flags to test, those which you will have no qualms using in your final system build, ie, leave out -malign-double
2) For each generation of the GA, *ALL* benchmarks are run and a rating is given to that "set" of flags as the GA fitness function
3) run the GA until you are satisfied with the overall results (since the set of flags is rather small as far as GA's are concerned, 20 generations should be good with a population of 50-100).
4) use ALL the flags of the winner GA on your system, because what you are testing is not "flag -fomg-fast is beneficial" but rather "flags -fsometimes-good -falmost-never -fduh-use-me-always and -mim-a-typewriter when used in tandem beats -O3 on average" |
This is not too different from now, except for step 3. The danger here is that you overoptimize to this particular aggregate situation, which is only a rough mapping to the space of all apps you will be compiling. By testing each algorithm separately, you have a larger base of variegated populations whose best traits you can extract statistically.
The bottom line is that I'm testing for "nice" flags, you are trying to find an optimum. In the case that interactions are very important to performance (i.e., strong correlation) as you contend, there's no way that the small Acovea tests can predict the performance of real world apps, so the discussion is moot -- every app would have to be optimized seperately anyway. If the optimizing interactions are weak but the interactions that cause breakage are strong (as I contend), then you want to draw "valuable" traits from a broad base of organisms. (*)
This is all borne out by the reports on the old thread (mostly anecdotal): programs aren't any faster, but programs build more reliably and execute with far more stability than the canonical -O2 or -O3.
One good suggestion you make is to diligently weed out flags that you would never use anyway, like "-malign-double", from the set of available flags -- they might cause bad interactions with certain flags that are otherwise valuable.
(*) It should be noted that the intended purpose of Acovea is to test compilers against the different supplied benchmarks, or a specific algorithm against a specific compiler. My scripts generate the inference I describe. _________________ Personal overlay | Simple backup scheme |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|