optimization flags, myths and truths for the real world ;-)

Message

xoxo_davide · Post by **xoxo_davide** » Fri Nov 23, 2007 5:36 pm

Ok, i finally found the time to post this one.
Everybody wants the best compiler flags (cflags) to put in its make.conf file for speed and optimization, so after days of tests on recent processors i found interesting news.
Everything said here applies on athlon 64, athlon 64 dual core and opteron. No other processor tested.
Compiler version used: gcc-4.1.2 , glibc-2.6.1
profile: amd64 no-multilib (that means pure 64 bit system, but that doesn't affect results).

So, for the impatients, here you have the 2 night & days performance tests result:

######################################################
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
######################################################

Shocked??? that looks so normal...

no -ffast-math? no -ftrace or other exotic flags?
And, just a note: -fomit-frame-pointer and -finline-functions are inlined by default with -O3, so they are actually not usefull. I just put them along because sometimes ebuilds replace -O3 with -O2, and the hope is that the two flags are kept along with -O2. And -pipe is not a real cflags, that's just to speed up compile time. So the result is -O3.

Please don't argue with me on those results, i'll not answer neither go in-deep anyway. If you want to know a bit more, read ahead, but then don't argue with me anyway.

--

For who wants to know how i came to that result.

The hardware:
Used machines are 2 similar asus motherboard based, one with an athlon 64 and one with an athlon 64 dual core, and a supermicro server board (dual cpu support) with a single opteron.

How i performed the tests:
- I started from acovea (simple said, an evolutionary compiler-flags tester) running all tests twice on all machines and noting the results: acovea best flags, acovea optimistic flags, and acovea pessimistic flags. Every test run (i said every) gave me differents flags. But i had to start from somewhere. (note: acovea's tests are very specific to a narrow range of operations, they are a kind of cpu sector test).
- I need daily usage, and daily said i'll never sacrifice one cpu range of operations for another, so i had to do a cross-strip of cflags, deleting from acovea's best and acovea's optimistic any flag appearing into any of the acovea's pessimistic flags. From the remaining i made 4 groups of flags build-up from common best flags and common optimistic flags, plus i added some "common used" groups of best flags (aherm.. well.. at least "thought to be best flags"..). Total: 7 groups of cflags to test.
- Now i needed to choose some program to compile. Question: will i gain productivity from a 7% speed increase in processing openoffice documents? Not me. My typing speed is still slower then my processor..

and my Calc documents are not that heavy to see differences. So is for Quanta or the startup time for konqueror. Will i gain something making my gimp filters apply faster? yes. or extracting files from ark and compressing a divx quicker? yes. The real difference made by cflags optimization on "daily usage" is seen only (!) on time-consuming heavy processors based apps. Will not go in-deep. Not my interest. That said, my representative programs are the following: tar, bzip2, povray, ffmpeg, and sometime added konqueror in the bulk (yes, konqueror, running a complex ajax application).
- All programs were compiled on each machine with each group of flags and tested everytime in the same way. Yeah that was a whole weekend of testing.
- On the fastest machine i added some test taking exactly acovea's best flags.

Conclusions:
Best cflags: -march=athlon64 -O3 -pipe -fomit-frame-pointer -finline-functions
So let gcc do his job. If you need more pure processor speed, don't waist your time and buy a faster one.
No much else to say.

Some considerations (that's before somebody posts something already said or verified around):
- Latest versions of gcc do a very good job in optimization. Some flags made a noticable difference in the past 3.x serie, now that's over.
- Cflags are a bunch of a lot and more then hundred. Some of them taken alone are more pessimistic then bind togheter with others, and some of them are more harmful in some tasks then in other where - in contrast - can affect positively performance.
- Modern programs use a wide range of tasks, from floating point calculation to memory block manipulation, you cannot find a combination of flags that is best for everything.
- The -march=athlon64 flag is the first one you have to consider. Compared to this flag anything else is not relevant.
- A good programmed application can gain much more then any best found combination of compiler flags. Programmers: review slow routines.
- A difference of 20$ when buying a processor can make a 30% speed increase rendering an image with blender. Tweaking flags maybe 5%. You'll not see the difference surfing with konqueror neither downloading mails.
- Acovea flags give sometime an increase of 40% in performance on... acovea's test. That's because they are very specific tests performing very specific tasks at time. On daily applicaton they are worst. Anyways. That said remember that acovea is a good and sometimes very useful program (see ahead).
- if you have a very specific program that performs a very specific task that wastes a lot of time (let's say you are a researcher and you wrote a C program to solve field equations of GR in Riemann's manifolds using a sub-division approach) that can take days on your machine, then i suggest to take the heaviest functions and test them in acovea to find best performance cflags. You may really increase speed.

As last topic, lets nuke some legend:
- You will not gain from 64bit compiled programs on a modern 64bit athlon (or higher) system. False. You WILL gain from 64bit compiled programs. In some case even 20%.
- You will not gain from a dual core then a single core processor. Well.. there is something true in this statement if you use gcc. Applications aren't still optimized for dual processors/dual core. Some benchmarks found on the internet say you can actually have loss of performance in some cases. I don't have a direct experience for this (the single core i used is slower anyway then the dual core, so i can't compare the results) but my guess is that the processor wastes more time trying to divide threads and tasks then performing everything on a single core. Using icc seems to increase performances a lot. Yafray seems to me to gain much more then other programs compiled on the dual core machine.
- -Os (or -O) is better because apps load faster. Are you serious? Modern programs are made of a bunch of libraries, they load and perform while needed, and they usually don't exceed a few megs. The kwrite launcher is a few kilo. And its libraries too. They are loaded in sequence, and the startup of an application wastes more time initialiting and executing libraries then loading them. So that's just False in most of cases (of course i'm not considering old pc's running short of ram).
- -O1 -fomit-frame-pointer -finline-functions is comparable to -O2. False, and the difference is noticeable.
- -O2 is better then -O3. False, but the difference is often not noticeable.
- -fomit-frame-pointer AND -finline-functions are the first cflags to consider. True. The difference between -O2 and -O3 is kinda annihilated when adding the two flags to -O2, with a preference for the first one.
- -ffast-math or -funsafe-math-optimizations or -mfpmath=387 or any combination of the 3 compile faster code. I wonder how many post i read about this of people claiming they are absolutely sure about this (very common is to bundle the -funsafe-math-optimizations with the -mfpmath=387, there are guys out there that say they got an impressive 50% increase on some applications!)... WTF!!!! Did they really test it or they all read about somebody else who read about? This is absolutely false. on any amd64 system every floating-point-processor-stressing-task performs slower (and not only...). A lot slower.

--

I'll not put the results, i really have a dozen of hand written sheets on my desk, and i don't think i'll ever find time to waste to reorder and post them, so please dont' ask me for those.
For the most curious, i include a set of tests for povray performed on the fastest machine (other results are coherent). Same rendering scene for every test. Scene includes transparency and reflections, radiosity calculated. Pure -O1 and -O2 are omitted, they are not of interest as they are worst anyways. I just want to highlight the interesting and more discussed differences.

Flags -> Rendering time (the faster the better)
-Acovea's best common (stripped) -> 1:13
-Acovea's best positive (stripped) -> 1:11
-Any of acovea's best set tried -> always more then 1:17 (i got even a 1:21 in a case)
-Os -> 1:16
-O1 -fomit-frame-pointer -finline-functions -> 1:14
-O2 -fomit-frame-pointer -finline-functions -> 1:10
-O3 -> 1:09 (!)
-O3 -mfpmath=387 -> 1:12
-O3 -ffast-math -> 1:16
-O3 -ffast-math -mfpmath -> 1:17
-O3 -funsafe-math-optimizations -> 1:18
-O3 -funsafe-math-optimizations -mfpmath=387 -> 1:18

Keruskerfuerst · Post by **Keruskerfuerst** » Fri Nov 23, 2007 8:02 pm

Simply No!

DaggyStyle · Post by **DaggyStyle** » Fri Nov 23, 2007 8:20 pm

acording to http://gentoo-wiki.com/Safe_Cflags, -fomit-frame-pointer disables 64 bit support, care to answer?

loftwyr · Post by **loftwyr** » Fri Nov 23, 2007 10:15 pm

Could you point out where it says it disables 64bit support? The only thing I read is that it's inlined on -Os to -O3.

s.hase · Post by **s.hase** » Sat Nov 24, 2007 12:57 pm

DaggyStyle wrote:acording to http://gentoo-wiki.com/Safe_Cflags, -fomit-frame-pointer disables 64 bit support, care to answer?

There is no such statement in the wiki:

The flag -fomit-frame-pointer is enabled at -O1, -O2, -O3 and -Os on arches where it doesn't interfere with debugging, such as AMD64

DaggyStyle · Post by **DaggyStyle** » Sun Nov 25, 2007 2:14 pm

ok, my bad, all the 64 bit setups dont have it but the 32 bit setups for the same cpu has it, I've autoassumed it, anyway, what's the effect on the system?

Paapaa · Post by **Paapaa** » Mon Nov 26, 2007 9:26 am

xoxo_davide wrote:And, just a note: -fomit-frame-pointer and -finline-functions are inlined by default with -O3, so they are actually not usefull. I just put them along because sometimes ebuilds replace -O3 with -O2, and the hope is that the two flags are kept along with -O2. And -pipe is not a real cflags, that's just to speed up compile time. So the result is -O3.

-fomit-frame-pointer is always enabled at levels -O, -O2, -O3, -Os. So you can remove it safely. As said: this applies to x86_64 where it doesn't affect debuggability.

And as for -finline-functions and circumventing flag filtering:

It's possible to circumvent -O filtering by redundantly listing the flags for a certain level, such as -O3, by doing things like:
Code: Select all
CFLAGS="-O3 -finline-functions -funswitch-loops"
However, this is not a smart thing to do. CFLAGS are filtered for a reason! When flags are filtered, it means that it is unsafe to build a package with those flags. Clearly, it is not safe to compile your whole system with -O3 if some of the flags turned on by that level will cause problems with certain packages. Therefore, you shouldn't try to "outsmart" the developers who maintain those packages. Trust the developers. Flag filtering and replacing is done for your benefit! If an ebuild specifies alternative flags, then don't try to get around it.

You will most likely continue to run into problems when you build a package with unacceptable flags. When you report your troubles on Bugzilla, the flags you use in /etc/make.conf will be readily visible and you will be told to recompile without those flags. Save yourself the trouble of recompiling by not using redundant flags in the first place! Don't just automatically assume that you know better than the developers.

And finally about -O3:

Compiling all your packages with -O3 will result in larger binaries that require more memory, and will significantly increase the odds of compilation failure or unexpected program behavior (including errors). Using -O3 is not recommended for gcc 4.x.

And come on, the difference between O2 and O3 was 1 sec. In practice that means they are equal. Where did I find all this information? Here:

http://www.gentoo.org/doc/en/gcc-optimization.xml

aTan · Post by **aTan** » Tue Nov 27, 2007 8:32 pm

xoxo_davide wrote: - You will not gain from 64bit compiled programs on a modern 64bit athlon (or higher) system. False. You WILL gain from 64bit compiled programs. In some case even 20%.

Are there any tests? What are the arguments and facts about it on a desktop? I agree that it helps on e.g. a high loaded DB servers or something like that, but what applications make use of 64bit on a desktop (without encoding stuff)?

squirrelfishfrog · Post by **squirrelfishfrog** » Wed Nov 28, 2007 1:00 pm

I have some results and questions about sse flags,

I wrote a benchmark to test the usage of sse instructions on a xeon processor (gcc 4.2.0), and the results are weird.
The code constists of loops with a lot of floating point operations. So this is something sse2 is made for.

My flags were -O{n} -march=nocona -mtune=nocona
( with n=0,1,2,3; nocona worked and does support sse,sse2, also /proc/cpuinfo lists sse,sse2 as supported)

the gentoo optimization guide says that -msse and -msse2 are implied by correct -march but i included them explicitly.

I compiled the same code without those flags (no -march flag, same -O level) but from what i see in `gcc -v -Q example.c` the -msse(2) options are still enabled, among many others.

So the result was: no matter what -O level I set, the performances are equal between generic and correct -march/-msse2 compiled versions.

Also I read that -mfpmath=sse,387 activates an additional processing units for sse (you did not comment on that one), but that had no positive effect for me.
I tried -funroll-loops: no advantage for sse2 compiled code.

I would assume that those features (-msse, -msse2, -mmmx,...) should be deactivated with generic, for compatibility with older processors, so i hope i did sth. wrong, for it seems they are always on....

any comments?

by the way: do i misunderstand what gcc -v -Q lists under "options enabled:"? And by no effect I mean literally no effect not even a millisecond on average (i did some statistical error estimation stuff).

I decided to stick with CFLAGS="-O2 -march=appropriatecpu -mtune=appropriatecpu -pipe"

But still, for my own code, i would like to know what sse actually does (performancewise)....and if it really is always on if gcc thinks it should turn it on for your system...and such

i would be thankful for any hints on that

red-wolf76 · Post by **red-wolf76** » Wed Nov 28, 2007 2:08 pm

If you're using a sufficiently advanced toolchain, you might consider -march=native. I wouldn't recommend using both -march and -mtune at the same time. Seems rather pointless.

squirrelfishfrog · Post by **squirrelfishfrog** » Wed Nov 28, 2007 2:49 pm

yes,
-march=X implies -mtune=X,
they differ only in the generic option.
But it doesn't hurt. (or does it?)

native (thanks for that):

i recompiled with -march=native
then with -mtune=generic

both versions had the same runtime.

At work i have access to intels C compiler (version 9.1.045 and 10.0.023) so i tried iccs (9.1.045) optimization flags and there was a 10% difference (no -march version slightly worse than -march=pentium4 version) and since the faster one was as fast as the gcc compiled ones, I would assume that my assumption was right and that sse(2) is always on. Although it is possible that sse had nothing to do with it and that some other optimization is responsible for the 10% difference.

I will try that with some other procs as soon as I can.

loftwyr · Post by **loftwyr** » Wed Nov 28, 2007 3:28 pm

Interestingly, I just found out on my X2 cpu, sse3 isn't enabled using -march=native. with gcc -v -Q -march=native, it passes -march=k8 -mtune=k8 and that does not include -msse3. My CPU does support sse3 so it should be enabled.

So much for -march=native.

*EDIT*
Seems it's true, and its fixed in 4.3
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33312

timeBandit · Post by **timeBandit** » Wed Nov 28, 2007 4:11 pm

http://funroll-loops.info/
[bug=74072]HOLY COW I'M TOTALLY GOING SO FAST OH F***[/bug]

xoxo_davide wrote:Please don't argue with me on those results, i'll not answer neither go in-deep anyway. If you want to know a bit more, read ahead, but then don't argue with me anyway.

If it brings you joy, by all means have fun, but it's largely wasted time.

You want optimized?

: USE flags >> CFLAGS.

red-wolf76 · Post by **red-wolf76** » Wed Nov 28, 2007 4:21 pm

loftwyr wrote:Interestingly, I just found out on my X2 cpu, sse3 isn't enabled using -march=native. with gcc -v -Q -march=native, it passes -march=k8 -mtune=k8 and that does not include -msse3. My CPU does support sse3 so it should be enabled.

So much for -march=native.

*EDIT*
Seems it's true, and its fixed in 4.3
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33312

Yeah, it's a new flag and probably there's going to be some working out to do. I read somewhere that it uses CPUID for working its magic.

Glad to see it fixed. In such cases, you can always add your -msse3 flag as a redundancy. Myself, I don't own such fancy processors (yet).

squirrelfishfrog · Post by **squirrelfishfrog** » Wed Nov 28, 2007 5:00 pm

timeBandit wrote:
http://funroll-loops.info/
[bug=74072]HOLY COW I'M TOTALLY GOING SO FAST OH F***[/bug]
xoxo_davide wrote:Please don't argue with me on those results, i'll not answer neither go in-deep anyway. If you want to know a bit more, read ahead, but then don't argue with me anyway.
If it brings you joy, by all means have fun, but it's largely wasted time.

You want optimized? : USE flags >> CFLAGS.

Like xoxo_davide already wrote: some applications do run a long time and are using a lot of floating point math: conversions (video stuff), encoding, fourier transforms (which are used in compression algorithms) and maybe your own programs. So maybe you wont appreciate 10% gain with kwrite but if the program is running an hour you would.

so obviously in that case USE !>> CFLAGS

But the quotes on 1 are still hilarious.

Besides, people who are doing physics and such have programs that run weeks.....

.
So knowing more about gcc stuff is never a bad idea.

agreed. (on global CFLAGS)
||
vv

red-wolf76 · Post by **red-wolf76** » Wed Nov 28, 2007 5:20 pm

Physics number crunching may not be the poster child, but there are actually applications that benefit quite a bit from -ffast-math. I remember hearing something about video apps. But certainly, that wouldn't warrant a global setting, more likely, the ebuilds that benefit from it should be written to include it of themselves.

Paapaa · Post by **Paapaa** » Wed Nov 28, 2007 6:23 pm

squirrelfishfrog wrote:any comments?

1. There is no harm using redundant "-mtune" but it still is redundant and useless.
2. Use "diff" to see if gcc produces identical binaries or not. There is no point benchmarking the same binary.
3. "-mfpmath=sse is the default choice for x86-64 compiler." "mfpmath: Generate floating point arithmetics for selected unit."
4. "msse, msse2: These switches enable or disable the use of instructions in the MMX, SSE, SSE2 or 3DNow! extended instruction sets."
5. You can try disabling sse instruction support with "-mno-sse" or "-mno-sse2" to see how it affects. Disable sse arithmetics with "-mfpmath=387".
6. See GCC docs and especially source code for more information about defaults of various march settings.
7. Oplitizimitasions are overrated on these forums. Just use -O2 and be happy - everything that gives significant and safe speedups are already included.

xoxo_davide · Post by **xoxo_davide** » Fri Nov 30, 2007 2:43 pm

Physics number crunching may not be the poster child, but there are actually applications that benefit quite a bit from -ffast-math. I remember hearing something about video apps. But certainly, that wouldn't warrant a global setting, more likely, the ebuilds that benefit from it should be written to include it of themselves

Actually... that's right. I voluntarily omitted one other app i tested. But i still have to point out that if you want an application to benefit from -ffast-math, you have to code C functions with the flag in mind. And, this, sorry to say, is done in a very very few applications. So, after all, that should not be inlined by ebuilds, but by developers in the configure of the application. Still agree that gentoo ebuilds should include safe flags that give some benefit to apps, mostly depending on the architecture.

One example for all of what i call 'right implemented'.
Yafray (rendering app) has -ffast-math optimized functions, and developers inline the flag during configuration. In addition, gentoo developers added the -fsigned-char flag. And again, i like to believe in facts, so here are the results.
Machine: Athlon64 dual core, 2 Gigs of Ram.
Test: yafray rendering a multi-mesh scene with texture, transparencies, reflections. Osa=8, ray-trace=on. The difference are minimal in this case, so i run the rendering twice every time, to be sure.

-O2 -ffast-math -fsigned-char -> 1.13.99 / 1.13.88
-O2 -ffast-math -fsigned-char -finline-functions -> 1.13.80 / 1.13.50
-O3 -ffast-math -> 1.15.30 / 1.15.06
-O3 -fsigned-char -> 1.15.40 / 1.15.72
-O3 -ffast-math -fsigned-char -> 1.13.16 / 1.13.47 (!)

Some considerations:
- -ffast-math has to be bundled with -fsigned-char if we want some benefits. That's because of the amd64 architecture.
- -O3 is still better then -O2, but the difference is kinda negligible (note: configure tries to set -O3, but standard ebuild substitutes -O3 with your flags).
- -finline-functions still gives a slight improvement on -02

Conclusions:
And again and for all: best make.conf flags: -O3 (configure will add -ffast-math, gentoo amd64 ebuild will add -fsigned-char, -finline-functions and -fomit-frame-pointer are already inlined by -O3).
And again: Programmers: review slow routines.

Keruskerfuerst · Post by **Keruskerfuerst** » Fri Nov 30, 2007 4:53 pm

Acovea is not working properly.

The best result is -O2 and some additional flags:

CFLAGS:

Code: Select all

CFLAGS="-march=k8"
CFLAGS="${CFLAGS} -O2"
CFLAGS="${CFLAGS} -combine"
CFLAGS="${CFLAGS} -falign-functions=0"
CFLAGS="${CFLAGS} -falign-jumps=0"
CFLAGS="${CFLAGS} -falign-labels=0"
CFLAGS="${CFLAGS} -falign-loops=0"
CFLAGS="${CFLAGS} -ffunction-cse"
CFLAGS="${CFLAGS} -fgcse-after-reload"
CFLAGS="${CFLAGS} -fgcse-lm"
CFLAGS="${CFLAGS} -fkeep-static-consts"
CFLAGS="${CFLAGS} -fmerge-constants"
CFLAGS="${CFLAGS} -fno-ident"
CFLAGS="${CFLAGS} -fprefetch-loop-arrays"
CFLAGS="${CFLAGS} -frename-registers"
CFLAGS="${CFLAGS} -fweb"
CFLAGS="${CFLAGS} -msse2"
CFLAGS="${CFLAGS} -m80387"
CFLAGS="${CFLAGS} -pipe"

CPPFLAGS:

Code: Select all

CPPFLAGS="-Wall"

and LDFLAGS:

Code: Select all

LDFLAGS="-Wl,-O4"
LDFLAGS="${LDFLAGS} -Wl,--as-needed"
LDFLAGS="${LDFLAGS} -Wl,--enable-new-dtags"
LDFLAGS="${LDFLAGS} -Wl,--hash-style=both"
LDFLAGS="${LDFLAGS} -Wl,--sort-common"
LDFLAGS="${LDFLAGS} -Wl,-S"
LDFLAGS="${LDFLAGS} -Wl,-z,now"

for an AMD Athlon64 CPU.

squirrelfishfrog · Post by **squirrelfishfrog** » Mon Dec 03, 2007 3:22 pm

Paapaa wrote:
1. There is no harm using redundant "-mtune" but it still is redundant and useless.
2. Use "diff" to see if gcc produces identical binaries or not. There is no point benchmarking the same binary.
3. "-mfpmath=sse is the default choice for x86-64 compiler." "mfpmath: Generate floating point arithmetics for selected unit."
4. "msse, msse2: These switches enable or disable the use of instructions in the MMX, SSE, SSE2 or 3DNow! extended instruction sets."
5. You can try disabling sse instruction support with "-mno-sse" or "-mno-sse2" to see how it affects. Disable sse arithmetics with "-mfpmath=387".
6. See GCC docs and especially source code for more information about defaults of various march settings.
7. Oplitizimitasions are overrated on these forums. Just use -O2 and be happy - everything that gives significant and safe speedups are already included.

1 k.
2. i checked that already, they did differ (even in size, slightly), but not in performance.
3-4. its what the manuals state....so i knew that, phew.
5. i read that too, but didn't try it because i was convinced that sse(2) shouldn't be on as default. but it is and the program doesn't even compile without sse (some stdlib functions require those apparently). so thanks for that.
6. i don't think that im that desperate... :) reading code is bad for your eyes. see 7
7. totally agreed.

JeliJami · Post by **JeliJami** » Mon Dec 03, 2007 3:39 pm

timeBandit wrote:
http://funroll-loops.info/
[bug=74072]HOLY COW I'M TOTALLY GOING SO FAST OH F***[/bug]

don't forget Howto: Gentoo Ricing 183%

Keruskerfuerst · Post by **Keruskerfuerst** » Wed Dec 05, 2007 8:48 pm

fomit-frame-pointer is always enabled at levels -O, -O2, -O3, -Os. So you can remove it safely. As said: this applies to x86_64 where it doesn't affect debuggability.

Prove:

#ifdef CAN_DEBUG_WITHOUT_FP
flag_omit_frame_pointer = 1;

MP_ · Post by **MP_** » Thu Dec 06, 2007 10:32 am

timeBandit wrote:You want optimized? : USE flags >> CFLAGS.

WELL WROTE CODE >> USE flags

timeBandit · Post by **timeBandit** » Thu Dec 06, 2007 3:43 pm

MP_ wrote:
timeBandit wrote:You want optimized? : USE flags >> CFLAGS.
WELL WROTE CODE >> USE flags

I see I need to clarify.

For a Gentoo system, the most effective overall optimization is to build only what you need. Code that never executes is the fastest of all.

You can optimaxilatize the rest to your heart's content but the biggest gains come from simply not building things you won't use. That's why Gentoo works well both on spankin' new multi-core monsters and on creaky 12-year-old Pentiums with less RAM than some mobile phones.

Keruskerfuerst · Post by **Keruskerfuerst** » Thu Dec 06, 2007 7:35 pm

I have made some experiments with ASFLAGS, but i did not improve execution speed.
ASFLAGS="--64"
ASFLAGS="-mtune=aaa"
ASFLAGS="-march=aaa"

optimization flags, myths and truths for the real world ;-)

optimization flags, myths and truths for the real world ;-)

Re: optimization flags, myths and truths for the real world

Re: optimization flags, myths and truths for the real world

sse/sse2

-mtune

Re: sse/sse2

Re: sse/sse2