Making full use of cpu registers in CFLAGS

TheCoop · Posted: Mon Apr 07, 2003 6:38 am Post subject:

add 'sse' to the cflags
_________________
95% of all computer errors occur between chair and keyboard (TM)

"One World, One web, One program" - Microsoft Promo ad.
"Ein Volk, Ein Reich, Ein Führer" - Adolf Hitler

Change the world - move a rock

Gnufsh · Guru Joined: 28 Dec 2002 Posts: 400 Location: Portland, OR

kappax · Posted: Mon Apr 07, 2003 2:16 pm Post subject:

Gnufsh · Guru Joined: 28 Dec 2002 Posts: 400 Location: Portland, OR

Yeah, you should put sse, mmx, and 3dnow in your USE="...", for some reason xfree only builds with those if they're in the USE variable for some reason.

xaviorm · n00b Joined: 07 Mar 2003 Posts: 24

Since I'm now completely confused. What should my CFLAGS be? My cpu info flags are:

flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

also should I be enabling acpi in the kernel? Any place I can go to read up on what all these flags are an how to utilize them?

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

Actually, that's incorrect. All you need is -march=<proc> OR -msse -mmmx, etc., for the compiler to "allow" for those types of instructions. Also, you need to set -mfpmath=sse for those instructions to be created automatically.

If you want to test me, go for it. :)

Compile a program with some C code and just use the following flags:

-march=pentium3 OR pentium4 OR athlon*
-mfpmath=sse
-O2

You don't need to specify -msse or -mmmx.

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

Also, a lot of people put optimizations in their cflags that aren't necessary at all.

-march=<proc> (BTW, this is the minimum processor the compiled code will run on) This would activate the flags your processors supports INCLUDING -mmmx, -msse, -msse2, -maltivec, etc. It may not show them with gcc -v -Q tests or it even may show something like -mmmx -mno-mmx, but they are nevertheless still active and code is still generated for those instructions.

-mcpu=<proc> (BTW, this option specifies to optimize for this processor but still support execution on other processors) This would most likely NOT active flags such as -mmmx, -msse, -msse, -maltivec, etc., unless the processor for march specifies so...

-mfpmath=sse or sse,387
The former generates sse code (note that either march OR msse is also required) for floating point code. The latter generates instructions for both functional units, but I've analyzed the code thoroughly, and it sometimes behaves funny. Trust me, if you are using a pentium3 or 4, you want to stay as far away from the ordinary 387 fpu as much. Leave it for specialty instructions, because it just doesn't compare. For athlons, i'd use sse,387 instead.

-malign-double or -m128-bit-something-aling :p
Stay away from these options please! They break code left and right. Ever get error such as every file being the same size in TERABYTES!? It's most likely due to this flag...

-mno-push-args
You could specify this. It's not going to make that much of a difference in speed. It reduces dependencies as opposed to a series of push instructions, BUT it greatly decrease decoding bandwidth. Personally, I'd stick with the pushes. Why? Because on the pentium3, the 4-1-1 decoding rule makes a series of complex mov instructions prohibitive. It'll only decode one per clock cycle, so that shoots that right there. Second, on the pentium4, the lack of specialized address generation units means that all such instructions are decomposed into micro-ops anyway, and the way in which the p4 breaks down such a mov as opposed to a push makes the push more efficient.

-maccumulate-args
On all the x86 systems I've tested, this doesn't do a damn thing. :p It doesn't matter, -fdefer-pop will accomplish a similar thing, but it's automatically enabled with -O2, -O3, and -Os...soooo.... ;)

-mpreferred-stack-align
Leave this option alone. The -O series of flags take care of it.

I'm gonna do a nother post on the -f series of options...

I hope this helps. :)

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

First and foremost, avoid SSA optimizations. They're not ready for system building yet. Also, you don't kneed to specify -fmove-all-movables or -freduce-all-givs. These options perform a similar optimization as -fstrength-reduce, but the latter is much better and is automatically enabled at -O2, -O3, and -Os (for a reason ;D).

As far as the other -f flags go, only a few are not activated by the -O series of optimizations.

-Os enables all important optimizations, plus performs an extra pass to replace certain groups of instructions with smaller instructions that perform the same task. This is actually a very good flag. It optimizes well, and I'd use it, especially on very large libraries/kernel.

-O2 enables the same optimizations as -Os, but does not perform the extra instruction-compact pass. Most importantly, -O2 enables alignment. This includes stack alignment, function alignment, jump alignment, loop alignment, and label alignment (Thanks Bedeox). That means specifying the alignment for these flags is unnecessary. -O2 and -O3 will automatically align the aforementioned to their defaults (which are very good, and tuned to the cache-line-length of the processor, among other smaller details, so do NOT change these unless you know what you're doing).

-O3 enables everything in -O2 in addition to -frename-registers and -finline-functions. This inlines ALL functions that reach certain heuristically defined criteria. (Note that -O2 also inlines functions, but only those that have the inline keyword in their prototype) Use -finline-limit to control the amount of inlining (I'd stick with default, it's there for a reason). :)

-fomit-frame-pointer
For x86, you need this, because -Ox doesn't enable it by default.

-ffast-math
For x86, you *might* want this, because -Ox doesn't enable it by default. For non-critical applications, go for it. :) BTW, this option enables 3 other -f optimizations. If you use -ffast-math, you don't need 'em.

-fprefetch-loop-arrays
This speeds up execution somewhat for large arrays on platforms that support. I'm not entirely sure, but I'm almost positive it only works on machines with SSE support. (P3/4, Athlon4/XP)

-fmerge-all-constants
This reduces the size of your data and text segments by a very small amount, but it helps, so why not? It eliminates redudancy, but it is non-ANSI-C compliant. Don't worry about that, turn it on if you want it. :p

As far as all the alignment options go, let the -Ox flags control it. They know the best values to use. If you curious about other optimizations, trust me, the chances are higher that it'll break some package in your system more than it will increase overall system speed by more than 1-0.5%.

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

Oh yeah, what the guy above said about putting sse, mmx, and/or 3dnow in your use flags, you should. Some packages have specific, highly optimized routines which utilize these instructions that are only enabled at compile time but are not generated automatically by the compiler.

But I rreeaaally got to run now...I'm late for class. :-p

Lovechild · Posted: Thu Apr 10, 2003 6:15 pm Post subject:

defconfoo... that was an awesome walkthrough..

by any chance does your studies have anything to do with compiler design

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

Hehe...thanks. :)

Actually, I'm into computer architecture.

Oh, I forgot to write about an important flag because I was in a rush:

-funroll-loops or -funroll-all-loops
These flags are overrated. I've studied what the gcc compiler does, and it in no way unrolls loops in an efficient manner. For instance, if you take the following loop:

int total = 0, *array....
for (int i = 0; i < yadyada; i++) {
total += array[i];
}

GCC will strength reduce this and result in a very well optimized, tight loop, but if you enable unrolling, all it will do is the equivalent of this in assembly (if yadayada is not known :p):

for (int i = 0; i < yadayada; i++) {
total += array[i++];
if (i >= yadayada) break;
total += array[i];
}

This offers absolutely NO speed up. In fact, it would slow things down because the loop would take up more space in the cache. Sometimes, unrolling loops is beneficial...like if the number of iterations is known AND is small (which most of the time, this isn't the case). GCC has a tendency to unroll the entire loop and not change it into a larger loop without as many iterations...for instance, this would yield a higher degree of parallelism

for (int i = 0; i < yadayada; i++) {
total1 += array[i++];
total2 += array[i];
}
total = total1 + total2;

It doesn't do that, and as far as I know, it isn't capable of doing that. This is a really silly example, but I think it makes the point. :\ I think that's why the put the warning in the manual about it slowing down code.

For a stupid story, one time I built a system using -O3 -finline-limit=1200 -unroll-all-loops. Just booting into X, running GAIM and Mozilla took up ~180 megabytes of memory :p. It wasn't pretty...

Alrighty, I'm tired...too many all-nighters in a row are killing me. :p

Got to nap... :D

Bedeox · n00b Joined: 23 Mar 2003 Posts: 2

Welcome all!

@defconfoo: -O2 enables label alignment - from GCC manpage: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-Os isn't so good for libs if you have >=256 MB ram and reasonably fast HDD
even if you run lots of apps simultaneously (most of the code will be shared)

@kappax: why do you specify -falign* flags? They are properly set (read fastest) using -mcpu and/or -march defaults.

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

Oh...ok.

On my system, when I compile with -O2 or above, functions and loops are byte aligned 16 (which is the best for x86), but in the assembler output, all jumps related to if's and gotos are not aligned. Maybe -O2 enables alignment, but for x86, the alignement default is 1?

floam · Posted: Sat Apr 12, 2003 3:43 am Post subject:

defconfoo: That walkthrough was excelent, you should post it up somewhere on the web where more people can read it.

ERW1N · n00b Joined: 09 Mar 2003 Posts: 35 Location: Singapore

great info defconfoo

so, which one do you think is better? -O2 or -O3 ?
since -O3 only turns on 2 flags: -finline-functions and -frename-registers, and from above post somebody said that -finline-functions is good and -frename-registers is not suitable for x86 arch....

and how bout
-falign-jumps
-falign-loops
-falign-functions ?
_________________
AthlonXP 2100+ :: 512 MB DDR :: Radeon 8500 (128mb) :: Sound Blaster Audigy
2.4.20-gentoo-r5 :: XFree 4.3.0-r2 :: Gnome 2.2.1

Bedeox · n00b Joined: 23 Mar 2003 Posts: 2

-O3 makes compilation much more memory intensive
and it might create a problem with some apps
(due to inlining, counter it with -fno-inline-functions)

It is faster on CPUs with large cache,
but might be slower on the ones with smaller cache

-frename-registers won't do much on x86, but it will help performance

@defconfoo: Which march/mcpu are you using?

defconfoo · n00b Joined: 11 Feb 2003 Posts: 19

-march=pentium3 for my home computer
-march=athlon-xp (it's actually an Athlon 4) for my laptop

But I didn't compile everything from scratch on my laptop. I gave up after 4 days. =)

Gnufsh · Guru Joined: 28 Dec 2002 Posts: 400 Location: Portland, OR

So functions are aligned to 16 by default on x86, is that the optimum for the athlonxp, with its big L1 cache? Should they be aligned to 64byte boundaries?

taskara · Posted: Sun May 04, 2003 10:14 am Post subject:

ghetto · Guru Joined: 10 Jul 2002 Posts: 369 Location: BC, Canada

taskara · Posted: Mon May 05, 2003 5:26 am Post subject:

that should go for all athlons because they all have 64kb level 1 cache

I think even durons have that
_________________
Kororaa install method - have Gentoo up and running quickly and easily, fully automated with an installer!

ghetto · Guru Joined: 10 Jul 2002 Posts: 369 Location: BC, Canada

Ok thanks, but I have one more question.. I know that at the begining of this thread the idea of adding flags like -mmmx -m3dnow etc etc was HIGHLY encouraged.

Is that still the case? Or has it been established that doing so is not really nessisary.

Here are my current: CFLAGS="-march=athlon -O2 -mmmx -m3dnow -falign-functions=64 -pipe"
_________________
Blizzard you suck.

taskara · Posted: Mon May 05, 2003 7:00 am Post subject:

I'm not sure.. I was under the impression that -march=athlon-xp automatically entered all those flags.

however someone posted that it doesn't.

but then someone said that putting -march=athlon-xp -mmmx -m3dnow -msse actually disabled them because it was already enabled in -march=athlon-xp ...

so the short answer?

I'm still confused.

I leave them out, but put them in my USE flagset

it would be GREAT if a dev could confirm this. ..

_________________
Kororaa install method - have Gentoo up and running quickly and easily, fully automated with an installer!

ghetto · Guru Joined: 10 Jul 2002 Posts: 369 Location: BC, Canada

Whoa.. ok so EITHER they are already in and it doesnt do anything OR putting those flags in actually disables those registers?!?! Eeep!

Ok Im removing those flags now.. dang that sucks.
_________________
Blizzard you suck.

Gnufsh · Guru Joined: 28 Dec 2002 Posts: 400 Location: Portland, OR

If I compile with -march=athlon-xp, sse, 3dnow, and mmx are enabled (through the -D__athlon_sse__ -D__tune_athlon__ -D__tune_athlon_sse__ -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__ macros). When I add, for example -mmmx, -mno-mmx appears after -mmmx in the "options enabled" list in the output of gcc -Q -v -march=athlon-xp -mmmx. However, -D__MMX__ doesn't go away, so MMX is still used. In short -mmmx, -msse, and -m3dnow are unneccessary, but they don't hurt.undefined