Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Making full use of cpu registers in CFLAGS
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next  
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
TheCoop
Veteran
Veteran


Joined: 15 Jun 2002
Posts: 1814
Location: Where you least expect it

PostPosted: Mon Apr 07, 2003 6:38 am    Post subject: Reply with quote

add 'sse' to the cflags
_________________
95% of all computer errors occur between chair and keyboard (TM)

"One World, One web, One program" - Microsoft Promo ad.
"Ein Volk, Ein Reich, Ein Führer" - Adolf Hitler

Change the world - move a rock
Back to top
View user's profile Send private message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Mon Apr 07, 2003 7:11 am    Post subject: Reply with quote

Quote:

CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64 -fforce-addr -mmmx -msse -m3dnow -mfpmath=sse,387"
should be rather fast. As we've pointed out, -march=athlon-xp enables mmx, sse, and 3dnow, so there is no point in specifing -mmmx -msse and -m3dnow. I don't think they do any harm, tho. -ffast-math might cause problems for anything that needs accurate math. I just recompiled with : CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -fprefetch-loop-arrays -funroll-loops -falign-jumps=4 -falign-loops=4 -falign-functions=5 -fforce-addr" and everything seems fine so far.


pagal: -march=pentium3 -O3 -fomit-frame-pointer -pipe is a start, I don't know how your processor will fair with the other settings. You may even drop back to -O2, because of the smaller L1 cache (as compared to the athlon, which benefits more from function inlining). -fprefetch-loop arrays will probably help (possibly more than it does on the athlon). I just checked, and -march=pentium3 enables -D__SSE__ and -D__MMX__, so your sse and mmx instructions should get used without any extra flags (other than -march=pentium3)

wrc1944: I think you have it backwards. -mcpu=athlon-xp will generate code optimized for an athlon-xp, but still able to run on an i386. -march=athlon-xp implies -mcpu (according to both the docs and my testing), while also enabling features that break support for other cpus (mmx, sse, 3dnow, etc.)

edit: for some reason [/quote] magically appoeared at the end of my message. Why? My quote is closed? Where did it come from? What does it want?
Back to top
View user's profile Send private message
kappax
Apprentice
Apprentice


Joined: 30 Aug 2002
Posts: 273
Location: The Moon

PostPosted: Mon Apr 07, 2003 2:16 pm    Post subject: Reply with quote

Gnufsh wrote:
Quote:

CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64 -fforce-addr -mmmx -msse -m3dnow -mfpmath=sse,387"
should be rather fast. As we've pointed out, -march=athlon-xp enables mmx, sse, and 3dnow, so there is no point in specifing -mmmx -msse and -m3dnow. I don't think they do any harm, tho. -ffast-math might cause problems for anything that needs accurate math. I just recompiled with : CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -fprefetch-loop-arrays -funroll-loops -falign-jumps=4 -falign-loops=4 -falign-functions=5 -fforce-addr" and everything seems fine so far.


pagal: -march=pentium3 -O3 -fomit-frame-pointer -pipe is a start, I don't know how your processor will fair with the other settings. You may even drop back to -O2, because of the smaller L1 cache (as compared to the athlon, which benefits more from function inlining). -fprefetch-loop arrays will probably help (possibly more than it does on the athlon). I just checked, and -march=pentium3 enables -D__SSE__ and -D__MMX__, so your sse and mmx instructions should get used without any extra flags (other than -march=pentium3)

wrc1944: I think you have it backwards. -mcpu=athlon-xp will generate code optimized for an athlon-xp, but still able to run on an i386. -march=athlon-xp implies -mcpu (according to both the docs and my testing), while also enabling features that break support for other cpus (mmx, sse, 3dnow, etc.)

edit: for some reason magically appoeared at the end of my message. Why? My quote is closed? Where did it come from? What does it want?


wee, I droped the flags so now i have.

Code:

CFLAGS="-march=athlon-xp -O3 -fomit-frame-pointer -pipe -ffast-math -fprefetch-loop-arrays -funroll-loops -finline-functions -falign-jumps=4 -falign-loops=4 -falign-functions=64  -fforce-addr -mfpmath=sse,387"


oh and was reading on use, seemed that X was not using sse ro mmx, but now it does

Code:

USE="-3dfx 3dnow mmx sse alsa cups kde gnome opengl samba"

_________________
My Box
glxgears - 4083.400 FPS
OS: GNU/Linux
Distro: Gentoo
kernel: 2.6.0-test9-mm2
----------------------
vi makes me :wq in word pad :(
Back to top
View user's profile Send private message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Wed Apr 09, 2003 4:12 pm    Post subject: Reply with quote

Yeah, you should put sse, mmx, and 3dnow in your USE="...", for some reason xfree only builds with those if they're in the USE variable for some reason.
Back to top
View user's profile Send private message
xaviorm
n00b
n00b


Joined: 07 Mar 2003
Posts: 24

PostPosted: Thu Apr 10, 2003 3:24 pm    Post subject: So what should my CFLAGS be? Reply with quote

Since I'm now completely confused. What should my CFLAGS be? My cpu info flags are:

flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

also should I be enabling acpi in the kernel? Any place I can go to read up on what all these flags are an how to utilize them?
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Thu Apr 10, 2003 5:29 pm    Post subject: Reply with quote

Actually, that's incorrect. All you need is -march=<proc> OR -msse -mmmx, etc., for the compiler to "allow" for those types of instructions. Also, you need to set -mfpmath=sse for those instructions to be created automatically.

If you want to test me, go for it. :)

Compile a program with some C code and just use the following flags:

-march=pentium3 OR pentium4 OR athlon*
-mfpmath=sse
-O2

You don't need to specify -msse or -mmmx.
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Thu Apr 10, 2003 5:44 pm    Post subject: Reply with quote

Also, a lot of people put optimizations in their cflags that aren't necessary at all.

-march=<proc> (BTW, this is the minimum processor the compiled code will run on) This would activate the flags your processors supports INCLUDING -mmmx, -msse, -msse2, -maltivec, etc. It may not show them with gcc -v -Q tests or it even may show something like -mmmx -mno-mmx, but they are nevertheless still active and code is still generated for those instructions.

-mcpu=<proc> (BTW, this option specifies to optimize for this processor but still support execution on other processors) This would most likely NOT active flags such as -mmmx, -msse, -msse, -maltivec, etc., unless the processor for march specifies so...

-mfpmath=sse or sse,387
The former generates sse code (note that either march OR msse is also required) for floating point code. The latter generates instructions for both functional units, but I've analyzed the code thoroughly, and it sometimes behaves funny. Trust me, if you are using a pentium3 or 4, you want to stay as far away from the ordinary 387 fpu as much. Leave it for specialty instructions, because it just doesn't compare. For athlons, i'd use sse,387 instead.

-malign-double or -m128-bit-something-aling :p
Stay away from these options please! They break code left and right. Ever get error such as every file being the same size in TERABYTES!? It's most likely due to this flag...

-mno-push-args
You could specify this. It's not going to make that much of a difference in speed. It reduces dependencies as opposed to a series of push instructions, BUT it greatly decrease decoding bandwidth. Personally, I'd stick with the pushes. Why? Because on the pentium3, the 4-1-1 decoding rule makes a series of complex mov instructions prohibitive. It'll only decode one per clock cycle, so that shoots that right there. Second, on the pentium4, the lack of specialized address generation units means that all such instructions are decomposed into micro-ops anyway, and the way in which the p4 breaks down such a mov as opposed to a push makes the push more efficient.

-maccumulate-args
On all the x86 systems I've tested, this doesn't do a damn thing. :p It doesn't matter, -fdefer-pop will accomplish a similar thing, but it's automatically enabled with -O2, -O3, and -Os...soooo.... ;)

-mpreferred-stack-align
Leave this option alone. The -O series of flags take care of it.

I'm gonna do a nother post on the -f series of options...

I hope this helps. :)
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Thu Apr 10, 2003 6:03 pm    Post subject: Reply with quote

First and foremost, avoid SSA optimizations. They're not ready for system building yet. Also, you don't kneed to specify -fmove-all-movables or -freduce-all-givs. These options perform a similar optimization as -fstrength-reduce, but the latter is much better and is automatically enabled at -O2, -O3, and -Os (for a reason ;D).

As far as the other -f flags go, only a few are not activated by the -O series of optimizations.

-Os enables all important optimizations, plus performs an extra pass to replace certain groups of instructions with smaller instructions that perform the same task. This is actually a very good flag. It optimizes well, and I'd use it, especially on very large libraries/kernel.

-O2 enables the same optimizations as -Os, but does not perform the extra instruction-compact pass. Most importantly, -O2 enables alignment. This includes stack alignment, function alignment, jump alignment, loop alignment, and label alignment (Thanks Bedeox). That means specifying the alignment for these flags is unnecessary. -O2 and -O3 will automatically align the aforementioned to their defaults (which are very good, and tuned to the cache-line-length of the processor, among other smaller details, so do NOT change these unless you know what you're doing).

-O3 enables everything in -O2 in addition to -frename-registers and -finline-functions. This inlines ALL functions that reach certain heuristically defined criteria. (Note that -O2 also inlines functions, but only those that have the inline keyword in their prototype) Use -finline-limit to control the amount of inlining (I'd stick with default, it's there for a reason). :)

-fomit-frame-pointer
For x86, you need this, because -Ox doesn't enable it by default.

-ffast-math
For x86, you *might* want this, because -Ox doesn't enable it by default. For non-critical applications, go for it. :) BTW, this option enables 3 other -f optimizations. If you use -ffast-math, you don't need 'em.

-fprefetch-loop-arrays
This speeds up execution somewhat for large arrays on platforms that support. I'm not entirely sure, but I'm almost positive it only works on machines with SSE support. (P3/4, Athlon4/XP)

-fmerge-all-constants
This reduces the size of your data and text segments by a very small amount, but it helps, so why not? It eliminates redudancy, but it is non-ANSI-C compliant. Don't worry about that, turn it on if you want it. :p

As far as all the alignment options go, let the -Ox flags control it. They know the best values to use. If you curious about other optimizations, trust me, the chances are higher that it'll break some package in your system more than it will increase overall system speed by more than 1-0.5%.


Last edited by defconfoo on Mon Apr 14, 2003 3:01 am; edited 1 time in total
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Thu Apr 10, 2003 6:05 pm    Post subject: Reply with quote

Oh yeah, what the guy above said about putting sse, mmx, and/or 3dnow in your use flags, you should. Some packages have specific, highly optimized routines which utilize these instructions that are only enabled at compile time but are not generated automatically by the compiler.

But I rreeaaally got to run now...I'm late for class. :-p
Back to top
View user's profile Send private message
Lovechild
Advocate
Advocate


Joined: 17 May 2002
Posts: 2858
Location: Århus, Denmark

PostPosted: Thu Apr 10, 2003 6:15 pm    Post subject: Reply with quote

defconfoo... that was an awesome walkthrough..

by any chance does your studies have anything to do with compiler design ;)
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Thu Apr 10, 2003 8:02 pm    Post subject: Reply with quote

Hehe...thanks. :)

Actually, I'm into computer architecture.

Oh, I forgot to write about an important flag because I was in a rush:

-funroll-loops or -funroll-all-loops
These flags are overrated. I've studied what the gcc compiler does, and it in no way unrolls loops in an efficient manner. For instance, if you take the following loop:

int total = 0, *array....
for (int i = 0; i < yadyada; i++) {
total += array[i];
}

GCC will strength reduce this and result in a very well optimized, tight loop, but if you enable unrolling, all it will do is the equivalent of this in assembly (if yadayada is not known :p):

for (int i = 0; i < yadayada; i++) {
total += array[i++];
if (i >= yadayada) break;
total += array[i];
}

This offers absolutely NO speed up. In fact, it would slow things down because the loop would take up more space in the cache. Sometimes, unrolling loops is beneficial...like if the number of iterations is known AND is small (which most of the time, this isn't the case). GCC has a tendency to unroll the entire loop and not change it into a larger loop without as many iterations...for instance, this would yield a higher degree of parallelism

for (int i = 0; i < yadayada; i++) {
total1 += array[i++];
total2 += array[i];
}
total = total1 + total2;

It doesn't do that, and as far as I know, it isn't capable of doing that. This is a really silly example, but I think it makes the point. :\ I think that's why the put the warning in the manual about it slowing down code.

For a stupid story, one time I built a system using -O3 -finline-limit=1200 -unroll-all-loops. Just booting into X, running GAIM and Mozilla took up ~180 megabytes of memory :p. It wasn't pretty...

Alrighty, I'm tired...too many all-nighters in a row are killing me. :p

Got to nap... :D
Back to top
View user's profile Send private message
Bedeox
n00b
n00b


Joined: 23 Mar 2003
Posts: 2

PostPosted: Fri Apr 11, 2003 2:36 pm    Post subject: Reply with quote

Welcome all!

@defconfoo: -O2 enables label alignment - from GCC manpage: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-Os isn't so good for libs if you have >=256 MB ram and reasonably fast HDD
even if you run lots of apps simultaneously (most of the code will be shared)

@kappax: why do you specify -falign* flags? They are properly set (read fastest) using -mcpu and/or -march defaults.
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Fri Apr 11, 2003 10:18 pm    Post subject: Reply with quote

Oh...ok.

On my system, when I compile with -O2 or above, functions and loops are byte aligned 16 (which is the best for x86), but in the assembler output, all jumps related to if's and gotos are not aligned. Maybe -O2 enables alignment, but for x86, the alignement default is 1?
Back to top
View user's profile Send private message
floam
Veteran
Veteran


Joined: 27 Oct 2002
Posts: 1067
Location: Vancouver, WA USA

PostPosted: Sat Apr 12, 2003 3:43 am    Post subject: Reply with quote

defconfoo: That walkthrough was excelent, you should post it up somewhere on the web where more people can read it.
Back to top
View user's profile Send private message
ERW1N
n00b
n00b


Joined: 09 Mar 2003
Posts: 35
Location: Singapore

PostPosted: Sat Apr 12, 2003 4:04 am    Post subject: Reply with quote

great info defconfoo ;)

so, which one do you think is better? -O2 or -O3 ?
since -O3 only turns on 2 flags: -finline-functions and -frename-registers, and from above post somebody said that -finline-functions is good and -frename-registers is not suitable for x86 arch....

and how bout
-falign-jumps
-falign-loops
-falign-functions ?
_________________
AthlonXP 2100+ :: 512 MB DDR :: Radeon 8500 (128mb) :: Sound Blaster Audigy
2.4.20-gentoo-r5 :: XFree 4.3.0-r2 :: Gnome 2.2.1
Back to top
View user's profile Send private message
Bedeox
n00b
n00b


Joined: 23 Mar 2003
Posts: 2

PostPosted: Sun Apr 13, 2003 4:51 pm    Post subject: Reply with quote

-O3 makes compilation much more memory intensive
and it might create a problem with some apps
(due to inlining, counter it with -fno-inline-functions)

It is faster on CPUs with large cache,
but might be slower on the ones with smaller cache

-frename-registers won't do much on x86, but it will help performance

@defconfoo: Which march/mcpu are you using?
Back to top
View user's profile Send private message
defconfoo
n00b
n00b


Joined: 11 Feb 2003
Posts: 19

PostPosted: Mon Apr 14, 2003 2:57 am    Post subject: Reply with quote

-march=pentium3 for my home computer
-march=athlon-xp (it's actually an Athlon 4) for my laptop

But I didn't compile everything from scratch on my laptop. I gave up after 4 days. =)
Back to top
View user's profile Send private message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Mon Apr 14, 2003 7:01 pm    Post subject: Reply with quote

So functions are aligned to 16 by default on x86, is that the optimum for the athlonxp, with its big L1 cache? Should they be aligned to 64byte boundaries?
Back to top
View user's profile Send private message
taskara
Advocate
Advocate


Joined: 10 Apr 2002
Posts: 3763
Location: Australia

PostPosted: Sun May 04, 2003 10:14 am    Post subject: Reply with quote

Gnufsh wrote:
So functions are aligned to 16 by default on x86, is that the optimum for the athlonxp, with its big L1 cache? Should they be aligned to 64byte boundaries?


if that is the case, then surely it should be set to 64 bytes!
so therefore amd users should add the CFLAG
Code:
 -falign-functions=64
to make.conf

agreed ?
_________________
Kororaa install method - have Gentoo up and running quickly and easily, fully automated with an installer!
Back to top
View user's profile Send private message
ghetto
Guru
Guru


Joined: 10 Jul 2002
Posts: 369
Location: BC, Canada

PostPosted: Mon May 05, 2003 4:50 am    Post subject: Reply with quote

taskara wrote:
if that is the case, then surely it should be set to 64 bytes!
so therefore amd users should add the CFLAG
Code:
 -falign-functions=64
to make.conf

agreed ?


Does that go for all AMD users? Or just amd-xp.
I have just a plain amd athlon (not thunderbird) ..what would I set it to?


cat /proc/cpuinfo
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1009.000
cache size : 256 KB
_________________
Blizzard you suck.
Back to top
View user's profile Send private message
taskara
Advocate
Advocate


Joined: 10 Apr 2002
Posts: 3763
Location: Australia

PostPosted: Mon May 05, 2003 5:26 am    Post subject: Reply with quote

that should go for all athlons because they all have 64kb level 1 cache :)

I think even durons have that
_________________
Kororaa install method - have Gentoo up and running quickly and easily, fully automated with an installer!
Back to top
View user's profile Send private message
ghetto
Guru
Guru


Joined: 10 Jul 2002
Posts: 369
Location: BC, Canada

PostPosted: Mon May 05, 2003 6:39 am    Post subject: Reply with quote

Ok thanks, but I have one more question.. I know that at the begining of this thread the idea of adding flags like -mmmx -m3dnow etc etc was HIGHLY encouraged.

Is that still the case? Or has it been established that doing so is not really nessisary.

Here are my current: CFLAGS="-march=athlon -O2 -mmmx -m3dnow -falign-functions=64 -pipe"
_________________
Blizzard you suck.
Back to top
View user's profile Send private message
taskara
Advocate
Advocate


Joined: 10 Apr 2002
Posts: 3763
Location: Australia

PostPosted: Mon May 05, 2003 7:00 am    Post subject: Reply with quote

I'm not sure.. I was under the impression that -march=athlon-xp automatically entered all those flags.

however someone posted that it doesn't.

but then someone said that putting -march=athlon-xp -mmmx -m3dnow -msse actually disabled them because it was already enabled in -march=athlon-xp ...

so the short answer?

I'm still confused.

I leave them out, but put them in my USE flagset


it would be GREAT if a dev could confirm this. .. ;)
_________________
Kororaa install method - have Gentoo up and running quickly and easily, fully automated with an installer!
Back to top
View user's profile Send private message
ghetto
Guru
Guru


Joined: 10 Jul 2002
Posts: 369
Location: BC, Canada

PostPosted: Mon May 05, 2003 6:04 pm    Post subject: Reply with quote

Whoa.. ok so EITHER they are already in and it doesnt do anything OR putting those flags in actually disables those registers?!?! Eeep! 8O

Ok Im removing those flags now.. dang that sucks.
_________________
Blizzard you suck.
Back to top
View user's profile Send private message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Mon May 05, 2003 7:39 pm    Post subject: Reply with quote

If I compile with -march=athlon-xp, sse, 3dnow, and mmx are enabled (through the -D__athlon_sse__ -D__tune_athlon__ -D__tune_athlon_sse__ -D__SSE__ -D__MMX__ -D__3dNOW__ -D__3dNOW_A__ macros). When I add, for example -mmmx, -mno-mmx appears after -mmmx in the "options enabled" list in the output of gcc -Q -v -march=athlon-xp -mmmx. However, -D__MMX__ doesn't go away, so MMX is still used. In short -mmmx, -msse, and -m3dnow are unneccessary, but they don't hurt.undefined
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 3 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum