Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Making full use of cpu registers in CFLAGS
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next  
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
ph317
n00b
n00b


Joined: 02 Jun 2002
Posts: 43

PostPosted: Thu Jul 10, 2003 4:11 am    Post subject: Re: cache and alignment Reply with quote

odegard wrote:
ph317 wrote:
A few corrections to some misinfo above:
First off, L1 and L2 caches are seperate, even on athlons.


Actually, *only* on athlons.

However, I was thinking, what is the bottleneck on modern computers? I/O. So why don't we optimize the code for smaller footprint than for faster execution? Lets be utterly simplistic and say that there are two variables: LOAD and EXECUTE. LOAD is far bigger than EXECUTE so in order to get a total boost, get LOAD down, even thought it may use longer time EXECUTING.

Agree/disagree?


L1 and L2 are seperate on all processors that have both such things. They are entirely different types of memory, the L1 is much faster than the L2, and therefore much more expensive per byte and much smaller. Being different kinds of memory and being attached at totally different places eletrically, they are different. If the L1 and L2 of a processor were the same, there would be no point in calling them L1 and L2 to begin with, you would just say you had a huge slow L1 or a small fast L2 or something.

On the I/O point, well yes normal tasks on a desktop system these days are more I/O than CPU bound - but they're bound by things like disks, network cards, the net itself, and your keyboard and mouse speed of course - you wouldn't believe how much time the average PC spends twiddling its thumbs waiting on the end user. In terms of instruction optimizations that this thread is talking about, going from a widely-aligned loop-unrolled fat set of optimizations to -Os and alignments set to zero aren't really making a difference by lowering I/O load per se: if they help, they're help because smaller tighter code keeps more references local to L1 and or L2 cache instead of taking a cache miss and going out to slow main memory. There's definitely some tradeoffs involved of course. On a Xeon with a couple megs of L2 cache it's probably not worth it to go -Os, but if whatever x86 clone you're using has like 128k or less of L2, it could very well help. Benchmark your own CPU running tasks you generally run is the best way to tell.
Back to top
View user's profile Send private message
odegard
Guru
Guru


Joined: 08 Mar 2003
Posts: 324
Location: Trondheim, NO

PostPosted: Sat Jul 12, 2003 5:58 pm    Post subject: Re: cache and alignment Reply with quote

ph317 wrote:
odegard wrote:
ph317 wrote:
A few corrections to some misinfo above:
First off, L1 and L2 caches are seperate, even on athlons.


Actually, *only* on athlons.


L1 and L2 are seperate on all processors that have both such things.


Yes, they are separate entities physcially. What I meant was that in a P4, the caches are INCLUSIVE meaning that everything that is contained in the L1 cache is duplicated in the L2 cache (actually, the P4 has two kind of L1 caches but thats a different story). In an Athlon however, the are EXCLUSIVE. Now perhaps my reply makes more sense. I was talking about separate entities FUNCTIONALLY, while I guess you meant physically...

Anyway, nothing to argue about.
Back to top
View user's profile Send private message
Gandalf_Grey_
Apprentice
Apprentice


Joined: 19 Apr 2003
Posts: 151

PostPosted: Tue Jul 15, 2003 3:18 am    Post subject: Reply with quote

I have an athlon tbird @1.33 ghz. cat /proc/cpuinfo returns this
Code:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) Processor
stepping        : 4
cpu MHz         : 1343.062
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 2680.42


and my current flags are

-march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128

does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice?
Back to top
View user's profile Send private message
higman
n00b
n00b


Joined: 12 Jun 2002
Posts: 7
Location: Langley, B.C., Canada

PostPosted: Wed Jul 16, 2003 3:48 pm    Post subject: Reply with quote

I have a tbird @ 1.4, runs well, also doubles as a space heater!
Gandalf_Grey_ wrote:
-march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128

does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice?

I'm using: -march=athlon-tbird -O3 -pipe -fomit-frame-pointer

What flags were you using before and which ones did you add to get this boost? As for your flags... after reading this entire thread and investigating a little on my own...

-m3dnow and -mmmx are redundant (-march=athlon-tbird implies)
-falign-functions=128 is insignifigant and/or dangerous, to the best of my knowledge, the compiler has good defaults for the different cpu (presumably tuned by developers?)
-ffast-math will cause precise calculations to fail
-fmerge-all-constants reduces size by a small amount with no other gain.

-O3 -pipe -fomit-frame-pointer looks good to me. I don't know anything about -mno-push-args though, are the tbirds not stack friendly?
Back to top
View user's profile Send private message
TeeHee
n00b
n00b


Joined: 24 Jun 2003
Posts: 7

PostPosted: Sat Jul 19, 2003 6:51 pm    Post subject: Reply with quote

'nother question here.

Trying to use openmosix on two mashines using different cfags.

anyone had success ? Problems ? Anything ?
Back to top
View user's profile Send private message
aardvark
Guru
Guru


Joined: 30 Jun 2002
Posts: 576

PostPosted: Sat Jul 19, 2003 7:00 pm    Post subject: Reply with quote

elektrohirn wrote:
hey i just even compiled openoffice 1.1beta2 with the above cflags. that's really surprising me because the ebuild tells you that openoffice is very fragile about aggressive cflags ... but openoffice is so stunning fast now!


Doesn't the openoffice ebuild filter out most flags though?
Back to top
View user's profile Send private message
higman
n00b
n00b


Joined: 12 Jun 2002
Posts: 7
Location: Langley, B.C., Canada

PostPosted: Sat Jul 19, 2003 10:10 pm    Post subject: Reply with quote

aardvark wrote:
Doesn't the openoffice ebuild filter out most flags though?

yes, it does, here's a segment from /usr/portage/app-office/openoffice/openoffice-1.1_beta2-r1.ebuild:
Code:
inherit flag-o-matic eutils
# Compile problems with these ...
filter-flags "-funroll-loops"
filter-flags "-fomit-frame-pointer"
replace-flags "-O3" "-O2"
Back to top
View user's profile Send private message
T2
n00b
n00b


Joined: 01 Jun 2002
Posts: 67
Location: Slovenia

PostPosted: Wed Jul 23, 2003 5:50 pm    Post subject: Reply with quote

I've read all thread, its really informative (and confusing at moments).
I'm staying at trusted&tried CFLAGS="-march=athlon-tbird -O3 -pipe"
for my tbird 1.33ghz.
IMHO critical packages such as kernel (and mplayer :lol: ) do their own cpu optimisations which are satisfactory. However I'm tempted to try some agressive gcc compile flags to overcome openoffice laziness.
regards
Back to top
View user's profile Send private message
Gandalf_Grey_
Apprentice
Apprentice


Joined: 19 Apr 2003
Posts: 151

PostPosted: Thu Jul 24, 2003 1:21 am    Post subject: Reply with quote

The cflags I mentioned above compiled openoffice fine,a nd it feels noticably more responsive than the binary install, before I changed my flags I had

-march=athlon-tbird -O3 -pipe

I did some research and it seems my current ones (mentioned above) are about as aggressive as I can get without breaking compiles left and right
Back to top
View user's profile Send private message
FastTurtle
Guru
Guru


Joined: 03 Sep 2002
Posts: 477
Location: Flakey Shake & Bake Caliornia, USA

PostPosted: Thu Jul 24, 2003 2:38 pm    Post subject: Reply with quote

I've got an XP1800 and these are the flags I'm using.

-march=athlon -m3dnow -mmmx -msse -O3 -pipe.

:cry: Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load. :?

As far as this thread goes, I'm real happy to have read the entire thing. Maybe I will begin testing some of the optimizations and seeing what speeds things up, especially KDE/Office 1.03 and other large apps.
Back to top
View user's profile Send private message
Gandalf_Grey_
Apprentice
Apprentice


Joined: 19 Apr 2003
Posts: 151

PostPosted: Thu Jul 24, 2003 7:54 pm    Post subject: Reply with quote

FastTurtle wrote:
I've got an XP1800 and these are the flags I'm using.

-march=athlon -m3dnow -mmmx -msse -O3 -pipe.

:cry: Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load. :?

As far as this thread goes, I'm real happy to have read the entire thing. Maybe I will begin testing some of the optimizations and seeing what speeds things up, especially KDE/Office 1.03 and other large apps.


If you have an athlon XP I hardly think that using the athlon-xp cflag is being aggressive.
Back to top
View user's profile Send private message
Forge
Tux's lil' helper
Tux's lil' helper


Joined: 20 Jun 2002
Posts: 125
Location: KOP, PA, USA

PostPosted: Fri Jul 25, 2003 10:22 am    Post subject: Reply with quote

OK, here's my semi-definitive Pentium/Athlon features guide and cache lecture.. I hope lynx doesn't barf.
(These are only cflag-relevant features, but I won't go into cache line sizes, etc.)
486: Not much. FPU.... Usually.
Pentium non-MMX: Same as 486, but i586.
Pentium MMX: adds MMX. Duh.
Pentium 2: Same as Pentium MMX, now i686.
Pentium 3: Adds SSE.
Pentium 4: Adds SSE2.

Athlon: Pentium 2, plus Advanced (aka Athlon) 3Dnow. Same cflags as any K6-* as far as 3Dnow goes.
Athlon Tbird (on-die L2, socketed Athlon): Same as Athlon.
Athlon XP: Adds SSE, known as '3Dnow Professional' for marketing reasons. 3Dnow Pro actually includes new 3Dnow instructions, as well as finishing out SSE support (Athlons with MMX and 3Dnow had *some* of the SSE instructions, but not enough to use it as SSE)
Athlon XP (Barton): Goes to 512K L2 instead of 256K on Tbird through Athlon XP)
Athlon64/Opteron: Adds SSE2, 1MB (1024KB) L2.

Celeron '1' (266MHz through 533MHz): Pentium 2, with 128K L2 instead of 512K/256K. The 266, 300 non-A, and 333 non-A versions actually have NO L2 whatsoever. These are fairly rare, though, and slot-only, FWIR.
Celeron '2' (533A MHz through 1.4GHz): Same as a Pentium3, SSE is added to the basic '1' Celeron. Early versions had 128K L2, A little past 1GHz, they moved to 256K L2.
Celeron 'P4' (1.6-2.4 or so): Same as a Pentium 4 (MMX, SSE, SSE2), only 128K L2 cache, though.

Now, as for cache sizes: Pretty much all of the Pentiums (P2 through P3 for sure) had 32K L1. This is divided into 16K of 'instructions' and 16K of 'data' cache. L1 cache and L2 cache are 'inclusive'. This means that any data that is in L1 MUST be in L2 also. Therefore a Pentium 2 with 32K of L1 and a 512K L2 has a TOTAL usable cache of only 512K. The Pentium 1's and MMXes had variable amounts of L2, sometimes 512K, sometimes 1MB, sometimes 2MB, always on the motherboard. Pentium 2's have 512K of L2 cache on the CPU card, but not on the core, it runs at half the speed of the CPU itself. The Pentium 3 had the same arrangement at first, 512K on card. Later Pentium3's (Coppermine core) had 256K of L2 cache on the CPU core, running at full CPU speed. All Celerons have on-die, full-speed L2. The Pentium 4 is the odd duck out... It has '12k micro ops' of L1 instruction cache... This is figured to be roughly 8KB. There is also 12K of L1 data cache, IIRC. This is inclusive. The first Pentium4s had 256K of on-die
cache. Later models (Nortwood core), starting at 1.6A through 3.2GHz, have 512K L2. Still inclusive. 512K total CPU cache.

Athlons, on the other hand, have *exclusive* L1/L2 caches. This means that data can be in L1 or L2, without the need to be in both. It's a minor boost in most things, since the data only has to be copied to the CPU once, and it allows more thorough utilization of the caches. This is much more important to Athlons than Pentiums, though, since Athlons (all of them, Athlon slot up through Barton and even the Opteron/Athlon64) have 128K of L1 cache. The original slot Athlon (Athlon Classic) had 128K of full-speed, on-cpu L1 cache, and 512K of L2 cache on the CPU card. This ran at 1/2, 2/5, or 1/3 of the CPU clock speed, depending on the CPU speed. (500MHz Athlons were 1/2, 750s were 2/5, 900+ were 1/3, IIRC). The Athlon 'Tbird' (Thunderbird core) changed this. It's a socketed CPU, so the L2 cache moved onto the CPU, changed to full CPU speed, and shrunk from 512K to 256K. This stayed the same for every Athlon from the Tbird through the Athlon XP, finally changing with the recent Barton core, which finally has
512K of full-cpu-speed L2. The Athlon64/Opteron have 1MB L2s. Now, since the caches don't have to hold the same info, marketing types often refer to the dual 64K L1s and the 256K L2 as '384K CPU cache'. This is technically correct. Since the Barton has 128K+512K, it technically has 640K total CPU cache. The Opteron/Athlon64 have 128K+1024K, 1152K total cache. Typically only marketing types refer to the caches this way, though. The Durons have always had 128K L1 and 64K L2. On a Pentium this wouldn't work at all, but since the Athlon series have exclusive caches, it gives the Duron 192K total cache... On an equivilent Pentium, it'd backfire, since only 64K of the L1 could be in L2 and thus used... Funny, eh?

Hope this cleared up more than it obscured, let me know if not.
Back to top
View user's profile Send private message
pr0t0type
n00b
n00b


Joined: 28 Jul 2003
Posts: 9

PostPosted: Wed Jul 30, 2003 11:39 am    Post subject: Reply with quote

Wow, great info guys. Thanks for all the good explanations :)

Just done an emerge world with these cflags and added 3dnow, mmx and sse to my use flags

Code:

-march=athlon-xp -O3 -pipe -fomit-frame-pointer -fpmath=sse,387 -falign-functions=4 -fprefetch-loop-arrays -fmerge-all-constants -mmmx -msse -m3dnow


Anyone see any stupid mistakes here?

Should find out how it runs in an hour or so. Also am i right in thinking that the kernel doesn't use these flags, it uses it's own in /usr/src/linux/makefile If so am I wise to leave it or to put in the optimized flags too?

Thanks
Back to top
View user's profile Send private message
Gnufsh
Guru
Guru


Joined: 28 Dec 2002
Posts: 400
Location: Portland, OR

PostPosted: Thu Jul 31, 2003 11:12 pm    Post subject: Reply with quote

1) leave the kernel flags alone

2)-mfpmath=sse,387 is usually sower than the default, so is -mfpmath=sse, at least on AMD machines, which I sure hope yours is, since you're using 3dnow.
Back to top
View user's profile Send private message
T2
n00b
n00b


Joined: 01 Jun 2002
Posts: 67
Location: Slovenia

PostPosted: Fri Aug 01, 2003 6:22 am    Post subject: Reply with quote

Just for info: I've installed openofice 1.1 rc2 binary package from official site and its way more speedier and responsive that openoffice 1.01. So there's probably no such need for recompiling here.
Back to top
View user's profile Send private message
LinuxDolt
Tux's lil' helper
Tux's lil' helper


Joined: 05 May 2003
Posts: 104

PostPosted: Fri Aug 01, 2003 6:45 am    Post subject: Reply with quote

i've got a p3 coppermine 933 MHz... what would be the most optimal (read as aggressive as i can get without having too many compile probs) cflags for me?
Back to top
View user's profile Send private message
byns
n00b
n00b


Joined: 01 May 2003
Posts: 29

PostPosted: Fri Aug 01, 2003 7:48 pm    Post subject: My flags Reply with quote

Ok I got a P3 Mobile after copying and pasting of all the post in this thread, I made these CFLAGS to quench the most optimization out of my CPU (without breaking exact math btw) The machine is really slow (933 MHz on AC) so I desperately need more speed.

Code:

CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmath=sse -fthread-jumps -fmerge-all-constants -mno-push-args -mno-align-stringops -frename-registers -fforce-addr -frerun-cse-after-loop -frerun-loop-opt -fprefetch-loop-arrays -falign-loops=4 -falign-functions=4 -falign-jumps=4"


I didn't emerge -e world yet. Any comments? Redundant stuff and the likes?
_________________
-----------------------------------------
It's easier to get forgiveness for being wrong than forgiveness for being
right.
Back to top
View user's profile Send private message
guard0
Tux's lil' helper
Tux's lil' helper


Joined: 26 Jun 2003
Posts: 96

PostPosted: Sat Aug 02, 2003 9:37 am    Post subject: Reply with quote

here's mine
they work FINE, been using them since 1.4rc1

CFLAGS="-march=athlon-xp -O3 -pipe -msse -ffast-math -fomit-frame-pointer -mmmx -m3dnow -mfpmath=sse -Wall -fexpensive-optimizations -funroll-loops -frerun-loop-opt -fforce-addr -frerun-cse-after-loop -falign-functions=16 -falign-labels=1 -foptimize-sibling-calls -fstrength-reduce -fprefetch-loop-arrays"

i dont remember where i got some of those flags
but they are stable and fast, havent noticed any loss of data or accuracy as a result of using those flags...
Back to top
View user's profile Send private message
odegard
Guru
Guru


Joined: 08 Mar 2003
Posts: 324
Location: Trondheim, NO

PostPosted: Sat Aug 02, 2003 10:04 am    Post subject: Reply with quote

Hate to be a spoilsports but can't too many optimizations actually ruin performance?
Back to top
View user's profile Send private message
dalcorta
n00b
n00b


Joined: 01 Nov 2003
Posts: 36

PostPosted: Tue Mar 02, 2004 9:54 am    Post subject: Pentium-M cflags? Reply with quote

So could anyone tell me which are the best cflags for a Centrino notebook? I search the forums (keywords centrino or pentium-m) and I read that it should be either a PIII or a PIV. So which is best?
Back to top
View user's profile Send private message
c4Ff3In3 4ddiC+
Tux's lil' helper
Tux's lil' helper


Joined: 16 Aug 2003
Posts: 110

PostPosted: Tue Mar 02, 2004 5:28 pm    Post subject: Reply with quote

odegard wrote:
Hate to be a spoilsports but can't too many optimizations actually ruin performance?

If you read the info pages for gcc concerning optimization flags, you'll see that even the gcc team acknowledges cases where certain optimizations may result in code that is actually slower. -funroll-loops is one optimization that has a tendency to slow some code down.

Now, for my personal experience, I've found that if I use gzip as a benchmark (yeah, I know, it is not very scientific), I will get slightly slower compression times using -march=pentium4 -O3 than if I use -march=pentium3 -O3. Also, I've found that with gzip, -march=pentium4 -O3 is slower than -march=pentium4 -O2.

Note: The differences are on the order of ~0.5 seconds when using the following command:
Code:
dd if=/dev/zero bs=1M count=1000 | gzip -c >/dev/null
Back to top
View user's profile Send private message
irf2003
Veteran
Veteran


Joined: 10 Sep 2003
Posts: 1078

PostPosted: Wed Mar 03, 2004 9:24 pm    Post subject: Reply with quote

magnet wrote:
I use the -mfpmath=sse,387 thinggy.
let's recompile the whole system, I'll post what will happend.
should I benchmark it before/after ? with glxgears maybe ?

I have not gone throught the whole of this thread, but, "-mfpmath=sse,387" is very dangerous, as according to the
gcc docs, the register allocator cannot deal with separate
floating point units, until the gcc devloppers say otherwise,
one should avoid "-mfpmath=sse,387", "-mfpmath=sse" should
do for now
hth
Back to top
View user's profile Send private message
Daagar
Tux's lil' helper
Tux's lil' helper


Joined: 14 Mar 2003
Posts: 78

PostPosted: Fri Mar 05, 2004 8:52 pm    Post subject: Reply with quote

Is there a replacement for the freehackers.org site which seemed to keep a nice list of CFLAGS based on arch? freehackers.org seems to have disappeard :(
Back to top
View user's profile Send private message
seppe
Guru
Guru


Joined: 01 Sep 2003
Posts: 431
Location: Hove, Antwerp, Belgium

PostPosted: Sun Mar 07, 2004 2:57 pm    Post subject: Reply with quote

Hi, I'm rather new in CFLAGS but after I read some threads and freehackers.org I'm now using these:

Code:

CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmatch=sse -fforce-addr -falign-functions=4 -fprefetch-loop-arrays"


This is my /proc/cpuinfo:
Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 800.265
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips        : 1568.76


Can anyone verify that these are the best CFLAGS for my Pentium 3 with 800Mhz please? Thanks a lot :)

Oh, and I once did a 'emerge -e world' after I changed my CFLAGS but it broke up everything (I couldn't log in anymore etc ..), so now I'm going to just recompile the most important packages (xfree, gnome, moizlla-firefox, evolution, gaim, openoffice, abiword, ..)
_________________
nitro-sources, because between stable and experimental there exists only speed

Latest release I made: 2.6.13.2-nitro1
Back to top
View user's profile Send private message
FireBurn
Apprentice
Apprentice


Joined: 19 Sep 2004
Posts: 170
Location: Edinburgh, UK

PostPosted: Sun Sep 26, 2004 10:56 pm    Post subject: Using GCC 3.4.2-r2 Reply with quote

Can I just check if any one is using the latest GCC on gentoo? GCC 3.4.2. And can they please confirm what CFLAGS they're using especally if they're using an athlon-xp.

I've broke my system so many times today it's unbelivable!

Mike
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 6 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum