View previous topic :: View next topic |
Author |
Message |
ph317 n00b
Joined: 02 Jun 2002 Posts: 43
|
Posted: Thu Jul 10, 2003 4:11 am Post subject: Re: cache and alignment |
|
|
odegard wrote: | ph317 wrote: | A few corrections to some misinfo above:
First off, L1 and L2 caches are seperate, even on athlons. |
Actually, *only* on athlons.
However, I was thinking, what is the bottleneck on modern computers? I/O. So why don't we optimize the code for smaller footprint than for faster execution? Lets be utterly simplistic and say that there are two variables: LOAD and EXECUTE. LOAD is far bigger than EXECUTE so in order to get a total boost, get LOAD down, even thought it may use longer time EXECUTING.
Agree/disagree? |
L1 and L2 are seperate on all processors that have both such things. They are entirely different types of memory, the L1 is much faster than the L2, and therefore much more expensive per byte and much smaller. Being different kinds of memory and being attached at totally different places eletrically, they are different. If the L1 and L2 of a processor were the same, there would be no point in calling them L1 and L2 to begin with, you would just say you had a huge slow L1 or a small fast L2 or something.
On the I/O point, well yes normal tasks on a desktop system these days are more I/O than CPU bound - but they're bound by things like disks, network cards, the net itself, and your keyboard and mouse speed of course - you wouldn't believe how much time the average PC spends twiddling its thumbs waiting on the end user. In terms of instruction optimizations that this thread is talking about, going from a widely-aligned loop-unrolled fat set of optimizations to -Os and alignments set to zero aren't really making a difference by lowering I/O load per se: if they help, they're help because smaller tighter code keeps more references local to L1 and or L2 cache instead of taking a cache miss and going out to slow main memory. There's definitely some tradeoffs involved of course. On a Xeon with a couple megs of L2 cache it's probably not worth it to go -Os, but if whatever x86 clone you're using has like 128k or less of L2, it could very well help. Benchmark your own CPU running tasks you generally run is the best way to tell. |
|
Back to top |
|
|
odegard Guru
Joined: 08 Mar 2003 Posts: 324 Location: Trondheim, NO
|
Posted: Sat Jul 12, 2003 5:58 pm Post subject: Re: cache and alignment |
|
|
ph317 wrote: | odegard wrote: | ph317 wrote: | A few corrections to some misinfo above:
First off, L1 and L2 caches are seperate, even on athlons. |
Actually, *only* on athlons.
|
L1 and L2 are seperate on all processors that have both such things. |
Yes, they are separate entities physcially. What I meant was that in a P4, the caches are INCLUSIVE meaning that everything that is contained in the L1 cache is duplicated in the L2 cache (actually, the P4 has two kind of L1 caches but thats a different story). In an Athlon however, the are EXCLUSIVE. Now perhaps my reply makes more sense. I was talking about separate entities FUNCTIONALLY, while I guess you meant physically...
Anyway, nothing to argue about. |
|
Back to top |
|
|
Gandalf_Grey_ Apprentice
Joined: 19 Apr 2003 Posts: 151
|
Posted: Tue Jul 15, 2003 3:18 am Post subject: |
|
|
I have an athlon tbird @1.33 ghz. cat /proc/cpuinfo returns this
Code: |
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 4
cpu MHz : 1343.062
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2680.42 |
and my current flags are
-march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128
does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice? |
|
Back to top |
|
|
higman n00b
Joined: 12 Jun 2002 Posts: 7 Location: Langley, B.C., Canada
|
Posted: Wed Jul 16, 2003 3:48 pm Post subject: |
|
|
I have a tbird @ 1.4, runs well, also doubles as a space heater!
Gandalf_Grey_ wrote: | -march=athlon-tbird -O3 -pipe -fomit-frame-pointer -mno-push-args -ffast-math -fmerge-all-constants -m3dnow -mmmx -falign-functions=128
does anyone see any blatent errors with this? or places I could improve? I have successfully compiled the gimp and there was a noticable improvment in start time. However would this be sufficient to compile something as picky as OpenOffice? |
I'm using: -march=athlon-tbird -O3 -pipe -fomit-frame-pointer
What flags were you using before and which ones did you add to get this boost? As for your flags... after reading this entire thread and investigating a little on my own...
-m3dnow and -mmmx are redundant (-march=athlon-tbird implies)
-falign-functions=128 is insignifigant and/or dangerous, to the best of my knowledge, the compiler has good defaults for the different cpu (presumably tuned by developers?)
-ffast-math will cause precise calculations to fail
-fmerge-all-constants reduces size by a small amount with no other gain.
-O3 -pipe -fomit-frame-pointer looks good to me. I don't know anything about -mno-push-args though, are the tbirds not stack friendly? |
|
Back to top |
|
|
TeeHee n00b
Joined: 24 Jun 2003 Posts: 7
|
Posted: Sat Jul 19, 2003 6:51 pm Post subject: |
|
|
'nother question here.
Trying to use openmosix on two mashines using different cfags.
anyone had success ? Problems ? Anything ? |
|
Back to top |
|
|
aardvark Guru
Joined: 30 Jun 2002 Posts: 576
|
Posted: Sat Jul 19, 2003 7:00 pm Post subject: |
|
|
elektrohirn wrote: | hey i just even compiled openoffice 1.1beta2 with the above cflags. that's really surprising me because the ebuild tells you that openoffice is very fragile about aggressive cflags ... but openoffice is so stunning fast now! |
Doesn't the openoffice ebuild filter out most flags though? |
|
Back to top |
|
|
higman n00b
Joined: 12 Jun 2002 Posts: 7 Location: Langley, B.C., Canada
|
Posted: Sat Jul 19, 2003 10:10 pm Post subject: |
|
|
aardvark wrote: | Doesn't the openoffice ebuild filter out most flags though? |
yes, it does, here's a segment from /usr/portage/app-office/openoffice/openoffice-1.1_beta2-r1.ebuild:
Code: | inherit flag-o-matic eutils
# Compile problems with these ...
filter-flags "-funroll-loops"
filter-flags "-fomit-frame-pointer"
replace-flags "-O3" "-O2" |
|
|
Back to top |
|
|
T2 n00b
Joined: 01 Jun 2002 Posts: 67 Location: Slovenia
|
Posted: Wed Jul 23, 2003 5:50 pm Post subject: |
|
|
I've read all thread, its really informative (and confusing at moments).
I'm staying at trusted&tried CFLAGS="-march=athlon-tbird -O3 -pipe"
for my tbird 1.33ghz.
IMHO critical packages such as kernel (and mplayer ) do their own cpu optimisations which are satisfactory. However I'm tempted to try some agressive gcc compile flags to overcome openoffice laziness.
regards |
|
Back to top |
|
|
Gandalf_Grey_ Apprentice
Joined: 19 Apr 2003 Posts: 151
|
Posted: Thu Jul 24, 2003 1:21 am Post subject: |
|
|
The cflags I mentioned above compiled openoffice fine,a nd it feels noticably more responsive than the binary install, before I changed my flags I had
-march=athlon-tbird -O3 -pipe
I did some research and it seems my current ones (mentioned above) are about as aggressive as I can get without breaking compiles left and right |
|
Back to top |
|
|
FastTurtle Guru
Joined: 03 Sep 2002 Posts: 477 Location: Flakey Shake & Bake Caliornia, USA
|
Posted: Thu Jul 24, 2003 2:38 pm Post subject: |
|
|
I've got an XP1800 and these are the flags I'm using.
-march=athlon -m3dnow -mmmx -msse -O3 -pipe.
Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load.
As far as this thread goes, I'm real happy to have read the entire thing. Maybe I will begin testing some of the optimizations and seeing what speeds things up, especially KDE/Office 1.03 and other large apps. |
|
Back to top |
|
|
Gandalf_Grey_ Apprentice
Joined: 19 Apr 2003 Posts: 151
|
Posted: Thu Jul 24, 2003 7:54 pm Post subject: |
|
|
FastTurtle wrote: | I've got an XP1800 and these are the flags I'm using.
-march=athlon -m3dnow -mmmx -msse -O3 -pipe.
Because my last build went south with more aggressive flags, I'm sticking with stability over speed right now because I've got a full gig of ram. Speed isn't a problem that I've noticed except with Open Office taking forever to load.
As far as this thread goes, I'm real happy to have read the entire thing. Maybe I will begin testing some of the optimizations and seeing what speeds things up, especially KDE/Office 1.03 and other large apps. |
If you have an athlon XP I hardly think that using the athlon-xp cflag is being aggressive. |
|
Back to top |
|
|
Forge Tux's lil' helper
Joined: 20 Jun 2002 Posts: 125 Location: KOP, PA, USA
|
Posted: Fri Jul 25, 2003 10:22 am Post subject: |
|
|
OK, here's my semi-definitive Pentium/Athlon features guide and cache lecture.. I hope lynx doesn't barf.
(These are only cflag-relevant features, but I won't go into cache line sizes, etc.)
486: Not much. FPU.... Usually.
Pentium non-MMX: Same as 486, but i586.
Pentium MMX: adds MMX. Duh.
Pentium 2: Same as Pentium MMX, now i686.
Pentium 3: Adds SSE.
Pentium 4: Adds SSE2.
Athlon: Pentium 2, plus Advanced (aka Athlon) 3Dnow. Same cflags as any K6-* as far as 3Dnow goes.
Athlon Tbird (on-die L2, socketed Athlon): Same as Athlon.
Athlon XP: Adds SSE, known as '3Dnow Professional' for marketing reasons. 3Dnow Pro actually includes new 3Dnow instructions, as well as finishing out SSE support (Athlons with MMX and 3Dnow had *some* of the SSE instructions, but not enough to use it as SSE)
Athlon XP (Barton): Goes to 512K L2 instead of 256K on Tbird through Athlon XP)
Athlon64/Opteron: Adds SSE2, 1MB (1024KB) L2.
Celeron '1' (266MHz through 533MHz): Pentium 2, with 128K L2 instead of 512K/256K. The 266, 300 non-A, and 333 non-A versions actually have NO L2 whatsoever. These are fairly rare, though, and slot-only, FWIR.
Celeron '2' (533A MHz through 1.4GHz): Same as a Pentium3, SSE is added to the basic '1' Celeron. Early versions had 128K L2, A little past 1GHz, they moved to 256K L2.
Celeron 'P4' (1.6-2.4 or so): Same as a Pentium 4 (MMX, SSE, SSE2), only 128K L2 cache, though.
Now, as for cache sizes: Pretty much all of the Pentiums (P2 through P3 for sure) had 32K L1. This is divided into 16K of 'instructions' and 16K of 'data' cache. L1 cache and L2 cache are 'inclusive'. This means that any data that is in L1 MUST be in L2 also. Therefore a Pentium 2 with 32K of L1 and a 512K L2 has a TOTAL usable cache of only 512K. The Pentium 1's and MMXes had variable amounts of L2, sometimes 512K, sometimes 1MB, sometimes 2MB, always on the motherboard. Pentium 2's have 512K of L2 cache on the CPU card, but not on the core, it runs at half the speed of the CPU itself. The Pentium 3 had the same arrangement at first, 512K on card. Later Pentium3's (Coppermine core) had 256K of L2 cache on the CPU core, running at full CPU speed. All Celerons have on-die, full-speed L2. The Pentium 4 is the odd duck out... It has '12k micro ops' of L1 instruction cache... This is figured to be roughly 8KB. There is also 12K of L1 data cache, IIRC. This is inclusive. The first Pentium4s had 256K of on-die
cache. Later models (Nortwood core), starting at 1.6A through 3.2GHz, have 512K L2. Still inclusive. 512K total CPU cache.
Athlons, on the other hand, have *exclusive* L1/L2 caches. This means that data can be in L1 or L2, without the need to be in both. It's a minor boost in most things, since the data only has to be copied to the CPU once, and it allows more thorough utilization of the caches. This is much more important to Athlons than Pentiums, though, since Athlons (all of them, Athlon slot up through Barton and even the Opteron/Athlon64) have 128K of L1 cache. The original slot Athlon (Athlon Classic) had 128K of full-speed, on-cpu L1 cache, and 512K of L2 cache on the CPU card. This ran at 1/2, 2/5, or 1/3 of the CPU clock speed, depending on the CPU speed. (500MHz Athlons were 1/2, 750s were 2/5, 900+ were 1/3, IIRC). The Athlon 'Tbird' (Thunderbird core) changed this. It's a socketed CPU, so the L2 cache moved onto the CPU, changed to full CPU speed, and shrunk from 512K to 256K. This stayed the same for every Athlon from the Tbird through the Athlon XP, finally changing with the recent Barton core, which finally has
512K of full-cpu-speed L2. The Athlon64/Opteron have 1MB L2s. Now, since the caches don't have to hold the same info, marketing types often refer to the dual 64K L1s and the 256K L2 as '384K CPU cache'. This is technically correct. Since the Barton has 128K+512K, it technically has 640K total CPU cache. The Opteron/Athlon64 have 128K+1024K, 1152K total cache. Typically only marketing types refer to the caches this way, though. The Durons have always had 128K L1 and 64K L2. On a Pentium this wouldn't work at all, but since the Athlon series have exclusive caches, it gives the Duron 192K total cache... On an equivilent Pentium, it'd backfire, since only 64K of the L1 could be in L2 and thus used... Funny, eh?
Hope this cleared up more than it obscured, let me know if not. |
|
Back to top |
|
|
pr0t0type n00b
Joined: 28 Jul 2003 Posts: 9
|
Posted: Wed Jul 30, 2003 11:39 am Post subject: |
|
|
Wow, great info guys. Thanks for all the good explanations
Just done an emerge world with these cflags and added 3dnow, mmx and sse to my use flags
Code: |
-march=athlon-xp -O3 -pipe -fomit-frame-pointer -fpmath=sse,387 -falign-functions=4 -fprefetch-loop-arrays -fmerge-all-constants -mmmx -msse -m3dnow
|
Anyone see any stupid mistakes here?
Should find out how it runs in an hour or so. Also am i right in thinking that the kernel doesn't use these flags, it uses it's own in /usr/src/linux/makefile If so am I wise to leave it or to put in the optimized flags too?
Thanks |
|
Back to top |
|
|
Gnufsh Guru
Joined: 28 Dec 2002 Posts: 400 Location: Portland, OR
|
Posted: Thu Jul 31, 2003 11:12 pm Post subject: |
|
|
1) leave the kernel flags alone
2)-mfpmath=sse,387 is usually sower than the default, so is -mfpmath=sse, at least on AMD machines, which I sure hope yours is, since you're using 3dnow. |
|
Back to top |
|
|
T2 n00b
Joined: 01 Jun 2002 Posts: 67 Location: Slovenia
|
Posted: Fri Aug 01, 2003 6:22 am Post subject: |
|
|
Just for info: I've installed openofice 1.1 rc2 binary package from official site and its way more speedier and responsive that openoffice 1.01. So there's probably no such need for recompiling here. |
|
Back to top |
|
|
LinuxDolt Tux's lil' helper
Joined: 05 May 2003 Posts: 104
|
Posted: Fri Aug 01, 2003 6:45 am Post subject: |
|
|
i've got a p3 coppermine 933 MHz... what would be the most optimal (read as aggressive as i can get without having too many compile probs) cflags for me? |
|
Back to top |
|
|
byns n00b
Joined: 01 May 2003 Posts: 29
|
Posted: Fri Aug 01, 2003 7:48 pm Post subject: My flags |
|
|
Ok I got a P3 Mobile after copying and pasting of all the post in this thread, I made these CFLAGS to quench the most optimization out of my CPU (without breaking exact math btw) The machine is really slow (933 MHz on AC) so I desperately need more speed.
Code: |
CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmath=sse -fthread-jumps -fmerge-all-constants -mno-push-args -mno-align-stringops -frename-registers -fforce-addr -frerun-cse-after-loop -frerun-loop-opt -fprefetch-loop-arrays -falign-loops=4 -falign-functions=4 -falign-jumps=4"
|
I didn't emerge -e world yet. Any comments? Redundant stuff and the likes? _________________ -----------------------------------------
It's easier to get forgiveness for being wrong than forgiveness for being
right. |
|
Back to top |
|
|
guard0 Tux's lil' helper
Joined: 26 Jun 2003 Posts: 96
|
Posted: Sat Aug 02, 2003 9:37 am Post subject: |
|
|
here's mine
they work FINE, been using them since 1.4rc1
CFLAGS="-march=athlon-xp -O3 -pipe -msse -ffast-math -fomit-frame-pointer -mmmx -m3dnow -mfpmath=sse -Wall -fexpensive-optimizations -funroll-loops -frerun-loop-opt -fforce-addr -frerun-cse-after-loop -falign-functions=16 -falign-labels=1 -foptimize-sibling-calls -fstrength-reduce -fprefetch-loop-arrays"
i dont remember where i got some of those flags
but they are stable and fast, havent noticed any loss of data or accuracy as a result of using those flags... |
|
Back to top |
|
|
odegard Guru
Joined: 08 Mar 2003 Posts: 324 Location: Trondheim, NO
|
Posted: Sat Aug 02, 2003 10:04 am Post subject: |
|
|
Hate to be a spoilsports but can't too many optimizations actually ruin performance? |
|
Back to top |
|
|
dalcorta n00b
Joined: 01 Nov 2003 Posts: 36
|
Posted: Tue Mar 02, 2004 9:54 am Post subject: Pentium-M cflags? |
|
|
So could anyone tell me which are the best cflags for a Centrino notebook? I search the forums (keywords centrino or pentium-m) and I read that it should be either a PIII or a PIV. So which is best? |
|
Back to top |
|
|
c4Ff3In3 4ddiC+ Tux's lil' helper
Joined: 16 Aug 2003 Posts: 110
|
Posted: Tue Mar 02, 2004 5:28 pm Post subject: |
|
|
odegard wrote: | Hate to be a spoilsports but can't too many optimizations actually ruin performance? |
If you read the info pages for gcc concerning optimization flags, you'll see that even the gcc team acknowledges cases where certain optimizations may result in code that is actually slower. -funroll-loops is one optimization that has a tendency to slow some code down.
Now, for my personal experience, I've found that if I use gzip as a benchmark (yeah, I know, it is not very scientific), I will get slightly slower compression times using -march=pentium4 -O3 than if I use -march=pentium3 -O3. Also, I've found that with gzip, -march=pentium4 -O3 is slower than -march=pentium4 -O2.
Note: The differences are on the order of ~0.5 seconds when using the following command:
Code: | dd if=/dev/zero bs=1M count=1000 | gzip -c >/dev/null |
|
|
Back to top |
|
|
irf2003 Veteran
Joined: 10 Sep 2003 Posts: 1078
|
Posted: Wed Mar 03, 2004 9:24 pm Post subject: |
|
|
magnet wrote: | I use the -mfpmath=sse,387 thinggy.
let's recompile the whole system, I'll post what will happend.
should I benchmark it before/after ? with glxgears maybe ? |
I have not gone throught the whole of this thread, but, "-mfpmath=sse,387" is very dangerous, as according to the
gcc docs, the register allocator cannot deal with separate
floating point units, until the gcc devloppers say otherwise,
one should avoid "-mfpmath=sse,387", "-mfpmath=sse" should
do for now
hth |
|
Back to top |
|
|
Daagar Tux's lil' helper
Joined: 14 Mar 2003 Posts: 78
|
Posted: Fri Mar 05, 2004 8:52 pm Post subject: |
|
|
Is there a replacement for the freehackers.org site which seemed to keep a nice list of CFLAGS based on arch? freehackers.org seems to have disappeard :( |
|
Back to top |
|
|
seppe Guru
Joined: 01 Sep 2003 Posts: 431 Location: Hove, Antwerp, Belgium
|
Posted: Sun Mar 07, 2004 2:57 pm Post subject: |
|
|
Hi, I'm rather new in CFLAGS but after I read some threads and freehackers.org I'm now using these:
Code: |
CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -mmmx -msse -mfpmatch=sse -fforce-addr -falign-functions=4 -fprefetch-loop-arrays"
|
This is my /proc/cpuinfo:
Code: |
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 3
cpu MHz : 800.265
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1568.76
|
Can anyone verify that these are the best CFLAGS for my Pentium 3 with 800Mhz please? Thanks a lot
Oh, and I once did a 'emerge -e world' after I changed my CFLAGS but it broke up everything (I couldn't log in anymore etc ..), so now I'm going to just recompile the most important packages (xfree, gnome, moizlla-firefox, evolution, gaim, openoffice, abiword, ..) _________________ nitro-sources, because between stable and experimental there exists only speed
Latest release I made: 2.6.13.2-nitro1 |
|
Back to top |
|
|
FireBurn Apprentice
Joined: 19 Sep 2004 Posts: 170 Location: Edinburgh, UK
|
Posted: Sun Sep 26, 2004 10:56 pm Post subject: Using GCC 3.4.2-r2 |
|
|
Can I just check if any one is using the latest GCC on gentoo? GCC 3.4.2. And can they please confirm what CFLAGS they're using especally if they're using an athlon-xp.
I've broke my system so many times today it's unbelivable!
Mike |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|