
They shouldn't because it's the same architecture. AMD released Socket 939 Opterons last year, so no shouldn't be any problems.Morpheouss wrote:Hi,
I have Athlon64 3000+ Venice (SSE SSE2 SSE3 MMX 3DNOW) and actually i have set -march to athlon64.
I wonder if i can change it to -march=opteron since tests compiled with this flags are much faster?
I wonder if optimalizing software for opteron instead of athlon64 may cause any troubles?
THANKS!
mod edit: Merged here from Unsupported Software --Earthwings
Shouldn't these numbers be 3 so that either 0,1,2,3 nops can be inserted so that you reach an address divisible by 4?steveL wrote:-falign-jumps=4 -falign-labels=4 -falign-loops=4
This is is a good reference on optimization flags: http://gcc.gnu.org/onlinedocs/gcc-4.1.2 ... tions.htmlShouldn't these numbers be 3 [...]
Thanks for the information. So the defaults are not processor-dependent and thus certainly not processor-optimized (as I would have expected). So it seems that SteveL's suggestion (if you choose 3 or 7, respectively) has only advantages and no disadvantages.Akkara wrote:I *think* the gcc defaults are something like N=8 for these, and 16 or even 32 for functions. But it has been a while since I looked (at the source)
thx[/code]processor : 0
vendor_id : CyrixInstead
cpu family : 6
model : 1
model name : 6x86MX 2.5x Core/Bus Clock
stepping : 4
cpu MHz : 167.054
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : yes
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de tsc msr cx8 pge cmov mmx cyrix_arr
bogomips : 335.01
This information was correct, at least for gcc-4.1.2. If you are interested in the defaults, they can be found in gcc/config/i386/i386.c in the procesor_target_table. Roughly speaking, on athlon chips a "bad" alignment apparently is less expensive than on intel chips (usually no more than align_*=7 is used by default for athlon; i486 and pentiumpro use the highest numbers 15 for align_loop)mv wrote:"if N is unspecified or 0 use a machine-dependent default".
The defaults are processor-dependent. Sorry I wasn't clear about that in the previous reply.So the defaults are not processor-dependent
Nah it's just experience of assembler. A NOP is a NOP is a total waste of time and space, unless you happen to be timing something (aka busy-waiting.) (Or you're aligning.) Loop-unrolling, for instance, is from the days when CPUs didn't have caches at all (I'm reliably informed.) So yeah what Akkara said, about keeping small code in the caches when you can, but no hard evidence. Anecdotal yeah in that KDE runs much faster on Gentoo than anything else (ofc) and everyone else in gentoo-land complains about it, but it's easily fast enough for us to code on older machines, while still having full desktop installs with apache php and mysql on board.mv wrote:SteveL, do you have more information than just your intuition why a smaller choice should be preferable?
I experienced an enormous speed increase with loop unrolling on amd64. But perhaps it is really true that amd chips have usually less cache and smaller pipelines so that only on these chips this has a positive effect.steveL wrote:Loop-unrolling, for instance, is from the days when CPUs didn't have caches at all (I'm reliably informed.)
Hmm, it'll always be faster, if you ignore caching issues, since there is no dec and conditional branch. Since we're effectively going for -Os type smaller code, however, it's not something we do (since it always leads to larger code.) I can't see the NOPs being much use, even if the chip is supposed to pipeline better etc, since those NOPs will always be executed. But again I have no hard figures. I'd imagine alignment of loops is where it'd be the most use, since loops are usually executed several times, followed by functions (which we allow since the alignment won't put NOPs in the execution path.) The most common branch (jump to label) is on an IF (eg to avoid the else part) and is usually executed only once in that block. It may well be in a loop but in the absence of profiling or the coder telling the compiler which branch is likely, there's no way to know which branch will be run more often. If it is aligned, the label to which we branch will have NOPs before it, and whenever the CPU goes down the other path it will waste a few cycles (as well as the code being larger.)mv wrote:I experienced an enormous speed increase with loop unrolling on amd64. But perhaps it is really true that amd chips have usually less cache and smaller pipelines so that only on these chips this has a positive effect.steveL wrote:Loop-unrolling, for instance, is from the days when CPUs didn't have caches at all (I'm reliably informed.)
What happens is something like this:I still find it hard to believe that a processor trips up on word boundaries
Code: Select all
0 1 2 3 4 5 6 7 8 9 A B C D E F
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
000x | | |add b,c| | | | | | | | |add a,b| dec c |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
001x |cmp c,0|bne $C | | | | | | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
. . .Code: Select all
area52 ~ # cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 75
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 1000.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips : 2011.43
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 75
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 1000.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips : 2011.43
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
Code: Select all
CFLAGS="-O2 -march=k8 -pipe"
CXXFLAGS="${CFLAGS}"
# This should not be changed unless you know exactly what you are doing. You
# should probably be using a different stage, instead.
CHOST="x86_64-pc-linux-gnu"
MAKEOPTS="-j3"
should i add some of these settings?CFLAGS="-march=athlon64 -O2 -msse3 -mfpmath=sse,387 -pipe -ffast-math -m64"
snIP3r wrote:these are my settings in /etc/make.conf with the system running stable:i read a post (i think here in this forum) with these suggested settings for CFLAGS with the same cpu like mine:Code: Select all
CFLAGS="-O2 -march=k8 -pipe" CXXFLAGS="${CFLAGS}" # This should not be changed unless you know exactly what you are doing. You # should probably be using a different stage, instead. CHOST="x86_64-pc-linux-gnu" MAKEOPTS="-j3"should i add some of these settings?CFLAGS="-march=athlon64 -O2 -msse3 -mfpmath=sse,387 -pipe -ffast-math -m64"
Code: Select all
CFLAGS="-O2 -march=athlon64 -pipe -fomit-frame-pointer"My guess is that - as it seems to happen rather often - the description on the gcc manpage is not correct:Akkara wrote:What I'd like to be able to say, is "align to 16, but only if you can skip 4 or less". But there doesn't seem to be any way of saying it.
Thanks for a nice explanation. The thing that gets me is that you're only talking about a problem on cache misses; to my mind the cache line isn't so significant here as the locality of code. It might need 3 or 10 lines for the function, but it's more about whether the function as a whole is in cache as opposed to the lines. Smaller code leads to less cache usage in the first place, and additionally there are no NOPs slowing things down. I can see the case for loops within a cache line as you explain though.Akkara wrote:This is a hypothetical processor since I don't know the details of x86 nor x_64 at this level.Code: Select all
0 1 2 3 4 5 6 7 8 9 A B C D E F +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 000x | | |add b,c| | | | | | | | |add a,b| dec c | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 001x |cmp c,0|bne $C | | | | | | | | | | | | | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ . . .
Let's say a loop starts with the "add a,b" instruction at address 0x000C. When the processor first jumps here, if the line isn't in cache it'll have to issue memory reads to fill the cache line (here illustrated as 16 bytes long, but modern processors have it longer - 32 I think maybe even 64). So that's a delay. And barely two instructions get down the pipe that there's another delay fetching the next cache line.
Earlier processors don't have as much in terms of cache but there's often a one-line buffer of sorts ahead of the instruction-decode logic that can execute a loop when it is entirely contained, without having to out to memory at all.
That is why it can be advantageous to align things. Depends on the specifics of the micro-architecture tho. What's good for one can be bad for another.
I guess that depends on how often condition c is true. Seems most of the time, forward branches like that one are almost always taken. For example, if((g = getchar()) == EOF). That if will be true at most once per file read. Or the very common if(do_something(...) == FAILED) { recovery(); } Almost all of those are forward branches around conditions that are rarely true.Eg a simple if (c) code; the compiler will implement that as if (!c) jp endif; code; endif:
It is more than just cache although I wasn't clear about it (in part because I'm not an expert in this level of detail).The thing that gets me is that you're only talking about a problem on cache misses; to my mind the cache line isn't so significant here as the locality of code.



Code: Select all
CFLAGS="-march=i686 -maccumulate-outgoing-args -pipe -O2 -frtl-abstract-sequences -funsafe-loop-optimizations -Wunsafe-loop-optimizations -fstrict-aliasing -fno-trapping-math -fno-ident"
CXXFLAGS = "-march=i686 -maccumulate-outgoing-args -pipe -O2 -frtl-abstract-sequences -funsafe-loop-optimizations -Wunsafe-loop-optimizations -fstrict-aliasing -fno-trapping-math fno-ident -fvisibility-inlines-hidden"