I understand the argument that -O3 -fblah flags are a bit pointless for a lot software, who gives a monkeys if 'less' is 7% faster (likewise, does 7% slower really matter either). But setting a high level of optimisation globally does guarantee that all applications that may benifit, do. I don't think anybody in their right mind would want to have optimisations on an application by application basis, especially if they're running a media intensive desktop. (and I don't expect the devs to start adding performance based flags to ebuilds)nxsty wrote:sigmalll: True, but the problem is that there is no magic CFLAGS that will make your system faster. So when choosing global CFLAGS they should always be "-O2 -march= -pipe" and perhaps also omitfp if you're on x86. It doesn't matter if -O3 is shown to be faster in some benchmarks, it shouldn't be used globaly anyway because 99% of the packages wont benefit from the higher optimizations and will in fact often run slower because of the extra bloat it causes.

Code: Select all
# grep CFLAGS /etc/make.conf
CFLAGS="-O2 -march=k8 -pipe"
CXXFLAGS="${CFLAGS}"

Code: Select all
CFLAGS="-march=athlon64 -O2 -ffast-math -fforce-addr -fmove-all-movables -fno-ident -fomit-frame-pointer -fpeel-loops -fprefetch-loop-arrays -frename-registers -ftracer -funrool-loops -funswitch-loops -fweb -pipe"Code: Select all
CFLAGS="-march=athlon64 -O2 -fforce-addr -fno-ident -ftracer -fweb -pipe" Code: Select all
CFLAGS="-march=k8 -O3 -pipe -fomit-frame-pointer"There is always a tradeoff betwen speed and size when the compiler is optimizing. -O2 is usually a good balance, it turns on most optimizations but still doesn't bloat code much. Higher optimizations like -O3 -funroll-loops and friends has the side effect that they make binaries larger. Larger binaries means more disc reads, more memory usage, slower execution and larger chans of cache misses. This is acceptable for the specific applications that actually benefits from the higher optimizations but for most things it's just unnecessary bloat that is bad for performance. In fact -Os is usually the best options since most applications don't benefit from the extra optimizations but they do benefit from the smaller binary size. Compiler optimizations is only a small part of performance, most comes from good written code.sigmalll wrote:I understand the argument that -O3 -fblah flags are a bit pointless for a lot software, who gives a monkeys if 'less' is 7% faster (likewise, does 7% slower really matter either). But setting a high level of optimisation globally does guarantee that all applications that may benifit, do. I don't think anybody in their right mind would want to have optimisations on an application by application basis, especially if they're running a media intensive desktop. (and I don't expect the devs to start adding performance based flags to ebuilds)
But the compiler is where the most performnce gains can be obtained, and good CFLAGS really do make parts of your system run faster.
(I do have to add that this is only really an issue because GCC's focus in the past has always been portability rather than performance. In some cases this actually makes Linux applications slower than their windows counterparts)
Code: Select all
-O2 -march=XX -pipeCode: Select all
category/packagename -fwhatever-you-wantThere is a way, it's just not officially endorsed, see http://forums.gentoo.org/viewtopic-t-28 ... art-0.htmlHacTek wrote:forgive my naivety, but is there a way to specify cflags on a package-by-package basis?
something similar to the way use flags can with the package.use file.
I have an Athlon 3200+ Winchester, I have compiled it and I get this output:energyman76b wrote:I read some weeks ago, that some CPUs report the PNI flag, without having SSE3.tnt wrote:It's Sempron 2800+ 'BX' Palermo core and it has 'PNI' flag so it should have SSE3.energyman76b wrote:-msse3 is only 'save' if you know for sure that your CPU supports it (Venice Amd64).
Anyway, thank you for '-funroll-all-loops' tip - very usefull one!
Try to run this:
cat test_pni.c
#include <stdint.h>
uint8_t __attribute__((aligned(64))) current[64];
uint8_t previous[64];
int main()
{
int i;
uint64_t result;
uint32_t _eax, _ebx, _ecx, _edx;
uint8_t _cpuid[13];
uint32_t *_cpuid0 = (uint32_t*) _cpuid;
uint32_t *_cpuid1 = (uint32_t*) ( _cpuid + 4 );
uint32_t *_cpuid2 = (uint32_t*) ( _cpuid + 8 );
uint8_t *ptr0 = current;
uint8_t *ptr1 = previous;
__asm__ __volatile__ (
"cpuid\n"
: "=a" (_eax),
"=b" (*_cpuid0), "=d" (*_cpuid1), "=c" (*_cpuid2)
: "a" (0) );
_cpuid[12] = 0;
printf( "cpuid(0) returns %d (%s)\n", _eax, _cpuid );
__asm__ __volatile__ (
"cpuid\n"
: "=a" (_eax), "=b" (_ebx), "=c" (_ecx), "=d" (_edx)
: "a" (1) );
printf( "cpuid(1) returns %08x %08x %08x %08x\n",
_eax, _ebx, _ecx, _edx );
memset( current, 0xaa, 64 );
memset( previous, 0x55, 64 );
for( i = 0; i < 4; i ++ ) {
__asm__ __volatile__ (
"movdqa %0, %%xmm0\n"
"movdqu %1, %%xmm1\n"
"psadbw %%xmm1, %%xmm0\n"
"paddw %%xmm0, %%xmm2\n"
"haddps %%xmm2, %%xmm2\n"
"haddps %%xmm2, %%xmm2\n"
: : "m" (*ptr0),
"m" (*ptr1) : "xmm0", "xmm1", "xmm2" );
ptr0 += 16;
ptr1 += 16;
}
__asm__ __volatile__ (
"movq %%xmm2, %0\n"
: "=m" (result) );
printf( "Result is %llu\n", result );
}
save it as test_pni.c, compile and run it.
If it throws errors, you do not have sse3.
If not, you have SSE3 and everything is fine.
Code: Select all
./test_pni
cpuid(0) returns 1 (AuthenticAMD)
cpuid(1) returns 00020ff0 00000800 00000001 078bfbff
Result is 496498219533200