Ultimate GCC optimization: -mtune=i386

Message

no_hope · Post by **no_hope** » Sat Aug 09, 2008 9:16 pm

I AM NOT TIMING COMPILING TIME!

Q: What is the best CFLAGS?*
A: CFLAGS="-mtune=i386 -O2"

* -- assuming emerge is a good indicator of overall performance

I tested various gcc optimizations using emerge (i.e. using python) and it turns out that -mtune=i386 produces the fastest code. I also did some X benchmarks and -mtune=i386 also came out on top. I don't have hard data for this test using -Os, but some quick tests indicate that it really sucks.

Software: gcc-4.2.4 glibc-2.7-r2 vanilla-sources 2.6.25.13 x86_64
Hardware: Core2 Duo 6400 @ 3.2 GHz, 800MHz RAM, 5GB

I AM NOT TIMING COMPILING TIME!

Code: Select all

1. recompile python and portage with new CFLAGS
2. emerge -pevt world (dry run to get stuff into memory)
3. 20 times: time emerge -pevt world &> /dev/null 
4. goto 1

I AM NOT TIMING COMPILING TIME!

The surprising results:

(1 standard deviation error bar)

0 -mtune=i386 -O2
1 -mtune=generic -O2
2 -march=nocona -O2 -ftree-loop-im -funswitch-loops
3 -march=nocona -O2
4 -march=nocona -O2 -ftree-loop-linear -funroll-loops -ftree-loop-ivcanon
5 -mtune=i686 -O2
6 -march=nocona -O2 -ftree-loop-linear -ftree-loop-im -funswitch-loops
7 -march=nocona -O3
8 -march=nocona -O2 -ftree-loop-ivcanon -funroll-loops
9 -march=nocona -O2 -fvariable-expansion-in-unroller -funroll-loops
10 -march=nocona -Os
11 -march=nocona -O2 -ftree-loop-linear -funroll-loops -fvariable-expansion-in-unroller
12 -march=nocona -O2 -ftree-loop-linear
13 -march=nocona -O2 -funroll-loops
14 -march=nocona -O2 -ftree-loop-linear -funroll-loops

I AM NOT TIMING COMPILING TIME!

Sadako · Post by **Sadako** » Sat Aug 09, 2008 10:19 pm

All this proves is that things compile faster with -mtune=i386, it gives you no indication of how well the resulting code will perform,
and that makes perfect sense, i386 being more generic, hence less optimized, and optimizations do increase compile time.

If you really want to compare the resulting code, try playing with encoding and/or compression, ie encode an mp3 with lame compiled with each of those CFLAGS, or bzip compression.

You should make sure all mmx and/or sse use flags are disabled, and ideally glibc should be compiled with the same CFLAGS as the encoder/compressor in each case, also seeing as how you obviously have the ram you should work with all the files in a tmpfs.

no_hope · Post by **no_hope** » Sat Aug 09, 2008 10:53 pm

Hopeless wrote:All this proves is that things compile faster with -mtune=i386, it gives you no indication of how well the resulting code will perform,
and that makes perfect sense, i386 being more generic, hence less optimized, and optimizations do increase compile time.

I am not measuring compile time. I am measuring how quickly portage can calculate the system package dependency tree and format the output

Hopeless wrote:If you really want to compare the resulting code, try playing with encoding and/or compression, ie encode an mp3 with lame compiled with each of those CFLAGS, or bzip compression.

I think benchmarks like those are very artificial and hard to translate to real-life experience. Most of the time I spend impatiently waiting for something doesn't involve decoding or compression. I do run many CPU-bound python scripts though.

StifflerStealth · Post by **StifflerStealth** » Sat Aug 09, 2008 11:13 pm

Soooo .... all this test shows is how fast python runs?

Not a good test, imho. You should test ICC on your Core 2 Duo. That will be even faster.

Cheers.

He who runs i386 on a Core 2 Duo is a dummy.

Akkara · Post by **Akkara** » Sun Aug 10, 2008 12:13 am

Taking a guess at some theories:

- A lot of emerge -p time is spent opening and reading the thousands of small files in the tree. Perhaps this is probably as much a measure of how well python and portage avoid stomping on the parts of the CPU cache that the kernel like to use, as it is of speed of the app itself.

- Modern x86s have a more risc-like internal core and the more complex instructions are translated by the fetch unit into a sequence of micro-ops. Perhaps the simplier i386 ones really do run faster, or perhaps have fewer interlock with other instructions around them due to using fewer execution resources.

StringCheesian · Post by **StringCheesian** » Sun Aug 10, 2008 1:40 am

EDIT: Please disregard this - I completely misunderstood.

Two problems here:

For a fair test it should recompile world with different CFLAGS per competitor, and then set the same CFLAGS on all competitors before timing them. That way all competitors have the same amount of work to do (-O3 is more work for gcc than -O2).

You should also time them compiling a set of packages not including the toolchain (emerge, python, bash, gcc, glibc, etc). As it is you are replacing the subject of the test halfway through. The result will be a mix of the speed of the new toolchain (running with the CFLAGS you intended to measure) with the speed of the old toolchain (running with some other CFLAGS...).

no_hope · Post by **no_hope** » Mon Aug 11, 2008 4:23 pm

Akkara wrote:Taking a guess at some theories:

- A lot of emerge -p time is spent opening and reading the thousands of small files in the tree. Perhaps this is probably as much a measure of how well python and portage avoid stomping on the parts of the CPU cache that the kernel like to use, as it is of speed of the app itself.

- Modern x86s have a more risc-like internal core and the more complex instructions are translated by the fetch unit into a sequence of micro-ops. Perhaps the simplier i386 ones really do run faster, or perhaps have fewer interlock with other instructions around them due to using fewer execution resources.

I think the first theory is the most likely one. I did a similar benchmark using Python's pybench suite, and it seems that at least for artificial workloads (e.g. running an empty loop a million times) , vanilla -O2 -march=nocona outperforms everything else, with i386 performing very poorly.

So it seems that the Python interpreter itself is not faster when compiled for i386, but emerge is.

PS: I AM NOT TIMING COMPILING TIME!

StringCheesian · Post by **StringCheesian** » Tue Aug 12, 2008 8:30 am

no_hope wrote:PS: I AM NOT TIMING COMPILING TIME!

Ooops. I didn't notice the "-p" in the code block. Sorry.

I just sort of assumed after I saw "I tested various gcc optimizations using emerge (i.e. using python)". Maybe "using emerge -p" would be more foolproof