Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Other Things Gentoo
  • Search

Ultimate GCC optimization: -mtune=i386

Still need help with Gentoo, and your question doesn't fit in the above forums? Here is your last bastion of hope.
Post Reply
Advanced search
8 posts • Page 1 of 1
Author
Message
no_hope
Guru
Guru
User avatar
Posts: 482
Joined: Mon Jun 23, 2003 8:50 pm

Ultimate GCC optimization: -mtune=i386

  • Quote

Post by no_hope » Sat Aug 09, 2008 9:16 pm

I AM NOT TIMING COMPILING TIME!

Q: What is the best CFLAGS?*
A: CFLAGS="-mtune=i386 -O2"

* -- assuming emerge is a good indicator of overall performance :)

I tested various gcc optimizations using emerge (i.e. using python) and it turns out that -mtune=i386 produces the fastest code. I also did some X benchmarks and -mtune=i386 also came out on top. I don't have hard data for this test using -Os, but some quick tests indicate that it really sucks.

Software: gcc-4.2.4 glibc-2.7-r2 vanilla-sources 2.6.25.13 x86_64
Hardware: Core2 Duo 6400 @ 3.2 GHz, 800MHz RAM, 5GB

I AM NOT TIMING COMPILING TIME!

Code: Select all

1. recompile python and portage with new CFLAGS
2. emerge -pevt world (dry run to get stuff into memory)
3. 20 times: time emerge -pevt world &> /dev/null 
4. goto 1
I AM NOT TIMING COMPILING TIME!

The surprising results:
Image
(1 standard deviation error bar)


0 -mtune=i386 -O2
1 -mtune=generic -O2
2 -march=nocona -O2 -ftree-loop-im -funswitch-loops
3 -march=nocona -O2
4 -march=nocona -O2 -ftree-loop-linear -funroll-loops -ftree-loop-ivcanon
5 -mtune=i686 -O2
6 -march=nocona -O2 -ftree-loop-linear -ftree-loop-im -funswitch-loops
7 -march=nocona -O3
8 -march=nocona -O2 -ftree-loop-ivcanon -funroll-loops
9 -march=nocona -O2 -fvariable-expansion-in-unroller -funroll-loops
10 -march=nocona -Os
11 -march=nocona -O2 -ftree-loop-linear -funroll-loops -fvariable-expansion-in-unroller
12 -march=nocona -O2 -ftree-loop-linear
13 -march=nocona -O2 -funroll-loops
14 -march=nocona -O2 -ftree-loop-linear -funroll-loops

I AM NOT TIMING COMPILING TIME!
Last edited by no_hope on Mon Aug 11, 2008 4:17 pm, edited 3 times in total.
Top
Sadako
Advocate
Advocate
User avatar
Posts: 3792
Joined: Thu Aug 05, 2004 5:50 pm
Location: sleeping in the bathtub
Contact:
Contact Sadako
Website

  • Quote

Post by Sadako » Sat Aug 09, 2008 10:19 pm

All this proves is that things compile faster with -mtune=i386, it gives you no indication of how well the resulting code will perform,
and that makes perfect sense, i386 being more generic, hence less optimized, and optimizations do increase compile time.

If you really want to compare the resulting code, try playing with encoding and/or compression, ie encode an mp3 with lame compiled with each of those CFLAGS, or bzip compression.

You should make sure all mmx and/or sse use flags are disabled, and ideally glibc should be compiled with the same CFLAGS as the encoder/compressor in each case, also seeing as how you obviously have the ram you should work with all the files in a tmpfs.
"You have to invite me in"
Top
no_hope
Guru
Guru
User avatar
Posts: 482
Joined: Mon Jun 23, 2003 8:50 pm

  • Quote

Post by no_hope » Sat Aug 09, 2008 10:53 pm

Hopeless wrote:All this proves is that things compile faster with -mtune=i386, it gives you no indication of how well the resulting code will perform,
and that makes perfect sense, i386 being more generic, hence less optimized, and optimizations do increase compile time.
I am not measuring compile time. I am measuring how quickly portage can calculate the system package dependency tree and format the output :)
Hopeless wrote:If you really want to compare the resulting code, try playing with encoding and/or compression, ie encode an mp3 with lame compiled with each of those CFLAGS, or bzip compression.
I think benchmarks like those are very artificial and hard to translate to real-life experience. Most of the time I spend impatiently waiting for something doesn't involve decoding or compression. I do run many CPU-bound python scripts though.
Top
StifflerStealth
Retired Dev
Retired Dev
User avatar
Posts: 968
Joined: Wed Jul 03, 2002 8:20 pm

  • Quote

Post by StifflerStealth » Sat Aug 09, 2008 11:13 pm

Soooo .... all this test shows is how fast python runs? :? Not a good test, imho. You should test ICC on your Core 2 Duo. That will be even faster. ;)

Cheers.

He who runs i386 on a Core 2 Duo is a dummy.
Nothing to read in this sig. Move along.
Top
Akkara
Bodhisattva
Bodhisattva
User avatar
Posts: 6702
Joined: Tue Mar 28, 2006 12:27 pm
Location: &akkara

  • Quote

Post by Akkara » Sun Aug 10, 2008 12:13 am

Taking a guess at some theories:

- A lot of emerge -p time is spent opening and reading the thousands of small files in the tree. Perhaps this is probably as much a measure of how well python and portage avoid stomping on the parts of the CPU cache that the kernel like to use, as it is of speed of the app itself.

- Modern x86s have a more risc-like internal core and the more complex instructions are translated by the fetch unit into a sequence of micro-ops. Perhaps the simplier i386 ones really do run faster, or perhaps have fewer interlock with other instructions around them due to using fewer execution resources.
Top
StringCheesian
l33t
l33t
Posts: 887
Joined: Tue Oct 21, 2003 6:21 am

  • Quote

Post by StringCheesian » Sun Aug 10, 2008 1:40 am

EDIT: Please disregard this - I completely misunderstood.

Two problems here:

For a fair test it should recompile world with different CFLAGS per competitor, and then set the same CFLAGS on all competitors before timing them. That way all competitors have the same amount of work to do (-O3 is more work for gcc than -O2).

You should also time them compiling a set of packages not including the toolchain (emerge, python, bash, gcc, glibc, etc). As it is you are replacing the subject of the test halfway through. The result will be a mix of the speed of the new toolchain (running with the CFLAGS you intended to measure) with the speed of the old toolchain (running with some other CFLAGS...).
Last edited by StringCheesian on Tue Aug 12, 2008 8:36 am, edited 1 time in total.
Top
no_hope
Guru
Guru
User avatar
Posts: 482
Joined: Mon Jun 23, 2003 8:50 pm

  • Quote

Post by no_hope » Mon Aug 11, 2008 4:23 pm

Akkara wrote:Taking a guess at some theories:

- A lot of emerge -p time is spent opening and reading the thousands of small files in the tree. Perhaps this is probably as much a measure of how well python and portage avoid stomping on the parts of the CPU cache that the kernel like to use, as it is of speed of the app itself.

- Modern x86s have a more risc-like internal core and the more complex instructions are translated by the fetch unit into a sequence of micro-ops. Perhaps the simplier i386 ones really do run faster, or perhaps have fewer interlock with other instructions around them due to using fewer execution resources.
I think the first theory is the most likely one. I did a similar benchmark using Python's pybench suite, and it seems that at least for artificial workloads (e.g. running an empty loop a million times) , vanilla -O2 -march=nocona outperforms everything else, with i386 performing very poorly.

So it seems that the Python interpreter itself is not faster when compiled for i386, but emerge is.

PS: I AM NOT TIMING COMPILING TIME!
Top
StringCheesian
l33t
l33t
Posts: 887
Joined: Tue Oct 21, 2003 6:21 am

  • Quote

Post by StringCheesian » Tue Aug 12, 2008 8:30 am

no_hope wrote:PS: I AM NOT TIMING COMPILING TIME!
Ooops. I didn't notice the "-p" in the code block. Sorry. :oops:

I just sort of assumed after I saw "I tested various gcc optimizations using emerge (i.e. using python)". Maybe "using emerge -p" would be more foolproof :oops:
Top
Post Reply

8 posts • Page 1 of 1

Return to “Other Things Gentoo”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic