View previous topic :: View next topic |
Author |
Message |
nelsonwcf Tux's lil' helper
Joined: 31 Oct 2012 Posts: 112
|
Posted: Thu Mar 16, 2017 3:48 am Post subject: -O3 Optimizations - Implications on portage: |
|
|
Hi,
I'm looking for a list of packages that work/doesn't work with -O3 at the Gentoo documentation but I couldn't find anything. The only reference I could find was on the Gentoo Handbook mentioning that using -O3 is not a good idea as some packages will have problems. However, in my ARM Banana Pi, my CFLAGS have -O3 and I have been using it for more than a year without any implications. Are there any information sources available on this subject, but specific to Gentoo?
Thank you,
Nelson |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9691 Location: almost Mile High in the USA
|
Posted: Thu Mar 16, 2017 5:34 am Post subject: |
|
|
Technically programs that compile incorrectly with -O3 is a gcc bug.
However as a lot of the -O3 are experimental, it can change from version to version, and if things are stable, these optimizations will go to -O2.
I'd treat using -O3 the equivalent of using ~arch ... Assumed unstable but likely will work. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Thu Mar 16, 2017 8:10 am Post subject: |
|
|
I don't recommend -O3 globally.
It often causes immense code-bloat as it loop-unrolls anything it can, vectorizing anything it can -- while at the same time keeping a copy of the scalar code and dynamically picking which one to use each time thru because it generally can't prove that the vectors will be aligned properly or that the iteration count will be a even multiple of the vector length.
... and topping it off, usually the only loops that are hyper-optimized in this way, are initialization loops. Those tend to be the only ones simple enough for its heuristics to find something.
In the end, it often makes things slower because the reduced effectiveness of the cache swamps any performance benefit it might have otherwise achieved.
However, it can be an excellent flag to use on a per-file basis: after you've profiled the code, found the hot-spots; improved the algorithms as much as possible; reduced the data inter-dependencies as much as possible; peppered your argument lists with "restrict"s to indicate which pointers never alias with any other; attached __attributes(...) to indicate buffer alignments (and allocated the buffers for maximally friendly alignment); sprinkled whatever #pragmas further assists conveying your intentions ... after all that, -O3, used when compiling the files thusly blessed (and only those files), can be an invaluable asset to getting a nice performance boost.
Try it for yourself: Use 'objdump' to look at the '.o's after the compilation stage has finished, or, better yet, pass -S as part of CFLAGS and inspect the generated assembler and compare the -O -O2 and -O3 versions. (But don't expect emerge to complete successfully if you add it to make.conf ) _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54306 Location: 56N 3W
|
Posted: Thu Mar 16, 2017 9:40 am Post subject: |
|
|
nelsonwcf,
-O3, if it works can produce slower code than -O2 or -Os.
The compiler makes the code bigger to eliminate instructions, particularly branch instructions that add nothing to solving the problem.
This bigger code no longer fits into the CPU cache which increases cache evictions, cache misses and fetches from much slower main memory.
As a result of this 'cache thrashing' execution slows down.
ARM CPUs are not noted for huge CPU caches, so a global -O3 is probably counter productive.
A few apps may benefit but the only way to find out is to compare -O2, -O3 and -Os.
_________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
nelsonwcf Tux's lil' helper
Joined: 31 Oct 2012 Posts: 112
|
Posted: Thu Mar 16, 2017 12:47 pm Post subject: |
|
|
Very nice answers, guys. Thank you very much. Will test changing the -O3 to -O2 in my ARMv7.
As an additional but related question, is it worth changing from gcc to icc in Gentoo (not for ARM, obviously)? Since the main benefit from Gentoo is to have the packages optimized to your system, I'm guessing that it would be possible to get an additional boost by using icc. Is my assumption correct?
Thank you again! |
|
Back to top |
|
|
Yamakuzure Advocate
Joined: 21 Jun 2006 Posts: 2285 Location: Adendorf, Germany
|
|
Back to top |
|
|
nelsonwcf Tux's lil' helper
Joined: 31 Oct 2012 Posts: 112
|
Posted: Thu Mar 16, 2017 6:50 pm Post subject: |
|
|
Hi Yamakuzure,
I've saw these posts as well but they are old and consider only punctual applications. I'm looking for some insight on more current versions and using it as the general compiler in portage from Gentoo. Obviously, not all packages can be compiled with icc due to different "dialects" of C (this is especially try for the GLIBC and GCC).
If fact, if it was possible to set icc as the general compiler but force portage to use gcc on a package basis (or the other way around), that would be a great solution. However, I don't know if there is any simple way to do that in Gentoo, reason I'm looking for an updated Gentoo packages that are known to work/don't work with icc. Newer benchmarks are also useful, but I couldn't find any.
Thank you again. |
|
Back to top |
|
|
Drone4four Apprentice
Joined: 09 May 2006 Posts: 247
|
Posted: Sun Mar 19, 2017 2:04 am Post subject: |
|
|
Now if only we add a compiler which used GPUs instead of CPUs, then we could put to good use the 3584 cuda cores potentially at our disposal. I wonder how long a GPU based compiler would build the linux kernel or the Gnome DE.
What an unrealistic fantasy! har har _________________ My rig:
IBM Personal System/2 Model 30-286 - - Intel 80286 (16 bit) 10 Mhz - - 1MB DRAM - - Integrated VGA Display adapter
1.44MB capacity Floppy Disk - - PS/2 keyboard (no mouse) |
|
Back to top |
|
|
axl Veteran
Joined: 11 Oct 2002 Posts: 1144 Location: Romania
|
Posted: Sun Mar 19, 2017 2:35 am Post subject: |
|
|
-O3 is not as experimental as it use to be back when gcc was age 2.
these stories, it's funny. it's like the chinese whispers game. in my country it's called the telephone without the wire game.
https://en.wikipedia.org/wiki/Chinese_whispers
some packages will not compile with -O3. right now, chromium is the only one i know.
and yes, binaries will be phater. bigger. and it's a really bad idea for an arm platform.
it only makes sense when you have a fast / or /usr storage. like an m2 ssd. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9691 Location: almost Mile High in the USA
|
Posted: Sun Mar 19, 2017 6:44 am Post subject: |
|
|
Well, if the compilation (or runtime speed) breaks with -O3, wouldn't that mean it's not quite prime time and thus "experimental"? Not only the cache size hit, -O3 may generate very slow code sequences for x86 too; no way to tell without trying (or knowing what your code is and what gcc does with the code).
Until the day gcc can automatically tell what optimizations are best during static code analysis and always generate the fastest/smallest/code compatible with anything, the optimizations in -O3 are just experimental - experiment with it, it could go one way or the other, and badly.
To be safe for most cases, simply use -O2 - where the gcc developers deem the optimizations tend to not cause worst case behavior. YMMV. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Sun Mar 19, 2017 8:06 am Post subject: |
|
|
Drone4four wrote: | Now if only we add a compiler which used GPUs instead of CPUs, then we could put to good use the 3584 cuda cores potentially at our disposal. I wonder how long a GPU based compiler would build the linux kernel or the Gnome DE. |
I don't think it will make much of a difference. Compilation isn't usually the bottleneck.
At work, I compile on a 16-core (32-threads) xeon-based server with more than enough RAM. And it just doesn't seem to make much of a difference for most packages, compared to a 4-core laptop (that has a strong fan blowing at it during the deed). Factor of 2 faster... maybe.
Some things are fast. Kernels take 30-35 seconds give or take. Emerging GCC itself runs in about 15-20 minutes, if I recall.
But for most packages, the vast majority of the time seems to be spent in autoconf and related tools. And those are woefully serial:Checking for fabs... ok
Checking for fstat... ok
Checking that fstat works... ok
... It goes on and on and on, multiple hundreds of such questions, asked and answered at a rate of a handful per second. Every package asks nearly the same set of questions, and they all (hopefully!) receive the same set of answers. Then there's a blip of compilation and a moment later that's finished, and then libtool starts up doing its thing, serially, followed by emerge itself, serially installing what has been built. (This last one likely needs to be serial.)
I've even tried giving preposterous --jobs= numbers to emerge. Ran one with --jobs=300 or similar silliness not that long ago, trying to accelerate a emerge -e @world. It starts off good: doing ~30 or so in parallel. But it soon hits these long strings of serial dependencies and it's back to one at a time again, with an occasional break where it might find 2 or 3 to do at once.
I've often wondered whether there's some way of caching that. Not like compiler-cache, but a autoconf-cache, one that works across packages. It won't be easy: it needs to be smart enough to know to clear out and re-do the checks for things provided by the package that was just merged. But we'd need to somehow break the autoconf bottleneck before there's a serious reduction in the end-to-end merge time.
eccerr0r wrote: | Well, if the compilation (or runtime speed) breaks with -O3, wouldn't that mean it's not quite prime time and thus "experimental"? Not only the cache size hit, -O3 may generate very slow code sequences for x86 too; no way to tell without trying (or knowing what your code is and what gcc does with the code). |
I don't think experimental is the right word. -O3 generally works and does what it says in the manual. It just happens that what it does isn't usually applicable or appropriate to do without thinking. It is an excellent flag to use within a package's makefile, where the developer has measured and peppered it at just the right places. It is a bad flag to use globally, because 90+% of the code out there is either run-once initialization, or debugging printfs, both of which benefit more from space optimization than from speed.
What you're asking for is for the compiler to somehow know what transformations to apply where. Would be nice if it could. Maybe it gets there someday. With automated profiling and similar tools it might be possible to come up with something. But even then, it'll be up to you to give it relevant test-cases, so that it profiles and optimizes the things that actually matter. And is coming up with relevant test cases significantly easier than picking flags according to your intuition and seeing how they do? Not an easy problem. _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sun Mar 19, 2017 6:24 pm Post subject: |
|
|
Autoconf is pretty awful. Is confcache still maintained nowadays? I don't see it in portage any more.
Mind you, Portage itself can be just as bad at times... that "resolving dependencies" spinner is often 50% of the time spent installing single packages for me. |
|
Back to top |
|
|
Roman_Gruber Advocate
Joined: 03 Oct 2006 Posts: 3846 Location: Austro Bavaria
|
Posted: Sun Mar 19, 2017 6:35 pm Post subject: |
|
|
Did not some ebuilds remove any bad optimizations?
I use this for quite a while
Quote: | CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer"
|
In the old days it matters to myself playing with those flags.
Since ivybridge i7 + SSD + 16GB RAM + tmpfs for building my stuff it does not really matter anymore
-2 minutes for libreoffice build time is not that much worth tinkering around.
What matters tehse days: Smaller + regular full system backups |
|
Back to top |
|
|
frostschutz Advocate
Joined: 22 Feb 2005 Posts: 2977 Location: Germany
|
Posted: Sun Mar 19, 2017 8:37 pm Post subject: |
|
|
march native pretty much eliminated the necessity for custom CFLAGS
used to be you had to look up the correct safe cflags for your processor, now the compiler does it for you. yay.
O3 makes for slower binaries, sometimes. I once had a broken system like that. |
|
Back to top |
|
|
|