View previous topic :: View next topic |
Author |
Message |
chw62 n00b
Joined: 15 Aug 2013 Posts: 3
|
Posted: Thu Aug 15, 2013 8:09 pm Post subject: Cross-system performance of GCC march native |
|
|
Everything I build is -march=native -O2 by default, with the exception of a select few packages such as graphicsmagick benchmarked and function tested with -march=native -Ofast.*
The question pertains to everything built with -march=native -O2.
I don't bother with looking past -march=native too much because it was built by people much smarter than myself who have spent far more time working with GCC. I trust -march=native -O2 to generally produce the best code for the host system without too many sacrifices or problems at the edge.
My desktop / main test box is powered by an Intel i5-4670, a processor of the Haswell microarchitecture with 4 cores, no hyperthreading, a 3.4GHz (3.8 turbo) clock rate, and 6MB of L2 cache. Some report the proper march for this processor to be core-avx2, others say corei7-avx2 even for Haswell i5.
My servers are all identical Sandy Bridge chips, so I just compile for them on the utility server.
Here's the problem: The new servers will all be Haswell Xeons, except for the utility server, which is staying as-is. They're quad core, but with hyperthreading 8MB L2 cache, etc. Ideally, I'd like to build everything on my desktop.
Here's the question: I know that -march=native uses processor-specific settings. I don't know which ones. Is the 2MB less L2 and lack of hyperthreading (and probably a few other differences) going to result in an -march=native build on my desktop producing code which isn't optimal? How specific does -march=native get? If it's specific enough for desktop-built packages to be not optimal on servers, how bad will it be, and how much of that can be fixed by swapping my desktop's CPU for an i7 with four cores, hyperthreading, and 8MB L2?
*(Note for those searching for -Ofast: Stay away from this option unless you know what you're doing. -Ofast and -O3 can slow things down, bloat files, and even break some programs. General benchmarks are not enough. They cannot be generic workload benchmarks and they must be run on your build on your system - that means your configuration options, your march=, and your linked libraries. With graphicsmagick in general, -Ofast slows down a few functions, gives a zero to slight gain in most, but significantly optimizes some. -Ofast is only worth the trouble if you've benchmarked the same source in -O2 and found a worthwhile composite speedup in what you're using it for, and it's only worth the trouble if you're making massive constant use of a prog/library or if you need a certain level of responsiveness in an interactive/networked usage model.) |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9645 Location: almost Mile High in the USA
|
|
Back to top |
|
|
chw62 n00b
Joined: 15 Aug 2013 Posts: 3
|
Posted: Thu Aug 15, 2013 11:04 pm Post subject: |
|
|
The minor trouble is I don't have a Haswell Xeon to reference.
I don't know if -march=native simply generates a list of flags and each flag is applied in the exact same manner on each invocation of gcc, regardless of the system, or if gcc assumes that native means "this system" and optimizes accordingly (even if it's a silent override of flag behavior). |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21490
|
Posted: Fri Aug 16, 2013 1:29 am Post subject: |
|
|
What would be the point of a -march=native switch that picked flags without regard to the system on which it ran? All the other -march switches exist for that purpose, and can be used to build code for a CPU other than the one you have. As I understand it, all the flags that are set by -march=native can be set individually, but using -march=native instructs gcc to pick the correct values for those flags so that you do not need to check your CPU capabilities personally. |
|
Back to top |
|
|
chw62 n00b
Joined: 15 Aug 2013 Posts: 3
|
Posted: Thu Aug 22, 2013 2:23 pm Post subject: |
|
|
Well there's three ways this could conceivably work:
1. Optimization at its core is based on instruction sets (-m64 -msse -msse2 ....) with the indicated flags telling gcc which instruction sets are available, and possibly, which sub-optimal things it need not do based on the minimum features/capabilities it can determine must be present in CPUs supporting the specified sets.
2. Optimization at its core looks not only at what instructions it can send to the processor but also what other features/assumptions are based on that family and above when we specify an architecture, such as
"corei7-avx"
3. Optimization at its core considers all factors specific to a processor, from the instructions it supports, to how its best able to execute code.
The gcc manual tells us that, "-march=cpu-type Generate instructions for the machine type cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated."
Because code "may not run at all" that lets us infer at least one of these two: the lack of compatibility is caused by the generated code using instructions (without compatibility paths) at the minimum of the specified set, and/or other legacy cruft has been stripped out.
To make this easy, what's the difference between between -march=native and -march=corei7-avx on a system which IDs as corei7-avx, if any? What is the difference between -march=corei7-avx and -m64 -mmmx -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx -maes -mpclmul, if any?
Would specifying -march=corei7-avx2 cause the compiler to do the exact same thing as specifying -march=corei7-avx -mfsgsbase -mrdrand -mfma -mbmi -mbmi2 -mf16c?
If -march=native is merely a shorthand for -march=<manually specified architecture> which is itself a shorthand for a long list of flags then gcc isn't being too smart.
This is why icc does so well. icc considers just about everything specific to one chip or everything in common shared by a range of chips.
At one point I had heard that benchmarks of some real-world usage scenarios on icc binaries beat the results from gcc binaries by about 30%. I also heard that since then gcc has closed the gap.
What information does gcc need to do its best - a list of available instruction sets, a named target architecture, or a hint that it can look at every last little thing about the host CPU when determining what works best? |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Fri Aug 23, 2013 1:04 am Post subject: |
|
|
See for yourself what GCC does on a given CPU:
Code: | #!/bin/sh
echo $(
gcc -v -march=native -x c /dev/null 2>&1 | \
grep -- '-march' | \
egrep -o -- '-+(m|param )[-_=.a-zA-Z0-9]+' | \
sort -u
) |
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|