Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
LLVM/Clang 8 coming with BDVER2 optimizations
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
NTU
Apprentice
Apprentice


Joined: 17 Jul 2015
Posts: 164

PostPosted: Tue Feb 26, 2019 8:15 pm    Post subject: LLVM/Clang 8 coming with BDVER2 optimizations Reply with quote

LLVM/Clang 8 is coming out soon and it has optimizations for BDVER2 so I'll be building it (along with compiler-rt, clang-runtime, etc.) with -O3 -ftree-vectorize -funsafe-math-optimizations (more compliant than -ffast-math but still allows for vectorization of floating point or something along those lines) and -march=bdver2. I have a bunch of older APUs laying around as they were dirt cheap (like 40-60 bucks, A10s were like $120 which is still not bad) Is rsbench a good test for this kind of thing?

https://github.com/darktable-org/rawspeed/tree/develop/src/utilities/rsbench

I know that -O3 and those other flags can slow things down or introduce bugs which is why I want to make sure I get some good tests going.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 5689
Location: Removed by Neddy

PostPosted: Tue Feb 26, 2019 8:51 pm    Post subject: Reply with quote

"more complaint than -ffast-math" you must not like you math results then
_________________
The best argument against democracy is a five-minute conversation with the average voter
Great Britain is a republic, with a hereditary president, while the United States is a monarchy with an elective king
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Tue Feb 26, 2019 9:47 pm    Post subject: Reply with quote

I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want.

Usually what happens is "easy" stuff gets unrolled and expanded out to squeeze every last cycle out of it, while the more complex stuff isn't much different than what -O2 gives you. Problem is the "easy" stuff tends to be initalizations and run-once kind of code, that doesn't matter how fast it is. Worse, the now bigger code-size pushes other things out of cache and can make things slower. This is especially true when combined with -ftree-vectorize. Unless you've gone in and peppered your code with __restrict__ and __attribute__((aligned(16))), the compiler doesn't know whether the pointers are aligned and what shortcuts it can take. Instead it generates two copies of the code, one vectorized assuming it is aligned, the other simply unrolled, and tests and jumps to the appropriate one.

Meanwhile your heavy loops don't benefit much unless you've been thru several dozen iterations of hand-tweaking to reduce inter-statement data dependencies so that -O3 might have something interesting to chew on. If you've done that, and inspected the assember output to check that it's "seeing" what you have in mind, and benchmarked to make sure it even matters, then -O3 could help. But even then, only on the files where you've paid this kind of attention to.

Personally, I find that -Os gives better overall performance when I'm not tweaking by hand. And many packages where it matters will already have custom build flags for the critical sections.

Regarding -funsafe-math, read up on all the -f*math* flags. Then pick the ones you actually care about. I often use -fno-math-errno in my own code to reduce data-dependencies across mathlib calls, along with -fno-trapping-math -fno-signalling-nans in embedded systems where there's no place to trap to. I don't touch the general packages.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1560
Location: KUUSANKOSKI, Finland

PostPosted: Tue Feb 26, 2019 11:39 pm    Post subject: Reply with quote

Akkara wrote:
I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want.

This article, while not very extensive, tells that -O3 is quite a safe bet. Which was somewhat a surprise to me.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Wed Feb 27, 2019 6:22 am    Post subject: Reply with quote

Zucca wrote:
Akkara wrote:
I do not recommend using -O3 unless you have personally inspected the resulting ".s" assember output and concluded that optimization level is doing what you want.

This article, while not very extensive, tells that -O3 is quite a safe bet. Which was somewhat a surprise to me.

Interesting, and good to know. Thanks. My comments are mostly based on experience with gcc-5 and 6, with some cursory checking of 7 (I should have remembered to mention that), and with battles fought in trying to get some inner loops vectorized, and generally failing to achieve what I was looking for, finally resorting to calling the intrinsics directly (non-portably). They are benchmarking with gcc-9 so there's hope things are getting better!

Looking at the benchmarks, there's some where -O3 clearly helps, such as ray-tracing. But the average results in the last page shows only a 4% improvement overall. I don't know how statistically significant that is. It does mean that -O3, at worse, probably won't hurt too much. The biggest positive effect (after going from no optimization to some) seems to come from -march=native. It would be interesting to see how the numbers for -O2 -march=native as well as for -Os -march=native compare. It would also be interesting to see how the code size is affected. Sometimes benchmarks that run well in isolation run more poorly when used concurrently with other programs and have to contend for cache.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1560
Location: KUUSANKOSKI, Finland

PostPosted: Wed Feb 27, 2019 11:43 am    Post subject: Reply with quote

I also tried to search another article where some programs compiled with -O2 -fsome-lto-flags-enabled were, not significantly but ... "quite"(?), faster than -O3 -fsome-flags.

Then there's also the -Ofast... I wonder if it's any better than -O3, even in the best scenarios.

EDIT: There's this 2016 test.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NTU
Apprentice
Apprentice


Joined: 17 Jul 2015
Posts: 164

PostPosted: Fri Mar 01, 2019 2:27 am    Post subject: Reply with quote

Using -march flags with -ftree-vectorize only really makes sense though with -funsafe-math-optimizations as certain loop constructs can only be vectorized if GCC is allowed to change the order of math ops. -march with -ftree-vectorize can actually increase code size without benefit as the compiled code will not have those SIMD instructions properly aligned. That is where -ffast-math or -funsafe-math-optimizations come into play. -march on it's own is the least efficient way to use those extra CPU instructions because of the lack of vectorization. Maybe I have this wrong, but I thought -march only makes sense with -ftree-vectorize and -funsafe-math-optimizations or -ffast-math.

Last edited by NTU on Fri Mar 01, 2019 7:38 am; edited 1 time in total
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Fri Mar 01, 2019 2:50 am    Post subject: Reply with quote

That's true if the heavy part of your code is doing floating-point.

There's also integer vector opcodes. Those don't need special math flags to vectorize well, because integer math is fully commutative and associative. (Unlike floating-point, where ((a + b) + c) doesn't always give the same answer as (a + (b + c))).

If you're going to use -ftree-vectorize, avoid code-bloat by telling the compiler your buffers are aligned. (And make sure they actually are aligned!.) Read up on __attribute__((aligned(N))), which works for both gcc and clang.

Also try to reduce data dependencies. Statements such as y[n] = a * y[n-1] + stuff are hard to vectorize because you can't calculate several y's in parallel without knowing the previous result. Remember the compiler can't know that two pointers don't happen to be pointing to the same data. So even an inoccuous-looking statement such as y[n] = x[n] + x[n-1] leads to code-bloat: the compiler checks that x and y don't overlap and jumps to the vectorized version, otherwise uses the one-iteration-at-a-time version. Use __restrict__ on pointer declarations to state that there's no data-dependencies between them.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
NTU
Apprentice
Apprentice


Joined: 17 Jul 2015
Posts: 164

PostPosted: Fri Mar 01, 2019 7:28 am    Post subject: Reply with quote

Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking. It's not practical going through the entire tree and analyzing the compiled assembly of LLVM and Clang themselves. Any suggestions on an easy way to do some simple performance/compliance tests?

-ffast-math is pushing it a bit too far with IEEE 754 non-compliance, MPFR, MPC and GMP have quite a few tests that fail, but if you use -funsafe-math-optimizations, there's only one error here and there in the testsuites per package. I forget exactly what the error rate was, but I know that -ffast-math performed much worse in `make check`per package.
Back to top
View user's profile Send private message
PrSo
Tux's lil' helper
Tux's lil' helper


Joined: 01 Jun 2017
Posts: 129

PostPosted: Fri Mar 01, 2019 6:04 pm    Post subject: Reply with quote

NTU wrote:
Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking.


IIRC -ftree-vectorize in GCC (-ftree-loop-vectorize and -ftree-slp-vectorize) is enabled by deafault in -O3, and why not -march=native?

Regards,
Przemek.
Back to top
View user's profile Send private message
NTU
Apprentice
Apprentice


Joined: 17 Jul 2015
Posts: 164

PostPosted: Fri Mar 01, 2019 6:58 pm    Post subject: Reply with quote

PrSo wrote:
NTU wrote:
Haha, I'm not going to be digging through the large LLVM code base. I rather just do some benchmarking with compiling LLVM+Clang with -march=bdver2 -O3 -ftree-vectorize -funsafe-math-optimizations compared to simply just -O2, and do some _basic_ sanity checking.


IIRC -ftree-vectorize in GCC (-ftree-loop-vectorize and -ftree-slp-vectorize) is enabled by deafault in -O3, and why not -march=native?

Regards,
Przemek.

-O3 enables -ftree-loop-vectorize and -ftree-slp-vectorize which means perform loop and basic block vectorization on trees, respectively. -ftree-vectorize however is not enabled at any level.

https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html -- "Vectorization is enabled by the flag -ftree-vectorize and by default at -O3." Yes, but only -ftree-loop-vectorize and -ftree-slp-vectorize.

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 enables:
Code:
-fgcse-after-reload
-finline-functions
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-paths
-ftree-loop-distribute-patterns
-ftree-loop-distribution
-ftree-loop-vectorize
-ftree-partial-pre
-ftree-slp-vectorize
-funswitch-loops
-fvect-cost-model
-fversion-loops-for-strides

Anyway back to the question, so there is this: https://llvm.org/docs/TestSuiteGuide.html

In the "Displaying and Analyzing Results" section, what exactly do these numbers represent?
Code:
Metric: exec_time

Program                                         baseline

INT2006/456.hmmer/456.hmmer                   1222.90
INT2006/464.h264ref/464.h264ref               928.70
...
             baseline
count  506.000000
mean   20.563098
std    111.423325
min    0.003400
25%    0.011200
50%    0.339450
75%    4.067200
max    1222.896800

Baseline 1222.90, what does 1222.90 mean in this context? count 506.000000 ? What is it counting?
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1560
Location: KUUSANKOSKI, Finland

PostPosted: Fri Mar 01, 2019 7:19 pm    Post subject: Reply with quote

Oh btw... a little off-topic but... I've enabled -mvzeroupper on my system. It should increase performance on Bulldozers. However, I have some packages which I have set custom en for clang compiling (in hopes for faster compiling). But I also needed to change CFLAGS, because clang wouldn't accept -mvzeroupper.
I wonder it that flags is actually worth anything...
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NTU
Apprentice
Apprentice


Joined: 17 Jul 2015
Posts: 164

PostPosted: Mon Mar 11, 2019 7:56 pm    Post subject: Reply with quote

Zucca wrote:
Oh btw... a little off-topic but... I've enabled -mvzeroupper on my system. It should increase performance on Bulldozers. However, I have some packages which I have set custom en for clang compiling (in hopes for faster compiling). But I also needed to change CFLAGS, because clang wouldn't accept -mvzeroupper.
I wonder it that flags is actually worth anything...

Do some Blender, imagemagick/graphicsmagick benchmarks?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum