Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Segfaults during compilation on AMD Ryzen.
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
likewhoa
l33t
l33t


Joined: 04 Oct 2006
Posts: 777
Location: Brooklyn, New York

PostPosted: Sun May 28, 2017 5:39 am    Post subject: Reply with quote

Naib wrote:

Like you I used XMP RAM settings. I noticed all my RAM timings were off until I enabled XMP which then set them correctly. I still think those running into random segfaults have RAM related issues as it has been shown as well Ryzen is very RAM dependent due to the CXX.
I should put an option on to see if people are setting their RAM to the ramstick settings OR what the bios thinks it should be. I am 99% sure this is the issue (assuming hte hardware is good)

AGESA 1.0.0.6 is due out now-2weeks so that should sort some more RAM issues (ie defaults as well as overclocking)


Yes most issues are probably bad RAM timings. I am currently stable at 3.9GHz 1.3626v CPU 1.0750v NB and 3.5v DRAM. anything higher would require more RAM than I can handle on air, so 3.9GHz will stay for now \o/
I can probably also assume that gcc-7, =>4.9 Linux kernel and proper RAM timings according to your kit specifications should give users a stable setup. XMP is not enabled by default and most people ignore BIOS settings and expect
things to just work™ but with Ryzen you do need to pay attention like with anything good like Gentoo :D

Can you post 'for i in gcc libreoffice llvm chromium; do genlop -t $i|tail -3;done' I like to compare what others have.

Code:
for i in gcc libreoffice llvm chromium; do genlop -t $i|tail -3;done                                                                                                                                 
     Fri May 26 00:31:35 2017 >>> sys-devel/gcc-7.1.0-r1                                                                                                                                                         
       merge time: 15 minutes and 49 seconds.                                                                                                                                                                     
                                                                                                                                                                                                                 
     Sun May 28 01:00:56 2017 >>> app-office/libreoffice-5.3.3.2                                                                                                                                                 
       merge time: 34 minutes and 1 second.                                                                                                                                                                       
                                                                                                                                                                                                                 
     Thu May 25 16:38:00 2017 >>> sys-devel/llvm-4.0.0-r2                                                                                                                                                         
       merge time: 10 minutes and 23 seconds.                                                                                                                                                                     
                                                                                                                                                                                                                 
     Fri Apr 28 08:29:01 2017 >>> www-client/chromium-58.0.3029.81                                                                                                                                               
       merge time: 57 minutes and 32 seconds.               


Enjoy your new rig man! -j16 FTW, oh have you tried 'MAKEOPTS="-j -l" emerge ceph'? :D
Back to top
View user's profile Send private message
Naib
Advocate
Advocate


Joined: 21 May 2004
Posts: 4978
Location: Removed by Neddy

PostPosted: Sun May 28, 2017 8:36 am    Post subject: Reply with quote

Code:

for i in gcc libreoffice llvm chromium; do genlop -t $i|tail -3;done
     Sun May 28 08:39:21 2017 >>> sys-devel/gcc-7.1.0-r1
       merge time: 26 minutes and 19 seconds.

     Wed May 17 21:34:22 2017 >>> app-office/libreoffice-5.2.7.2
       merge time: 37 minutes and 58 seconds.

     Sun May 28 08:56:21 2017 >>> sys-devel/llvm-4.0.0-r2
       merge time: 17 minutes.

!!! Error: no merge found for 'chromium'

_________________
The best argument against democracy is a five-minute conversation with the average voter
Great Britain is a republic, with a hereditary president, while the United States is a monarchy with an elective king
Back to top
View user's profile Send private message
drizzt
Guru
Guru


Joined: 21 Jul 2002
Posts: 402

PostPosted: Sun May 28, 2017 9:38 am    Post subject: Reply with quote

trippels wrote:
drizzt wrote:
ruby crashed rather nasty during emerge:
[code]./template/verconf.h.tmpl:3: [BUG] Segmentation fault at 0x00000000000000
ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-linux]
Update:
Hurray, looks like there is another source of trouble coming along:
I was able to build the above ruby version(dev-lang/ruby-2.3.4-r2) flawlessly with gcc-6. gcc-7 always triggers the error shown. *sigh*


It is a known ruby bug (undefined behavior due to misaligned loads). This has nothing to do with Ryzen.
See: https://bugs.ruby-lang.org/issues/11831
(It also could be the following GC bug: https://bugs.ruby-lang.org/issues/13150)
Both issues are fixed in 2.4, so I would suggest to simply upgrade ruby.


Thank you for the information
_________________
People don't have to earn my respect. I offer my respect to them, but be careful to lose my respect...
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Tue May 30, 2017 4:05 pm    Post subject: Reply with quote

nitm wrote:
If there is interest by someone at AMD about corefile, I can gather them.
But I doubt they'll be of much use - it is totally random.
And I've never seen an ICE in GCC, all my crashes are in bash and head.


here the dmesg is 99% of the time "error 6 in bash[400000 a7000]"

error 6 is probably an user read in an invalid page, so I suspect is the cpu cache to have problem, silicon or whatever (architectural)
Back to top
View user's profile Send private message
boudin
n00b
n00b


Joined: 15 May 2017
Posts: 4

PostPosted: Tue May 30, 2017 9:01 pm    Post subject: Reply with quote

I've installed the latest AGESA (1006) through this beta bios http://forum.gigabyte.us/thread/886/am4-beta-bios-thread?page=1

The situation is much better, but still had a few compilation errors (weird one though). I'm now running with the XMP profile of my DDR4 without changing the voltage, cool'n'quiet on with -j13 for MAKEOPTS. Before I wasn't able to build anything in this configuration and my whole system would become unstable once segfault will start to appear.
Now I've rebuilt randomly GCC and mesa a few times + some other stuff. I did had two compilation errors with mesa but overall it's much more stable. I'm gonna start again toying with my ram timings and voltage to see if it can reach a stable point.
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Wed May 31, 2017 7:46 am    Post subject: Reply with quote

boudin wrote:
I've installed the latest AGESA (1006) through this beta bios http://forum.gigabyte.us/thread/886/am4-beta-bios-thread?page=1

The situation is much better, but still had a few compilation errors (weird one though). I'm now running with the XMP profile of my DDR4 without changing the voltage, cool'n'quiet on with -j13 for MAKEOPTS. Before I wasn't able to build anything in this configuration and my whole system would become unstable once segfault will start to appear.
Now I've rebuilt randomly GCC and mesa a few times + some other stuff. I did had two compilation errors with mesa but overall it's much more stable. I'm gonna start again toying with my ram timings and voltage to see if it can reach a stable point.


AGESA 1006 didn't change a thing here.
Back to top
View user's profile Send private message
mark_lagace
n00b
n00b


Joined: 19 Nov 2002
Posts: 72
Location: Ottawa, Canada

PostPosted: Wed May 31, 2017 8:35 pm    Post subject: Reply with quote

TL;DR: Does anyone have a stable Ryzen 7 system?

I've been pulling my hair out trying to get the system stable for compiling and every option mentioned in this thread has so far not been successful.

My system:
    CPU: Ryzen 7 1700 (stock cooler - wraith spire)
    RAM: (CMK16GX4M2B3200C16R): 2x8GB DDR4 3200 (running at 2933)
    MB: MSI X370 Gaming Pro Carbon, latest BIOS (7A32v15)
    Fresh install of Gentoo from a recent stage3 tar (kernel 4.9 gcc 5.4)
    Initially used "-O2 -pipe -march=native", but recompiled the system with "-O2 -pipe -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp" after first running into problems and reading the gcc 5.4 uses bdver4 for the Ryzen procs.
    MAKEOPTS is set to -j16

I've tried:
    RAM settings - JEDEC (2133MHz, 1.2V; CL 15); XMP profile 1 (2933 MHz, 1.35V; CL16); XMP profile 2 (3200 MHz, 1.35V; CL16)
    Pulling one stick of RAM and just using 1x8GB (only at 2133) with either stick of RAM just in case one is bad.
    Turning SMT on or off
    OP Codes - no options in my BIOS to change
    CPU frequency governor on Performance vs ondemand
    Compiling GCC 6.3 and using -march=znver1
    Reducing makeopts to -j13 or -j8

In all cases, I will get random segfaults during compiles. They happen more frequently with large packages and higher levels of threading (i.e. -j16 segfaults more than -j8 ) but even with memory at 2133, SMT off and -j8, compiles will segfault. I've just compiled kernel 4.11.3 and will see if that makes any difference, but I'm not holding my breath.

Having gone through all of this, my question at this point is whether ANYONE has a stable, Ryzen 7 system?

FWIW, memtest86+ runs through multiple passes with no issues. Prime95 under Win10 in stress test mode with 16 helpers running also doesn't crash (at least for the 2-3 hours that I let it run).
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Wed May 31, 2017 8:49 pm    Post subject: Reply with quote

mark_lagace wrote:
TL;DR: Does anyone have a stable Ryzen 7 system?


It's a lottery, people have stable system with various ryzens and people have unstable system with various cpus. It is in the water.

PS 4.11.3 makes no difference.
Back to top
View user's profile Send private message
trippels
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2010
Posts: 131
Location: Berlin

PostPosted: Thu Jun 01, 2017 4:31 am    Post subject: Reply with quote

It should also be noted that the crashing compiler is just a noisy symptom of a silent CPU corruption.
I have also seen git projects get corrupted on Ryzen (sha checksum issues).

So the point is you cannot trust your data on this CPU. It might be OK for a pure gaming PC,
but for everything else it is utterly unacceptable.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 38366
Location: 56N 3W

PostPosted: Thu Jun 01, 2017 7:33 am    Post subject: Reply with quote

trippels,

The only separation between code and data in a PC is context.
Why any PC CPU operate correctly executing code but not when processing data?

I have seen it on more exotic architectures that have separate busses and memory arrangements for CPU instructions and data but there is separation at the hardware level.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
trippels
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2010
Posts: 131
Location: Berlin

PostPosted: Thu Jun 01, 2017 8:39 am    Post subject: Reply with quote

NeddySeagoon wrote:
trippels,

The only separation between code and data in a PC is context.
Why any PC CPU operate correctly executing code but not when processing data?


I cannot parse this question. The CPU apparently flips random bits in memory/cache.
If the memory area contains code the process may crash. If it contains only data
the user may never notice.
Back to top
View user's profile Send private message
Naib
Advocate
Advocate


Joined: 21 May 2004
Posts: 4978
Location: Removed by Neddy

PostPosted: Thu Jun 01, 2017 8:53 am    Post subject: Reply with quote

Harvard vs von Neumann architecture
_________________
The best argument against democracy is a five-minute conversation with the average voter
Great Britain is a republic, with a hereditary president, while the United States is a monarchy with an elective king
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 38366
Location: 56N 3W

PostPosted: Thu Jun 01, 2017 9:00 am    Post subject: Reply with quote

trippels,

Quote:
The CPU apparently flips random bits in memory/cache.
If the memory area contains code the process may crash.

Hold that thought ..
I can't reconcile it with your earlier statement,
Quote:
It might be OK for a pure gaming PC

The inference being that its OK for games to crash.

I'm not aware of segfaults in gcc. its usually in bash, during a build, not gcc itself.

We only know that AM4 systems containing Ryzen processor can generate segfaults under load.
Its a feature of the system, not the CPU. At least, its not been demonstrated that its the CPU.
AMD may know more but they don't have a fix yet.
Tweaking the CPU may fix the system problem but that does not imply the CPU was the root cause.

My money is on the Vcore or Vram PSU transient response behaviour causing brownouts.
They are the hardest bits of a PC to get right and issues there only appear as system load changes.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
trippels
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2010
Posts: 131
Location: Berlin

PostPosted: Thu Jun 01, 2017 9:16 am    Post subject: Reply with quote

NeddySeagoon wrote:
trippels,

Quote:
The CPU apparently flips random bits in memory/cache.
If the memory area contains code the process may crash.

Hold that thought ..
I can't reconcile it with your earlier statement,
Quote:
It might be OK for a pure gaming PC

The inference being that its OK for games to crash.

I'm not aware of segfaults in gcc. its usually in bash, during a build, not gcc itself.


No. I've seen several non reproducible gcc segfaults myself.
Games don't use all cores at 100% like compiling with -j16 does.
This is a CPU bug. I wish AMD would officially acknowledge it.
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Thu Jun 01, 2017 10:44 am    Post subject: Reply with quote

NeddySeagoon wrote:

My money is on the Vcore or Vram PSU transient response behaviour causing brownouts.
They are the hardest bits of a PC to get right and issues there only appear as system load changes.


Many bios options were cited in this thread to alleviate the problem and none worked for me. Then I discovered that setting LLC to "high" in bios reduces the 10-20 errors for an emerge -e @world to about 5.

So, your money could be well put, but I don't like the consequences of it, because, if so, it could have no solution at all for this line of CPUs. The problem is that AMD is selling very appealing multi core/threads CPUs that fail exactly at the job that makes them appealing.
Back to top
View user's profile Send private message
roarinelk
Guru
Guru


Joined: 04 Mar 2004
Posts: 487

PostPosted: Thu Jun 01, 2017 10:48 am    Post subject: Reply with quote

trippels wrote:
NeddySeagoon wrote:
trippels,

Quote:
The CPU apparently flips random bits in memory/cache.
If the memory area contains code the process may crash.

Hold that thought ..
I can't reconcile it with your earlier statement,
Quote:
It might be OK for a pure gaming PC

The inference being that its OK for games to crash.

I'm not aware of segfaults in gcc. its usually in bash, during a build, not gcc itself.


No. I've seen several non reproducible gcc segfaults myself.
Games don't use all cores at 100% like compiling with -j16 does.
This is a CPU bug. I wish AMD would officially acknowledge it.


I see random MCE's on a core ("u-op cache tag parity error") when all 16 are fully taxed for
a few hours (i.e. encode a dvd with x264 all options set to max) The latest round of bios updates has reduced them significantly, though.
Reducing memory frequency to minumum also cut down on random segfaults a lot.

EDIT: Oh, and there's a far more annoying bug: sometimes processes just stop. They don't crash or anything,
but they don't make any progress either (i.e. RIP doesn't move), but can be killed easily.

EDIT:
Code:

kernel: [Hardware Error]: Corrected error, no action required.
kernel: [Hardware Error]: CPU:5 (17:1:1) MC3_STATUS[-|CE|MiscV|-|-|-|-|SyndV|-]: 0x9820000000000150
kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000002a000503
kernel: [Hardware Error]: Decode Unit Extended Error Code: 0
kernel: [Hardware Error]: Decode Unit Error: uop cache tag parity error.
kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD


Last edited by roarinelk on Sat Jun 03, 2017 10:21 am; edited 3 times in total
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 5740

PostPosted: Thu Jun 01, 2017 12:00 pm    Post subject: Reply with quote

isn't that the earlier ryzen are just all defective? Have a look (sorry guys, you will have to click here and there, zillions pubs! they deserve an award for top awful website)
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Fri Jun 02, 2017 8:44 am    Post subject: Reply with quote

here it is a gcc segfault, with no dmesg info; it happened during compilation of gcc 6.3.0 itself (with a parallel mesa emerging)

../../gcc-6.3.0/lto-plugin/lto-plugin.c: In function ‘process_symtab’:
../../gcc-6.3.0/lto-plugin/lto-plugin.c:945:1: internal compiler error: Segmentation fault
Back to top
View user's profile Send private message
mark_lagace
n00b
n00b


Joined: 19 Nov 2002
Posts: 72
Location: Ottawa, Canada

PostPosted: Fri Jun 02, 2017 5:28 pm    Post subject: Reply with quote

alfonsor wrote:
here it is a gcc segfault, with no dmesg info; it happened during compilation of gcc 6.3.0 itself (with a parallel mesa emerging)

../../gcc-6.3.0/lto-plugin/lto-plugin.c: In function ‘process_symtab’:
../../gcc-6.3.0/lto-plugin/lto-plugin.c:945:1: internal compiler error: Segmentation fault


I can confirm that I have had segfaults in the compiler as well - though far less frequently than bash segfaults.
Back to top
View user's profile Send private message
tuggbuss
Tux's lil' helper
Tux's lil' helper


Joined: 20 Mar 2017
Posts: 93

PostPosted: Fri Jun 02, 2017 11:40 pm    Post subject: Reply with quote

Phoronics is mentioning this issue

http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Compiler-Issues
Back to top
View user's profile Send private message
trippels
Tux's lil' helper
Tux's lil' helper


Joined: 24 Nov 2010
Posts: 131
Location: Berlin

PostPosted: Sat Jun 03, 2017 4:57 am    Post subject: Reply with quote

Looking at the https://community.amd.com/message/2796982 thread,
it seems to be an issue with the new micro-op cache. The AMD guy recommends
to disable it in the BIOS (OPCache Control).

roarinelk also reported "u-op cache crc mismatch" MCEs.
Back to top
View user's profile Send private message
Bigfoot77
n00b
n00b


Joined: 15 Dec 2006
Posts: 16

PostPosted: Sat Jun 03, 2017 5:42 am    Post subject: Reply with quote

Hmm, well I haven't read through every post, but I've gotten through at least half of them. Not sure if anyone else has mentioned this themselves yet, but I figured I'd at least throw it out there in case it's useful:

I built my Ryzen 7 1800X machine a week after launch day. Nothing was overclocked and was just running at stock speeds. My mobo's BIOS version at the time of initial install was v1.0. I (very sporadically) ran into the same segfaults when building larger packages on that initial install (identical to the first post in this thread), but for the most part everything seemed to build without issue and I didn't think much of the few segfaults. Not long after, I upgraded to BIOS v1.3 for my mobo which included the updated AGESA 1.0.0.4a code.

While running v1.3 of the BIOS. I needed to re-emerge world and during that re-emerge is when I started seeing the segfaults constantly. They started to happen more and more often as I got through the package list. By the end, I could no longer build mesa without segfaulting which became my standard for testing the segfaults. *sometimes* gcc would build without segfaulting, but pretty much any large package couldn't make it through. This went on for a few weeks. I was about to RMA the CPU and as a last ditch effort just before, I decided to try a fresh install on a new partition. Interestingly, that install worked completely flawlessly and built every package (mesa included). For the past 1.5 - 2 months, that install has continued to work without issue and is what I'm still currently using. I update almost every day and haven't had any segfaults no matter what gets built.

Has anyone tried a second install running a brand new BIOS to see what happens? I don't know the inner workings of GCC all that well, but is it possible that something during those early buggy BIOS releases causing something to build incorrectly (GCC, libtool, something?) that would always propagate to anything else that was built? The only difference between my initial install when I first built the machine and my current install is the BIOS version. No other hardware has changed. The BIOSes before v1.3 were pretty flaky for me and I didn't use the machine too much, so I can't comment if the segfaults were actually happening a lot during that time or not. I did see the few segfaults with v1.0 initially, so at least something was going on from the beginning.

For reference, my system specs are:

Ryzen 7 1800X
MSI Tomahawk B350 mobo
Kingston HyperX Fury 2400 MHz RAM
gcc 6.3.0 (CFLAGS=-O2 -pipe -march=native -fno-stack-protector)

Please let me know if anyone wants any other hardware/software info about my setup. I know this isn't a ton to go on, but it's worth mentioning (I sure hope I didn't jinx myself :)
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Sat Jun 03, 2017 9:25 am    Post subject: Reply with quote

Bigfoot77, yes I tried 3 times a complete gentoo installation and the problem is still there. But your story made me think I always used the same kernel configuration from my main installation... Did you change your kernel config during re-installing?
Back to top
View user's profile Send private message
Naib
Advocate
Advocate


Joined: 21 May 2004
Posts: 4978
Location: Removed by Neddy

PostPosted: Sat Jun 03, 2017 9:28 am    Post subject: Reply with quote

alfonsor wrote:
Bigfoot77, yes I tried 3 times a complete gentoo installation and the problem is still there. But your story made me think I always used the same kernel configuration from my main installation... Did you change your kernel config during re-installing?
when I did my install I did a complete new start: fresh install, fresh make.conf (kept use) and fresh configured kernel
_________________
The best argument against democracy is a five-minute conversation with the average voter
Great Britain is a republic, with a hereditary president, while the United States is a monarchy with an elective king
Back to top
View user's profile Send private message
alfonsor
n00b
n00b


Joined: 13 Oct 2007
Posts: 16

PostPosted: Sat Jun 03, 2017 9:52 am    Post subject: Reply with quote

yup, everything was "fresh" here; just the kernel configuration

I am pretty sure it is not a kernel problem; my fixation is with cache coerence, anyway let's give it a try, what could be better then spending the weekend hacking ryzen segfaults? :P
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 7 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum