View previous topic :: View next topic |
Author |
Message |
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Mon Jun 26, 2017 2:46 pm Post subject: |
|
|
AGESA 1.0.0.6 is not solving the problem.
I used sat's script to test building mesa and help getting a reproducible test case.
On my gentoo system, I have pretty much the same versions and settings than mrostu, the only difference is that I use -j 16 and my kernel is not configured by genkernel.
I have had the random segfault while compiling issue with 2 mobos/RAMs combinations:
- Asus B350-MA + 2x8 Crucial @2133 (this RAM wasn't able to boot @2666): BIOS with AGESA 1.0.0.4 (I don't have this one anymore)
- Asus B350-plus + 2x8 Geil @2933: latest BIOS with AGESA 1.0.0.6 (the one I currently have)
- CPU is 1700X at stock frequencies (I wouldn't OC anything until the setup is 100% stable anyway)
No problem detected by memtest86+ and prime95 (under windows).
I tried to set some RAM settings from 'auto' to 'manual' in the BIOS, like timings (obviously), but from the windows 10 perspective, CPU-Z reports that they are not correctly applied: 15-17-17-70 is resulting in 16-17-17-70 and the command rate I've set to 2T (I don't know the correct value for my RAM kit, so I tried "the worst") is seen as 1T. So I can't put aside some RAM misconfiguration by the motherboard for now. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
thigobr n00b
Joined: 31 Aug 2007 Posts: 31
|
Posted: Mon Jul 03, 2017 2:10 pm Post subject: |
|
|
My system has been showing these seg faults when doing packages updates. Last problem it had was when updating systemd:
Code: | sys-apps/systemd-233-r3:0/2::gentoo [233-r1:0/2::gentoo] |
After disabling ASLR emerging this package went as expected (with the system overclocked). Before trying this I reverted my overclock only to be sure it wasn't affecting stability in some way (CCMOS and load BIOS defaults). Segmentation fault still happened all times I tried to emerge systemd with the system in stock clocks.
My config:
Ryzen 1700 (@3.8GHz)
Asus Prime X370-PRO BIOS 0805 final (AGESA 1.0.6a)
G.Skill Trident Z 2x8GB @3200MHz CL14 stock clock (Samsung B-die)
Plextor M8pe 256GB NVME
Corsair AX760 Platinum |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Tue Jul 04, 2017 2:57 pm Post subject: |
|
|
I would interject in this long thread that I received notice this monrning that gcc 6.4 is out as a bug fix version. I would advise anyone having problems and running 6.3 to update as soon as 6.4 hits the tree. And for the truly adventurous, try porting it yourself!
EDIT: Spelling errors
Last edited by Tony0945 on Tue Jul 04, 2017 4:49 pm; edited 1 time in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Tue Jul 04, 2017 4:09 pm Post subject: |
|
|
Tony0945,
I think the Ryzen issues are hardware related.
We already know that changing major versions of gcc don't clear the issue.
gcc-7.2 is promised 'soon' too. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Tue Jul 04, 2017 4:47 pm Post subject: |
|
|
NeddySeagoon wrote: | I think the Ryzen issues are hardware related. |
I tend to agree, however, it seems (just an impression) that tuning for default amd64 is safer than tuning for Zen. This could be because:
1. The specific Zen features broken in the hardware and because they are are not used by plain vanilla tuning the problems are thus avoided.
or (and?)
2. The initial Zen compiler tunings have bugs in gcc.
#2 is quite likely and it costs nothing but time to check out.
#1 while more likely means that the investment in Zen is shot and the repercussions will likely destroy AMD.
#2 also explains why Windows doesn't show the problems. I'd hate to think that AMD is explaining fixes to Microsoft that they won't tell to GNU. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Tue Jul 04, 2017 6:16 pm Post subject: |
|
|
Tony0945,
For most users Ryzen works well.
I don't think its likely to destroy AMD any more than the Pentium FDIV bug destroyed Intel.
I'm confident that AMD and motherboard partners will fix it between them but it may take a silicon revision to do it. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ozhdfw n00b
Joined: 21 Jun 2017 Posts: 7
|
Posted: Fri Jul 07, 2017 6:55 am Post subject: |
|
|
I am trying to gather more data to help narrow down the segfualts on Ryzen. How many of you who are are having segfaults are running single ranked ram? If you are not familiar with the rank of your ram I hope you did not build your Ryzen system. Considering this is a memory related issue I just wanted to rule out this possibility. Also I would not rely on the qvl lists posted they are not always reliable. https://community.amd.com/servlet/JiveServlet/showImage/2-2784271-117947/sm_add-slide-2_800.png |
|
Back to top |
|
|
thigobr n00b
Joined: 31 Aug 2007 Posts: 31
|
Posted: Fri Jul 07, 2017 12:57 pm Post subject: |
|
|
ozhdfw wrote: | I am trying to gather more data to help narrow down the segfualts on Ryzen. How many of you who are are having segfaults are running single ranked ram? If you are not familiar with the rank of your ram I hope you did not build your Ryzen system. Considering this is a memory related issue I just wanted to rule out this possibility. Also I would not rely on the qvl lists posted they are not always reliable. https://community.amd.com/servlet/JiveServlet/showImage/2-2784271-117947/sm_add-slide-2_800.png |
I am running single rank DIMMs but the frequency, voltage, and timings don't have effect on the segfault occurencies, at least when the machine is stable (couple hours of GSAT, Prime95 and y-cruncher). |
|
Back to top |
|
|
Ggtgp n00b
Joined: 09 Jul 2017 Posts: 1
|
Posted: Sun Jul 09, 2017 7:59 am Post subject: |
|
|
The Playstation 2 Dev units had an almost identical bug accessing the extra rambus ram that that the dev units had. The CPU would fault and a register would have an impossible value, but if you checked the address 64 megabytes lower you would fine that value.
The bug would happen when another access (the GPU) had opened that lower page and the access for the higher page would get that lower cache line instead of the 64 megabyte higher address.
It may have also involved a tight timing situation, both accesses happening in sequence, etc.
This would be why turning off SMT or ASLR reduces the frequency of crashes, fewer chances of identical low address bits on close accesses.
If this is the problem one could write a tool to cause reproducible crashes. |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Tue Jul 11, 2017 3:38 pm Post subject: |
|
|
Tony0945 wrote: | NeddySeagoon wrote: | I think the Ryzen issues are hardware related. |
I tend to agree, however, it seems (just an impression) that tuning for default amd64 is safer than tuning for Zen. This could be because:
1. The specific Zen features broken in the hardware and because they are are not used by plain vanilla tuning the problems are thus avoided.
or (and?)
2. The initial Zen compiler tunings have bugs in gcc. |
Tunings & gcc versions have no impact on the segfault issue.
Facts: I tested both gcc 5.x and 6.x, starting with a generic march (i.e. no tuning for whole world (I migrated from an intel CPU)), and only tried the march=native on 6.3 for fun. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Tue Jul 11, 2017 4:04 pm Post subject: |
|
|
El_Goretto wrote: | Tunings & gcc versions have no impact on the segfault issue.
Facts: I tested both gcc 5.x and 6.x, starting with a generic march (i.e. no tuning for whole world (I migrated from an intel CPU)), and only tried the march=native on 6.3 for fun. | Good info. Thank you. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Tue Jul 11, 2017 5:16 pm Post subject: |
|
|
El_Goretto,
I think we can discount El_Goretto wrote: | 2. The initial Zen compiler tunings have bugs in gcc. |
Here's why.
A segfault is a runtime thing. It means an application has tried to access memory its not allowed to.
Were it something gcc was planting in the output binaries, these affected binaries would always fail. That's not what is observed. Sometimes there is a segfault, sometimes not.
This does not say that gcc is correct, just that it is not the cause of intermittent segfaults.
It rather reminds me of a silicon problem in some early 64 bit PPC processors. When they were operated in 64 bit mode and within certain temperature range and lots of address bits changed at the same time, the die could not keep the address bus voltage within limits and address errors occurred. The fix was either new silicon or not use 64 bit mode, so that no more that 32 bits of the address bus changed at any time. In our embedded system, 32 bits was enough, so we forced the CPU into 32 bit mode at startup.
Not every sample was affected. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Thu Jul 13, 2017 12:24 pm Post subject: |
|
|
@NeddySeagoon: exactly my thinking.
Causes != symptoms
(same way that solutions != workaround, /me looks at AMD...) _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Thu Jul 13, 2017 1:06 pm Post subject: |
|
|
El_Goretto,
I think AMD are firmly in the frame for this. The segfaults have been observed on different motherboard vendors boards.
The common elements are all down to AMD. AMD chipsets, AMD processors and AMD specifications used as the basis for the (independent) motherboard designs.
AMD won't publish the root cause until they also have a tried and tested fix/workaround. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ozhdfw n00b
Joined: 21 Jun 2017 Posts: 7
|
Posted: Sun Jul 16, 2017 10:41 pm Post subject: Ryzen segfault bug |
|
|
Now would be a good time to RMA considering the the just released pro 1700x may be using an updated stepping so the memory controller may have possibly been updated. https://www.amd.com/en/ryzen-pro#
Good luck, I look forward to future feedback and findings on this issue. |
|
Back to top |
|
|
trippels Tux's lil' helper
Joined: 24 Nov 2010 Posts: 137 Location: Berlin
|
Posted: Mon Jul 17, 2017 10:39 am Post subject: Re: Ryzen segfault bug |
|
|
ozhdfw wrote: | Now would be a good time to RMA considering the the just released pro 1700x may be using an updated stepping so the memory controller may have possibly been updated. https://www.amd.com/en/ryzen-pro#
Good luck, I look forward to future feedback and findings on this issue. |
No, all Zen related products (Ryzen, Threadripper, Epyc) use exactly the same die.
I'm afraid it looks like AMD simply bins the dies too aggressively.
There are users that see no segfaults after RMA. Others still see the same issue after several RMAs.
So it is pure lottery whether you get a good CPU... |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Mon Jul 17, 2017 2:14 pm Post subject: |
|
|
Indeed.
AMD support asked me to RMA the chip, to get the "faulty chip" possibility out of the equation. They first asked me details about my setup, then asked me to test after increasing the CPU Voltage from 1.30 to 1.425V (that was unsuccessful, still segfaulting randomly when emerging mesa). Anyway, they are doing kinda great on the support side (it's not every day/everywhere that you have a reactive tech guy to speak to), I'll keep you informed when the RMA request is accepted and I've been able to test the "new" chip. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
ozhdfw n00b
Joined: 21 Jun 2017 Posts: 7
|
Posted: Mon Jul 17, 2017 11:00 pm Post subject: Re: Ryzen segfault bug |
|
|
trippels wrote: | ozhdfw wrote: | Now would be a good time to RMA considering the the just released pro 1700x may be using an updated stepping so the memory controller may have possibly been updated. https://www.amd.com/en/ryzen-pro#
Good luck, I look forward to future feedback and findings on this issue. |
No, all Zen related products (Ryzen, Threadripper, Epyc) use exactly the same die.
I'm afraid it looks like AMD simply bins the dies too aggressively.
There are users that see no segfaults after RMA. Others still see the same issue after several RMAs.
So it is pure lottery whether you get a good CPU... |
I couldn't imagine this bug being present on their pro line of cpus. I would assume AMD has corrected this issue with a microcode update at the very least by now; if they are not going to be using an updated stepping. If this is a poor yield / binning issue then it truly would be a lottery and a microcode or stepping update may not be as applicable, but the pro chips are better binned and still may be worth looking into as a replacement for that reason.
Last edited by ozhdfw on Tue Jul 18, 2017 2:09 am; edited 2 times in total |
|
Back to top |
|
|
mark_lagace Tux's lil' helper
Joined: 19 Nov 2002 Posts: 77 Location: Ottawa, Canada
|
Posted: Mon Jul 17, 2017 11:10 pm Post subject: |
|
|
I have also had very good support from AMD, but no resolution to the problem. I've RMA'd my chip (R7-1700) and unfortunately the replacement also segfaults. I have tried 2 different motherboards from different manufacturers, two different sets of RAM, and at this point 3 processors. While I haven't tried every combination and permutation, I have yet to find one that works. Some people have mentioned success after getting an RMA'd processor (or have simply stated that they don't experience this problem), so maybe I'm just unlucky, but I honestly don't know what to do at this point.
As for whether it's specific to GCC (6.3 or otherwise), I have used 4.9, 5.4, 6.3 and 7.1, as well as CLang 3.9 and 4.0 with no success (although CLang takes longer to segfault.) I have tried no optimization, march=x86-64, march=znver1, and others; again with no success. I have tried kernels from 4.4 through 4.12, low-latency or regular, and various other kernel settings. I have tried adjusting voltages, LLC, RAM speeds, and any BIOS settings available; nothing works other than disabling SMT (at which point I can go at least 20 hours without segfault).
Note that someone has written test code to reproduce the SEGV that does not involve simply compiling for hours. It's found here: https://github.com/hayamdk/ryzen_segv_test. At least for me, this code will throw a segfault within 5-10 minutes of starting it with 16 threads going. Aside from the variety of different compilers and versions I have tested, this strongly suggests to me that it is not a compiler bug. I'm going to try to compile it under Windows and see if it crashes, or if the problem is restricted to Linux. I'm not sure what I'm hoping for...
EDIT: I don't know what to think anymore. The ryzen_segv_test ran for a few hours without any errors under Windows. Having read that gcc 6.4 is out and some people reported stable compilation with Ryzen using it, I thought I would test it out in Gentoo. I compiled gcc 6.4 from source and used that to compile the ryzen_segv_test. It ran without segfault for a few hours, so I thought this might be a solution. I let it run overnight, however, and it did segfault once (out of approx 350K "OK" results). So now I'm going to have to try the windows version for much longer to see if it crashes on a long run. Oddly enough, I went back to trying the ryzen_segv_test compiled with gcc 7.1 and it ran for over 2 hours without a segfault as opposed to the 5-10 minutes I experienced earlier.
Last edited by mark_lagace on Tue Jul 18, 2017 1:54 pm; edited 1 time in total |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Tue Jul 18, 2017 9:42 am Post subject: |
|
|
mark_lagace wrote: | Note that someone has written test code to reproduce the SEGV that does not involve simply compiling for hours. It's found here: https://github.com/hayamdk/ryzen_segv_test. At least for me, this code will throw a segfault within 5-10 minutes of starting it with 16 threads going. Aside from the variety of different compilers and versions I have tested, this strongly suggests to me that it is not a compiler bug. I'm going to try to compile it under Windows and see if it crashes, or if the problem is restricted to Linux. I'm not sure what I'm hoping for... |
I barely launched it yesterday (spent too much time modifying the shell scripts coming with it). Anyway, there is a windows binary version available "releases" section (I don't have Visual Studio).
I'll really try it soon too. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
Posted: Tue Jul 18, 2017 7:46 pm Post subject: |
|
|
El_Goretto wrote: | mark_lagace wrote: | Note that someone has written test code to reproduce the SEGV that does not involve simply compiling for hours. It's found here: https://github.com/hayamdk/ryzen_segv_test. At least for me, this code will throw a segfault within 5-10 minutes of starting it with 16 threads going. Aside from the variety of different compilers and versions I have tested, this strongly suggests to me that it is not a compiler bug. I'm going to try to compile it under Windows and see if it crashes, or if the problem is restricted to Linux. I'm not sure what I'm hoping for... |
I barely launched it yesterday (spent too much time modifying the shell scripts coming with it). Anyway, there is a windows binary version available "releases" section (I don't have Visual Studio).
I'll really try it soon too. |
modifying why?
I read the sh and the c to check nothing obviously dodgy & ran it
Code: | ~/Downloads/ryzen_segv_test-master] $ make
cc -O2 -Wall -c ryzen_segv_test.c -o ryzen_segv_test.o
cc -pthread -o ryzen_segv_test ryzen_segv_test.o
$ ./run.sh 12 2500000
|
its been running for like 20min without any segfaults or any other issues. I might remake with march=native just to check... but so far fine... _________________
Quote: | Removed by Chiitoo |
|
|
Back to top |
|
|
soulsource n00b
Joined: 25 Jan 2014 Posts: 26
|
Posted: Tue Jul 18, 2017 9:21 pm Post subject: |
|
|
ozhdfw wrote: | I am trying to gather more data to help narrow down the segfualts on Ryzen. How many of you who are are having segfaults are running single ranked ram? If you are not familiar with the rank of your ram I hope you did not build your Ryzen system. Considering this is a memory related issue I just wanted to rule out this possibility. Also I would not rely on the qvl lists posted they are not always reliable. https://community.amd.com/servlet/JiveServlet/showImage/2-2784271-117947/sm_add-slide-2_800.png |
I'm running single ranked RAM as well (Corsair CMK16GX4M2B3200C16). The segfaults appear to happen independent of RAM clock rate, timings, CPU voltage, memory controller voltage, load line calibration, whatever...
I've tried every setting at auto, what means DDR4 2133 at rather conservative timings of 16-18-18-36, clocking the RAM as DDR-2666 with the same timings, which are still far slower than what the RAM should be able to deliver at this clock rate, and clocking the RAM at the specified clock rate of DDR4 3200 with the timings it's specified at, which are again 16-18-18-36. Stability was the same in those three cases (namely: segfault at the first or second attempt to compile mesa). Then I've tried to up the CPU and memory controller voltages above what my mainboard picked at auto, but the only result was that in addition to the segfaults the system started to have occasional freezes. I've been trying various values for Load Line Calibration, but the segfaults still occurred during the first or second attempt to compile mesa.
I've also tried to disable ASLR and setting CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. With these settings I've managed to compile mesa three times without crash, but the fourth attempt again had a segfault, so I wouldn't consider this a big improvement or workaround.
Long story short, I'm meanwhile rather certain that it's a hardware issue, and I'm hoping that AMD gets a workaround or fix out for it soon, as it's really annoying to have longer builds crash regularly. |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Wed Jul 19, 2017 12:26 pm Post subject: |
|
|
Naib wrote: | El_Goretto wrote: | I barely launched it yesterday (spent too much time modifying the shell scripts coming with it). |
modifying why?
|
Because it seems to me that there is no results reporting (+ I tried to fix a non-existent issue ).
Naib wrote: | its been running for like 20min without any segfaults or any other issues. I might remake with march=native just to check... but so far fine... |
Same here, I compiled it with gcc 6.3 without zen specific march setting, let it run for a couple of hours without any segfault (a couple of minutes with a version compiled with gcc 5.4 didn't showed anything either).
At least in my case, it's not effective to reproduce the segfault issue. I'll stick with compiling mesa. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Wed Jul 19, 2017 1:20 pm Post subject: |
|
|
El_Goretto,
Maybe that indicates that there are several causes of segfaults in Ryzen systems ? _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
mir3x Guru
Joined: 02 Jun 2012 Posts: 455
|
Posted: Thu Jul 20, 2017 10:46 am Post subject: |
|
|
I got my lovely Ryzen 7 1700X with 64Gb Kingston FURY BLACK 2.4, Asus B350Plus( jumped from core 2 duo with 4Gb ram:D), compiling stuff in tmpfs ..
I dont have segfaults showing in dmesg- but i had 3 x internal compiler errors - after another emerge package was successfully compiled somehow.
I had them also on core 2 duo, then I used older compiler and everything was ok.
(I used march znver1 )
So how is that all ryzen fault, not eg bad flags or gcc bug ? _________________ Sent from Windows |
|
Back to top |
|
|
|