Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Segfaults during compilation on AMD Ryzen.
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3 ... , 9, 10, 11  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
El_Goretto
Moderator
Moderator


Joined: 29 May 2004
Posts: 3166
Location: Paris

PostPosted: Thu Jul 20, 2017 12:03 pm    Post subject: Reply with quote

NeddySeagoon wrote:
El_Goretto,

Maybe that indicates that there are several causes of segfaults in Ryzen systems ?

Maybe. I'm not sure we can conclude in one way or the other.

Considering the random aspect of the segfaults, we use some test cases to get higher chances to provoke it (or "one" if I follow your idea).
The (mesa) compiling is a kinda "brute force"/statistics test case: it draws much power, does a lot of computation and is likely to provoke a failure on any unstable system in general.
On the opposite, the hayamdk/ryzen_segv_test tool seems to focus on a very specific hypothetic root cause. And I'm not skilled enough to undertand what its author tries to do and why.

Only thing I can say: compiling mesa can draw more than 200W (total at the plug), when ryzen_segv_test draws around 150W.
_________________
-TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT)
Back to top
View user's profile Send private message
Tony0945
Watchman
Watchman


Joined: 25 Jul 2006
Posts: 5127
Location: Illinois, USA

PostPosted: Thu Jul 20, 2017 2:24 pm    Post subject: Reply with quote

mir3x,
What exact compiler vrsion are you running and would you please post your complete CFLAGS?
Back to top
View user's profile Send private message
mir3x
Guru
Guru


Joined: 02 Jun 2012
Posts: 455

PostPosted: Thu Jul 20, 2017 3:18 pm    Post subject: Reply with quote

I just used what wiki said, just added -pipe only
CFLAGS="-march=znver1 -O2 -pipe"
for gcc6.3
I updated bios from bios directly and it updated to some version from june, I havent seen any aegis version

I used newest kernel 4.12.2, i haven't modified anything almost.
thats cool:
qlop -t libreoffice:
libreoffice: 1716 seconds average for 1 merges
( I last checked few years ago for core 2 duo and it was above 12h, and gave up with office then)

I'll switch to ck-sources soon.
_________________
Sent from Windows
Back to top
View user's profile Send private message
soulsource
n00b
n00b


Joined: 25 Jan 2014
Posts: 26

PostPosted: Thu Jul 20, 2017 5:41 pm    Post subject: Reply with quote

While an internal compiler error might be just a bug, it can also be a symptom of data corruption. I've seen both on my Ryzen machine, segfaults (probably data corruption that caused the compiler to access data outside of its process memory), and internal compiler errors (probably data corruption that made the compiler access incorrect data within its process memory), while on my old FX-6350 I can't remember ever seeing such issues.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6050
Location: Removed by Neddy

PostPosted: Sat Aug 05, 2017 10:58 pm    Post subject: Reply with quote

could people experiencing the crash try these steps.

https://fujii.github.io/2017/06/23/how-to-reproduce-the-segmentation-faluts-on-ryzen/
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
mrostu
n00b
n00b


Joined: 26 Jun 2017
Posts: 2
Location: Moscow

PostPosted: Sun Aug 06, 2017 10:05 pm    Post subject: Reply with quote

Like many others here I was too premature with my conclusion. Segfaults still arises, but more rarely. Now I can't compile webkit-gtk.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6050
Location: Removed by Neddy

PostPosted: Sun Aug 06, 2017 11:35 pm    Post subject: Reply with quote

do you have a Ryzen7 or a Ryzen5?
Are you using GCC-5,6,7
What march do you have
what is your BIOS AGESA
are your RAM timings as per your sticks not what BIOS thinks it should be
is the RAM voltage what is required for DDR4
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
El_Goretto
Moderator
Moderator


Joined: 29 May 2004
Posts: 3166
Location: Paris

PostPosted: Mon Aug 07, 2017 3:35 pm    Post subject: Reply with quote

Hello, just a quick update: after a couple of weeks interacting with AMD support, they asked me to RMA my CPU. They even proposed to test the replacement CPU with the same mobo/RAM setup I have before shipping it (which I gladly accepted). Now I'm waiting.

Anyway, I cross my finger in hope that my story ends up pretty much like this dude's one.
_________________
-TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT)
Back to top
View user's profile Send private message
Flittchen
n00b
n00b


Joined: 07 Aug 2017
Posts: 1

PostPosted: Mon Aug 07, 2017 4:10 pm    Post subject: Don't get it Reply with quote

RMA Folks,

What I do not get. There are several people in this thread reporting that they got rid of the segfaults by replacing their RAM with more compatible versions (Samsung B ICs) or avoiding 4 Modules (4x8 replaced by 2x16).

So if that fixes the problem, why do you hope that an RMA of the CPU (which likely is not the cause of the problem in the first place) is going to help you out?

You saved roughly $500 over buying an Intel based X299 Sys. Do you really need to save an aditional $30 on cheap RAM?

Simply don't take the cheapest Hynx or Micron based RAM but Samsung B instead and live happily ever after!

To proof myself wrong. Any FlareX owners around that can reproduce this problem under default UEFI settings (CPU not overclocked and RAM @ XMP or SPD; n>=1)?

Best,

Flittchen
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3309
Location: Rasi, Finland

PostPosted: Mon Aug 07, 2017 10:14 pm    Post subject: Reply with quote

Looks like AMD has finally admitted - there is a problem.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6050
Location: Removed by Neddy

PostPosted: Mon Aug 07, 2017 10:40 pm    Post subject: Reply with quote

A better link is this: https://community.amd.com/message/2816382#comment-2816382#2816382
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3309
Location: Rasi, Finland

PostPosted: Mon Aug 07, 2017 11:00 pm    Post subject: Reply with quote

Naib wrote:
A better link is this: https://community.amd.com/message/2816382#comment-2816382#2816382
Thanks.
For some reason my Firefox doesn't not follow the anchor on that link. I found the post anyways.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
El_Goretto
Moderator
Moderator


Joined: 29 May 2004
Posts: 3166
Location: Paris

PostPosted: Tue Aug 08, 2017 2:26 pm    Post subject: Reply with quote

El_Goretto wrote:
Hello, just a quick update: after a couple of weeks interacting with AMD support, they asked me to RMA my CPU. They even proposed to test the replacement CPU with the same mobo/RAM setup I have before shipping it (which I gladly accepted). Now I'm waiting.

Anyway, I cross my finger in hope that my story ends up pretty much like this dude's one.

OK... so I got my 1700X replacement (timestamped "1725", so supposedly made 20 weeks after my defective one), tested it for more than 1hour, and it didn't segfaulted once after 30 rounds of mesa compilation (10mins was enough to trigger it previously).
I won't be able to test it correctly (for at least 12 hours) before I come back from holidays in 2 weeks, but I'm pretty confident.

Anyway, that's a good news that you can get a 100% working CPU if you go though the official RMA procedure now that AMD tech support guys test replacement CPUs before shipping them.
_________________
-TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT)
Back to top
View user's profile Send private message
El_Goretto
Moderator
Moderator


Joined: 29 May 2004
Posts: 3166
Location: Paris

PostPosted: Wed Aug 23, 2017 9:13 am    Post subject: Reply with quote

El_Goretto wrote:

OK... so I got my 1700X replacement (timestamped "1725", so supposedly made 20 weeks after my defective one), tested it for more than 1hour, and it didn't segfaulted once after 30 rounds of mesa compilation (10mins was enough to trigger it previously).
I won't be able to test it correctly (for at least 12 hours) before I come back from holidays in 2 weeks, but I'm pretty confident.

Quick update: I was able to "serial-compile" Mesa for a whole night, and after more than 200 iterations, no segfault.
So RMA is the solution, and there definitly was (half full glass version :roll: ) a manufacturing issue that AMD QA was unaware of (half empty glass version is already written all over the AMD forum by angry customers (who I understand too)).
_________________
-TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT)


Last edited by El_Goretto on Wed Aug 23, 2017 4:43 pm; edited 2 times in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54028
Location: 56N 3W

PostPosted: Wed Aug 23, 2017 9:45 am    Post subject: Reply with quote

El_Goretto,

The cynic in me says that the manufacturing issue is still there but AMD keep a pool of screened devices to deal with RMAs.
It will be interesting to see how Threadripper gets on in the real world, since its two Ryzen die in the same package.

It must be a big problem, getting 185A (approx) into and out of that package.
No wonder it needs all those pins?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3309
Location: Rasi, Finland

PostPosted: Thu Aug 24, 2017 10:55 am    Post subject: Reply with quote

NeddySeagoon wrote:
It must be a big problem, getting 185A (approx) into and out of that package.
Now that you said it... At the voltages current CPUs run at, it's a huge current. Never reallty put a thought on it. Now It makes much more sense why those VRMs tend to get so hot.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6050
Location: Removed by Neddy

PostPosted: Thu Aug 24, 2017 11:07 am    Post subject: Reply with quote

The thread ripper had a TDP of 180W and a core voltage of 1.3V
Typical loaded current draw is thus 138A. Could easily peak up that high
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54028
Location: 56N 3W

PostPosted: Thu Aug 24, 2017 11:21 am    Post subject: Reply with quote

Zucca,

The power loss is I^2R. (That's your heating)
The voltage droop is IR. (Plays havoc with the dynamic regulation)

I would like to think the the VRM is divided into several zones and there is voltage feedback from the silicon inside the package so that's the regulation point, not the PCB power planes on the motherboard. Then you have to live with the IR loss due to the package.
As Naib says, its 138A peak (each way).
Its easy to see how you get a few mV of ripple on the silicon that takes it out of spec, especially when the load changes from not a lot to 138A

Its left as an exercise for the reader to determine R for a 10mV change in operating voltage at the silicon.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 6050
Location: Removed by Neddy

PostPosted: Thu Aug 24, 2017 1:19 pm    Post subject: Reply with quote

The thing about TDP is it is an arbitrary operating point meant to mimic real world usage. It isn't light loading and it isnt fully loaded... How close the stated TDP is to the continually loaded thermal power is why some say it is a useless measure... It's a measure non the less.

It wouldn't surprise me if 180A is pulled by the CPU while using Gentoo and say compiling chromium with -j32 (or whatever it is)
_________________
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
mark_lagace
Tux's lil' helper
Tux's lil' helper


Joined: 19 Nov 2002
Posts: 77
Location: Ottawa, Canada

PostPosted: Mon Aug 28, 2017 1:15 am    Post subject: Re: Don't get it Reply with quote

Flittchen wrote:
RMA Folks,

What I do not get. There are several people in this thread reporting that they got rid of the segfaults by replacing their RAM with more compatible versions (Samsung B ICs) or avoiding 4 Modules (4x8 replaced by 2x16).

So if that fixes the problem, why do you hope that an RMA of the CPU (which likely is not the cause of the problem in the first place) is going to help you out?

You saved roughly $500 over buying an Intel based X299 Sys. Do you really need to save an aditional $30 on cheap RAM?

Simply don't take the cheapest Hynx or Micron based RAM but Samsung B instead and live happily ever after!

To proof myself wrong. Any FlareX owners around that can reproduce this problem under default UEFI settings (CPU not overclocked and RAM @ XMP or SPD; n>=1)?

Best,

Flittchen


I'm a little late to respond to this post, but I can confirm that the segmentation faults do happen with default UEFI settings and G.Skill FlareX memory since that what I had. After RMA'ing the CPU, the new CPU does not segfault in the same system with the same settings.

If your system is otherwise stable (e.g. can run mprime / prime95 / stressapptest for 12+ hours), but throws segfaults while compiling software, then I recommend contacting AMD support to RMA the chip.
Back to top
View user's profile Send private message
rich0
Developer
Developer


Joined: 15 Sep 2002
Posts: 161

PostPosted: Fri Sep 08, 2017 12:23 am    Post subject: Reply with quote

I've been going back and forth with their RMA dept for the last week and am almost through. If you're looking for a much faster test than the gcc-build scripts try building systemd. I find it fails around 50% of the time or more on this CPU. For good measure I usually build gcc and glibc at the same time, but I'm not sure this is really necessary.

I've also seen MAME suggested as a useful test package that fails more often than gcc online.

Has anybody had any luck getting AMD to ship a replacement BEFORE returning the original one? So far they're offering to ship while my package is in transit.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54028
Location: 56N 3W

PostPosted: Fri Sep 08, 2017 10:01 am    Post subject: Reply with quote

rich0,

Try telling them you want to reuse their packaging to make sure your part gets to them safely.

HDD companies ship before receipt of your dud against a credit card.
You get 30 days before the card is charged, or your failed unit is returned.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
rich0
Developer
Developer


Joined: 15 Sep 2002
Posts: 161

PostPosted: Fri Sep 08, 2017 10:35 am    Post subject: Reply with quote

NeddySeagoon wrote:
Try telling them you want to reuse their packaging to make sure your part gets to them safely.

HDD companies ship before receipt of your dud against a credit card.
You get 30 days before the card is charged, or your failed unit is returned.


You're trying to suggest logic here. We'll see how it goes. It takes a day to get a reply to every email I send, and I just learned that they're shipping their CPUs out of Florida, so I can only imagine how that will go for the next few weeks.

But, sure, I've done this with HDD replacements. I've also done this with Intel the last time I had to RMA a CPU from them. In the case of HDD replacements the return shipping is usually at customer expense (which I think is silly), but they will offer to print you a return label at their rates if you pay them.

If I had to I'd happily pay the $7 to pay for the shipping. What I don't want to do is be without a CPU for a week, or however long it takes them to clean up from the hurricane. That would end up costing me $50+ to buy an AM4 CPU to use in the interim. I'm not about to go replacing my entire motherboard with the old one, and then rebuilding all my packages for its instruction set in advance.
Back to top
View user's profile Send private message
thigobr
n00b
n00b


Joined: 31 Aug 2007
Posts: 31

PostPosted: Mon Sep 18, 2017 7:35 pm    Post subject: Reply with quote

It took a few weeks to receive the replacement but I can say that the new one is "segfault free"! Tried every voltage settings with the old CPU (batch 1707PGT) but nonne of them worked. New CPU (batch 1730SUS) is really good and can even run memory at 3200MHz at lower VSOC voltage.
Back to top
View user's profile Send private message
rich0
Developer
Developer


Joined: 15 Sep 2002
Posts: 161

PostPosted: Mon Sep 18, 2017 7:51 pm    Post subject: Reply with quote

thigobr wrote:
It took a few weeks to receive the replacement but I can say that the new one is "segfault free"! Tried every voltage settings with the old CPU (batch 1707PGT) but nonne of them worked. New CPU (batch 1730SUS) is really good and can even run memory at 3200MHz at lower VSOC voltage.


I literally got my replacement today. The stamp indicates that it was made after the defect was fixed, and so far it works fine in testing. It took a week from when they said they would ship it - looks like mine came direct from Canada (and properly done - not like the cheap Amazon stuff that gets stamped gift).
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2, 3 ... , 9, 10, 11  Next
Page 10 of 11

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum