View previous topic :: View next topic |
Author |
Message |
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Thu Jul 20, 2017 12:03 pm Post subject: |
|
|
NeddySeagoon wrote: | El_Goretto,
Maybe that indicates that there are several causes of segfaults in Ryzen systems ? |
Maybe. I'm not sure we can conclude in one way or the other.
Considering the random aspect of the segfaults, we use some test cases to get higher chances to provoke it (or "one" if I follow your idea).
The (mesa) compiling is a kinda "brute force"/statistics test case: it draws much power, does a lot of computation and is likely to provoke a failure on any unstable system in general.
On the opposite, the hayamdk/ryzen_segv_test tool seems to focus on a very specific hypothetic root cause. And I'm not skilled enough to undertand what its author tries to do and why.
Only thing I can say: compiling mesa can draw more than 200W (total at the plug), when ryzen_segv_test draws around 150W. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Thu Jul 20, 2017 2:24 pm Post subject: |
|
|
mir3x,
What exact compiler vrsion are you running and would you please post your complete CFLAGS? |
|
Back to top |
|
|
mir3x Guru
Joined: 02 Jun 2012 Posts: 455
|
Posted: Thu Jul 20, 2017 3:18 pm Post subject: |
|
|
I just used what wiki said, just added -pipe only
CFLAGS="-march=znver1 -O2 -pipe"
for gcc6.3
I updated bios from bios directly and it updated to some version from june, I havent seen any aegis version
I used newest kernel 4.12.2, i haven't modified anything almost.
thats cool:
qlop -t libreoffice:
libreoffice: 1716 seconds average for 1 merges
( I last checked few years ago for core 2 duo and it was above 12h, and gave up with office then)
I'll switch to ck-sources soon. _________________ Sent from Windows |
|
Back to top |
|
|
soulsource n00b
Joined: 25 Jan 2014 Posts: 26
|
Posted: Thu Jul 20, 2017 5:41 pm Post subject: |
|
|
While an internal compiler error might be just a bug, it can also be a symptom of data corruption. I've seen both on my Ryzen machine, segfaults (probably data corruption that caused the compiler to access data outside of its process memory), and internal compiler errors (probably data corruption that made the compiler access incorrect data within its process memory), while on my old FX-6350 I can't remember ever seeing such issues. |
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
|
Back to top |
|
|
mrostu n00b
Joined: 26 Jun 2017 Posts: 2 Location: Moscow
|
Posted: Sun Aug 06, 2017 10:05 pm Post subject: |
|
|
Like many others here I was too premature with my conclusion. Segfaults still arises, but more rarely. Now I can't compile webkit-gtk. |
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
Posted: Sun Aug 06, 2017 11:35 pm Post subject: |
|
|
do you have a Ryzen7 or a Ryzen5?
Are you using GCC-5,6,7
What march do you have
what is your BIOS AGESA
are your RAM timings as per your sticks not what BIOS thinks it should be
is the RAM voltage what is required for DDR4 _________________
Quote: | Removed by Chiitoo |
|
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Mon Aug 07, 2017 3:35 pm Post subject: |
|
|
Hello, just a quick update: after a couple of weeks interacting with AMD support, they asked me to RMA my CPU. They even proposed to test the replacement CPU with the same mobo/RAM setup I have before shipping it (which I gladly accepted). Now I'm waiting.
Anyway, I cross my finger in hope that my story ends up pretty much like this dude's one. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
Flittchen n00b
Joined: 07 Aug 2017 Posts: 1
|
Posted: Mon Aug 07, 2017 4:10 pm Post subject: Don't get it |
|
|
RMA Folks,
What I do not get. There are several people in this thread reporting that they got rid of the segfaults by replacing their RAM with more compatible versions (Samsung B ICs) or avoiding 4 Modules (4x8 replaced by 2x16).
So if that fixes the problem, why do you hope that an RMA of the CPU (which likely is not the cause of the problem in the first place) is going to help you out?
You saved roughly $500 over buying an Intel based X299 Sys. Do you really need to save an aditional $30 on cheap RAM?
Simply don't take the cheapest Hynx or Micron based RAM but Samsung B instead and live happily ever after!
To proof myself wrong. Any FlareX owners around that can reproduce this problem under default UEFI settings (CPU not overclocked and RAM @ XMP or SPD; n>=1)?
Best,
Flittchen |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3311 Location: Rasi, Finland
|
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3311 Location: Rasi, Finland
|
Posted: Mon Aug 07, 2017 11:00 pm Post subject: |
|
|
Thanks.
For some reason my Firefox doesn't not follow the anchor on that link. I found the post anyways. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Tue Aug 08, 2017 2:26 pm Post subject: |
|
|
El_Goretto wrote: | Hello, just a quick update: after a couple of weeks interacting with AMD support, they asked me to RMA my CPU. They even proposed to test the replacement CPU with the same mobo/RAM setup I have before shipping it (which I gladly accepted). Now I'm waiting.
Anyway, I cross my finger in hope that my story ends up pretty much like this dude's one. |
OK... so I got my 1700X replacement (timestamped "1725", so supposedly made 20 weeks after my defective one), tested it for more than 1hour, and it didn't segfaulted once after 30 rounds of mesa compilation (10mins was enough to trigger it previously).
I won't be able to test it correctly (for at least 12 hours) before I come back from holidays in 2 weeks, but I'm pretty confident.
Anyway, that's a good news that you can get a 100% working CPU if you go though the official RMA procedure now that AMD tech support guys test replacement CPUs before shipping them. _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT) |
|
Back to top |
|
|
El_Goretto Moderator
Joined: 29 May 2004 Posts: 3169 Location: Paris
|
Posted: Wed Aug 23, 2017 9:13 am Post subject: |
|
|
El_Goretto wrote: |
OK... so I got my 1700X replacement (timestamped "1725", so supposedly made 20 weeks after my defective one), tested it for more than 1hour, and it didn't segfaulted once after 30 rounds of mesa compilation (10mins was enough to trigger it previously).
I won't be able to test it correctly (for at least 12 hours) before I come back from holidays in 2 weeks, but I'm pretty confident. |
Quick update: I was able to "serial-compile" Mesa for a whole night, and after more than 200 iterations, no segfault.
So RMA is the solution, and there definitly was (half full glass version ) a manufacturing issue that AMD QA was unaware of (half empty glass version is already written all over the AMD forum by angry customers (who I understand too)). _________________ -TrueNAS & jails: µ-serv Gen8 E3-1260L, 16Go ECC + µ-serv N40L, 10Go ECC
-Réseau: APU2C4 (OpenWRT) + GS726Tv3 + 2x GS108Tv2 + Archer C5v1 (OpenWRT)
Last edited by El_Goretto on Wed Aug 23, 2017 4:43 pm; edited 2 times in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Wed Aug 23, 2017 9:45 am Post subject: |
|
|
El_Goretto,
The cynic in me says that the manufacturing issue is still there but AMD keep a pool of screened devices to deal with RMAs.
It will be interesting to see how Threadripper gets on in the real world, since its two Ryzen die in the same package.
It must be a big problem, getting 185A (approx) into and out of that package.
No wonder it needs all those pins? _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3311 Location: Rasi, Finland
|
Posted: Thu Aug 24, 2017 10:55 am Post subject: |
|
|
NeddySeagoon wrote: | It must be a big problem, getting 185A (approx) into and out of that package. | Now that you said it... At the voltages current CPUs run at, it's a huge current. Never reallty put a thought on it. Now It makes much more sense why those VRMs tend to get so hot. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
Posted: Thu Aug 24, 2017 11:07 am Post subject: |
|
|
The thread ripper had a TDP of 180W and a core voltage of 1.3V
Typical loaded current draw is thus 138A. Could easily peak up that high _________________
Quote: | Removed by Chiitoo |
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Thu Aug 24, 2017 11:21 am Post subject: |
|
|
Zucca,
The power loss is I^2R. (That's your heating)
The voltage droop is IR. (Plays havoc with the dynamic regulation)
I would like to think the the VRM is divided into several zones and there is voltage feedback from the silicon inside the package so that's the regulation point, not the PCB power planes on the motherboard. Then you have to live with the IR loss due to the package.
As Naib says, its 138A peak (each way).
Its easy to see how you get a few mV of ripple on the silicon that takes it out of spec, especially when the load changes from not a lot to 138A
Its left as an exercise for the reader to determine R for a 10mV change in operating voltage at the silicon. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Naib Watchman
Joined: 21 May 2004 Posts: 6051 Location: Removed by Neddy
|
Posted: Thu Aug 24, 2017 1:19 pm Post subject: |
|
|
The thing about TDP is it is an arbitrary operating point meant to mimic real world usage. It isn't light loading and it isnt fully loaded... How close the stated TDP is to the continually loaded thermal power is why some say it is a useless measure... It's a measure non the less.
It wouldn't surprise me if 180A is pulled by the CPU while using Gentoo and say compiling chromium with -j32 (or whatever it is) _________________
Quote: | Removed by Chiitoo |
|
|
Back to top |
|
|
mark_lagace Tux's lil' helper
Joined: 19 Nov 2002 Posts: 77 Location: Ottawa, Canada
|
Posted: Mon Aug 28, 2017 1:15 am Post subject: Re: Don't get it |
|
|
Flittchen wrote: | RMA Folks,
What I do not get. There are several people in this thread reporting that they got rid of the segfaults by replacing their RAM with more compatible versions (Samsung B ICs) or avoiding 4 Modules (4x8 replaced by 2x16).
So if that fixes the problem, why do you hope that an RMA of the CPU (which likely is not the cause of the problem in the first place) is going to help you out?
You saved roughly $500 over buying an Intel based X299 Sys. Do you really need to save an aditional $30 on cheap RAM?
Simply don't take the cheapest Hynx or Micron based RAM but Samsung B instead and live happily ever after!
To proof myself wrong. Any FlareX owners around that can reproduce this problem under default UEFI settings (CPU not overclocked and RAM @ XMP or SPD; n>=1)?
Best,
Flittchen |
I'm a little late to respond to this post, but I can confirm that the segmentation faults do happen with default UEFI settings and G.Skill FlareX memory since that what I had. After RMA'ing the CPU, the new CPU does not segfault in the same system with the same settings.
If your system is otherwise stable (e.g. can run mprime / prime95 / stressapptest for 12+ hours), but throws segfaults while compiling software, then I recommend contacting AMD support to RMA the chip. |
|
Back to top |
|
|
rich0 Developer
Joined: 15 Sep 2002 Posts: 161
|
Posted: Fri Sep 08, 2017 12:23 am Post subject: |
|
|
I've been going back and forth with their RMA dept for the last week and am almost through. If you're looking for a much faster test than the gcc-build scripts try building systemd. I find it fails around 50% of the time or more on this CPU. For good measure I usually build gcc and glibc at the same time, but I'm not sure this is really necessary.
I've also seen MAME suggested as a useful test package that fails more often than gcc online.
Has anybody had any luck getting AMD to ship a replacement BEFORE returning the original one? So far they're offering to ship while my package is in transit. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54099 Location: 56N 3W
|
Posted: Fri Sep 08, 2017 10:01 am Post subject: |
|
|
rich0,
Try telling them you want to reuse their packaging to make sure your part gets to them safely.
HDD companies ship before receipt of your dud against a credit card.
You get 30 days before the card is charged, or your failed unit is returned. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
rich0 Developer
Joined: 15 Sep 2002 Posts: 161
|
Posted: Fri Sep 08, 2017 10:35 am Post subject: |
|
|
NeddySeagoon wrote: | Try telling them you want to reuse their packaging to make sure your part gets to them safely.
HDD companies ship before receipt of your dud against a credit card.
You get 30 days before the card is charged, or your failed unit is returned. |
You're trying to suggest logic here. We'll see how it goes. It takes a day to get a reply to every email I send, and I just learned that they're shipping their CPUs out of Florida, so I can only imagine how that will go for the next few weeks.
But, sure, I've done this with HDD replacements. I've also done this with Intel the last time I had to RMA a CPU from them. In the case of HDD replacements the return shipping is usually at customer expense (which I think is silly), but they will offer to print you a return label at their rates if you pay them.
If I had to I'd happily pay the $7 to pay for the shipping. What I don't want to do is be without a CPU for a week, or however long it takes them to clean up from the hurricane. That would end up costing me $50+ to buy an AM4 CPU to use in the interim. I'm not about to go replacing my entire motherboard with the old one, and then rebuilding all my packages for its instruction set in advance. |
|
Back to top |
|
|
thigobr n00b
Joined: 31 Aug 2007 Posts: 31
|
Posted: Mon Sep 18, 2017 7:35 pm Post subject: |
|
|
It took a few weeks to receive the replacement but I can say that the new one is "segfault free"! Tried every voltage settings with the old CPU (batch 1707PGT) but nonne of them worked. New CPU (batch 1730SUS) is really good and can even run memory at 3200MHz at lower VSOC voltage. |
|
Back to top |
|
|
rich0 Developer
Joined: 15 Sep 2002 Posts: 161
|
Posted: Mon Sep 18, 2017 7:51 pm Post subject: |
|
|
thigobr wrote: | It took a few weeks to receive the replacement but I can say that the new one is "segfault free"! Tried every voltage settings with the old CPU (batch 1707PGT) but nonne of them worked. New CPU (batch 1730SUS) is really good and can even run memory at 3200MHz at lower VSOC voltage. |
I literally got my replacement today. The stamp indicates that it was made after the defect was fixed, and so far it works fine in testing. It took a week from when they said they would ship it - looks like mine came direct from Canada (and properly done - not like the cheap Amazon stuff that gets stamped gift). |
|
Back to top |
|
|
|