View previous topic :: View next topic |
Author |
Message |
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Thu May 26, 2016 10:58 pm Post subject: |
|
|
Neddy, Tony ...
Where to begin ... I seriously doubt it's hardware. I have a friend with the exact same machine and we have swapped out HD's, power supplies, batteries, and for a week I was using his, and he mine (with a different OS). His machine also crashed similarly (with the same kernel). I also have another machine with similar hardware, and this likewise locks up (same kernel, mostly same config). If it were hardware then I would expect OS X to trigger it, and it doesn't. Also both this laptop and the secondary laptop seem to lock-up for no particular reason, ie, this morning I booted the secondary laptop and after idling at the (console) login prompt for about 5 miniutes the machine was completely locked and required a hard reset. This is a 3.12.58 kernel, but I've had similar issues with 3.14.x, 3.18.x, 4.4.x, and 4.5.x (all recent kernels from these series). As I said, I've lost count of the number of kernels I've built (or attempted to build) in the past year or more, and none of them has been what I would call stable, ie, they crash, and/or lock the machine up, or are for other reasons unusable (ie, not able to provide a framebuffer).
As for other obvious issues, the machine has been completely dismantled and cleaned, heat sync paste replaced (w/ arctic silver), memtest run successfully, HD swapped, badblocks run, the entire system rebuilt. The only common issue thoughout is the kernel ... 3.13.11 is fine, anything more recent (including 3.12.x) will hang (eventually ... normally within the week).
I could hypothesise why this is the case, linux development is directed at those who are either paying developers, or those with the money to buy new machines (and so are providing capital to pay developers) ... everything else exists in the mythological realm of "supported hardware" and if it works then good for you (but otherwise, too bad).
best ... khay |
|
Back to top |
|
|
pilla Bodhisattva
Joined: 07 Aug 2002 Posts: 7729 Location: Underworld
|
Posted: Thu May 26, 2016 11:00 pm Post subject: |
|
|
We have had a bunch of bad capacitors back when I was in grad school, circa 2003-2004. Their caps started to pop out as they got bloated by problems in the composition of their internals.
They made some nice shooting sounds when the caps hit the computer's case. _________________ "I'm just very selective about the reality I choose to accept." -- Calvin |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Fri May 27, 2016 12:12 am Post subject: |
|
|
pilla wrote: | We have had a bunch of bad capacitors back when I was in grad school, circa 2003-2004. Their caps started to pop out as they got bloated by problems in the composition of their internals.
They made some nice shooting sounds when the caps hit the computer's case. |
I've heard of those exploding caps. Luckily I've only experienced bulging caps. |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Fri May 27, 2016 12:16 am Post subject: |
|
|
Khay, the only one I had to block was gentoo-sources-4.2.0
These are all AMD machines. What's your CPU & mobo & PSU? It doesn't hurt to check. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Fri May 27, 2016 1:14 pm Post subject: |
|
|
Tony0945 wrote: | These are all AMD machines. What's your CPU & mobo & PSU? It doesn't hurt to check. |
Tony ... both machines are x86 and both have a core duo (not to be confused with core2 duo), so intel T2500 and and T2400 processors . The motherboards are Apple, and Dell, similarly with the PSU's. The Dell had been running win7 until about a month ago (when it was given to me) and had no issues, and the PSU is practically new. The Apple I'd swaped PSU, and the one in use is approximately 16 months old (and an official Apple unit).
Given the fact that for it to be hardware then when the enire machine (sans HD/OS) was swapped you would not expect the issue to be transfered (unless it were software). The HD was also swapped out at one point, and were it the HD then why would it similarly effect the Dell (which has a brand new HD). All this points to the OS, and is due to the fact that this is "old" hardware and so doesn't recieve anything like the level of focus that x64 recieves. I know for a fact that there are numerious kernels in the tree (most of sys-kernel/tuxonice-sources) that do not compile on x86, and that the patch isn't tested on x86 by upstream (the most recent patch(es) having either the same issue, or a fix which causes TOI to fail suspending).
The irony in all this is that the Dell was given to me as a solution to my problem, but once linux was installed the machine now shows the same symptoms ... the friend who gave it to me can't help but laugh as both OS X and win7 run fine on the macbook (as this was the case when we swapped machines) and win7 runs fine on the Dell ... but linux, no.
best ... khay |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54216 Location: 56N 3W
|
Posted: Fri May 27, 2016 3:43 pm Post subject: |
|
|
khayyam,
It looks like you have done some thorough systems level checks, the result of which is to confirm a systems level issue.
You have swapped some bits around, which at first sight appears to have eliminated those bits, as the problem did not go with the swapped parts.
I'm still of the opinion that its a hardware systems level tolerancing problem aggravated by the software.
More cynically, there is a bug in kernel 3.13.11 that allows it to work, rather than a bug in later kernels. We don't know that the cause is the kernel, just that there appears to be some correlation.
On the transient PSU issue. Heres something to try, if you haven't already.
Try using either the powersave or the performance governor. The idea is to minimise transient stresses on the PSU chain.
Powersave runs the CPU at minimum frequency all the time and performance runs it at max all the time. Neither will eliminate transients but both will reduce them.
If you can put up with the performance hit, powersave will be the best as it will also increase your static PSU load headroom too. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Fri May 27, 2016 5:27 pm Post subject: |
|
|
NeddySeagoon wrote: | I'm still of the opinion that its a hardware systems level tolerancing problem aggravated by the software. |
Neddy ... a problem which effects both the Apple and Dell, and which occurs on two macbooks of the exact same type but doesn't effect that same machine if running Mac OS X or win7? Occam's razor ...
NeddySeagoon wrote: | On the transient PSU issue. Heres something to try, if you haven't already.
Try using either the powersave or the performance governor. The idea is to minimise transient stresses on the PSU chain. |
I'm already using 'ondemand' on both machines, so the cpu is at 1000MHz unless some cpu intensive task occurs, and as I noted above the lock-up has occured invariably when the machines are idle.
best ... khay |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54216 Location: 56N 3W
|
Posted: Fri May 27, 2016 5:49 pm Post subject: |
|
|
khayyam,
Ahh, I missed the "when idle". _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
proteusx Guru
Joined: 21 Jan 2008 Posts: 338
|
Posted: Fri May 27, 2016 7:38 pm Post subject: |
|
|
khayyam,
One of my 32 bit machines is this old (2003) Athlon:
Code: | Portage 2.3.0_rc1 (python 3.4.3-final-0, default/linux/x86/13.0, gcc-4.9.3, glibc-2.22-r4, 4.0.5-gentoo-amd i686)
=================================================================
System uname: Linux-4.0.5-gentoo-amd-i686-AMD_Athlon-tm-_64_Processor_3200+-with-gentoo-2.2
KiB Mem: 3113928 total, 2524820 free |
The kernel it is running at the moment must be really crappy (not even in the tree).
It is on 24/7 and never suffered any lock ups.
I rarely maintain it but I use only stable kernels.
If your problem is not with the hardware I would have a closer look at your software configuration
Last edited by proteusx on Sat May 28, 2016 1:08 am; edited 1 time in total |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Sat May 28, 2016 12:40 am Post subject: |
|
|
Khay, I am assuming that you are running gentoo-sources and configuring by hand. Try a build with vanilla-sources or vice versa if my assumption is wrong.
Have you tried emerging with the "experimental" use flag and selecting march=native? Which gives me another thought, which gcc are you running because some versions have trouble with older CPU's. My K6-3 is still running well but gcc is no longer tested against it. Some years ago (3.x series) I was sending them reports and I always got a thank you e-mail that thanked me for testing as no one else was testing against k6. I'll have to fire up my k6 machine and tell you the kernel version and gcc version.
I was about to ask about memory but going into outer space while idling sounds more like an illegal instruction which relates back to the gcc parameters the kernel was built with. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat May 28, 2016 3:38 am Post subject: |
|
|
Tony0945 wrote: | Khay, I am assuming that you are running gentoo-sources and configuring by hand. Try a build with vanilla-sources or vice versa if my assumption is wrong. |
Tony0945 ... I use a self-maintained ck-sources to which I apply the tuxonice patch, those are not the problem because in the course of my testing I've reverted these patches, and with only those patches from genpatches applied I've similarly had the same issue.
Tony0945 wrote: | Have you tried emerging with the "experimental" use flag and selecting march=native? |
This patch is only available with > 3.14.x (as I remember), and yes, I've selected -march=native when available (for the core duo this sets -march=pentium-m).
Tony0945 wrote: | Which gives me another thought, which gcc are you running because some versions have trouble with older CPU's. |
Currently 4.9.3, but I seriously doubt this is the cause.
Tony0945 wrote: | I was about to ask about memory but going into outer space while idling sounds more like an illegal instruction which relates back to the gcc parameters the kernel was built with. |
echo $(gcc -v -march=native -x c /dev/null |& grep /dev/null | egrep -o -- '-+(m|param )\S+'): | -march=pentium-m -mmmx -mno-3dnow -msse -msse2 -msse3 -mno-ssse3 -mno-sse4a -mno-cx16 -mno-sahf -mno-movbe -mno-aes -mno-sha -mno-pclmul -mno-popcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mno-avx -mno-avx2 -mno-sse4.2 -mno-sse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=2048 -mtune=generic |
... that would be a common factor between both machines but I'm more inclined toward it being a kernel issue.
proteusx wrote: | If your problem is not with the hardware I would have a closer look at your software configuration |
proteusx ... I have, and the most probable cause is the kernel (at least those greater than 3.13.11). If it were somewhere else in the stack it would also occur with 3.13.11.
thanks & best ... khay |
|
Back to top |
|
|
proteusx Guru
Joined: 21 Jan 2008 Posts: 338
|
Posted: Sat May 28, 2016 2:44 pm Post subject: |
|
|
khayyam wrote: | Code: | egrep -o -- '-+(m|param )\S+' |
|
I am curious as to the meaning of the '--' in your egrep.
I could not find anything about it in the documentation of grep.
What does it do?
I would have done it like this: Code: | egrep -o '\-+(m|param )\S+' |
|
|
Back to top |
|
|
pilla Bodhisattva
Joined: 07 Aug 2002 Posts: 7729 Location: Underworld
|
Posted: Sat May 28, 2016 3:11 pm Post subject: |
|
|
proteusx wrote: | khayyam wrote: | Code: | egrep -o -- '-+(m|param )\S+' |
|
I am curious as to the meaning of the '--' in your egrep.
I could not find anything about it in the documentation of grep.
What does it do?
I would have done it like this: Code: | egrep -o '\-+(m|param )\S+' |
|
It is common in utilities, and it is used to avoid '-' after '--' to be interpreted as a parameter flag.
For example, '-1' is a parameter flag '1', while '-- -1' is interpreted as negative one. _________________ "I'm just very selective about the reality I choose to accept." -- Calvin |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat May 28, 2016 3:25 pm Post subject: |
|
|
proteusx wrote: | I am curious as to the meaning of the '--' in your egrep. |
proteusx ... to clarify what pilla says above ... it basically says "stop processing options", or "everything that follows isn't an option".
man bash wrote: | -- A -- signals the end of options and disables further option processing. Any arguments after the -- are treated as filenames and arguments. An argument of - is equivalent to --. |
best ... khay |
|
Back to top |
|
|
proteusx Guru
Joined: 21 Jan 2008 Posts: 338
|
Posted: Sat May 28, 2016 3:38 pm Post subject: |
|
|
@pilla and khayyam.
Thank you for the explanation. |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Sat May 28, 2016 7:56 pm Post subject: |
|
|
Well well. I find I'm actually running 3.18.12 on the old AT box. I'm emerging gentoo-sources-4.4.6 as we speak. There may be a problem with x86 on some of the kernels.
If you can find the sources of 3.18.12 try building without the ck-sources patches using this config https://bpaste.net/show/09857f6af085
It's for a k6-3 so you will have have to change it to your Intel processor and change frequency controller as well as your other hardware changes. I see this is the only box without the "experimental" use flag. That will be remedied with 4.4.6 |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54216 Location: 56N 3W
|
Posted: Sat May 28, 2016 8:01 pm Post subject: |
|
|
Tony0945,
The sources will be on kernel.org and genpatches will be on dev.gentoo.org somewhere.
I'll find them if you want them. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat May 28, 2016 8:47 pm Post subject: |
|
|
Tony ... I appreciate your trying to help but as I said, its not the BFS/ck patch that is at issue, and 3.18.12 will have other issues (specifically inteldrmfb). I've been around the block and back in terms of the config, patches applied, etc, but I haven't been able to isolate it (other than to kernels after 3.13.11). The only thing I haven't as yet considered is BFQ (which seems to be a likely candidate from what I've been reading). So, when I find time I'll probably try a more recent kernel and the new BFQ patches.
best ... khay |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Sat May 28, 2016 10:31 pm Post subject: |
|
|
NeddySeagoon wrote: | Tony0945,
The sources will be on kernel.org and genpatches will be on dev.gentoo.org somewhere.
I'll find them if you want them. |
Thanks, but I have them. I wanted Khayyam to try them. |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Mon May 30, 2016 12:13 am Post subject: |
|
|
Installed gentoo-sources 4.4.6 on my k6 box. Took all the defaults for oldconfig except for some that defaulted for Y or M for hardware. This is an AT box. The hardware isn't going to change. I also see I still have udev, might as well convert to mdev or static dev for the same reason. Wasted a few hours trying to fix booting into my 32bit faux k6 Gentoo on my Phenom II box. Would up building everything native. Kernel took 6 hours and seven minutes (gcc 4.8.5). Rebooted into the new kernel OK and running emerge -auvND @system. It's about 16 hours into 89 of 90 emerges (udev-225). I don't run X on that box. No crashes but you said yours crashed while idling.
EDIT: Total time for emerge -uvND @system was 16 hours 39 minutes
Last edited by Tony0945 on Mon May 30, 2016 1:02 pm; edited 1 time in total |
|
Back to top |
|
|
dol-sen Retired Dev
Joined: 30 Jun 2002 Posts: 2805 Location: Richmond, BC, Canada
|
Posted: Mon May 30, 2016 6:25 am Post subject: |
|
|
Khayyam. I too had intermittent lockup issues on my new (to me, used) workstation. After doing a number of things including fixing a few slightly bent cpu socket tabs. Those fixed a few bootup correctlable ram errors (ECC mem). I still had the occasional lockup. So I disabled xscreensaver and haven't had another lock up in months. The screen still blanks, but never runs those screensaver graphics and images. So, disable the screensavers, it is a likely cause of your trouble.
BTW, it would lock up on me when idle but not always when it was in screensaver mode. _________________ Brian
Porthole, the Portage GUI frontend irc@freenode: #gentoo-guis, #porthole, Blog
layman, gentoolkit, CoreBuilder, esearch... |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Mon May 30, 2016 2:52 pm Post subject: |
|
|
dol-sen wrote: | So, disable the screensavers, it is a likely cause of your trouble. |
dol-sen ... I'm not using xscreensaver, and as I mentioned above I've had it lock up when idle at getty/login (and so console).
Tony0945 wrote: | No crashes but you said yours crashed while idling. |
Sometimes, yes, but not always.
I had a little time to look at the BFQ patches, they are basically the same as those from 3.13.11 so this suggests this isn't the cause. I'm building 3.12.60 right now, and will probably build 4.4.11, I've been thinking that as I've had various issues with inteldrmfb/KMS over the course of this past year (including a number of segfaults with ... I forget ... 4.4.x or 4.5.x) I wonder if this isn't related.
best ... khay |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Mon May 30, 2016 8:26 pm Post subject: |
|
|
Finished emerge -uvND @world, and additional 3 hours 30 minutes. Rebooted. Can't log in with SSH and have no terminal. Somethings wrong. I get the prompt, give the password, just get the prompt again. Is this similar to your problem? |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Mon May 30, 2016 8:53 pm Post subject: |
|
|
Tony0945 wrote: | Finished emerge -uvND @world, and additional 3 hours 30 minutes. Rebooted. Can't log in with SSH and have no terminal. Somethings wrong. I get the prompt, give the password, just get the prompt again. Is this similar to your problem? |
Tony ... no, that sounds like you've updated openssh and are effected by the setting of 'PermitRootLogin no' (perhaps you didn't look closely at the changes made to /etc/ssh/sshd_config, or have eu_automerge="yes" in etc-update.conf). If not then you should pass -v, -vv, or -vvv which should provide more info.
best ... khay |
|
Back to top |
|
|
Tony0945 Watchman
Joined: 25 Jul 2006 Posts: 5127 Location: Illinois, USA
|
Posted: Mon May 30, 2016 9:40 pm Post subject: |
|
|
[quote="khayyam"Tony ... no, that sounds like you've updated openssh and are effected by the setting of 'PermitRootLogin no' (perhaps you didn't look closely at the changes made to /etc/ssh/sshd_config, or have eu_automerge="yes" in etc-update.conf). If not then you should pass -v, -vv, or -vvv which should provide more info.
best ... khay[/quote]
I remember having that prpblem on other machines, but they had a terminal. I don't think I updated sshd.config but perhaps I hit the wrong key. Now I have to decide which is easier.
1. Upplug a terminal and carry it to the machine.
2. carry the tower to another location that has a terminal.
3. Remove the hard drive, carry it to another machine that has an IDE port and mount the drive on /mnt/gentoo to manually edit the sshd config files.
Booting with a CD is no help because the machine is headless. (Chinese capacitors killed the monitor") |
|
Back to top |
|
|
|