View previous topic :: View next topic |
Author |
Message |
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Fri Nov 30, 2012 5:38 pm Post subject: System instabilities - General Protection faults |
|
|
I've got a new homebuild system based on a ASUS Mini-ITX Motherboard. Nothing else has been installed on it (OS wise). The system is intended as a headless backup server. RAID/SSH/rsync/NFS.
The install went smoothly, things seem to work fine. Until I start using the system for it's intended purposes - doing backups. I started getting frequent crashes.
Narrowing down / searching these forums, I see most folks with similar issues end up finding hardware issues. So, to start I focus there.
Memtest+ runs for 36 hours. No issues.
CpuBurn for 4 hours. No issues.
Check cooling - looks ok - CPU temp never goes above 65 C.
Underclock the system by 5% (both CPU and memory). Failure modes doesn't seem to change.
Pull 1 DIMM - I have (2) 4 GB Dimms - . No changes.
Swap - use other DIMM. No changes
So, I'm thinking HW looks fairly reasonable.
So focus more on SW. Trying to narrow my scope, I can usually
get a crash just by doing a dd on the server itself:
dd if=/dev/md127 of=/dev/null
The failures aren't identical, but seem to similar to below:
Code: |
[35101.826124] general protection fault: 0000 [#1] SMP
[35101.826153] CPU 0
[35101.826161] Modules linked in: k10temp
[35101.826179]
[35101.826190] Pid: 568, comm: kswapd0 Not tainted 3.4.9-gentoo #9 System manufacturer System Product Name/C60M1-I
[35101.826224] RIP: 0010:[<ffffffff81145fb8>] [<ffffffff81145fb8>] drop_buffers+0x28/0xb0
[35101.826260] RSP: 0018:ffff880234eff9a0 EFLAGS: 00010206
[35101.826276] RAX: 0000000000000000 RBX: ffffea00048dab40 RCX: 0000000000000000
[35101.826295] RDX: 0000000000000000 RSI: ffff880234eff9d8 RDI: ffbf8801332bdf08
[35101.826314] RBP: ffff880234eff9c0 R08: dead000000200200 R09: dead000000100100
[35101.826333] R10: ffff880234effbb8 R11: ffff880234effbc0 R12: ffff8802365e55a0
[35101.826352] R13: ffff8801332bdf08 R14: ffff880234eff9d8 R15: 0000000000000001
[35101.826373] FS: 00007fb00bc26700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[35101.826395] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35101.826411] CR2: 000000000065dd00 CR3: 000000022b4fd000 CR4: 00000000000007f0
[35101.826464] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[35101.826516] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[35101.826569] Process kswapd0 (pid: 568, threadinfo ffff880234efe000, task ffff88023597be70)
[35101.826655] Stack:
[35101.826696] ffffea00048dab40 ffff8802365e55a0 0000000000000000 ffffea00048dab40
[35101.826789] ffff880234effa00 ffffffff81146090 ffff880234eff9f0 0000000000000000
[35101.826882] ffff8802365e55a0 ffff880234effd90 ffff880234effba0 ffffea00048dab60
[35101.826975] Call Trace:
[35101.827023] [<ffffffff81146090>] try_to_free_buffers+0x50/0xb0
[35101.827076] [<ffffffff8114cd9d>] blkdev_releasepage+0x3d/0x50
[35101.827130] [<ffffffff810cc29d>] try_to_release_page+0x2d/0x40
[35101.827185] [<ffffffff810df2e2>] shrink_page_list+0x762/0x910
[35101.827239] [<ffffffff810e8464>] ? __mod_zone_page_state+0x44/0x50
[35101.827293] [<ffffffff810dd134>] ? update_isolated_counts.clone.55+0x114/0x130
[35101.827383] [<ffffffff810df974>] shrink_inactive_list+0x244/0x4c0
[35101.827437] [<ffffffff810e0304>] shrink_mem_cgroup_zone+0x3b4/0x4f0
[35101.827491] [<ffffffff8111bbe2>] ? prune_super+0x192/0x1b0
[35101.827545] [<ffffffff810e10f2>] balance_pgdat+0x542/0x730
[35101.827598] [<ffffffff810e1449>] kswapd+0x169/0x3c0
[35101.827649] [<ffffffff81058be0>] ? wake_up_bit+0x40/0x40
[35101.827701] [<ffffffff810e12e0>] ? balance_pgdat+0x730/0x730
[35101.827752] [<ffffffff81058466>] kthread+0x96/0xa0
[35101.827804] [<ffffffff816ae2d4>] kernel_thread_helper+0x4/0x10
[35101.827856] [<ffffffff810583d0>] ? flush_kthread_worker+0xb0/0xb0
[35101.827909] [<ffffffff816ae2d0>] ? gs_change+0xb/0xb
[35101.827955] Code: 00 00 00 55 48 89 e5 41 56 49 89 f6 41 55 41 54 53 48 8b 07 48 89 fb f6 c4 08 0f 84 8e 00 00 00 4c 8b 6f 30 4c 89 ef 0f 1f 40 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b
[35101.828240] RIP [<ffffffff81145fb8>] drop_buffers+0x28/0xb0
[35101.828294] RSP <ffff880234eff9a0>
[35101.828669] ---[ end trace 7979c35d1c9be633 ]---
[35161.769980] INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies)
[35161.770288] Pid: 3339, comm: dd Tainted: G D 3.4.9-gentoo #9
[35161.771507] Call Trace:
[35161.771588] <IRQ> [<ffffffff810a7d26>] __rcu_pending+0x206/0x490
[35161.771735] [<ffffffff810a8470>] rcu_check_callbacks+0xb0/0x170
[35161.771828] [<ffffffff810472b3>] update_process_times+0x43/0x80
[35161.771918] [<ffffffff8107a6bf>] tick_sched_timer+0x5f/0xb0
[35161.772008] [<ffffffff8105c7a8>] __run_hrtimer+0x78/0x1c0
[35161.772097] [<ffffffff8107a660>] ? tick_nohz_handler+0xe0/0xe0
[35161.772187] [<ffffffff8105cfe6>] hrtimer_interrupt+0xf6/0x240
[35161.772278] [<ffffffff810212e4>] smp_apic_timer_interrupt+0x64/0xa0
[35161.772371] [<ffffffff816ada87>] apic_timer_interrupt+0x67/0x70
[35161.772458] <EOI> [<ffffffff816ac7ca>] ? _raw_spin_lock+0x1a/0x30
[35161.772597] [<ffffffff8114734b>] create_empty_buffers+0x4b/0xd0
[35161.772689] [<ffffffff811485a8>] block_read_full_page+0x2c8/0x390
[35161.772780] [<ffffffff8114c280>] ? I_BDEV+0x10/0x10
[35161.772869] [<ffffffff810e8cae>] ? __inc_zone_page_state+0x2e/0x30
[35161.772961] [<ffffffff810ccfeb>] ? add_to_page_cache_locked+0x8b/0xe0
[35161.773052] [<ffffffff8114ce23>] blkdev_readpage+0x13/0x20
[35161.773142] [<ffffffff810d81c9>] __do_page_cache_readahead+0x1d9/0x260
[35161.773234] [<ffffffff810d857c>] ra_submit+0x1c/0x20
[35161.773321] [<ffffffff810d868d>] ondemand_readahead+0x10d/0x230
[35161.773413] [<ffffffff812d283d>] ? copy_user_generic_string+0x2d/0x40
[35161.773503] [<ffffffff810d8830>] page_cache_async_readahead+0x80/0xa0
[35161.773596] [<ffffffff810ce86b>] generic_file_aio_read+0x48b/0x780
[35161.773688] [<ffffffff811183c2>] do_sync_read+0xe2/0x120
[35161.773778] [<ffffffff8127db83>] ? security_file_permission+0x93/0xb0
[35161.773869] [<ffffffff81118c93>] vfs_read+0xc3/0x170
[35161.773956] [<ffffffff81118d8c>] sys_read+0x4c/0x90
[35161.774044] [<ffffffff816acfca>] ? system_call_after_swapgs+0x17/0x59
[35161.774135] [<ffffffff816ad022>] system_call_fastpath+0x16/0x1b
|
I understand folks don't want to debug processes that are "Tainted". The log above shows one process (568) as "Not Tainted", the other (3339) is "Tainted". Really "dd" is Tainted? Or I'm just interpreting this wrong?
Full dmesg:
http://pastebin.com/raw.php?i=04Y9Kx75
Full kernel .config:
http://pastebin.com/raw.php?i=kUqbhM7z
Any help appreciated.
Thanks,
Mark |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Fri Nov 30, 2012 10:26 pm Post subject: |
|
|
MarkCu,
CONFIG_HZ_1000=y is known to cause problems on some hardware. Its not need on a headless system either.
Try 100Hz instead.
You also have several debug options on in your kernel, I did not check them all. Debug options always cause logspam and sometimes interfere with normal operation.
Debug options should only be on if you are debugging that part of the kernel.
While you are fixing your kernel timer, turn off all the debug stuff too. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Sat Dec 01, 2012 1:08 am Post subject: |
|
|
Thanks for the advise.
Recompiled my kernel with suggested changes.
Still crashing:
Worth mentioning - don't know if it matters or not - but I'm
running without swap. Figured 8G memory should be plenty
for this config.
Same command:
dd if=/dev/md127 of=/dev/null bs=1024
dmesg Result:
Code: |
[ 2111.906328] general protection fault: 0000 [#1] PREEMPT SMP
[ 2111.906358] CPU 0
[ 2111.906365] Modules linked in: k10temp
[ 2111.906381]
[ 2111.906391] Pid: 565, comm: kswapd0 Not tainted 3.4.9-gentoo #11 System manufacturer System Product Name/C60M1-I
[ 2111.906421] RIP: 0010:[<ffffffff8114a658>] [<ffffffff8114a658>] drop_buffers+0x28/0xc0
[ 2111.906453] RSP: 0018:ffff880234fb79c0 EFLAGS: 00010206
[ 2111.906467] RAX: 0000000000000000 RBX: ffffea0004883ec0 RCX: 0000000000000000
[ 2111.906484] RDX: 0000000000000000 RSI: ffff880234fb79f8 RDI: ffbf8801326bdf08
[ 2111.906501] RBP: ffff8802365e05e8 R08: 0000000000000003 R09: ffff880234fb6000
[ 2111.906518] R10: ffff880234fb7fd8 R11: ffff880234fb7bb0 R12: ffff8801326bdf08
[ 2111.906536] R13: ffff880234fb79f8 R14: 0000000000000001 R15: ffff880234fb7af0
[ 2111.906554] FS: 00007fdcadc1b700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 2111.906574] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2111.906588] CR2: 00007f9bae9e8000 CR3: 00000002340be000 CR4: 00000000000007f0
[ 2111.906606] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2111.906623] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2111.906642] Process kswapd0 (pid: 565, threadinfo ffff880234fb6000, task ffff880235a4e200)
[ 2111.906724] Stack:
[ 2111.906762] ffff8802365e0560 ffffea0004883ec0 ffff8802365e05e8 0000000000000000
[ 2111.906857] ffffea0004883ec0 ffffffff8114a73f ffff880234fb7b90 0000000000000000
[ 2111.906951] ffff880234fb7d80 ffff880234fb7b90 ffffea0004883ee0 ffffffff810e2060
[ 2111.907046] Call Trace:
[ 2111.907096] [<ffffffff8114a73f>] ? try_to_free_buffers+0x4f/0xc0
[ 2111.907153] [<ffffffff810e2060>] ? shrink_page_list+0x790/0x970
[ 2111.907209] [<ffffffff810eb60f>] ? __mod_zone_page_state+0x3f/0x50
[ 2111.907265] [<ffffffff810e080b>] ? update_isolated_counts.clone.56+0x13b/0x170
[ 2111.907356] [<ffffffff810e2713>] ? shrink_inactive_list+0x233/0x4d0
[ 2111.907413] [<ffffffff810e3092>] ? shrink_mem_cgroup_zone+0x392/0x4d0
[ 2111.907471] [<ffffffff810e3e9a>] ? balance_pgdat+0x4ea/0x6b0
[ 2111.907526] [<ffffffff810e41dc>] ? kswapd+0x17c/0x430
[ 2111.907579] [<ffffffff816b129c>] ? __schedule+0x27c/0x5e0
[ 2111.907632] [<ffffffff81059790>] ? wake_up_bit+0x40/0x40
[ 2111.907685] [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0
[ 2111.907738] [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0
[ 2111.907791] [<ffffffff81058fee>] ? kthread+0x9e/0xb0
[ 2111.907844] [<ffffffff816b42d4>] ? kernel_thread_helper+0x4/0x10
[ 2111.907900] [<ffffffff81058f50>] ? flush_kthread_worker+0xc0/0xc0
[ 2111.907955] [<ffffffff816b42d0>] ? gs_change+0xb/0xb
[ 2111.908003] Code: 00 00 00 41 55 49 89 f5 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 f6 c4 08 0f 84 99 00 00 00 4c 8b 67 30 4c 89 e7 0f 1f 44 00 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b
[ 2111.908304] RIP [<ffffffff8114a658>] drop_buffers+0x28/0xc0
[ 2111.908360] RSP <ffff880234fb79c0>
[ 2111.908707] ---[ end trace c23f7d9f938c612f ]---
[ 2111.908797] note: kswapd0[565] exited with preempt_count 1
|
dmesg:
http://pastebin.com/raw.php?i=LxmU9QKX
.config:
http://pastebin.com/raw.php?i=jV3pQ8a5
Any other ideas?
Thanks,
Mark |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Sat Dec 01, 2012 3:05 pm Post subject: |
|
|
MarkCu,
Code: | CONFIG_SLUB_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_HWMON_DEBUG_CHIP=y
CONFIG_DEBUG_FS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y |
Use the search (press /) in to find the above options and turn them off.
Not having swap does not stop the kernel swapping, it just robs the kernel of the ability to move dynamically allocated RAM to disk.
The kernel will still swap by discarding from RAM data or code that has a permanent home in disk, then reloading it when its needed again.
Unless you are running a diskless node, a small swap, say 512Mb, is a good thing.
You can make a swap file if you want to test your swap theory but I agree, no swap is unlikely to be the problem. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Sun Dec 02, 2012 8:13 pm Post subject: |
|
|
Ok managed, to get (most) of those other DEBUG kernel options off.
One, I couldn't figure out how to disable:
CONFIG_X86_DEBUGCTLMSR=y
The help doesn't show the dependencies, nor where it is, nor much else, and I can't
find it.
Anyway, similar results:
Code: |
[58460.314388] ------------[ cut here ]------------
[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]
[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP
[58460.314452] CPU 0
[58460.314458] Modules linked in: k10temp
[58460.314474]
[58460.314483] Pid: 562, comm: kswapd0 Not tainted 3.4.9-gentoo #12 System manufacturer System Product Name/C60M1-I
[58460.314513] RIP: 0010:[<ffffffff81116ca6>] [<ffffffff81116ca6>] free_buffer_head+0x66/0x80
[58460.314543] RSP: 0018:ffff880234f6b9f0 EFLAGS: 00010287
[58460.314558] RAX: ffff880124837ce0 RBX: ffff880124837c98 RCX: 0000000000000000
[58460.314575] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff880124837c98
[58460.314592] RBP: ffff88023646d968 R08: 0000000000000003 R09: ffff880234f6a000
[58460.314609] R10: ffff880234f6bfd8 R11: ffff880234f6bbd0 R12: 0000000000000001
[58460.314626] R13: ffffea00048f8c40 R14: 0000000000000001 R15: ffff880234f6bb00
[58460.314644] FS: 00007f2118045700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[58460.314696] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58460.314742] CR2: 00007fd7644c1000 CR3: 00000002317dd000 CR4: 00000000000007f0
[58460.314792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58460.314841] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[58460.314891] Process kswapd0 (pid: 562, threadinfo ffff880234f6a000, task ffff880235433600)
[58460.314973] Stack:
[58460.315011] ffff88023646d968 ffff880124837c98 ffff88023646d968 ffffffff81116eec
[58460.315099] ffff880234f6bbb0 ffff880124837c98 ffff880234f6bda0 ffff880234f6bbb0
[58460.315191] ffffea00048f8c60 ffffffff810b8e00 0000000000010dc0 ffff880234f6bac0
[58460.315286] Call Trace:
[58460.315335] [<ffffffff81116eec>] ? try_to_free_buffers+0x7c/0xc0
[58460.315392] [<ffffffff810b8e00>] ? shrink_page_list+0x740/0x8c0
[58460.315447] [<ffffffff810c01df>] ? __mod_zone_page_state+0x3f/0x50
[58460.315502] [<ffffffff810b794b>] ? update_isolated_counts.clone.53+0x13b/0x170
[58460.315591] [<ffffffff810b94a6>] ? shrink_inactive_list+0x286/0x470
[58460.315641] [<ffffffff810b9d82>] ? shrink_mem_cgroup_zone+0x3a2/0x4e0
[58460.315693] [<ffffffff810eefbe>] ? grab_super_passive+0x3e/0x90
[58460.315742] [<ffffffff810baa9a>] ? balance_pgdat+0x4fa/0x6c0
[58460.315792] [<ffffffff810badf6>] ? kswapd+0x196/0x300
[58460.315840] [<ffffffff81051b20>] ? wake_up_bit+0x40/0x40
[58460.315887] [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0
[58460.315936] [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0
[58460.315983] [<ffffffff8105144e>] ? kthread+0x9e/0xb0
[58460.316032] [<ffffffff8163e314>] ? kernel_thread_helper+0x4/0x10
[58460.316081] [<ffffffff810513b0>] ? flush_kthread_worker+0xc0/0xc0
[58460.316130] [<ffffffff8163e310>] ? gs_change+0xb/0xb
[58460.316174] Code: 65 ff 0c 25 60 e2 00 00 e8 38 ff ff ff 83 6b 1c 01 48 8b 85 38 e0 ff ff a8 08 75 11 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 <0f> 0b 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 e9 15 4a 52 00
[58460.316434] RIP [<ffffffff81116ca6>] free_buffer_head+0x66/0x80
[58460.316483] RSP <ffff880234f6b9f0>
[58460.319321] ---[ end trace fa6084efedc140cf ]---
|
dmesg:
http://pastebin.com/raw.php?i=rXvLKJ5w
.config
http://pastebin.com/raw.php?i=cHPrW85H
Also tried adding some swap - no difference.
Thanks
Mark |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Sun Dec 02, 2012 8:31 pm Post subject: |
|
|
MarkCu,
[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]
[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP
Invalid opcode means the system tried to execute an instruction that the CPU does not understand.
If its in the kernel, you have set the wrong CPU type in the kernel.
If its in a program, your CFLAGS or USE flags do not match your CPU.
Please post your emerge --info output and your /proc/cpuinfo.
If you have anything in /etc/portage/package.use ... all of that too. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Sun Dec 02, 2012 8:46 pm Post subject: |
|
|
Kernel CPU is just x86_64. Same for USE
Code: |
% emerge --info
Portage 2.1.11.9 (default/linux/amd64/10.0, gcc-4.5.4, glibc-2.15-r2, 3.4.9-gentoo x86_64)
=================================================================
System uname: Linux-3.4.9-gentoo-x86_64-AMD_C-60_APU_with_Radeon-tm-_HD_Graphics-with-gentoo-2.1
Timestamp of tree: Wed, 10 Oct 2012 00:45:01 +0000
app-shells/bash: 4.2_p37
dev-lang/python: 2.7.3-r2, 3.2.3
dev-util/cmake: 2.8.9
dev-util/pkgconfig: 0.27.1
sys-apps/baselayout: 2.1-r1
sys-apps/openrc: 0.9.8.4
sys-apps/sandbox: 2.5
sys-devel/autoconf: 2.13, 2.68
sys-devel/automake: 1.11.6
sys-devel/binutils: 2.22-r1
sys-devel/gcc: 4.5.4
sys-devel/gcc-config: 1.7.3
sys-devel/libtool: 2.4-r1
sys-devel/make: 3.82-r3
sys-kernel/linux-headers: 3.4-r2 (virtual/os-headers)
sys-libs/glibc: 2.15-r2
Repositories: gentoo
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -march=x86-64"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -march=x86-64"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles news parallel-fetch parse-eapi-ebuild-head protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY=""
SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"
USE="X acl amd64 apng berkdb bluray bzip2 cddb cli consolekit cracklib crypt cups cxx dbus dri embedded examples exif ffmpeg fortran gdbm gif gpm gudev hwdb iconv imap inotify ipv6 javascrip javascript jpeg jpeg2k lm_sensors lzma midi minizip mmx modules mp3 mp4 mpeg mudflap multilib ncurses nls nptl ogg openmp pam pcre perl png policykit ppds pppd python readline session sse sse2 ssl svg taglib tcpd thumbnail tiff unicode vorbis x264 zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" PHP_TARGETS="php5-3" PYTHON_TARGETS="python3_2 python2_7" RUBY_TARGETS="ruby18 ruby19" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
|
Code: |
% cat /etc/portage/make.conf
# These settings were set by the catalyst build script that automatically
# built this stage.
# Please consult /usr/share/portage/config/make.conf.example for a more
# detailed example.
CFLAGS="-O2 -march=x86-64"
CXXFLAGS="${CFLAGS}"
# WARNING: Changing your CHOST is not something that should be done lightly.
# Please consult http://www.gentoo.org/doc/en/change-chost.xml before changing.
CHOST="x86_64-pc-linux-gnu"
# These are the USE flags that were used in addition to what is provided by the
# profile used for building.
USE="mmx sse sse2 python png X gif jpeg mp3 mp4 mpeg jpeg2k tiff apng ppds ssl dbus gudev policykit embedded consolekit ogg vorbis hwdb midi readline imap -gnome -kde minizip examples lzma perl bluray x264 svg cddb exif ffmpeg inotify javascrip javascript taglib thumbnail lm_sensors"
GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"
SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"
MAKEOPTS="-j2"
|
Code: |
% cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD C-60 APU with Radeon(tm) HD Graphics
stepping : 0
microcode : 0x500010d
cpu MHz : 1000.010
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips : 2000.02
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb
processor : 1
vendor_id : AuthenticAMD
cpu family : 20
model : 2
model name : AMD C-60 APU with Radeon(tm) HD Graphics
stepping : 0
microcode : 0x500010d
cpu MHz : 1000.010
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips : 2000.02
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb
|
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Sun Dec 02, 2012 9:36 pm Post subject: |
|
|
MarkCu,
These flags Code: | mmx sse sse2 mmxext ssse3 sse4a | frm your cpuinfo indicate optional instruction set extensions that are present.
Usually, AMD CPUs have 3Dnow and 3Dnowext too.
should be safe.
are in your USE flags but thats OK as your CPU has those flags too.
As you say, your kernel has CONFIG_GENERIC_CPU=y, so thats OK too.
So, its all OK, it just doesn't work :(
Unfortunately, I'm out of ideas. Your kernel is Code: | Linux version 3.4.9-gentoo | you could try the testing gentoo-sources in case it really is a kernel bug and its now fixed.
You could also try rolling your own kernel with the help of kernel-seeds.org. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Sun Dec 02, 2012 9:41 pm Post subject: |
|
|
Oh, and package.use
Code: |
% cat /etc/portage/package.use
media-video/vlc dvd ffmpeg mpeg mad wxwindows aac dts a52 ogg flac theora oggvorbis matroska freetype bidi xv svga gnutls stream vlm httpd cdda vcd cdio live lua truetype debug
net-misc/ntp caps
app-portage/layman git subversion
app-admin/gkrellm X lm_sensors
|
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Sun Dec 02, 2012 9:58 pm Post subject: |
|
|
MarkCu,
package.use is all good. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Wed Dec 05, 2012 4:08 pm Post subject: |
|
|
FYI, for those following - compiled a newer "testing" kernel - gentoo-sources-3.4.11.
Still same crashes. Going to try 3.6.8 next. |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Thu Dec 06, 2012 4:54 am Post subject: |
|
|
gentoo-sources-3.6.8 doesn't help either. Still crashing.
Neddy indicates the fault is showing an illegal opcode. Can I tell from the
dmesg log what the opcode is that's causing the problem? Or, is there a
debug switch I can turn on that's more verbose?
Any pointer are appreciated. I can google around about kernel
debugging, but it's a big subject.
Thanks,
Mark |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54421 Location: 56N 3W
|
Posted: Thu Dec 06, 2012 6:49 pm Post subject: |
|
|
MarkCu,
The opcode is 0000 - that was in one of your posts but it doesn't help.
If it was really a kernel bug, lots of users would see it and it would be all over Google. Its not.
That points to your hardware somewhere.
If you have several sticks of RAM, Remove them all except one. Now what happens.
Try them in turn, one at a time.
Can you try other binary distros or the Gentoo liveDVD. That would prove its not something gone wrong with your builds as you would be running code built elsewhere.. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
toralf Developer
Joined: 01 Feb 2004 Posts: 3925 Location: Hamburg
|
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Fri Dec 07, 2012 12:22 am Post subject: |
|
|
toralf,
Thanks for the pointer. I went and installed git-sources-3.7_rc8
Configured/make/installed kernel.
Thought that might have done it. My test ran about 45 minutes and no crash. (Previously, it'd always crash in less than 10 minutes).
Still crashed in the end.
Similar crash report. Whole thing's here:
http://pastebin.com/raw.php?i=PeGmLcDg
.config:
http://pastebin.com/raw.php?i=KUgSNQnf
Neddy - testing/swapping memory was one of the first things I did. No change. Plus memtest86+ ran for 36 hours with no reported issues. I've got two DIMMS, tried running with just one or the other. No changes. I've been swinging back and forth over HW vs SW problems. I've eliminated just about all I can HW wise - all that's left is the CPU and power supply.
I can try some sort of other live DVD - although I need to go through the hoops to make it work from a USB stick - no CD/DVD/etc drive installed.
Krinn - thanks for the pointer. I had -march=native before. My intent was to un-optimize it even more - just generic x86-64. The way I understand it native could be using SSE/ etc... I was trying to remove even these usages to pare down my problem.
I'll read up more to see what the appropriate march is - probably generic?
Thanks all so far for all the pointers. Still digging. |
|
Back to top |
|
|
wcg Guru
Joined: 06 Jan 2009 Posts: 588
|
Posted: Fri Dec 07, 2012 11:59 am Post subject: |
|
|
What is this?
Code: |
Modules linked in: k10temp
|
_________________ TIA |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Fri Dec 07, 2012 3:55 pm Post subject: |
|
|
Code: |
Modules linked in: k10temp
|
One of the lm_sensors, I think. Used to monitor temperature. Crash was happening before I installed this, but can take it out. |
|
Back to top |
|
|
wcg Guru
Joined: 06 Jan 2009 Posts: 588
|
Posted: Sat Dec 08, 2012 5:23 am Post subject: |
|
|
Without having examined everything in detail, a bad opcode
is almost always a result of an inappropriate CFLAG. (Bad
assembly code that uses an opcode not supported by the cpu,
a binutils bug, or a gcc bug would be possible, too, if less common.)
The kernel pretty much sets its own CFLAGS, though, so if you
have the correct architecture and use a stable gcc, inappropriate
CFLAGS would be pretty rare in kernel compiles. I would look for
something in "Processor Type and Features" in the kernel .config.
(I use K8 for K10 cpus. Seems to work.) _________________ TIA |
|
Back to top |
|
|
kondor6c n00b
Joined: 12 Jul 2007 Posts: 9
|
Posted: Thu Dec 13, 2012 5:37 pm Post subject: |
|
|
I had an issue with my ASUS p8z77-v, it had seemingly random segfaults. I eventually tracked it down that my RAM was not on their approved memory vendor list. |
|
Back to top |
|
|
BitJam Advocate
Joined: 12 Aug 2003 Posts: 2508 Location: Silver City, NM
|
Posted: Thu Dec 13, 2012 6:57 pm Post subject: |
|
|
A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.
You have established that it is not a heat related issue or a RAM issue. I suspect the problem is related to the disk drive subsystem since that is being exercised during your failures but was not exercised in any of your hardware tests. It is also possible the problem is the CPU.
Unfortunately, the next level of tests involve swapping either the CPU or the motherboard. I suggest you report this as defective product and try to get a refund or replacement. |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Thu Dec 13, 2012 9:29 pm Post subject: |
|
|
For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have. |
|
Back to top |
|
|
wcg Guru
Joined: 06 Jan 2009 Posts: 588
|
Posted: Fri Dec 14, 2012 9:25 pm Post subject: |
|
|
Quote: | For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. |
I was wondering why someone would use CONFIG_GENERIC_CPU with a k10
architecture chip. So these AMD apus are not k10s (perhaps some features
in common, but not drop-in replacements that will necessarily run
the same compiled code). While that module may not be the cause of the error,
one wonders if the lmsensors k10temp module actually works with the AMD HSA
(Fusion) architectures.
dmesg:
Code: |
CPU0: AMD C-60 APU with Radeon(tm) HD Graphics stepping 00
|
( http://en.wikipedia.org/wiki/AMD_Fusion )
kernel .config:
Code: |
# CONFIG_MK8 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
|
Could the box need more low memory protection than 64K?
Code: |
CONFIG_X86_RESERVE_LOW=64
|
This can be set as high as 640. _________________ TIA |
|
Back to top |
|
|
MarkCu n00b
Joined: 28 Nov 2012 Posts: 19
|
Posted: Tue Dec 18, 2012 12:03 am Post subject: |
|
|
Quote: |
For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have.
|
Things like this is info I'm looking for. Thanks.
I've recompiled "world" with new gentoo CFLAGS:
CFLAGS="-O2 -march=athlon64 -mtune=generic"
For the kernel, just this is sufficient right?
Code: |
# CONFIG_MK8 is not set
CONFIG_GENERIC_CPU=y
|
Still no change in behavior.
I'll take the K10temp module out next time I recompile. I only added it to check the temps when I started noticing the crashes. It was crashing without it.
Quote: |
A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.
|
The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.
I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups.
Code: |
CONFIG_X86_RESERVE_LOW=64
|
I'll try upping this next...
I'm also going to try and change my test to read from the raw drive(s) instead of the raw RAID device. Those results may be informative. |
|
Back to top |
|
|
BitJam Advocate
Joined: 12 Aug 2003 Posts: 2508 Location: Silver City, NM
|
Posted: Tue Dec 18, 2012 1:30 am Post subject: |
|
|
MarkCu wrote: | The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes. | Is the crash always in the same place in the code? When I first installed Gentoo, I had a hardware issue where I could consistently get my machine to crash when I was doing big compiles when using a ReiserFS but not ext2. But no two crashes were identical.
Quote: | I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups. | Usually there is a time limit on an RMA and the big question is who will pay for shipping. You don't have to ship it back immediately but I don't want you to miss out on options while you are trying to diagnose the problem. I think you should get the RMA process in motion. Ideally, you could discuss it with someone and they would give you more time for further testing before you have to send it back.
I really do think you have a hardware problem that is difficult to diagnose. I've run into a few of these over the years and they can suck up a tremendous amount of time and energy. At some point you need to treat it like it's hardware problem even if you can't prove (even to yourself) that the problem is hardware. It is now extremely unlikely the problem is a bad instruction in the code. If it were, you'd be much closer to pinpointing where in the codebase the problem is.
There is no way mis-tuning CONFIG_X86_RESERVE_LOW could cause the problems you have if they are due to software. If so, then the kernel is garbage and I know it is not garbage. If you want to play around with things to see if you can work around the bug then you could try turning off multi-core support. If a non-smp kernel did work then that would be further evidence of a hardware problem although it would not constitute proof. |
|
Back to top |
|
|
|