Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
System instabilities - General Protection faults
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Fri Nov 30, 2012 5:38 pm    Post subject: System instabilities - General Protection faults Reply with quote

I've got a new homebuild system based on a ASUS Mini-ITX Motherboard. Nothing else has been installed on it (OS wise). The system is intended as a headless backup server. RAID/SSH/rsync/NFS.

The install went smoothly, things seem to work fine. Until I start using the system for it's intended purposes - doing backups. I started getting frequent crashes.

Narrowing down / searching these forums, I see most folks with similar issues end up finding hardware issues. So, to start I focus there.

Memtest+ runs for 36 hours. No issues.
CpuBurn for 4 hours. No issues.
Check cooling - looks ok - CPU temp never goes above 65 C.

Underclock the system by 5% (both CPU and memory). Failure modes doesn't seem to change.

Pull 1 DIMM - I have (2) 4 GB Dimms - . No changes.
Swap - use other DIMM. No changes

So, I'm thinking HW looks fairly reasonable.

So focus more on SW. Trying to narrow my scope, I can usually
get a crash just by doing a dd on the server itself:

dd if=/dev/md127 of=/dev/null

The failures aren't identical, but seem to similar to below:
Code:

[35101.826124] general protection fault: 0000 [#1] SMP
[35101.826153] CPU 0
[35101.826161] Modules linked in: k10temp
[35101.826179]
[35101.826190] Pid: 568, comm: kswapd0 Not tainted 3.4.9-gentoo #9 System manufacturer System Product Name/C60M1-I
[35101.826224] RIP: 0010:[<ffffffff81145fb8>]  [<ffffffff81145fb8>] drop_buffers+0x28/0xb0
[35101.826260] RSP: 0018:ffff880234eff9a0  EFLAGS: 00010206
[35101.826276] RAX: 0000000000000000 RBX: ffffea00048dab40 RCX: 0000000000000000
[35101.826295] RDX: 0000000000000000 RSI: ffff880234eff9d8 RDI: ffbf8801332bdf08
[35101.826314] RBP: ffff880234eff9c0 R08: dead000000200200 R09: dead000000100100
[35101.826333] R10: ffff880234effbb8 R11: ffff880234effbc0 R12: ffff8802365e55a0
[35101.826352] R13: ffff8801332bdf08 R14: ffff880234eff9d8 R15: 0000000000000001
[35101.826373] FS:  00007fb00bc26700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[35101.826395] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[35101.826411] CR2: 000000000065dd00 CR3: 000000022b4fd000 CR4: 00000000000007f0
[35101.826464] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[35101.826516] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[35101.826569] Process kswapd0 (pid: 568, threadinfo ffff880234efe000, task ffff88023597be70)
[35101.826655] Stack:
[35101.826696]  ffffea00048dab40 ffff8802365e55a0 0000000000000000 ffffea00048dab40
[35101.826789]  ffff880234effa00 ffffffff81146090 ffff880234eff9f0 0000000000000000
[35101.826882]  ffff8802365e55a0 ffff880234effd90 ffff880234effba0 ffffea00048dab60
[35101.826975] Call Trace:
[35101.827023]  [<ffffffff81146090>] try_to_free_buffers+0x50/0xb0
[35101.827076]  [<ffffffff8114cd9d>] blkdev_releasepage+0x3d/0x50
[35101.827130]  [<ffffffff810cc29d>] try_to_release_page+0x2d/0x40
[35101.827185]  [<ffffffff810df2e2>] shrink_page_list+0x762/0x910
[35101.827239]  [<ffffffff810e8464>] ? __mod_zone_page_state+0x44/0x50
[35101.827293]  [<ffffffff810dd134>] ? update_isolated_counts.clone.55+0x114/0x130
[35101.827383]  [<ffffffff810df974>] shrink_inactive_list+0x244/0x4c0
[35101.827437]  [<ffffffff810e0304>] shrink_mem_cgroup_zone+0x3b4/0x4f0
[35101.827491]  [<ffffffff8111bbe2>] ? prune_super+0x192/0x1b0
[35101.827545]  [<ffffffff810e10f2>] balance_pgdat+0x542/0x730
[35101.827598]  [<ffffffff810e1449>] kswapd+0x169/0x3c0
[35101.827649]  [<ffffffff81058be0>] ? wake_up_bit+0x40/0x40
[35101.827701]  [<ffffffff810e12e0>] ? balance_pgdat+0x730/0x730
[35101.827752]  [<ffffffff81058466>] kthread+0x96/0xa0
[35101.827804]  [<ffffffff816ae2d4>] kernel_thread_helper+0x4/0x10
[35101.827856]  [<ffffffff810583d0>] ? flush_kthread_worker+0xb0/0xb0
[35101.827909]  [<ffffffff816ae2d0>] ? gs_change+0xb/0xb
[35101.827955] Code: 00 00 00 55 48 89 e5 41 56 49 89 f6 41 55 41 54 53 48 8b 07 48 89 fb f6 c4 08 0f 84 8e 00 00 00 4c 8b 6f 30 4c 89 ef 0f 1f 40 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b
[35101.828240] RIP  [<ffffffff81145fb8>] drop_buffers+0x28/0xb0
[35101.828294]  RSP <ffff880234eff9a0>
[35101.828669] ---[ end trace 7979c35d1c9be633 ]---
[35161.769980] INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies)
[35161.770288] Pid: 3339, comm: dd Tainted: G      D      3.4.9-gentoo #9
[35161.771507] Call Trace:
[35161.771588]  <IRQ>  [<ffffffff810a7d26>] __rcu_pending+0x206/0x490
[35161.771735]  [<ffffffff810a8470>] rcu_check_callbacks+0xb0/0x170
[35161.771828]  [<ffffffff810472b3>] update_process_times+0x43/0x80
[35161.771918]  [<ffffffff8107a6bf>] tick_sched_timer+0x5f/0xb0
[35161.772008]  [<ffffffff8105c7a8>] __run_hrtimer+0x78/0x1c0
[35161.772097]  [<ffffffff8107a660>] ? tick_nohz_handler+0xe0/0xe0
[35161.772187]  [<ffffffff8105cfe6>] hrtimer_interrupt+0xf6/0x240
[35161.772278]  [<ffffffff810212e4>] smp_apic_timer_interrupt+0x64/0xa0
[35161.772371]  [<ffffffff816ada87>] apic_timer_interrupt+0x67/0x70
[35161.772458]  <EOI>  [<ffffffff816ac7ca>] ? _raw_spin_lock+0x1a/0x30
[35161.772597]  [<ffffffff8114734b>] create_empty_buffers+0x4b/0xd0
[35161.772689]  [<ffffffff811485a8>] block_read_full_page+0x2c8/0x390
[35161.772780]  [<ffffffff8114c280>] ? I_BDEV+0x10/0x10
[35161.772869]  [<ffffffff810e8cae>] ? __inc_zone_page_state+0x2e/0x30
[35161.772961]  [<ffffffff810ccfeb>] ? add_to_page_cache_locked+0x8b/0xe0
[35161.773052]  [<ffffffff8114ce23>] blkdev_readpage+0x13/0x20
[35161.773142]  [<ffffffff810d81c9>] __do_page_cache_readahead+0x1d9/0x260
[35161.773234]  [<ffffffff810d857c>] ra_submit+0x1c/0x20
[35161.773321]  [<ffffffff810d868d>] ondemand_readahead+0x10d/0x230
[35161.773413]  [<ffffffff812d283d>] ? copy_user_generic_string+0x2d/0x40
[35161.773503]  [<ffffffff810d8830>] page_cache_async_readahead+0x80/0xa0
[35161.773596]  [<ffffffff810ce86b>] generic_file_aio_read+0x48b/0x780
[35161.773688]  [<ffffffff811183c2>] do_sync_read+0xe2/0x120
[35161.773778]  [<ffffffff8127db83>] ? security_file_permission+0x93/0xb0
[35161.773869]  [<ffffffff81118c93>] vfs_read+0xc3/0x170
[35161.773956]  [<ffffffff81118d8c>] sys_read+0x4c/0x90
[35161.774044]  [<ffffffff816acfca>] ? system_call_after_swapgs+0x17/0x59
[35161.774135]  [<ffffffff816ad022>] system_call_fastpath+0x16/0x1b

I understand folks don't want to debug processes that are "Tainted". The log above shows one process (568) as "Not Tainted", the other (3339) is "Tainted". Really "dd" is Tainted? Or I'm just interpreting this wrong?

Full dmesg:
http://pastebin.com/raw.php?i=04Y9Kx75

Full kernel .config:
http://pastebin.com/raw.php?i=kUqbhM7z

Any help appreciated.

Thanks,
Mark
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Fri Nov 30, 2012 10:26 pm    Post subject: Reply with quote

MarkCu,

CONFIG_HZ_1000=y is known to cause problems on some hardware. Its not need on a headless system either.
Try 100Hz instead.

You also have several debug options on in your kernel, I did not check them all. Debug options always cause logspam and sometimes interfere with normal operation.
Debug options should only be on if you are debugging that part of the kernel.

While you are fixing your kernel timer, turn off all the debug stuff too.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Sat Dec 01, 2012 1:08 am    Post subject: Reply with quote

Thanks for the advise.

Recompiled my kernel with suggested changes.

Still crashing:

Worth mentioning - don't know if it matters or not - but I'm
running without swap. Figured 8G memory should be plenty
for this config.

Same command:
dd if=/dev/md127 of=/dev/null bs=1024

dmesg Result:
Code:

[ 2111.906328] general protection fault: 0000 [#1] PREEMPT SMP
[ 2111.906358] CPU 0
[ 2111.906365] Modules linked in: k10temp
[ 2111.906381]
[ 2111.906391] Pid: 565, comm: kswapd0 Not tainted 3.4.9-gentoo #11 System manufacturer System Product Name/C60M1-I
[ 2111.906421] RIP: 0010:[<ffffffff8114a658>]  [<ffffffff8114a658>] drop_buffers+0x28/0xc0
[ 2111.906453] RSP: 0018:ffff880234fb79c0  EFLAGS: 00010206
[ 2111.906467] RAX: 0000000000000000 RBX: ffffea0004883ec0 RCX: 0000000000000000
[ 2111.906484] RDX: 0000000000000000 RSI: ffff880234fb79f8 RDI: ffbf8801326bdf08
[ 2111.906501] RBP: ffff8802365e05e8 R08: 0000000000000003 R09: ffff880234fb6000
[ 2111.906518] R10: ffff880234fb7fd8 R11: ffff880234fb7bb0 R12: ffff8801326bdf08
[ 2111.906536] R13: ffff880234fb79f8 R14: 0000000000000001 R15: ffff880234fb7af0
[ 2111.906554] FS:  00007fdcadc1b700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 2111.906574] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2111.906588] CR2: 00007f9bae9e8000 CR3: 00000002340be000 CR4: 00000000000007f0
[ 2111.906606] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2111.906623] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2111.906642] Process kswapd0 (pid: 565, threadinfo ffff880234fb6000, task ffff880235a4e200)
[ 2111.906724] Stack:
[ 2111.906762]  ffff8802365e0560 ffffea0004883ec0 ffff8802365e05e8 0000000000000000
[ 2111.906857]  ffffea0004883ec0 ffffffff8114a73f ffff880234fb7b90 0000000000000000
[ 2111.906951]  ffff880234fb7d80 ffff880234fb7b90 ffffea0004883ee0 ffffffff810e2060
[ 2111.907046] Call Trace:
[ 2111.907096]  [<ffffffff8114a73f>] ? try_to_free_buffers+0x4f/0xc0
[ 2111.907153]  [<ffffffff810e2060>] ? shrink_page_list+0x790/0x970
[ 2111.907209]  [<ffffffff810eb60f>] ? __mod_zone_page_state+0x3f/0x50
[ 2111.907265]  [<ffffffff810e080b>] ? update_isolated_counts.clone.56+0x13b/0x170
[ 2111.907356]  [<ffffffff810e2713>] ? shrink_inactive_list+0x233/0x4d0
[ 2111.907413]  [<ffffffff810e3092>] ? shrink_mem_cgroup_zone+0x392/0x4d0
[ 2111.907471]  [<ffffffff810e3e9a>] ? balance_pgdat+0x4ea/0x6b0
[ 2111.907526]  [<ffffffff810e41dc>] ? kswapd+0x17c/0x430
[ 2111.907579]  [<ffffffff816b129c>] ? __schedule+0x27c/0x5e0
[ 2111.907632]  [<ffffffff81059790>] ? wake_up_bit+0x40/0x40
[ 2111.907685]  [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0
[ 2111.907738]  [<ffffffff810e4060>] ? balance_pgdat+0x6b0/0x6b0
[ 2111.907791]  [<ffffffff81058fee>] ? kthread+0x9e/0xb0
[ 2111.907844]  [<ffffffff816b42d4>] ? kernel_thread_helper+0x4/0x10
[ 2111.907900]  [<ffffffff81058f50>] ? flush_kthread_worker+0xc0/0xc0
[ 2111.907955]  [<ffffffff816b42d0>] ? gs_change+0xb/0xb
[ 2111.908003] Code: 00 00 00 41 55 49 89 f5 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 f6 c4 08 0f 84 99 00 00 00 4c 8b 67 30 4c 89 e7 0f 1f 44 00 00 <48> 8b 07 f6 c4 08 74 0e 48 8b 43 08 48 85 c0 74 05 f0 80 48 7b
[ 2111.908304] RIP  [<ffffffff8114a658>] drop_buffers+0x28/0xc0
[ 2111.908360]  RSP <ffff880234fb79c0>
[ 2111.908707] ---[ end trace c23f7d9f938c612f ]---
[ 2111.908797] note: kswapd0[565] exited with preempt_count 1


dmesg:
http://pastebin.com/raw.php?i=LxmU9QKX

.config:
http://pastebin.com/raw.php?i=jV3pQ8a5

Any other ideas?

Thanks,

Mark
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Sat Dec 01, 2012 3:05 pm    Post subject: Reply with quote

MarkCu,

Code:
CONFIG_SLUB_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_HWMON_DEBUG_CHIP=y
CONFIG_DEBUG_FS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y


Use the search (press /) in
Code:
make menuconfig
to find the above options and turn them off.

Not having swap does not stop the kernel swapping, it just robs the kernel of the ability to move dynamically allocated RAM to disk.
The kernel will still swap by discarding from RAM data or code that has a permanent home in disk, then reloading it when its needed again.
Unless you are running a diskless node, a small swap, say 512Mb, is a good thing.

You can make a swap file if you want to test your swap theory but I agree, no swap is unlikely to be the problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Sun Dec 02, 2012 8:13 pm    Post subject: Reply with quote

Ok managed, to get (most) of those other DEBUG kernel options off.
One, I couldn't figure out how to disable:

CONFIG_X86_DEBUGCTLMSR=y

The help doesn't show the dependencies, nor where it is, nor much else, and I can't
find it.

Anyway, similar results:
Code:

[58460.314388] ------------[ cut here ]------------
[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]
[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP
[58460.314452] CPU 0
[58460.314458] Modules linked in: k10temp
[58460.314474]
[58460.314483] Pid: 562, comm: kswapd0 Not tainted 3.4.9-gentoo #12 System manufacturer System Product Name/C60M1-I
[58460.314513] RIP: 0010:[<ffffffff81116ca6>]  [<ffffffff81116ca6>] free_buffer_head+0x66/0x80
[58460.314543] RSP: 0018:ffff880234f6b9f0  EFLAGS: 00010287
[58460.314558] RAX: ffff880124837ce0 RBX: ffff880124837c98 RCX: 0000000000000000
[58460.314575] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff880124837c98
[58460.314592] RBP: ffff88023646d968 R08: 0000000000000003 R09: ffff880234f6a000
[58460.314609] R10: ffff880234f6bfd8 R11: ffff880234f6bbd0 R12: 0000000000000001
[58460.314626] R13: ffffea00048f8c40 R14: 0000000000000001 R15: ffff880234f6bb00
[58460.314644] FS:  00007f2118045700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[58460.314696] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58460.314742] CR2: 00007fd7644c1000 CR3: 00000002317dd000 CR4: 00000000000007f0
[58460.314792] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58460.314841] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[58460.314891] Process kswapd0 (pid: 562, threadinfo ffff880234f6a000, task ffff880235433600)
[58460.314973] Stack:
[58460.315011]  ffff88023646d968 ffff880124837c98 ffff88023646d968 ffffffff81116eec
[58460.315099]  ffff880234f6bbb0 ffff880124837c98 ffff880234f6bda0 ffff880234f6bbb0
[58460.315191]  ffffea00048f8c60 ffffffff810b8e00 0000000000010dc0 ffff880234f6bac0
[58460.315286] Call Trace:
[58460.315335]  [<ffffffff81116eec>] ? try_to_free_buffers+0x7c/0xc0
[58460.315392]  [<ffffffff810b8e00>] ? shrink_page_list+0x740/0x8c0
[58460.315447]  [<ffffffff810c01df>] ? __mod_zone_page_state+0x3f/0x50
[58460.315502]  [<ffffffff810b794b>] ? update_isolated_counts.clone.53+0x13b/0x170
[58460.315591]  [<ffffffff810b94a6>] ? shrink_inactive_list+0x286/0x470
[58460.315641]  [<ffffffff810b9d82>] ? shrink_mem_cgroup_zone+0x3a2/0x4e0
[58460.315693]  [<ffffffff810eefbe>] ? grab_super_passive+0x3e/0x90
[58460.315742]  [<ffffffff810baa9a>] ? balance_pgdat+0x4fa/0x6c0
[58460.315792]  [<ffffffff810badf6>] ? kswapd+0x196/0x300
[58460.315840]  [<ffffffff81051b20>] ? wake_up_bit+0x40/0x40
[58460.315887]  [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0
[58460.315936]  [<ffffffff810bac60>] ? balance_pgdat+0x6c0/0x6c0
[58460.315983]  [<ffffffff8105144e>] ? kthread+0x9e/0xb0
[58460.316032]  [<ffffffff8163e314>] ? kernel_thread_helper+0x4/0x10
[58460.316081]  [<ffffffff810513b0>] ? flush_kthread_worker+0xc0/0xc0
[58460.316130]  [<ffffffff8163e310>] ? gs_change+0xb/0xb
[58460.316174] Code: 65 ff 0c 25 60 e2 00 00 e8 38 ff ff ff 83 6b 1c 01 48 8b 85 38 e0 ff ff a8 08 75 11 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 <0f> 0b 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 e9 15 4a 52 00
[58460.316434] RIP  [<ffffffff81116ca6>] free_buffer_head+0x66/0x80
[58460.316483]  RSP <ffff880234f6b9f0>
[58460.319321] ---[ end trace fa6084efedc140cf ]---


dmesg:
http://pastebin.com/raw.php?i=rXvLKJ5w

.config
http://pastebin.com/raw.php?i=cHPrW85H

Also tried adding some swap - no difference.

Thanks

Mark
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Sun Dec 02, 2012 8:31 pm    Post subject: Reply with quote

MarkCu,

[58460.314414] Kernel BUG at ffffffff81116ca6 [verbose debug info unavailable]
[58460.314433] invalid opcode: 0000 [#1] PREEMPT SMP

Invalid opcode means the system tried to execute an instruction that the CPU does not understand.
If its in the kernel, you have set the wrong CPU type in the kernel.

If its in a program, your CFLAGS or USE flags do not match your CPU.

Please post your emerge --info output and your /proc/cpuinfo.
If you have anything in /etc/portage/package.use ... all of that too.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Sun Dec 02, 2012 8:46 pm    Post subject: Reply with quote

Kernel CPU is just x86_64. Same for USE

Code:

% emerge --info
Portage 2.1.11.9 (default/linux/amd64/10.0, gcc-4.5.4, glibc-2.15-r2, 3.4.9-gentoo x86_64)
=================================================================
System uname: Linux-3.4.9-gentoo-x86_64-AMD_C-60_APU_with_Radeon-tm-_HD_Graphics-with-gentoo-2.1
Timestamp of tree: Wed, 10 Oct 2012 00:45:01 +0000
app-shells/bash:          4.2_p37
dev-lang/python:          2.7.3-r2, 3.2.3
dev-util/cmake:           2.8.9
dev-util/pkgconfig:       0.27.1
sys-apps/baselayout:      2.1-r1
sys-apps/openrc:          0.9.8.4
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.13, 2.68
sys-devel/automake:       1.11.6
sys-devel/binutils:       2.22-r1
sys-devel/gcc:            4.5.4
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r3
sys-kernel/linux-headers: 3.4-r2 (virtual/os-headers)
sys-libs/glibc:           2.15-r2
Repositories: gentoo
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -march=x86-64"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -march=x86-64"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles news parallel-fetch parse-eapi-ebuild-head protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY=""
SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"
USE="X acl amd64 apng berkdb bluray bzip2 cddb cli consolekit cracklib crypt cups cxx dbus dri embedded examples exif ffmpeg fortran gdbm gif gpm gudev hwdb iconv imap inotify ipv6 javascrip javascript jpeg jpeg2k lm_sensors lzma midi minizip mmx modules mp3 mp4 mpeg mudflap multilib ncurses nls nptl ogg openmp pam pcre perl png policykit ppds pppd python readline session sse sse2 ssl svg taglib tcpd thumbnail tiff unicode vorbis x264 zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" PHP_TARGETS="php5-3" PYTHON_TARGETS="python3_2 python2_7" RUBY_TARGETS="ruby18 ruby19" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON


Code:

% cat /etc/portage/make.conf
# These settings were set by the catalyst build script that automatically
# built this stage.
# Please consult /usr/share/portage/config/make.conf.example for a more
# detailed example.
CFLAGS="-O2 -march=x86-64"
CXXFLAGS="${CFLAGS}"
# WARNING: Changing your CHOST is not something that should be done lightly.
# Please consult http://www.gentoo.org/doc/en/change-chost.xml before changing.
CHOST="x86_64-pc-linux-gnu"
# These are the USE flags that were used in addition to what is provided by the
# profile used for building.
USE="mmx sse sse2 python png X gif jpeg mp3 mp4 mpeg jpeg2k tiff apng ppds ssl dbus gudev policykit embedded consolekit ogg vorbis hwdb midi readline imap -gnome -kde minizip examples lzma perl bluray x264 svg cddb exif ffmpeg inotify javascrip javascript taglib thumbnail lm_sensors"

GENTOO_MIRRORS="ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/"

SYNC="rsync://rsync.us.gentoo.org/gentoo-portage"
MAKEOPTS="-j2"


Code:

% cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 20
model           : 2
model name      : AMD C-60 APU with Radeon(tm) HD Graphics
stepping        : 0
microcode       : 0x500010d
cpu MHz         : 1000.010
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips        : 2000.02
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 20
model           : 2
model name      : AMD C-60 APU with Radeon(tm) HD Graphics
stepping        : 0
microcode       : 0x500010d
cpu MHz         : 1000.010
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 6
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt arat cpb hw_pstate npt lbrv svm_lock nrip_save pausefilter
bogomips        : 2000.02
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Sun Dec 02, 2012 9:36 pm    Post subject: Reply with quote

MarkCu,
These flags
Code:
mmx sse sse2 mmxext ssse3 sse4a
frm your cpuinfo indicate optional instruction set extensions that are present.
Usually, AMD CPUs have 3Dnow and 3Dnowext too.

Code:
-march=x86-64
should be safe.

Code:
mmx sse sse2
are in your USE flags but thats OK as your CPU has those flags too.

As you say, your kernel has CONFIG_GENERIC_CPU=y, so thats OK too.

So, its all OK, it just doesn't work :(

Unfortunately, I'm out of ideas. Your kernel is
Code:
 Linux version 3.4.9-gentoo
you could try the testing gentoo-sources in case it really is a kernel bug and its now fixed.
You could also try rolling your own kernel with the help of kernel-seeds.org.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Sun Dec 02, 2012 9:41 pm    Post subject: Reply with quote

Oh, and package.use

Code:

% cat /etc/portage/package.use
media-video/vlc dvd ffmpeg mpeg mad wxwindows aac dts a52 ogg flac theora oggvorbis matroska freetype bidi xv svga gnutls stream vlm httpd cdda vcd cdio live lua truetype debug
net-misc/ntp caps
app-portage/layman git subversion
app-admin/gkrellm X lm_sensors
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Sun Dec 02, 2012 9:58 pm    Post subject: Reply with quote

MarkCu,

package.use is all good.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Wed Dec 05, 2012 4:08 pm    Post subject: Reply with quote

FYI, for those following - compiled a newer "testing" kernel - gentoo-sources-3.4.11.

Still same crashes. Going to try 3.6.8 next.
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Thu Dec 06, 2012 4:54 am    Post subject: Reply with quote

gentoo-sources-3.6.8 doesn't help either. Still crashing.

Neddy indicates the fault is showing an illegal opcode. Can I tell from the
dmesg log what the opcode is that's causing the problem? Or, is there a
debug switch I can turn on that's more verbose?

Any pointer are appreciated. I can google around about kernel
debugging, but it's a big subject.

Thanks,

Mark
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 32170
Location: 56N 3W

PostPosted: Thu Dec 06, 2012 6:49 pm    Post subject: Reply with quote

MarkCu,

The opcode is 0000 - that was in one of your posts but it doesn't help.

If it was really a kernel bug, lots of users would see it and it would be all over Google. Its not.
That points to your hardware somewhere.

If you have several sticks of RAM, Remove them all except one. Now what happens.
Try them in turn, one at a time.

Can you try other binary distros or the Gentoo liveDVD. That would prove its not something gone wrong with your builds as you would be running code built elsewhere..
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
toralf
Advocate
Advocate


Joined: 01 Feb 2004
Posts: 2717
Location: Hamburg/Germany

PostPosted: Thu Dec 06, 2012 7:00 pm    Post subject: Reply with quote

B/c I see kswapd in the trace and Linus delayed the current kernel due to a kswap issue and GKH added few minutes ago this https://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=blob;f=queue-3.6/mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch;h=8b0b3bee5d8824bc999adf8bcc2edbc555bc173d;hb=a86ab94abf14d4c06752f5f7363fe72a97f3e372 to the stable kernel tree probably worth to test that particular git piece ?
Back to top
View user's profile Send private message
krinn
Advocate
Advocate


Joined: 02 May 2003
Posts: 4339

PostPosted: Thu Dec 06, 2012 10:37 pm    Post subject: Reply with quote

Never been expert at amd64 arch, but there's no x86-64 march setting on gcc 4.5.4, so your -march= setup might do random result.
switch to generic or native
http://gcc.gnu.org/onlinedocs/gcc-4.5.4/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Fri Dec 07, 2012 12:22 am    Post subject: Reply with quote

toralf,

Thanks for the pointer. I went and installed git-sources-3.7_rc8

Configured/make/installed kernel.

Thought that might have done it. My test ran about 45 minutes and no crash. (Previously, it'd always crash in less than 10 minutes).

Still crashed in the end.

Similar crash report. Whole thing's here:
http://pastebin.com/raw.php?i=PeGmLcDg

.config:
http://pastebin.com/raw.php?i=KUgSNQnf

Neddy - testing/swapping memory was one of the first things I did. No change. Plus memtest86+ ran for 36 hours with no reported issues. I've got two DIMMS, tried running with just one or the other. No changes. I've been swinging back and forth over HW vs SW problems. I've eliminated just about all I can HW wise - all that's left is the CPU and power supply.

I can try some sort of other live DVD - although I need to go through the hoops to make it work from a USB stick - no CD/DVD/etc drive installed.

Krinn - thanks for the pointer. I had -march=native before. My intent was to un-optimize it even more - just generic x86-64. The way I understand it native could be using SSE/ etc... I was trying to remove even these usages to pare down my problem.

I'll read up more to see what the appropriate march is - probably generic?

Thanks all so far for all the pointers. Still digging.
Back to top
View user's profile Send private message
wcg
Guru
Guru


Joined: 06 Jan 2009
Posts: 588

PostPosted: Fri Dec 07, 2012 11:59 am    Post subject: Reply with quote

What is this?
Code:

Modules linked in: k10temp

_________________
TIA
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Fri Dec 07, 2012 3:55 pm    Post subject: Reply with quote

Quote:

What is this?

Code:

Modules linked in: k10temp


One of the lm_sensors, I think. Used to monitor temperature. Crash was happening before I installed this, but can take it out.
Back to top
View user's profile Send private message
wcg
Guru
Guru


Joined: 06 Jan 2009
Posts: 588

PostPosted: Sat Dec 08, 2012 5:23 am    Post subject: Reply with quote

Without having examined everything in detail, a bad opcode
is almost always a result of an inappropriate CFLAG. (Bad
assembly code that uses an opcode not supported by the cpu,
a binutils bug, or a gcc bug would be possible, too, if less common.)

The kernel pretty much sets its own CFLAGS, though, so if you
have the correct architecture and use a stable gcc, inappropriate
CFLAGS would be pretty rare in kernel compiles. I would look for
something in "Processor Type and Features" in the kernel .config.

(I use K8 for K10 cpus. Seems to work.)
_________________
TIA
Back to top
View user's profile Send private message
kondor6c
n00b
n00b


Joined: 12 Jul 2007
Posts: 9

PostPosted: Thu Dec 13, 2012 5:37 pm    Post subject: Reply with quote

I had an issue with my ASUS p8z77-v, it had seemingly random segfaults. I eventually tracked it down that my RAM was not on their approved memory vendor list.
Back to top
View user's profile Send private message
BitJam
Advocate
Advocate


Joined: 12 Aug 2003
Posts: 2454
Location: Silver City, NM

PostPosted: Thu Dec 13, 2012 6:57 pm    Post subject: Reply with quote

A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.

You have established that it is not a heat related issue or a RAM issue. I suspect the problem is related to the disk drive subsystem since that is being exercised during your failures but was not exercised in any of your hardware tests. It is also possible the problem is the CPU.

Unfortunately, the next level of tests involve swapping either the CPU or the motherboard. I suggest you report this as defective product and try to get a refund or replacement.
Back to top
View user's profile Send private message
Ant P.
Advocate
Advocate


Joined: 18 Apr 2009
Posts: 2433
Location: UK

PostPosted: Thu Dec 13, 2012 9:29 pm    Post subject: Reply with quote

For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have.
Back to top
View user's profile Send private message
wcg
Guru
Guru


Joined: 06 Jan 2009
Posts: 588

PostPosted: Fri Dec 14, 2012 9:25 pm    Post subject: Reply with quote

Quote:
For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf.


I was wondering why someone would use CONFIG_GENERIC_CPU with a k10
architecture chip. So these AMD apus are not k10s (perhaps some features
in common, but not drop-in replacements that will necessarily run
the same compiled code). While that module may not be the cause of the error,
one wonders if the lmsensors k10temp module actually works with the AMD HSA
(Fusion) architectures.

dmesg:
Code:

CPU0: AMD C-60 APU with Radeon(tm) HD Graphics stepping 00


( http://en.wikipedia.org/wiki/AMD_Fusion )

kernel .config:
Code:

# CONFIG_MK8 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64


Could the box need more low memory protection than 64K?
Code:

CONFIG_X86_RESERVE_LOW=64


This can be set as high as 640.
_________________
TIA
Back to top
View user's profile Send private message
MarkCu
n00b
n00b


Joined: 28 Nov 2012
Posts: 15

PostPosted: Tue Dec 18, 2012 12:03 am    Post subject: Reply with quote

Quote:

For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf. That sets 3DNow flags which the newer CPUs don't have.


Things like this is info I'm looking for. Thanks.

I've recompiled "world" with new gentoo CFLAGS:
CFLAGS="-O2 -march=athlon64 -mtune=generic"

For the kernel, just this is sufficient right?
Code:

# CONFIG_MK8 is not set
CONFIG_GENERIC_CPU=y


Still no change in behavior.

I'll take the K10temp module out next time I recompile. I only added it to check the temps when I started noticing the crashes. It was crashing without it.

Quote:

A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.


The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.

I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups.

Code:

CONFIG_X86_RESERVE_LOW=64


I'll try upping this next...

I'm also going to try and change my test to read from the raw drive(s) instead of the raw RAID device. Those results may be informative.
Back to top
View user's profile Send private message
BitJam
Advocate
Advocate


Joined: 12 Aug 2003
Posts: 2454
Location: Silver City, NM

PostPosted: Tue Dec 18, 2012 1:30 am    Post subject: Reply with quote

MarkCu wrote:
The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.
Is the crash always in the same place in the code? When I first installed Gentoo, I had a hardware issue where I could consistently get my machine to crash when I was doing big compiles when using a ReiserFS but not ext2. But no two crashes were identical.
Quote:
I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups.
Usually there is a time limit on an RMA and the big question is who will pay for shipping. You don't have to ship it back immediately but I don't want you to miss out on options while you are trying to diagnose the problem. I think you should get the RMA process in motion. Ideally, you could discuss it with someone and they would give you more time for further testing before you have to send it back.

I really do think you have a hardware problem that is difficult to diagnose. I've run into a few of these over the years and they can suck up a tremendous amount of time and energy. At some point you need to treat it like it's hardware problem even if you can't prove (even to yourself) that the problem is hardware. It is now extremely unlikely the problem is a bad instruction in the code. If it were, you'd be much closer to pinpointing where in the codebase the problem is.

There is no way mis-tuning CONFIG_X86_RESERVE_LOW could cause the problems you have if they are due to software. If so, then the kernel is garbage and I know it is not garbage. If you want to play around with things to see if you can work around the bug then you could try turning off multi-core support. If a non-smp kernel did work then that would be further evidence of a hardware problem although it would not constitute proof.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum