Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Kernel 3.1.6 AMD64 inexplicable slow down with heavy use
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
WvR
Tux's lil' helper
Tux's lil' helper


Joined: 03 Mar 2011
Posts: 135
Location: Tsuruga, Japan

PostPosted: Thu Mar 15, 2012 1:45 pm    Post subject: Kernel 3.1.6 AMD64 inexplicable slow down with heavy use Reply with quote

I hope that some Gurus on the forum can shed a light on the following issue.

We have a Dell 7500 Cn workstation, with 24 CPUs (6 times quad core CPU); each CPU has 4 GB of memory for a total of 96 GB. The system has one 1500 GB harddisk which is connected to some RAID card. We use Gentoo stable with the exception of Gnome 3. This system has been working for about a year to full satisfaction.

We mainly use this workstation to do heavy number crunching. We use simulation software which is written in a special flavor of FORTRAN. This software has some issues with 32bit / 64bit but we managed to compile everything with gfortran and the software works. The system has been running for months with sometimes 30 to 40 simultaneous calculations, each of which uses several GB of memory. Until now, we have never had any performance issues. Until today, unfortunately.....

In January 2012 I upgraded the machine to kernel 3.1.6. I use genkernel. I rebooted the machine on January 27 and it was running fine until today (15 March 2012).

Today we noticed that routine calculations which normally take about an hour suddenly did not finish after running more than 8 hours. A look in "top" confirms: the system shows a high load (higher than can be expected from the number of processes), and a large % of "waiting" (up to 30%). The calculations are given the status "D" which means "uninterruptible sleep (usually connected to IO)". The machine is nearly unresponsive. After killing all the calculations, the system goes back to normal.

We have tried to identify the problem: just running one calculation is no problem. If we run about 10 simultaneous calculations, after about 10 minutes the system seems to go in a state of shock: the % waiting shoots up, the processes get "D" status, the load goes up; after about 30 minutes the fans start at their maximum. The harddisk light is on continuously but in "top" there are no processes which seem to be linked to this disk activity. The system becomes unresponsive. The calculations do not need to access the HDD; all data is in RAM. The kind of calculations requires a small amount of data but very many iterations.

In a final move of desperation we rebooted. That did not solve the problem. Then we rebooted into kernel 2.6.36, and guess what: the problem went away with the older kernel.....

The machine was running fine until March 12 and 13. I updated ("emerge --sync" followed by "emerge -auvDN world", "revdep-rebuild") on March 13. No errors were reported. Apparently the update of March 13 did something that causes the kernel to do very weird things. I googled the "D" status. In many posts it is related to disk activity, USB keys and things like that and one post suggested that the "D" status is actually a problem in the kernel in that a program waits for something but does not clear the CPU. Note that with 10 calculations, each calculation takes 2 GB, we are using 20 GB of memory out of 96 GB and 10 out of 24 CPUs. In other words, for this workstation that should not be any problem at all.

I updated "gvfs", can this be the cause? Also I noticed that a set of emul-linux libs was updated, can this be the problem? I thought that maybe there was an update of gcc and glibc, so we recompiled the software with the newest gfortran but that did not solve the problem.
Back to top
View user's profile Send private message
Dont Panic
Guru
Guru


Joined: 20 Jun 2007
Posts: 320
Location: SouthEast U.S.A.

PostPosted: Thu Mar 15, 2012 2:05 pm    Post subject: Reply with quote

If your kernel is compiled with "CONFIG_MAGIC_SYSRQ=y", you can type <ALT><SysRq>-W to dump a traceback of the delayed ("D" state) tasks to your dmesg log.

The output of the traceback is difficult to read, but it may give you an idea if it is disk I/O, or USB, or something else.
Back to top
View user's profile Send private message
CkoTuHa
n00b
n00b


Joined: 27 Mar 2009
Posts: 74

PostPosted: Sun Mar 18, 2012 5:23 am    Post subject: not a guru at all, Reply with quote

First.
I am always looking for new plain vanilla kernel releases up first when I run into problems.
as of today the current one is 3.2.11 and from changelog it is clear that it reverted only one commit from 3.2.10
So, why not trying a newest version, rather than what is stable ?

Second.
The best thing I discovered for myself is using what works for big boys. Kernel releases are in much better shape and include a lot of fixes even compared to upstream.
For instance, get and run the latest kernel release from fedora here:
http://kojipkgs.fedoraproject.org/packages/kernel/3.2.10/3.fc16/

it is plain vanilla 3.2.0 kernel + upstream 3.2.10 patch, which takes it to equivalent upstream version 3.2.10. And on top of that, and this is the best part, they include a lot of patches that are soon going to end up in the upstream stable kernel anyway. They cherry pick patches from newest development kernel and backport it to their releases. For instance I use compat-wireless which is also included + the patches from newest upstream.
Redhat guys contribute alot to kernel.org :)


The actual rpm archive is available here:
http://kojipkgs.fedoraproject.org/packages/kernel/3.2.10/3.fc16/src/kernel-3.2.10-3.fc16.src.rpm

rpmunpack it see the content. it has 3.2.0 tared compressed + 3.2.10 patch. Apply that, then see the kernel.spec file.
It has changelog, where they say which patch does what there. And the directory rpmunpacked will be stacked with *.patch files.

I just apply all of them excluding the media patches, those that have to do with v4l and viola + compat-wireless, and on top of it the upstreamed patches and it all works nicely.


so I have the list of patches I am going to apply in main.patch. Its contents are:

Code:

Patch02: git-linus.diff

Patch04: linux-2.6-compile-fixes.patch

Patch05: linux-2.6-makefile-after_link.patch

Patch09: linux-2.6-upstream-reverts.patch


Patch100: taint-vbox.patch
Patch160: linux-2.6-32bit-mmap-exec-randomization.patch
Patch161: linux-2.6-i386-nx-emulation.patch

Patch383: linux-2.6-defaults-aspm.patch

Patch390: linux-2.6-defaults-acpi-video.patch
Patch391: linux-2.6-acpi-video-dos.patch
Patch394: linux-2.6-acpi-debug-infinite-loop.patch
Patch395: acpi-ensure-thermal-limits-match-cpu-freq.patch
Patch396: acpi-sony-nonvs-blacklist.patch

Patch450: linux-2.6-input-kill-stupid-messages.patch
Patch452: linux-2.6.30-no-pcspkr-modalias.patch

Patch460: linux-2.6-serial-460800.patch

Patch470: die-floppy-die.patch
Patch471: floppy-drop-disable_hlt-warning.patch

Patch510: linux-2.6-silence-noise.patch
Patch520: quite-apm.patch
Patch530: linux-2.6-silence-fbcon-logo.patch
Patch540: modpost-add-option-to-allow-external-modules-to-avoi.patch

Patch700: linux-2.6-e1000-ich9-montevina.patch

Patch800: linux-2.6-crash-driver.patch

Patch1500: fix_xen_guest_on_old_EC2.patch

Patch1824: drm-intel-next.patch

Patch1826: drm-i915-fbc-stfu.patch

Patch1900: linux-2.6-intel-iommu-igfx.patch

Patch2802: linux-2.6-silence-acpi-blacklist.patch


Patch3500: jbd-jbd2-validate-sb-s_first-in-journal_get_superblo.patch

Patch4000: NFSv4-Reduce-the-footprint-of-the-idmapper.patch
Patch4001: NFSv4-Further-reduce-the-footprint-of-the-idmapper.patch


Patch12016: disable-i8042-check-on-apple-mac.patch

Patch12303: dmar-disable-when-ricoh-multifunction.patch

Patch13002: revert-efi-rtclock.patch
Patch13003: efi-dont-map-boot-services-on-32bit.patch

Patch14000: hibernate-freeze-filesystems.patch

Patch14010: lis3-improve-handling-of-null-rate.patch

Patch20000: utrace.patch

Patch21000: arm-omap-dt-compat.patch
Patch21001: arm-smsc-support-reading-mac-address-from-device-tree.patch

Patch21045: nfs-client-freezer.patch

Patch21050: alps.patch

Patch21070: ext4-Support-check-none-nocheck-mount-options.patch
Patch21071: ext4-Fix-error-handling-on-inode-bitmap-corruption.patch

Patch21073: KVM-x86-extend-struct-x86_emulate_ops-with-get_cpuid.patch
Patch21074: KVM-x86-fix-missing-checks-in-syscall-emulation.patch

Patch21076: rtl8192cu-Fix-WARNING-on-suspend-resume.patch

Patch21080: sysfs-msi-irq-per-device.patch

Patch21082: procfs-parse-mount-options.patch
Patch21083: procfs-add-hidepid-and-gid-mount-options.patch
Patch21084: proc-fix-null-pointer-deref-in-proc_pid_permission.patch

Patch22100: msi-irq-sysfs-warning.patch

Patch21101: hpsa-add-irqf-shared.patch

Patch21226: pci-crs-blacklist.patch

Patch21232: rt2x00_fix_MCU_request_failures.patch

Patch21233: jbd2-clear-BH_Delay-and-BH_Unwritten-in-journal_unmap_buf.patch

Patch21234: e1000e-Avoid-wrong-check-on-TX-hang.patch

Patch21235: scsi-fix-sd_revalidate_disk-oops.patch

Patch21240: ACPICA-Fix-regression-in-FADT-revision-checks.patch

Patch21242: sony-laptop-Enable-keyboard-backlight-by-default.patch

Patch21243: disable-threading-in-compression-for-hibernate.patch

Patch21244: mm-thp-fix-pmd_bad-triggering.patch

Patch21300: unhandled-irqs-switch-to-polling.patch

Patch21350: x86-ioapic-add-register-checks-for-bogus-io-apic-entries.patch

Patch22000: weird-root-dentry-name-debug.patch


then I have compat-wireless patches in compat.patch:
Code:

Patch50000: compat-wireless-config-fixups.patch
Patch50001: compat-wireless-pr_fmt-warning-avoidance.patch
Patch50002: compat-wireless-integrated-build.patch
Patch50100: compat-wireless-rtl8192cu-Fix-WARNING-on-suspend-resume.patch

Patch50101: mac80211-fix-debugfs-key-station-symlink.patch
Patch50102: brcmsmac-fix-tx-queue-flush-infinite-loop.patch
Patch50103: mac80211-Use-the-right-headroom-size-for-mesh-mgmt-f.patch
Patch50105: b43-add-option-to-avoid-duplicating-device-support-w.patch
Patch50106: mac80211-update-oper_channel-on-ibss-join.patch
Patch50107: mac80211-set-bss_conf.idle-when-vif-is-connected.patch
Patch50108: iwlwifi-fix-PCI-E-transport-inta-race.patch
Patch50109: bcma-Fix-mem-leak-in-bcma_bus_scan.patch
Patch50110: rt2800lib-fix-wrong-128dBm-when-signal-is-stronger-t.patch
Patch50111: iwlwifi-make-Tx-aggregation-enabled-on-ra-be-at-DEBU.patch
Patch50112: ssb-fix-cardbus-slot-in-hostmode.patch
Patch50113: iwlwifi-don-t-mess-up-QoS-counters-with-non-QoS-fram.patch
Patch50114: mac80211-timeout-a-single-frame-in-the-rx-reorder-bu.patch
Patch50115: ath9k-use-WARN_ON_ONCE-in-ath_rc_get_highest_rix.patch
Patch50116: mwifiex-handle-association-failure-case-correctly.patch
Patch50117: ath9k-Fix-kernel-panic-during-driver-initilization.patch
Patch50118: mwifiex-add-NULL-checks-in-driver-unload-path.patch
Patch50119: ath9k-fix-a-WEP-crypto-related-regression.patch
Patch50120: ath9k_hw-fix-a-RTS-CTS-timeout-regression.patch
Patch50121: bcma-don-t-fail-for-bad-SPROM-CRC.patch
Patch50122: zd1211rw-firmware-needs-duration_id-set-to-zero-for-.patch
Patch50123: mac80211-Fix-a-rwlock-bad-magic-bug.patch
Patch50124: rtlwifi-Modify-rtl_pci_init-to-return-0-on-success.patch
Patch50125: mac80211-call-rate-control-only-after-init.patch
Patch50126: mac80211-do-not-call-rate-control-.tx_status-before-.patch
Patch50127: mwifiex-clear-previous-security-setting-during-assoc.patch
Patch50128: ath9k-stop-on-rates-with-idx-1-in-ath9k-rate-control.patch
Patch50129: ath9k_hw-prevent-writes-to-const-data-on-AR9160.patch
Patch50130: rt2x00-fix-a-possible-NULL-pointer-dereference.patch
Patch50131: iwlwifi-fix-key-removal.patch
Patch50132: mac80211-zero-initialize-count-field-in-ieee80211_tx.patch
Patch50133: mac80211-Fix-a-warning-on-changing-to-monitor-mode-f.patch
Patch50134: brcm80211-smac-fix-endless-retry-of-A-MPDU-transmiss.patch
Patch50135: brcm80211-smac-only-print-block-ack-timeout-message-.patch




and I issue this script
Code:
for i in `cat ../main.patch | sed 's/Patch[0-9]*\: //'`; do patch -p1 < ../$i; done;

to apply all those patches inside the patched kernel directory (3.2.0 + 3.2.10 mainline patched)

And it all works nicely afterwards. Try it, it might be the best thing you have done to your kernel issues.
Back to top
View user's profile Send private message
aCOSwt
Moderator
Moderator


Joined: 19 Oct 2007
Posts: 2379
Location: Hilbert space

PostPosted: Sun Mar 18, 2012 8:48 am    Post subject: Reply with quote

To start with, we are numerous to face performance problems since 3.0 series kernel.
My system got its latency increased by 25% compared to 2.6.38 irrespective of the load.

You might be interested in reading : http://www.linuxquestions.org/questions/linux-kernel-70/extremely-sluggish-system-with-newest-kernels-under-high-memory-load-933664/

as well as : http://forums.gentoo.org/viewtopic-t-908954-highlight-.html

For my problem, I suspected the changes made in the CF scheduler and tested the ck-sources 3.2.6 kernel which prove performing much better.
(Of course, my hardware is far less powerful than yours)
_________________
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum