Joined: 03 Mar 2011
Location: Tsuruga, Japan
|Posted: Thu Mar 15, 2012 1:45 pm Post subject: Kernel 3.1.6 AMD64 inexplicable slow down with heavy use
|I hope that some Gurus on the forum can shed a light on the following issue.
We have a Dell 7500 Cn workstation, with 24 CPUs (6 times quad core CPU); each CPU has 4 GB of memory for a total of 96 GB. The system has one 1500 GB harddisk which is connected to some RAID card. We use Gentoo stable with the exception of Gnome 3. This system has been working for about a year to full satisfaction.
We mainly use this workstation to do heavy number crunching. We use simulation software which is written in a special flavor of FORTRAN. This software has some issues with 32bit / 64bit but we managed to compile everything with gfortran and the software works. The system has been running for months with sometimes 30 to 40 simultaneous calculations, each of which uses several GB of memory. Until now, we have never had any performance issues. Until today, unfortunately.....
In January 2012 I upgraded the machine to kernel 3.1.6. I use genkernel. I rebooted the machine on January 27 and it was running fine until today (15 March 2012).
Today we noticed that routine calculations which normally take about an hour suddenly did not finish after running more than 8 hours. A look in "top" confirms: the system shows a high load (higher than can be expected from the number of processes), and a large % of "waiting" (up to 30%). The calculations are given the status "D" which means "uninterruptible sleep (usually connected to IO)". The machine is nearly unresponsive. After killing all the calculations, the system goes back to normal.
We have tried to identify the problem: just running one calculation is no problem. If we run about 10 simultaneous calculations, after about 10 minutes the system seems to go in a state of shock: the % waiting shoots up, the processes get "D" status, the load goes up; after about 30 minutes the fans start at their maximum. The harddisk light is on continuously but in "top" there are no processes which seem to be linked to this disk activity. The system becomes unresponsive. The calculations do not need to access the HDD; all data is in RAM. The kind of calculations requires a small amount of data but very many iterations.
In a final move of desperation we rebooted. That did not solve the problem. Then we rebooted into kernel 2.6.36, and guess what: the problem went away with the older kernel.....
The machine was running fine until March 12 and 13. I updated ("emerge --sync" followed by "emerge -auvDN world", "revdep-rebuild") on March 13. No errors were reported. Apparently the update of March 13 did something that causes the kernel to do very weird things. I googled the "D" status. In many posts it is related to disk activity, USB keys and things like that and one post suggested that the "D" status is actually a problem in the kernel in that a program waits for something but does not clear the CPU. Note that with 10 calculations, each calculation takes 2 GB, we are using 20 GB of memory out of 96 GB and 10 out of 24 CPUs. In other words, for this workstation that should not be any problem at all.
I updated "gvfs", can this be the cause? Also I noticed that a set of emul-linux libs was updated, can this be the problem? I thought that maybe there was an update of gcc and glibc, so we recompiled the software with the newest gfortran but that did not solve the problem.
Joined: 20 Jun 2007
Location: SouthEast U.S.A.
|Posted: Thu Mar 15, 2012 2:05 pm Post subject:
|If your kernel is compiled with "CONFIG_MAGIC_SYSRQ=y", you can type <ALT><SysRq>-W to dump a traceback of the delayed ("D" state) tasks to your dmesg log.
The output of the traceback is difficult to read, but it may give you an idea if it is disk I/O, or USB, or something else.
Joined: 27 Mar 2009
|Posted: Sun Mar 18, 2012 5:23 am Post subject: not a guru at all,
I am always looking for new plain vanilla kernel releases up first when I run into problems.
as of today the current one is 3.2.11 and from changelog it is clear that it reverted only one commit from 3.2.10
So, why not trying a newest version, rather than what is stable ?
The best thing I discovered for myself is using what works for big boys. Kernel releases are in much better shape and include a lot of fixes even compared to upstream.
For instance, get and run the latest kernel release from fedora here:
it is plain vanilla 3.2.0 kernel + upstream 3.2.10 patch, which takes it to equivalent upstream version 3.2.10. And on top of that, and this is the best part, they include a lot of patches that are soon going to end up in the upstream stable kernel anyway. They cherry pick patches from newest development kernel and backport it to their releases. For instance I use compat-wireless which is also included + the patches from newest upstream.
Redhat guys contribute alot to kernel.org
The actual rpm archive is available here:
rpmunpack it see the content. it has 3.2.0 tared compressed + 3.2.10 patch. Apply that, then see the kernel.spec file.
It has changelog, where they say which patch does what there. And the directory rpmunpacked will be stacked with *.patch files.
I just apply all of them excluding the media patches, those that have to do with v4l and viola + compat-wireless, and on top of it the upstreamed patches and it all works nicely.
so I have the list of patches I am going to apply in main.patch. Its contents are:
then I have compat-wireless patches in compat.patch:
and I issue this script
|for i in `cat ../main.patch | sed 's/Patch[0-9]*\: //'`; do patch -p1 < ../$i; done; |
to apply all those patches inside the patched kernel directory (3.2.0 + 3.2.10 mainline patched)
And it all works nicely afterwards. Try it, it might be the best thing you have done to your kernel issues.
Joined: 19 Oct 2007
Location: Hilbert space
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum