AMD64 system slow/unresponsive during disk access...

Message

engineermdr · Post by **engineermdr** » Mon Feb 05, 2007 2:46 pm

devsk wrote:is it possible for people who are experiencing this issue, to try and disable the network during the course of the experiment? I have a feeling that the culprit is not the disk but the network. I can reproduce the temporary freezes (mouse hickups) while doing a file transfer using cifs/smbfs of a large file. Otherwise, I can 'cat' to /dev/null as big a file as I want and overload the system and IO specifically in other ways, but I can't reproduce the issue.

So, if its isolated to the network elated io, we can probably get closer to the issue.

Another angle I wanted to cover was if the freeze is just input/output related i.e. if freeze is just seen for mouse, scrolling text in the terminal, keys pressed but not appearing on screen etc., and not cpu scheduler related i.e. if a sample program doing just plain cpu intensive calculations sees a hickup as well.

I think you're on to something here. I also agree that everything is fine while the network is idle. I previously reported that I had problems while unrar'ing large volumes, but I was also downloading in the background. Things like starting a new konqueror or opening iconified windows are normal while doing heavy I/O without the network. I'm using the forcedeth driver with my NV4 chipset.

moesasji · Post by **moesasji** » Mon Feb 05, 2007 4:19 pm

Here switching of the network by running /etc/init.d/net.eth0 stop seems to make no difference.

Calculating a MD5 sum for a DVD iso effectively makes that I can't open any other applications both with and without the network. Note that something as simple as a terminal should be catched in memory....even that won't show up until the CPU/IO intensive task is finished.

JanR · Post by **JanR** » Mon Feb 05, 2007 4:22 pm

Hi,

Calculating a MD5 sum for a DVD iso effectively makes that I can't open any other applications both with and without the network. Note that something as simple as a terminal should be catched in memory....even that won't show up until the CPU/IO intensive task is finished.

Does this change if you set swappiness to 5?

Greetings,

Jan

devsk · Post by **devsk** » Mon Feb 05, 2007 4:32 pm

mdr wrote: I'm using the forcedeth driver with my NV4 chipset.

I have the same. So, at least I know that my problem is now limited to forcedeth driver.

I think there are two un-related things happening here:

1. one network driver related.
2. excessive swapping.

I say this because I have not experienced the freezes doing just disk io. I have swappiness at 20.

JanR · Post by **JanR** » Mon Feb 05, 2007 4:39 pm

Hi,

I think there are two un-related things happening here:

I agree.

The slowdown (not freeze, mouse still moves) is IMO swap caused.

The freezes have something to do with heavy use of both net and disk.

I have the same. So, at least I know that my problem is now limited to forcedeth driver.

Not really I guess... I also have freezes (see other postings) but I ran e100. The nvidia interface is disabled in BIOS after experiencing IRQ problems in the first days of installation (15 month ago). Then I switched to sklin98 for the Yucon Gigabitcard which is also onboard the A8N Premium but this also led to problems so I changed in summer to an old PCI e100 which I use now. Problems seems to be reduced but heavy NFS in parallel to disk IO still gives a 10 s freeze sometimes (see other post). Therefore, I think it is not related to the forcedeth driver but more to the network subsystem.

Greetings,

Jan

moesasji · Post by **moesasji** » Mon Feb 05, 2007 4:55 pm

JanR wrote:Hi,
Does this change if you set swappiness to 5?

No, setting the swappiness as low as 5 makes no difference.
Until something like 5% into the iso I can still open something.
If the percentage becomes larger the application only opens after K3B is finished.

For completeness I tested both with and without network enabled.
No difference as far as I can see.

Bornio · Post by **Bornio** » Mon Feb 05, 2007 4:59 pm

Guys, please remember this is an IO-WAIT issue.
This is not limited to a certain hardware, CPU, or system.

I am suspecting that this is a kernel issue with the way it handles IO's.

devsk · Post by **devsk** » Mon Feb 05, 2007 5:19 pm

Bornio wrote:Guys, please remember this is an IO-WAIT issue.

actually, for me its not. I can saturate my disk IO paths as much as I want but can't reproduce the freezes. as soon as I saturate network, the freezes start. if by IO you mean, network AND disk IO, you are right. But high IO wait shown by top is not the cause of this. During my experiment where I cat three >2gb files to /dev/null, I see high IO wait (and understandably) in top but no freezes. Try disabling the network and catting three >2gb on the same disk to /dev/null: if it leads to freezes the problem is disk IO related, otherwise its likely related to network. you can even construct a small 'dd' test case for this.

Bornio · Post by **Bornio** » Mon Feb 05, 2007 5:25 pm

dstat -afc --output <filename>
create the "freeze" issue
and carefuly analyze by elimination. that means, that if, for example, whenever a freeze happens, your iowait shoots up, check what else acts abnormally.
then try to take that other thing, and ignore it. see if anything else might be acting weird, until you have eliminated through each abnormal activity.

let me know what you find out.

moesasji · Post by **moesasji** » Mon Feb 05, 2007 6:13 pm

Code: Select all

  4   0  95   1   0   0: 59  12   0  28   0   1|  69M    0 |   0     0 >
  4   0  95   1   0   0: 53  12   0  32   1   2|  65M    0 |4306B    0 >
  0   1  99   0   0   0: 53  12   0  34   0   1|  59M    0 |   0     0 >
  2   0  98   0   0   0: 61  13   0  24   1   1|  68M    0 |   0     0 >
  3   0  97   0   0   0: 44   6   0  49   0   1|  45M    0 |   0     0 >
  3   1  95   1   0   0: 56   8   0  34   0   2|  62M    0 |   0     0 >
-------cpu0-usage--------------cpu1-usage------ --dsk/sda-- --net/eth0->
usr sys idl wai hiq siq:usr sys idl wai hiq siq|_read _writ|_recv _send>
  3   0  94   3   0   0: 54  12   0  33   0   1|  61M    0 |   0     0 >
  3   1  95   1   0   0: 62  13   0  25   0   0|  73M    0 |   0     0 >
  3   2  95   0   0   0: 58   9   0  31   0   2|  68M    0 |   0     0 >
  3   0  96   1   0   0: 55  14   0  30   0   1|  66M    0 |   0     0 >
  2   0  91   7   0   0: 22   3   1  73   0   1|  13M   28k| 444B  370B>
  0   0  98   2   0   0: 24   1   2  72   0   1|5488k    0 | 140B  938B>
  0   0  99   1   0   0:  9   2   1  88   0   0|5936k    0 |  13k  594B>
  5   1  84  10   0   0: 11   4  16  68   1   1|1448k    0 |  29k 1972B>
  5   1  94   0   0   0: 20   2  62  15   0   1|  88k    0 |2445B 7161B>
  0   0 100   0   0   0:  3   1  95   0   0   1|   0    64k|  66B   54B>
  1   0  99   0   0   0:  2   1  96   0   0   1|   0     0 |   0     0 >
  0   0 100   0   0   0:  1   1  98   0   0   0|   0     0 |   0     0 >

Above is the output off dstat -afc when I calculate an MD5sum on a DVD-iso.
(the last 4 lines are when K3B is finished)

The only strange thing that I notice is that all the load is handled by one of the cores.
The other one is doing basically nothing, but that could simply be the way K3B is programmed.
For the rest I don't see anything strange, but still calculating this checksum blocks opening any other program.

So if this brings up any thoughts or leads let me know.

Bornio · Post by **Bornio** » Mon Feb 05, 2007 6:17 pm

this is not the entire result of dstat -afc but only what probably fits your screen.

please output it to a file (it will output in CVS mode for easy parsing) and do the elimination i suggested in the last post, using some spreadsheet (like that of OOo).

You should have by far more columns then yo pasted.

moesasji · Post by **moesasji** » Mon Feb 05, 2007 6:26 pm

Sorry, my mistake:

Code: Select all

   7   2  90   1   0   0: 62  10   0  28   0   0|  67M    0 |   0     0 |   0     0 |1354  1116 |  7   2  90   1   0   0: 62  10   0  28   0   0
  3   0  96   1   0   0: 65  11   0  23   0   1|  68M    0 |   0     0 |   0     0 |1245   551 |  3   0  96   1   0   0: 65  11   0  23   0   1
  4   1  95   0   0   0: 61  13   0  25   0   1|  67M    0 |   0     0 |   0     0 |1264   613 |  4   1  95   0   0   0: 61  13   0  25   0   1
-------cpu0-usage--------------cpu1-usage------ --dsk/sda-- --net/eth0- ---paging-- ---system-- -------cpu0-usage--------------cpu1-usage------
usr sys idl wai hiq siq:usr sys idl wai hiq siq|_read _writ|_recv _send|__in_ _out_|_int_ _csw_|usr sys idl wai hiq siq:usr sys idl wai hiq siq
  5   5  87   3   0   0: 58  11   0  30   0   1|  62M    0 |   0     0 |   0     0 |1127   561 |  5   5  87   3   0   0: 58  11   0  30   0   1
  3   0  95   2   0   0: 42   8   0  49   1   1|  46M   20k|4306B    0 |   0     0 | 992   515 |  3   0  95   2   0   0: 42   8   0  49   1   1
  3   1  83  13   0   0: 57   7   0  35   0   1|  59M    0 |   0     0 |   0     0 |1195   685 |  3   1  83  13   0   0: 57   7   0  35   0   1
  5   0  94   1   0   0: 66  11   0  23   0   0|  68M 4096B|   0     0 |   0     0 |1297   528 |  5   0  94   1   0   0: 66  11   0  23   0   0
  4   0  95   1   0   0: 61  12   0  25   1   1|  69M    0 |   0     0 |   0     0 |1282   509 |  4   0  95   1   0   0: 61  12   0  25   1   1
  5   0  91   4   0   0: 62   9   0  28   0   1|  66M    0 |   0     0 |   0     0 |1249   558 |  5   0  91   4   0   0: 62   9   0  28   0   1
  3   2  93   2   0   0: 58   7   0  33   1   1|  60M   24k|   0     0 |   0     0 |1123   663 |  3   2  93   2   0   0: 58   7   0  33   1   1
  3   1  93   3   0   0: 60  10   0  29   0   1|  64M    0 |   0     0 |   0     0 |1199   577 |  3   1  93   3   0   0: 60  10   0  29   0   1
  3   1  96   0   0   0: 50   9   0  40   0   2|  53M    0 |   0     0 |   0     0 |1061   440 |  3   1  96   0   0   0: 50   9   0  40   0   2
  3   3  94   0   0   0: 57   9   0  34   0   0|  62M    0 |   0     0 |   0     0 |1116   584 |  3   3  94   0   0   0: 57   9   0  34   0   0
  7   2  82   9   0   0: 61  13   0  24   0   2|  63M    0 |   0     0 |   0     0 |1271  1414 |  7   2  82   9   0   0: 61  13   0  24   0   2
  2   1  93   4   0   0: 76  11   0  12   0   1|  73M   20k|   0     0 |   0     0 |1453   863 |  2   1  93   4   0   0: 76  11   0  12   0   1
  2   1  97   0   0   0: 61  10   0  29   0   0|  68M    0 |   0     0 |   0     0 |1263   670 |  2   1  97   0   0   0: 61  10   0  29   0   0
  4   1  94   1   0   0: 61  13   0  23   0   3|  70M    0 |   0     0 |   0     0 |1278   543 |  4   1  94   1   0   0: 61  13   0  23   0   3
  3   1  96   0   0   0: 20   2   0  77   0   1|  21M    0 |   0     0 |   0     0 | 771   557 |  3   1  96   0   0   0: 20   2   0  77   0   1
  2   1  77  20   0   0: 27   4   6  63   0   0|3220k   16k|1072B  963B|   0     0 | 826  1755 |  2   1  77  20   0   0: 27   4   6  63   0   0
  0   0  94   6   0   0:  6   2  73  18   0   1| 136k    0 | 104B    0 |   0     0 | 467   823 |  0   0  94   6   0   0:  6   2  73  18   0   1
  3   0  97   0   0   0:  5   4  89   0   1   1|   0     0 |   0     0 |   0     0 | 368   862 |  3   0  97   0   0   0:  5   4  89   0   1   1
  0   1  99   0   0   0:  4   2  94   0   0   0|   0     0 |   0     0 |   0     0 | 324   349 |  0   1  99   0   0   0:  4   2  94   0   0   0

But still the same, I don't see anything strange.

Bornio · Post by **Bornio** » Mon Feb 05, 2007 6:40 pm

I am sorry, I made a typo
it should be:

dstat -afv --output <filename>
(and not -afc)

make sure to actually dump it into a file and parse ...

devsk · Post by **devsk** » Mon Feb 05, 2007 6:45 pm

@moesasji: yours seems like a totally different problem. Your maximum IO wait seen is 77%. I have seen waits like 95% without any freezes. your swap/network are not coming into picture for the duration.

Is your mouse frozen during the time k3b is running md5sum? are you not able to open any KDE menu? I think yours might be a k3b/kde issue. What versions of k3b and kde are you running?

moesasji · Post by **moesasji** » Mon Feb 05, 2007 6:58 pm

@devsk: I've seen IO-waits shoot up to 95% as well, but that depends on what my system is doing.
Running klibido on the background is a good way to make that happen for example.
Now the system was doing nothing else then calculating the checksum.

Symptoms happen with a lot of programs. Everything that is intensive for IO and CPU. Things like rar, copy files, par2repair, etc. So it is definitely not K3B specific. I take the checksum as a testcase as it clearly shows that the seektime for the HD does not play a role. It just start at the start of the ISO-file and should then give maximum throughput which it does.
(the ISO itself comes from a rar-file that is extracted to a drive with plenty of free space)

Symptoms during this kind of processes is that I can not start any other program.
For example opening a terminal......terminal only appears after the intensive process is finished.
Also opening a new tab in my browser becomes a nightmare.
My best guess is a problem with the kernel-scheduler. And indeed I don't see a freezing mouse.
That one still moves....

The only weird thing that is related to it is that the system during this heavy load itself freezes often for a very brief time. Very noticable when typing a text in an edit-box such as I do now. During typing suddenly a couple of letters don't appear for one or two seconds and then the backlog appears again. This also happens during normal operation, but then it is less pronounced.

ps) Bornio, I will try....but I can parse little if I don't know what to look for.

moesasji · Post by **moesasji** » Mon Feb 05, 2007 7:38 pm

@Bonio: Indeed the complete output does show more....I removed columns that appeared twice.
And it also shows something strange....

Code: Select all

-------cpu0-usage--------------cpu1-usage------ --dsk/sda-- --net/eth0- ---paging-- ---system-- ---procs--- ------memory-usage
usr sys idl wai hiq siq:usr sys idl wai hiq siq|_read _writ|_recv _send|__in_ _out_|_int_ _csw_|run blk new|_used _buff _cach _free
  1   0  91   8   0   0: 57  10   0  32   0   1|  58M   36k|   0     0 |   0     0 |1137   614 |  1   3   0| 312M 1908k  679M 9072k
  8   1  89   2   0   0: 62  15   0  22   0   1|  70M    0 |  60B  108B|   0     0 |1292   459 |  2   2   0| 313M 1776k  678M  9.9M
  9   6   2  83   0   0: 65  10   0  24   0   1|  69M    0 |   0     0 |   0     0 |1305   511 |  2   2   0| 313M 1784k  678M  9.9M|  
  8   1   0  91   0   0: 63  10   0  25   1   1|  69M    0 |4306B    0 |   0     0 |1272   469 |  1   2   0| 312M 1788k  678M   10M|
  6   1  84   9   0   0: 58  12   0  29   0   1|  63M    0 |1236B 1071B|   0     0 |1233   560 |  1   2   0| 313M 1788k  678M   10M|
  9   1  62  28   0   0: 50  16   0  33   0   1|  60M   20k|   0     0 |   0     0 |1172   716 |  0   2   0| 312M 1812k  679M 9012k|  
  7   0  90   3   0   0: 64  12   0  23   0   1|  67M    0 |   0     0 |   0     0 |1240   536 |  1   1   2| 313M 1764k  679M 9232k|   
  8   2  57  33   0   0: 60  12   0  26   0   2|  68M    0 |   0     0 |   0     0 |1331   589 |  1   1   0| 313M 1776k  679M 9144k|  
  7   2   0  91   0   0: 61  13   2  22   1   1|  72M    0 |   0     0 |   0     0 |1341  1716 |  1   2   1| 313M 1788k  678M  9.9M|   
  9   0  49  43   0   0: 53  12   2  31   1   1|  58M    0 |3305B 1227B|   0     0 |1228   693 |  0   3   0| 312M 1788k  679M 9328k|
  1   0  83  16   0   0: 53  10   0  36   0   1|1290M   40k|  14k  759B|   0     0 |  26k   14k|  2   3   6| 313M 2020k  678M  9.9M| 
  7   3  55  35   0   0: 57  11   0  31   1   0|  62M   28k|   0     0 |   0     0 |1291   880 |  2   2   0| 313M 2044k  678M  9.8M|   
 12   2  84   2   0   0: 66   8   0  24   0   2|  69M    0 |   0     0 |   0     0 |1308   779 |  2   2   0| 313M 2004k  678M 9984k|   
  7   2  89   2   0   0: 58  14   0  26   1   1|  64M    0 |   0     0 |   0     0 |1275   784 |  3   3   0| 313M 2012k  677M  9.8M|   
 11   8  79   2   0   0: 54  13   0  32   0   1|  58M 8192B| 420B 1265B|   0     0 |1097  1769 |  2   5   1| 313M 1996k  678M 9904k|
  6   5  89   0   0   0: 56  12   0  31   0   1|  64M   40k| 866B  722B|   0     0 |1232   638 |  0   4   0| 313M 2004k  679M 9064k|

If you look in the line with the process to be started you see that there are a large number of blocked processes.
Something that does not happen when I try to start an application without applying the heavy load.
That does make sense as I try to open applications during the calculation of the checksum, but they don't appear.

But what is really strange is one of the lines. You see that roughly in the middle there is a line where 1290M is read by the harddisk.
That is exactly the point where I see that the whole system briefly stops in this case very noticable. Also the output from dstat is halted.
The large numbers are an artefact because it simply adds all reading that occured over a much longer time (20 seconds in fact).
Also visible for the network and system colums. Divide those numbers by 20 and they correspond to the rest.
(people that see the network going beserk might misinterpret it for this same reason)

Clear proof that something strange is going on. However I have no clue what it means.
Only thing strange is that exactly at that point 6 new processes are created.

devsk · Post by **devsk** » Mon Feb 05, 2007 8:02 pm

Code: Select all

  1   0  83  16   0   0: 53  10   0  36   0   1|1290M   40k|  14k  759B|   0     0 |  26k   14k|  2   3   6| 313M 2020k  678M  9.9M|

this indeed means that there is a problem with process scheduling codepath, not the IO scheduling. Can you please attach this data to the bug report at: http://bugzilla.kernel.org/show_bug.cgi?id=7372

I again want to say that my freeze symptoms are totally different from this, but I can attach a dstat report only when I reach home.

moesasji · Post by **moesasji** » Mon Feb 05, 2007 9:00 pm

I've attached the output of dstat to the kernel-bug report with the same explanation as above.
Hopefully it is of some use to solve this problem.

devsk · Post by **devsk** » Tue Feb 06, 2007 4:01 am

I can produce mouse hickups with a large file copy over cifs. whether I do k3b md5sum on a large ISO is not relevant. So, here is the dstat when doing large file copy over cifs (hda) and k3b doing md5sum on a large ISO (sdd) at the same time. Look at the middle part where the thruput of disk IO just falls to close to 2-5Mbytes/sec although they are separate disks. This is also the time when the mouse freezes. What is triggering the fall in thruput and freezes, is unknown.

Code: Select all

  0   1   0| 21   7  38  31   1   3|  24M    0 :  36M    0 |  98k 3983k|   0     0 |5696  5799 
  1   0   0| 17   5  40  36   1   2|  40M    0 :  17M    0 |  79k 3276k|   0     0 |5196  5509 
  1   1   0| 14   5  57  22   1   2|  21M    0 :  14M    0 |  98k 3922k|   0     0 |5504  5739 
  1   1   0| 15   4  56  22   0   2|  17M    0 :  18M    0 |  85k 3247k|   0     0 |5014  5329 
  2   0   3| 13   4  74   9   1   1|4612k    0 :8192k    0 |  51k 2033k|   0     0 |3624  4317 
  1   0   0| 29   6  37  23   1   3|  16M    0 :  43M    0 |  71k 2862k|   0     0 |4861  5110 
  1   0   0| 10   3  67  19   1   2|  18M    0 :1280k 1024B|  95k 3661k|   0     0 |5272  5183 
  1   1   0| 10   2  79   6   0   2|3328k    0 :1536k  512B| 235k 6794k|   0     0 |8383  9802 
  0   0   0| 10   3  76   9   1   3| 388k    0 :4608k  512B| 290k 8065k|   0     0 |9587    11k
  2   0   1|  9   3  76  10   1   2|2560k    0 :2304k    0 | 263k 7790k|   0     0 |9067    10k
  2   0   2|  9   4  78   4   1   3|1540k    0 :2176k  512B| 268k 7643k|   0     0 |9187    11k
  2   0   0|  9   2  77  10   1   2|3200k    0 :1664k  512B| 217k 6062k|   0     0 |7842  8655 
  4   0   0|  8   3  75  12   1   2|2052k    0 :1792k  512B| 220k 6073k|   0     0 |7792  8995 
  1   0   0| 12   2  71  10   1   2|2176k    0 :3968k    0 | 304k 8555k|   0     0 |  10k   12k
  0   0   3| 12   5  75   5   1   3|2180k    0 :3584k 1024B| 279k 6939k|   0     0 |9017    11k
  2   0   0| 11   3  79   4   1   3|2564k    0 :2176k  512B| 306k 8574k|   0     0 |  10k   12k
  1   0   0|  8   2  75  11   1   2|5764k    0 : 128k  512B| 233k 7139k|   0     0 |8507  9439 
---procs--- ----total-cpu-usage---- --dsk/hda-----dsk/sdd-- --net/eth0- ---paging-- ---system-- 
run blk new|usr sys idl wai hiq siq|_read _writ:_read _writ|_recv _send|__in_ _out_|_int_ _csw_
  0   0   0| 13   3  67  14   1   3|3588k    0 :5504k  512B| 293k 9293k|   0     0 |  10k   11k
  3   2   0| 14   3  70  10   1   3|5764k    0 :6912k  512B| 249k 6919k|   0     0 |8817  9976 
  2   0   4| 16   6  65  11   1   3|6792k    0 :8704k    0 | 193k 6027k|   0     0 |7718  8670 
  0   0   0| 12   3  70  10   1   3|4996k    0 :6144k 1024B| 278k 7880k|   0     0 |9660    11k
  1   0   0| 12   3  80   3   1   3|2180k    0 :3968k  512B| 223k 7039k|   0     0 |8452  8953

Bornio · Post by **Bornio** » Tue Feb 06, 2007 6:51 am

BTW, May I suggest trying the new 2.6.0 kernel (just hit gentoo-sources too) ?
it might help, now that it has a somewhat new IO manager...

moesasji · Post by **moesasji** » Thu Feb 08, 2007 8:35 pm

I'm currently trying the vanilla-2.6.20 kernel.
I must say that it feels way more responsive than the 2.6.18 kernel on AMD64.
For me at least there appears to be some progress.

Unfortunately the specific problem I experience is still in this new kernel.
I did manage to track down the commit that is the origin for the stalls I experience.
Hopefully something is done with it....but response on open kernel-bugs seems to be pretty slow.

evilben · Post by **evilben** » Mon Feb 12, 2007 6:44 am

TinheadNed wrote:I won't rush into it, as new vanilla kernels normally seem to have little hiccoughs in them when they first come out, but I'd be interested if anyone else has seen a difference on sata_nv. I assume this will extend to my nforce570 chipset.

TinheadNed and Phenax,

I have an nForce 590 motherboard and just tried out 2.6.20 tonight. ADMA is disabled by default on MC51-61 (see http://lwn.net/Articles/203532/), but I tried it out just to be sure--as suggested, I changed the MCP55 lines in sata_nv.c from "GENERIC" to "ADMA" and recompiled. I got a kernel panic, and didn't investigate any further.

Since the patch that NVidia supplied didn't enable it, my hopes are not very high.

Zi7 · Post by **Zi7** » Sun Feb 25, 2007 7:04 am

Similar problem here with different hardware (Intel/Promise). It definitely looks like a kernel bug unrelated to any specific SATA hardware/driver.

What i have here is 2 SATA HDs that are showing the same behaviour as previously reported (persistent kernel hangs on massive file copy, with [pdflush] threads in numbers when it happens).
Interestingly enough, i use different sorts of RAID on these discs:

RAID 0 works perfectly (/ and /var), even when emerging sync or world.
RAID 1 and LINEAR mode show the hanging behaviour.

It has to be noted though that RAID 1 works OK on a small partition (/boot for that matter), consistent with the fact that kernel hangs only seem to happen when the total amount of data to be written is large enough.
On my system, this threshold is more or less a few hundred Mo, and is most probably related to the amount of RAM available (1 Go here). Above this limit, problems related with pdflush buffering start to appear.

Edit:
Forgot to mention hangs happen with all kinds of file systems (Ext3, ReiserFS, XFS).
Disabling all kinds of I/O schedulers as well as all kinds of kernel Preemption doesn't change a bit.

moesasji · Post by **moesasji** » Sun Feb 25, 2007 8:48 am

@Zi7: I managed earlier to find the kernel-commit that caused the problem on my system. It would be good if you (or somebody) else could indeed confirm that the problem lies in that commit and report that in the corresponding kernel-bug report. I've reported the outcome of the bisect here

You can test if this is indeed the problem on your system as well by using bisect. An howto is given here

I guess you should do the following: (not tested)

Code: Select all

$ cg-clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linuxdir
(this can take a long time)
$ cd linuxdir
(Note that it makes sense to first copy your .config to this new directory)
$ git-bisect start

# then say the current version doesn't work:
$ git-bisect bad

(# or even an earlier version (in may case 2.6.18):
$ git-bisect bad v2.6.18)

$ git checkout master
$ git revert <bad-commit-id>

In this case the bad-commit-id would be: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1

Then do the normal make && make modules_install from that directory and reboot and see if the problem still exists or is solved.

ps) probably it is best to use the v2.6.18 kernel for testing as the bad commit is introduced between 2.6.17 and 2.6.18. I don't know how many changes later on make use of the features that were introduced with these patches.

Zi7 · Post by **Zi7** » Sun Feb 25, 2007 5:17 pm

I could definitely try that.
However i should start with knowing which is the last kernel that worked (none has worked since i bought those SATA disks) and which is the first one that did not work, i guess. Do you know these kernel versions (vanilla or gentoo sources) ?
Meanwhile i'll test older kernels going backward from around 2.6.18 when i have some spare time...