High Network Utilization crash?

eccerr0r · Posted: Tue Nov 08, 2011 8:51 pm Post subject: High Network Utilization crash?

Anyone seen this happening? Just looking for some "me too's" at least, even if it's not solved...

I try to dump a whole bunch of data (network computer to computer copy of data off another HD) through Gbit ethernet using Linux-3.0.6-gentoo, and it completely crashes the box. However if I switched to 2.6.35-r4 (gentoo-sources as well) the problem goes away and I can complete the copy. Weird! Linux-3.0.3-vanilla also crashes.

I need to determine whether it's writing to disk or just network activity is causing the problem but I ruled out NFS as the culprit (using NFS crashes after staring to dump the data, and I was also able to trigger the crash via using 'netcat' to do the same machine-to-machine copy.

Crashing system (haven't gotten any debug information from it because the machine seems to simply hang with no debug data):
Linux-3.0.6-gentoo
destination disks: MDRAID RAID5 on SATA ICH
x86 (Core2 Quad, 32 bit mode)
r8110s based Gbit ethernet on a Gigabyte EP43-UD3L board
ATI RadeonHD 5770 (FGLRX)
4GB RAM/64G PAE

The machine I was copying from
Linux-2.6.21-Custom
source disk: plain single disk SATA on ICH
x86 (Core2 Duo, 32 bit mode)
Marvell Gbit Ethernet on a Foxconn G965MA board
G965 chipset
4GB RAM/64G PAE

Both machines are connected via a Gbit ethernet switch.

Weird...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

Hu · Moderator Joined: 06 Mar 2007 Posts: 21635

What if you have the receiving netcat write everything to /dev/null? This will allow you to reproduce the network load without involving the receiving disk. Similarly, you could try using dd bs=1M if=/dev/zero of=foo on the "receiving" system to generate a substantial disk load with no network involvement.

eccerr0r · Posted: Wed Nov 09, 2011 5:36 am Post subject:

The machine is otherwise stable... looks like network netcat works (114MB/sec), as well as dumping to disk (140MB/sec... blah, crappy raid...) individually.

I think I have some more clues now though, this might still be NFS after all. I have a feeling what's crashing is the file locks. I think I had the exports read only, and then it would work fine, but if it was exported read-write then it would crash. Of course this still isn't expected behavior...

hmm..need more testing.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

depontius · Advocate Joined: 05 May 2004 Posts: 3509

I've seen something like this, a year or two back. I've since tweaked aspects of my installation to make-it-not-happen, though it wasn't out-and-out crashes, rather occasional nearly-minute-long hangs.

My nfs server is an 800MHz P-III, and the problem could be tripped by a single Athlon 64 client. This was back in the timeframe when Firefox had just started using sqlite to store its instance information and filesystem delays/problems were emerging with the fsync operations. It seemed to me that the sqlite fsync on a much faster client against a slower server was overwhelming the server. I moved .firefox over to local disk and symlinked it back to nfs-mounted /home. The problems went away.
_________________
.sigs waste space and bandwidth

krinn · Watchman Joined: 02 May 2003 Posts: 7470

did you check dmesg for troubles with irq ? under heavy load many motherboard with buggy part show failure, and the bad irq is then throw away from irq table, the funny part is that any device using it is not reset to grab another one but leave as-is in that bad state. If it happen to the irq that your hdd controller is using, you can expect slow down/freeze and crash, same for network card. It could happen just because too much device use the same irq, or just because some device didn't like shared their irq with anyone.

But i suppose it might not be that, as your dmesg should have already report it.

depontius · Advocate Joined: 05 May 2004 Posts: 3509

As I said, these problems were a year or two back. I don't remember all of the steps I took at the time to try to diagnose this. I also remember that at that time there were known problems with nfs under heavy load, so I had a heavy presumption that that was the problem, rather than anything more fundamental.

It would be fairly easy to "go back" into the trouble realm. I would just have to remove the .firefox symlink from /home, and "cp -a" the .firefox from local space back to /home. There may be enough other things changed, that I wouldn't get back into trouble. One of those other changes I made was to start using cachefilesd to cache my nfsv4 /home, but AFAIK that doesn't cache writes, so if that's the problem, it should still exist. It is possible however that the nfsv4 write path is changed sufficiently by cachefilesd that that alone would ease my old problems.
_________________
.sigs waste space and bandwidth

eccerr0r · Posted: Wed Nov 09, 2011 3:53 pm Post subject:

More interesting is that it seems to die at around the same place each time when I start the particular copy.

Also, not much data needed to transfer before it crashed. I think it got a few KB over on this multi-GB transfer and the crash occurs, indicating the locking mechanism which is needed at the beginning of the transfer could have been at play. By random luck I don't think the amount of data matters, just a specific packet is sufficient to hang the server.

Unfortunately I destroyed the source copy of the data in question, this will probably be a bit of a mystery that won't be solved soon unless I build another data set that can repeat this... But I deeply suspect this to be a software issue versus a hardware one, though I can't point it to one or the other yet.

(I was copying everything off that one disk because I wanted to convert that core2duo to a 64-bit install!)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

depontius · Advocate Joined: 05 May 2004 Posts: 3509

Come to think of it, I recently moved something over 5G onto NFS with no problem. The photos had been pulled off of SD cards onto my laptop. I don't yet have an SD reader for my deskside machine, and my laptop isn't set up for my NFS. (It's a work laptop.) I used scp to copy the photos from the laptop onto the deskside - in NFS space, which meant that they were being copied from the laptop through the deskside to the NFS server.

This is over a full-duplex 100Mbit LAN. Even though it's really 2 one-way problems, I suspect that there's enough handshaking overhead that neither transfer got the full 100Mbit rate. Either my network is in better shape several years later, or that little bit of double-transfer degradation made some difference, or the fact is, I started the transfer and walked away - only occasionally checking to see how it was going. OTOH, nothing crashed.
_________________
.sigs waste space and bandwidth

HeissFuss · Guru Joined: 11 Jan 2005 Posts: 414

I haven't heard of this issue crashing an entire system, but THP was added in 2.6.38 and is know to cause application slowness/crashes if you have a lot of filesystem writes pending, or otherwise low on free memory. Did you enable transparent huge pages in your 3.0 kernels?

loopx · Posted: Fri Dec 23, 2011 1:06 am Post subject:

depontius · Advocate Joined: 05 May 2004 Posts: 3509

Does this have to do with THP on the NFS server, client, or both?

I have an i686 machine serving NFSV4, and several amd64 clients. The server is running some level of 2.6.39, but since it's i686 doesn't have THP. I believe the clients are all running THP. I had a bout of performance problems a bit over a year ago, but they had gone by the wayside. About a week or two back, I noticed another "temporary hang" reminiscent of the bad old days. I'm wondering if I'm having the THP problem.
_________________
.sigs waste space and bandwidth

eccerr0r · Posted: Fri Dec 23, 2011 4:33 pm Post subject:

The server is the machine crashing for me, so I suppose that's the machine that needs to have attention...

Unfortunately I don't have transparent huge pages enabled (but regular huge pages are enabled)...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

loopx · Posted: Fri Dec 23, 2011 10:06 pm Post subject:

In my case, the server is a Synology DS411+II which has no problem. Client got problem (Gentoo) but with the new configuration in the kernel, no more problem

.

Yes, NFS is like a "slow device" over a 100Mbits network. I think it was loading the memory with data to write to NFS and now, it's limited and so, there is no more hang at all. I was experiencing random hang every 1-2 minutes and during ... 1 or 2 minutes ... :-/

now it works like a charm

_________________
Mon MediaWiki perso : http://pix-mania.dyndns.org

loopx · Posted: Fri Dec 23, 2011 10:09 pm Post subject:

eccerr0r · Posted: Mon Feb 27, 2012 1:28 am Post subject:

Hmm... I guess I'm seeing this pop up again, sort of...

Once again it's the same two machines - a c2q with i686 3.2.1-gentoo-r2 and a c2d x86-64 machine also running 3.2.1-gentoo-r2.

I mounted the c2q with the c2d, and when starting to try to copy a bunch of files from the c2q to the c2d through, the copy process would d-state hang, basically requiring a client side reboot to clear up.

Still no signs of what's going on - no dmesg messages...

Now what's weird: coping from/to my athlonxp machine through nfs works perfectly fine!!!

Ugh...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?