Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
High Network Utilization crash?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Tue Nov 08, 2011 8:51 pm    Post subject: High Network Utilization crash? Reply with quote

Anyone seen this happening? Just looking for some "me too's" at least, even if it's not solved...

I try to dump a whole bunch of data (network computer to computer copy of data off another HD) through Gbit ethernet using Linux-3.0.6-gentoo, and it completely crashes the box. However if I switched to 2.6.35-r4 (gentoo-sources as well) the problem goes away and I can complete the copy. Weird! Linux-3.0.3-vanilla also crashes.

I need to determine whether it's writing to disk or just network activity is causing the problem but I ruled out NFS as the culprit (using NFS crashes after staring to dump the data, and I was also able to trigger the crash via using 'netcat' to do the same machine-to-machine copy.

Crashing system (haven't gotten any debug information from it because the machine seems to simply hang with no debug data):
Linux-3.0.6-gentoo
destination disks: MDRAID RAID5 on SATA ICH
x86 (Core2 Quad, 32 bit mode)
r8110s based Gbit ethernet on a Gigabyte EP43-UD3L board
ATI RadeonHD 5770 (FGLRX)
4GB RAM/64G PAE

The machine I was copying from
Linux-2.6.21-Custom
source disk: plain single disk SATA on ICH
x86 (Core2 Duo, 32 bit mode)
Marvell Gbit Ethernet on a Foxconn G965MA board
G965 chipset
4GB RAM/64G PAE

Both machines are connected via a Gbit ethernet switch.

Weird...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Tue Nov 08, 2011 10:21 pm    Post subject: Reply with quote

What if you have the receiving netcat write everything to /dev/null? This will allow you to reproduce the network load without involving the receiving disk. Similarly, you could try using dd bs=1M if=/dev/zero of=foo on the "receiving" system to generate a substantial disk load with no network involvement.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Wed Nov 09, 2011 5:36 am    Post subject: Reply with quote

The machine is otherwise stable... looks like network netcat works (114MB/sec), as well as dumping to disk (140MB/sec... blah, crappy raid...) individually.

I think I have some more clues now though, this might still be NFS after all. I have a feeling what's crashing is the file locks. I think I had the exports read only, and then it would work fine, but if it was exported read-write then it would crash. Of course this still isn't expected behavior...

hmm..need more testing.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Wed Nov 09, 2011 1:04 pm    Post subject: Reply with quote

I've seen something like this, a year or two back. I've since tweaked aspects of my installation to make-it-not-happen, though it wasn't out-and-out crashes, rather occasional nearly-minute-long hangs.

My nfs server is an 800MHz P-III, and the problem could be tripped by a single Athlon 64 client. This was back in the timeframe when Firefox had just started using sqlite to store its instance information and filesystem delays/problems were emerging with the fsync operations. It seemed to me that the sqlite fsync on a much faster client against a slower server was overwhelming the server. I moved .firefox over to local disk and symlinked it back to nfs-mounted /home. The problems went away.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Wed Nov 09, 2011 3:30 pm    Post subject: Reply with quote

did you check dmesg for troubles with irq ? under heavy load many motherboard with buggy part show failure, and the bad irq is then throw away from irq table, the funny part is that any device using it is not reset to grab another one but leave as-is in that bad state. If it happen to the irq that your hdd controller is using, you can expect slow down/freeze and crash, same for network card. It could happen just because too much device use the same irq, or just because some device didn't like shared their irq with anyone.

But i suppose it might not be that, as your dmesg should have already report it.
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Wed Nov 09, 2011 3:36 pm    Post subject: Reply with quote

As I said, these problems were a year or two back. I don't remember all of the steps I took at the time to try to diagnose this. I also remember that at that time there were known problems with nfs under heavy load, so I had a heavy presumption that that was the problem, rather than anything more fundamental.

It would be fairly easy to "go back" into the trouble realm. I would just have to remove the .firefox symlink from /home, and "cp -a" the .firefox from local space back to /home. There may be enough other things changed, that I wouldn't get back into trouble. One of those other changes I made was to start using cachefilesd to cache my nfsv4 /home, but AFAIK that doesn't cache writes, so if that's the problem, it should still exist. It is possible however that the nfsv4 write path is changed sufficiently by cachefilesd that that alone would ease my old problems.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Wed Nov 09, 2011 3:53 pm    Post subject: Reply with quote

More interesting is that it seems to die at around the same place each time when I start the particular copy.

Also, not much data needed to transfer before it crashed. I think it got a few KB over on this multi-GB transfer and the crash occurs, indicating the locking mechanism which is needed at the beginning of the transfer could have been at play. By random luck I don't think the amount of data matters, just a specific packet is sufficient to hang the server.

Unfortunately I destroyed the source copy of the data in question, this will probably be a bit of a mystery that won't be solved soon unless I build another data set that can repeat this... But I deeply suspect this to be a software issue versus a hardware one, though I can't point it to one or the other yet.

(I was copying everything off that one disk because I wanted to convert that core2duo to a 64-bit install!)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Wed Nov 09, 2011 4:35 pm    Post subject: Reply with quote

Come to think of it, I recently moved something over 5G onto NFS with no problem. The photos had been pulled off of SD cards onto my laptop. I don't yet have an SD reader for my deskside machine, and my laptop isn't set up for my NFS. (It's a work laptop.) I used scp to copy the photos from the laptop onto the deskside - in NFS space, which meant that they were being copied from the laptop through the deskside to the NFS server.

This is over a full-duplex 100Mbit LAN. Even though it's really 2 one-way problems, I suspect that there's enough handshaking overhead that neither transfer got the full 100Mbit rate. Either my network is in better shape several years later, or that little bit of double-transfer degradation made some difference, or the fact is, I started the transfer and walked away - only occasionally checking to see how it was going. OTOH, nothing crashed.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
HeissFuss
Guru
Guru


Joined: 11 Jan 2005
Posts: 414

PostPosted: Tue Dec 13, 2011 9:09 am    Post subject: Reply with quote

I haven't heard of this issue crashing an entire system, but THP was added in 2.6.38 and is know to cause application slowness/crashes if you have a lot of filesystem writes pending, or otherwise low on free memory. Did you enable transparent huge pages in your 3.0 kernels?
Back to top
View user's profile Send private message
loopx
Advocate
Advocate


Joined: 01 Apr 2005
Posts: 2787
Location: Belgium / Liège

PostPosted: Fri Dec 23, 2011 1:06 am    Post subject: Reply with quote

HeissFuss wrote:
I haven't heard of this issue crashing an entire system, but THP was added in 2.6.38 and is know to cause application slowness/crashes if you have a lot of filesystem writes pending, or otherwise low on free memory. Did you enable transparent huge pages in your 3.0 kernels?


Wooooow, thank you very much. I was thinking that NFS was big s*** but in fact, it's that problem you pointed out. So now, I hope to fix that a day because it's very annoying to be stuck for 1 minute when coping over NFS and trying to google to find out why it's hanging ...


EDIT: check that : http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt

I've done this :
Code:

echo "none" > /sys/kernel/mm/transparent_hugepage/defrag


and the problem still to be gone :)


In the kernel, I will now try to rebuild with that new setting :
Code:

 .config - Linux/x86_64 3.0.6-gentoo Kernel Configuration                                                                                                                           
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ┌────────────────────────────────────────────────────────────────────────────────── madvise ──────────────────────────────────────────────────────────────────────────────────┐   
  │ CONFIG_TRANSPARENT_HUGEPAGE_MADVISE:                                                                                                                                        │   
  │                                                                                                                                                                             │   
  │ Enabling Transparent Hugepage madvise, will only provide a                                                                                                                  │   
  │ performance improvement benefit to the applications using                                                                                                                   │   
  │ madvise(MADV_HUGEPAGE) but it won't risk to increase the                                                                                                                    │   
  │ memory footprint of applications without a guaranteed                                                                                                                       │   
  │ benefit.                                                                                                                                                                    │   
  │ Symbol: TRANSPARENT_HUGEPAGE_MADVISE [=y]                                                                                                                                   │   
  │ Type  : boolean                                                                                                                                                             │   
  │ Prompt: madvise                                                                                                                                                             │   
  │   Defined at mm/Kconfig:333                                                                                                                                                 │   
  │   Depends on: <choice>                                                                                                                                                      │   
  │   Location:                                                                                                                                                                 │   
  │     -> Processor type and features                                                                                                                                          │   
  │       -> Transparent Hugepage Support (TRANSPARENT_HUGEPAGE [=y])                                                                                                           │   
  │         -> Transparent Hugepage Support sysfs defaults (<choice> [=y])


;-)


EDIT2: I confirm : it works very well now :)
_________________
Mon MediaWiki perso : http://pix-mania.dyndns.org
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Fri Dec 23, 2011 1:31 pm    Post subject: Reply with quote

Does this have to do with THP on the NFS server, client, or both?

I have an i686 machine serving NFSV4, and several amd64 clients. The server is running some level of 2.6.39, but since it's i686 doesn't have THP. I believe the clients are all running THP. I had a bout of performance problems a bit over a year ago, but they had gone by the wayside. About a week or two back, I noticed another "temporary hang" reminiscent of the bad old days. I'm wondering if I'm having the THP problem.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Fri Dec 23, 2011 4:33 pm    Post subject: Reply with quote

The server is the machine crashing for me, so I suppose that's the machine that needs to have attention...

Unfortunately I don't have transparent huge pages enabled (but regular huge pages are enabled)...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
loopx
Advocate
Advocate


Joined: 01 Apr 2005
Posts: 2787
Location: Belgium / Liège

PostPosted: Fri Dec 23, 2011 10:06 pm    Post subject: Reply with quote

In my case, the server is a Synology DS411+II which has no problem. Client got problem (Gentoo) but with the new configuration in the kernel, no more problem ;).

Yes, NFS is like a "slow device" over a 100Mbits network. I think it was loading the memory with data to write to NFS and now, it's limited and so, there is no more hang at all. I was experiencing random hang every 1-2 minutes and during ... 1 or 2 minutes ... :-/


now it works like a charm :)
_________________
Mon MediaWiki perso : http://pix-mania.dyndns.org
Back to top
View user's profile Send private message
loopx
Advocate
Advocate


Joined: 01 Apr 2005
Posts: 2787
Location: Belgium / Liège

PostPosted: Fri Dec 23, 2011 10:09 pm    Post subject: Reply with quote

eccerr0r wrote:
The server is the machine crashing for me, so I suppose that's the machine that needs to have attention...

Unfortunately I don't have transparent huge pages enabled (but regular huge pages are enabled)...



At work place, we have one server running EXT4 and NFS for VMware ESXi (used as backup for virtual hdd + thin provisioning). Server has high nice time and transfer are not as fast as the FTP protocol but I think this is normal with NFS. I've not checked THP settings ...
_________________
Mon MediaWiki perso : http://pix-mania.dyndns.org
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Mon Feb 27, 2012 1:28 am    Post subject: Reply with quote

Hmm... I guess I'm seeing this pop up again, sort of...

Once again it's the same two machines - a c2q with i686 3.2.1-gentoo-r2 and a c2d x86-64 machine also running 3.2.1-gentoo-r2.

I mounted the c2q with the c2d, and when starting to try to copy a bunch of files from the c2q to the c2d through, the copy process would d-state hang, basically requiring a client side reboot to clear up.

Still no signs of what's going on - no dmesg messages...

Now what's weird: coping from/to my athlonxp machine through nfs works perfectly fine!!!

Ugh...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum