rsync pull messing up network connection after 45 secs

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Sorry for this accusatory headline, but the limited space did not
allow for niceties. Of course I do not know that it is the rsync pull;
it just looks like that. But the behaviour is absolutely reliable and
reproducible.

I am a longterm internet user, but not a networking person, which is
why I have tried different scenarios to somehow pinpoint the
problem. But many of my observations might be useless, or just normal
and expected behaviour. I hope that at least something useful is among
them.

So let's go:

I am trying to sync files between my big ("home") laptop and my small
netbook (which I take to work) using rsync. I have a router supplied
by my telephone company, which I use to access the internet, and which
assigns IP addresses using DHCP. When I plug in both computers to the
router, they become automatically mutually visible under their host
names (let's call them 'laptop' and 'netbook'), and provided 'laptop'
is running rsyncd and I have defined a module in /etc/rsyncd.conf, I
can start pulling files from it using

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

Please lay off the enter key. Your post is much more readable without jumping to a new line every 70 characters. The forum will natively wrap regular text at a width that works well for the reader's browser.

The "weird" IP address is from the APIPA range. It is assigned when your DHCP client is required to configure something and does not get a valid response from a DHCP server. This could happen if your network card is so confused that it cannot talk to the DHCP server. The dmesg entry about DMA Status error. Resetting chip. looks bad. The last line of dmesg seems to claim the device recovered, but I am doubtful about that since you say it continues not to work and stays broken even across a reboot.

What kernel version are you using? Has this ever worked? Does it happen if you stream an equivalent volume of data quickly between the two machines using some other protocol? I assume when you refer to a reboot that you mean a warm reboot, where the machine remains powered but you start the OS again. Please try a cold reboot instead, where you tell Linux to halt and turn off power. Wait 10-15 seconds, then turn power on again. The magic recovery after trying to use the network after a reboot sounds very odd. Based on what you have said, I would blame either the kernel driver or the NIC firmware on the machine which loses connectivity.

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Hi Hu,

thank you for responding, and so quickly! Sorry about the line breaks in my original posting - actually it was my editor which did this - I had to collect the information across several reboots, and so started to write in an editor. No need for that now. (But if there is any more, I will switch off the line-breaking behaviour).

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

By default, ping uses small packets, as you noted. The ping flood test you did is interesting, since it seems to suggest that large volumes of small packets are fine, but moderate volumes of large packets are a problem. Using ping -f can be hard on a network, but if the test involves only your machines, that is fine. According to the documentation, ping -f sends packets as fast as they come back or one hundred times per second, whichever is more. Therefore, your test may not have had much data in flight concurrently, which is another possible explanation for why it did not fail. Can you try a test using chargen as a data source? That should produce more concurrent traffic and may also use larger packets, more like how rsync is behaving.

The behaviour with cold reboot and with the router cycled are just weird. With either of those reports alone, I would blame the other component. With both those reports together, I do not know what is at fault. I still lean toward a problem with the NIC firmware, but a cold reboot ought to have prevented that. I suppose the reboot might not be truly cold if the NIC remained powered up using laptop battery power. Could you repeat the test with the battery disconnected and wall power removed during the 5-10 second window after the system halts?

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Hi Hu,

I am very sorry for not having replied earlier: I'm a very busy teacher, and simply did not have the time. But it does feel like that was rude of me.

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

Two days does not seem long to me. No need to apologize.

If you installed xinetd, you have the ability to run chargen, but it may not be configured yet. Edit /etc/xinetd.d/chargen-stream to configure it. See man xinetd.conf for details on the configuration language, or ask here if that is insufficient.

Larger ping packets might help, but I would rather reproduce this with TCP so that we get the same flow characteristics.

I meant halt the system, unplug power, remove the battery, plug in wall power, conduct your test until the connection fails. When it fails, then halt the system, unplug wall power, let it sit for 5+ seconds, restore wall power, then turn it on. The goal is to ensure that the relevant hardware is temporarily completely unpowered (including lack of battery backup), since that gives the best chance that it will be initialized from a good default state on boot.

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

Right, we want to see whether the hardware can be made pristine again at all. After that, we can try to narrow down why it needs that help at all.

Yes, it was intended, since I never included instructions on when to put the battery in, either. I intended for you to remove it once and leave it out for the duration.

You probably have the xinetd listener bound to localhost, rather than wildcard. Change the bind directive to listen to 0.0.0.0, restart xinetd, and try again.

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

The results with removing power make me fairly confident that the problem lies in the laptop, not in the router. The original messages about DMA lead me to believe this is somehow related to having a large volume of traffic in flight concurrently. Large contiguous files are more likely to satisfy this than a group of scattered small files, especially if the total size exceeds system caching. Working chargen would be nice for making this easier to reproduce, but is not required, since it seems like you can reproduce this quite readily with the rsync test. We still do not know whether the problem is card firmware or the kernel driver, but I am inclined to believe that it is at least in part a firmware bug. The kernel ought to initialize the card the same way on every boot, so if the firmware gets into a state where sometimes the kernel probe initializes it correctly and sometimes it does not, then in my opinion, that is a firmware bug.

You probably need to set the bind option on the individual services, specifically chargen-stream. That is why I previously said:

fsavigny · n00b Joined: 08 Sep 2012 Posts: 24

Hu · Moderator Joined: 06 Mar 2007 Posts: 21607

A firmware update is usually a last resort, due both to the dangers you mentioned and that some manufacturers are rather bad about not releasing firmware updates for anything but the most critical problems, such as issues that would otherwise warrant a product return/recall.

Yes, I think we should set aside the xinetd/chargen tests for now. I had suggested that because I thought it would be easier to do that than to ask you to re-run the rsync tests, but it is proving to cost you more time than it would save.

You can get a more recent kernel through Portage by adding sys-kernel/vanilla-sources to /etc/portage/package.accept_keywords, so that it accepts testing versions of that package, then running emerge --oneshot sys-kernel/vanilla-sources. You can optionally specify a version. For this purpose, I would suggest using the latest 3.12 series, which is currently the newest stable kernel.