Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Very bad network performance [solved]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Mon Feb 06, 2023 5:00 am    Post subject: Very bad network performance [solved] Reply with quote

I apologize if this plea for insight is in the wrong forum. And for its length.

For the past couple months I've been fighting with file access throughput to my professional sever (which I moved to my home during COVID).

The basic problem is that uploads of all kinds from work (not NAT but behind a firewall) to home (behind a NAT firewall, but with an ssh pinhole) have been hideously slow. Real life transfer speeds of <10Mb/s, despite the entire route being wired 1GBE or faster. From home to work by contrast, the connection runs as fast as you could hope for, ~500-600Mbps real life transfer speed, often more. The ping in both directions is a respectable 30ms.

Hardware: (Home)HP Elite 7000 from ~2011, core i7 860, 12GB RAM, realtek RTL8111/8168/8411 PCIe onboard. Router is an AT&T Arris model BGW210, using an AT&T FTTH connection. It should be super fast. Work is a similar era Dell with a cisco switch. All machines running gentoo or Calculate. Gentoo machines are using the built-in kernel r8169 driver, calculate boxes using the module.

This has been very hard to track down because of the many layers and network pieces in between. There are many things that can quietly go wrong. And it's difficult to test because both firewalls block ping. Iperf has been a godsend. so has Speedtest-cli.

1. On campus: throughput between my two work machines is perfect, both ways. I spent a long time proving that. My IT guy insisted it was the older work machine slowing everything down.
2. Routing from campus to home, apparently is multipath. The route in the up direction (to home) is really *really* different from the down route from the work ISP to my home ISP. That threw us off a long time once we discovered it, but it was a red herring.
3. On the home side I get perfect 950Mb/s between machines, though slower to the server machine. Still, >30x faster than the miserable throughput from the outside world.

BUT who's to say whether it's samba, ssh, NFS, wireguard or any of the myriad things I've tried that could have been misconfigured. I've had at least two threads in this group trying to solve the throughput issue through the network stack at various levels. And maybe (I thought) somebody is throttling me, like my ISP or the campus ISP doing some traffic shaping me to punish me, who knows, for my choice in music or something. I really thought it must be a glitch at work.

Two key observations the past couple of days:

4. I opened a port in my home router firewall just for iPerf. When I tested the upload speed without any encryption layer, (like ssh or wireguard) and no samba/NFS/sshfs, still nothing changed: still ~10Mb/s in the upload (ie to home) direction. Often not even that. So ssh, samba, nfs, wireguard, et al, are now all off the hook.

5. I spun up an AWS-EC2 server just to test this issue. And guess what: I got the same rotten throughput in the to-home direction as I had gotten from campus. And perfect throughput to and from campus. That lets the entire campus side of it off the hook. None of the problems are with IT services, they are blameless. My problem is inside the house (cue spooky music).

6. I redirected the home router iperf pinhole -- instead of port forwarding it to my main server (above) like everything else, I sent it to an even older core i3 machine I keep around just for playing with python on devices. THAT machine got perfect throughput up and down! Aha. The game is afoot! I swapped cables with the server machine, and the fault stayed on the server.

7. So it seems simple, right? Sounds like a bad nic on the home server. Even though everything works, that's maybe because it's mostly very fault tolerant. There were a few dropped packets in ifconfig in the RX direction. But not 90%, or anything close.

8. I got good but not perfect performance between the three wired ethernet machines on my router (one of which is the server). Always slowest in the upload to the server direction, but still plenty fast enough if that was the only problem, ~300Mb/s. The other connections were >900Mb/s, as they should be.

9. It was a real AHA moment when both my mac laptop (using wifi) *and* the gentoo running in the virtualbox on the mac both got the same bad performance on the upload side to the server. So that means *my* ISP is now off the hook. The problem is really local to my office here. Any packet that is routed to my server goes at 1% performance, except for the ones coming from machines on the wired ethernet ports, those are only 30% performance. You would shrug that[/b] off if it wasn't for the catastrophic speed of the routed packets to the server.

10. So taken together the evidence seems to be that the problem is combination of *that* nic with *that* router, not just one component. I have a new nic arriving Monday, that should solve at least one part of the problem. The idea of a new router though gives me shortness of breath, because now I'll have to re-authenticate every single gadget in the house from the tv to the toaster.

So my "what have I missed" questions:

Q1: Is there some tuning/troubleshooting of the r8169 driver one can do? It's well and good to see dropped packets, but I don't see anything I can tune to try to fix the problem. This is an extremely mature driver, so I'd sooner blame the card.

Q2: I bought a new 2.5GBE PCIE nic, figuring I'll move all my main machines to 2.5GBE someday anyway. Am I opening a new can of worms? Maybe I should have gotten a 1GBE NIC. Maybe I should have gone with a USB3 nic. I don't know what the simple/fast/bombproof solution is.

Q3: How often does the internal routing in a telco router go bad (without failing entirely)? This must be a rare case.

Q4: Is there a particular "gold standard" router that could replace my AT&T one and do the things it does, only better?

Many thanks if you've read this far, and I welcome any comments.

Cheers,
Jon.


Last edited by jesnow on Sat Feb 11, 2023 9:44 pm; edited 2 times in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Feb 06, 2023 11:22 am    Post subject: Reply with quote

jesnow,

Quote:
Q1: Is there some tuning/troubleshooting of the r8168 driver one can do?

The r8168 driver should not be used if you can possibly avoid it. Its the Realtek open source driver, not in the kernel.
If its a typo for r8169, that's the in kernel driver. Use that if you you can.

Look in dmesg to see if the NIC driver is trying to load firmware and failing. It works without but the firmware is bug fixes.
Provide it. I don't know what it does.

Not all 2.5G NICs play nicely with a 1G LAN. Good luck. It may add a new problem.

For testing, a cheap USB3/1G NIC from Amazon Warehouse is nice and portable. You probably want one in your parts bin anyway.
There are a lot of fakes out there USB2/100Mb, marked USB3/1G. I've got a few Amazon Basics ones.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Mon Feb 06, 2023 1:42 pm    Post subject: Reply with quote

As always thanks!

Here's what dmesg has to say:

Code:

merckx /home/jesnow # dmesg | grep eth
[    0.499512] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 40:61:86:0d:a3:e6, XID 281, IRQ 32
[    0.499629] r8169 0000:02:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[   17.603666] r8169 0000:02:00.0 enp2s0: renamed from eth0





NeddySeagoon wrote:
jesnow,

Quote:
Q1: Is there some tuning/troubleshooting of the r8168 driver one can do?

The r8168 driver should not be used if you can possibly avoid it. Its the Realtek open source driver, not in the kernel.
If its a typo for r8169, that's the in kernel driver. Use that if you you can.


It looks like that's what I've got, I will edit my post.

Quote:


Look in dmesg to see if the NIC driver is trying to load firmware and failing. It works without but the firmware is bug fixes.
Provide it. I don't know what it does.


Code:

merckx /home/jesnow # equery list firmware -f
 * Searching for firmware ...
[IP-] [  ] sys-kernel/linux-firmware-20230117:0




Quote:


Not all 2.5G NICs play nicely with a 1G LAN. Good luck. It may add a new problem.

For testing, a cheap USB3/1G NIC from Amazon Warehouse is nice and portable. You probably want one in your parts bin anyway.
There are a lot of fakes out there USB2/100Mb, marked USB3/1G. I've got a few Amazon Basics ones.


I think I actually have one, now that you mention it. "In a box somewhere".

Thanks again!
Cheers,
Jon.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Feb 06, 2023 1:56 pm    Post subject: Reply with quote

jesnow,

Code:
dmesg | grep -i failed
is better at spotting firmware loading failures.
The line may not mention eth.

sys-kernel/linux-firmware is required to provide the firmware but its only sufficient if r8169 is a loadable module.
If its built into the kernel binary, the firmware must be built in too.
At
Code:
[    0.499512] r8169 ...
less than 0.5 sec into booting I suspect that r8169 is built in.

Long shot ... is one end of the link getting the MTU wrong?
That can result in lots of fragmentation, which is bad for throughput.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Wed Feb 08, 2023 2:57 pm    Post subject: Reply with quote

Still no solution. One heartning thing is that after much googlery I found someone in my exact situation:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1880076

I did all the same tests they tried in that thread with a ll the same head-scratchy results. I even have an almost identical machine (Core i3 version of the same MB) sitting right next to my "broken" one.

They traced the poor receive performance to missed RX packets which indeed I have on my broken machine:

Code:

merckx /usr/src/linux-5.15.10-gentoo # ifconfig
enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 2600:1700:a90:1b20:2fe7:d7c9:577f:b148  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::889b:b025:e1a2:d7b0  prefixlen 64  scopeid 0x20<link>
        ether 40:61:86:0d:a3:e6  txqueuelen 1000  (Ethernet)
        RX packets 2825732  bytes 2013896062 (1.8 GiB)
        RX errors 0  dropped 16993  overruns 0  frame 0
        TX packets 3492224  bytes 4449519440 (4.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


and not on my "good machine"

Code:

armstrong linux # ifconfig
enp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.106  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::6e62:6dff:fe58:fb7b  prefixlen 64  scopeid 0x20<link>
        inet6 2600:1700:a90:1b20:6e62:6dff:fe58:fb7b  prefixlen 64  scopeid 0x0<global>
        ether 6c:62:6d:58:fb:7b  txqueuelen 1000  (Ethernet)
        RX packets 6059455  bytes 8951734354 (8.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2310118  bytes 124969812 (119.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



I did a lot of checks using ethtool and lshw -vv, and found no, zero, zip nada difference between the drivers/hardware/anything.

Except that armstrong is using r8169 driver as a module, and merckx is using it as a built-in. Also slightly different kernel version. Maybe that's it? I have t leave the server up the rest of the day but tonight I can take it down and rebuild the kernel with identical settings.


Again the "broken" machine:

Code:

merckx /usr/src/linux-5.15.10-gentoo # grep R8169 .config
CONFIG_R8169=y
merckx /usr/src/linux-5.15.10-gentoo # grep REALT .config
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_REALTEK_PHY=y
CONFIG_SND_HDA_CODEC_REALTEK=y
# CONFIG_USB_STORAGE_REALTEK is not set
merckx /usr/src/linux-5.15.10-gentoo # lshw | grep rtl
                configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=5.15.10-gentoo duplex=full firmware=rtl_nic/rtl8168d-1.fw ip=192.168.1.105 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
merckx /usr/src/linux-5.15.10-gentoo # dmesg | grep r816
[    0.480037] r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
[    0.484406] libphy: r8169: probed
[    0.486531] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 40:61:86:0d:a3:e6, XID 281, IRQ 32
[    0.486649] r8169 0000:02:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[   15.363292] r8169 0000:02:00.0 enp2s0: renamed from eth0
[   16.404434] RTL8211B Gigabit Ethernet r8169-0-200:00: attached PHY driver (mii_bus:phy_addr=r8169-0-200:00, irq=MAC)
[   16.571601] r8169 0000:02:00.0 enp2s0: Link is Down
[   19.328196] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[ 1974.319047] r8169 0000:02:00.0 enp2s0: Link is Down
[ 1981.212917] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[13376.226128] r8169 0000:02:00.0 enp2s0: Link is Down
[13385.568968] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[13418.307528] r8169 0000:02:00.0 enp2s0: Link is Down
[13441.783841] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[13449.360890] r8169 0000:02:00.0 enp2s0: Link is Down
[13452.124273] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[120095.139243] r8169 0000:02:00.0: invalid VPD tag 0xe1 (size 46370) at offset 26691


The broken machine link seems to go up and down a lot. I'm pretty sure it's over about the same time frame.
...and the "good" machine:

Code:

armstrong linux # grep R8169 .config
CONFIG_R8169=m
armstrong linux # grep REALT .config
CONFIG_NET_DSA_REALTEK_SMI=m
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_REALTEK_PHY=m
CONFIG_WLAN_VENDOR_REALTEK=y
CONFIG_SND_HDA_CODEC_REALTEK=m
CONFIG_USB_STORAGE_REALTEK=m
CONFIG_REALTEK_AUTOPM=y
CONFIG_MMC_REALTEK_PCI=m
CONFIG_MMC_REALTEK_USB=m
CONFIG_MEMSTICK_REALTEK_PCI=m
CONFIG_MEMSTICK_REALTEK_USB=m
armstrong linux # uname -a
Linux armstrong 5.15.82-calculate #1 SMP PREEMPT Tue Dec 13 09:40:40 UTC 2022 x86_64 Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz GenuineIntel GNU/Linux
armstrong linux # lshw | grep rtl
                configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=5.15.82-calculate duplex=full firmware=rtl_nic/rtl8168d-1.fw ip=192.168.1.106 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
armstrong linux # dmesg | grep r816
[    4.391407] r8169 0000:01:00.0: can't disable ASPM; OS doesn't have ASPM control
[    4.400576] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 6c:62:6d:58:fb:7b, XID 281, IRQ 28
[    4.400585] r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[    4.619792] r8169 0000:01:00.0 enp1s0: renamed from eth0
[    9.324491] RTL8211B Gigabit Ethernet r8169-0-100:00: attached PHY driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC)
[    9.456405] r8169 0000:01:00.0 enp1s0: Link is Down
[   12.239166] r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off
[47332.767508] r8169 0000:01:00.0: invalid VPD tag 0xb2 (size 46435) at offset 26708


Any thoughts? What is "Invalid VPD tag" that both machines show? Googling that it seems to be a show stopper on other os's but not here.

Cheers,
Jon.
Back to top
View user's profile Send private message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Thu Feb 09, 2023 1:54 am    Post subject: Reply with quote

Things are getting clearer about what they symptoms are:

I get dropped RX packets shown in ifconfig:

Code:


enp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::889b:b025:e1a2:d7b0  prefixlen 64  scopeid 0x20<link>
        inet6 2600:1700:a90:1b20:2fe7:d7c9:577f:b148  prefixlen 64  scopeid 0x0<global>
        ether 40:61:86:0d:a3:e6  txqueuelen 1000  (Ethernet)
        RX packets 1764813  bytes 2364484783 (2.2 GiB)
        RX errors 0  dropped 44110  overruns 0  frame 0
        TX packets 2195670  bytes 2268436095 (2.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



that ip -s link shows as "missed"

Code:


3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 40:61:86:0d:a3:e6 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast           
    2364497477 1764921      0       2   44108    6706
    TX:  bytes packets errors dropped carrier collsns           
    2268440521 2195694      0       0       0       0



I can sit here watching the number spool upwards when I run iperf.
Other people have had this issue, even to the point (like I do) of having a very similar machine with a similar chip sitting right next to it. But nobody seems to know the cause. Here for example is someone with exactly my problem:

https://bugzilla.redhat.com/show_bug.cgi?id=1671958

I recompiled the kernel to make the r8169 driver a module like it is on the other machine. No luck.

Any help would be greatly appreciated...

Cheers,
Jon.
Back to top
View user's profile Send private message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Thu Feb 09, 2023 3:37 am    Post subject: Reply with quote

I tried the realtek r8168 driver that was in portage. Same behavior.

Code:

jesnow@merckx ~ $ dmesg | grep r8
[    0.041866] percpu: Embedded 43 pages/cpu s137048 r8192 d30888 u262144
[    0.041873] pcpu-alloc: s137048 r8192 d30888 u262144 alloc=1*2097152
[   14.842585] r8168: loading out-of-tree module taints kernel.
[   14.843396] r8168 Gigabit Ethernet driver 8.051.02-NAPI loaded
[   14.846079] r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
[   14.846099] r8168  Copyright (C) 2022 Realtek NIC software team <nicfae@realtek.com>
[   14.955835] r8168 0000:02:00.0 enp2s0: renamed from eth0
[   15.067263] r8152 2-1.7:1.0 (unnamed net_device) (uninitialized): netif_napi_add() called with weight 256
[   15.067535] r8152 2-1.7:1.0 eth0: v2.16.3 (2022/07/06)
[   15.067538] r8152 2-1.7:1.0 eth0: This product is covered by one or more of the following patents:
[   15.067560] usbcore: registered new interface driver r8152
[   15.073892] r8152 2-1.7:1.0 enp0s29u1u7: renamed from eth0
[   19.095191] r8168: enp2s0: link up


The r8152 works and coexists with r8168 and 8169. BUT isn't so great on a USB 1.0 port. I have a usb 3.2 gen2 card on the way, maybe that will end up being the workaround.

Cheers,
Jon.
Back to top
View user's profile Send private message
jesnow
l33t
l33t


Joined: 26 Apr 2006
Posts: 856

PostPosted: Sat Feb 11, 2023 9:43 pm    Post subject: Reply with quote

OK finally an end to this saga.

Yes is was a bad onboard NIC in my old machine.

BUT the new Realtek r8125 2.5GBE card would not work no matter what I did. Rebuilt the kernel couple different way, tried different driver versions. I ended up with the realtek supplied driver.

The USB 2.5GBE adapter did work, with the r8152 driver. But in a usb 1.1 port, it was veeeery sloooow. I just installed a pcie usb3.2 gen2 card, and will test that when I get around to it.

As I never got the system to recognize the PCIE nic, or get the link light to come on I decided this new nic must be bad. As I was putting it back in the bag to send it back, a small oscillator fell out of the bag. When I put it under a microscope I could see that the only way it lined up with the solder pads was if it was upside down, which seemed strange. But so I soldered it back on upside down, put it in the slot, rebooted, and lo and behold, everything works. Hooray.

My takeaway: Most cheap 2.5GBE adapters seem to use the r8125 chip, which is incompatible with the r8168 and r8169 in-kernel or modular drivers. The realtek supplied drivers, which you install from a tarball via a script seem to work fine. There are older ones in the kernel I did not test. If course it's not running at 2.5GBE because I don't have any other 2.5 GBE equipment. In due course.

Thanks to everyone who provided feedback!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum