Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
InfiniBand - a cheap way to _fast_ network, PC to PC?
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4  
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 41417
Location: 56N 3W

PostPosted: Tue May 23, 2017 10:16 pm    Post subject: Reply with quote

Zucca,

Put the switch back in the hole, just so you can't accidentally poke a finger in and touch something nasty.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Tue May 23, 2017 10:24 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Zucca,

Put the switch back in the hole, just so you can't accidentally poke a finger in and touch something nasty.
That won't happen. ;) The server is inside ventilated 19" rack cabinet. Thanks for mentioning anyways.
I could however put something there... Maybe I'll carefully take the switch apart (I want to see the damage inside) and then put a hollow switch back there. I'll think it again tomorrow. Thanks. :)
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Sun May 28, 2017 12:24 pm    Post subject: Reply with quote

NFS over RDMA is working here.
But I have serious problems on the client side with unmounting the shares before putting the system to sleep. I have previously just unloaded all ib and rdma modules before sleep. But now that isn't possible because of the shares. Basically modprobe just tells that the modules are in use. So I disabled the systemd service that unloads those modules before going to sleep. Everything was fine, but after waking up I've lost all ib connectivity. The ip-over-ib network interface is still up, but nothing goes trough. At this point I ran the service which unload and then reloads the modules... It froze systemd too. I managed to escape from that lockup by enabling magic SysRq and doing REISUB. No filesystem corrupption afterwards. Yay!

I now need to think of a way how to first unmount all the nfs shares. Then kill all programs using libibumad (and rdma too maybe). After that modules should be freely unloadable before sleeping.

Note that this seems to be an issue only on Mellanox hardware. And maybe just on the older ones. AFAIK the newer hardware uses different driver and does not suffer from this.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Thu Jun 01, 2017 8:52 pm    Post subject: Reply with quote

Okay.
After several shots in the dark I got my NFS propely working... But via ip. So no RDMA. :( Well... It works over RDMA, but if I unmount the share I cannot mount it again. I tried manual unmounting and then remounting, but the problem is the same. So if you keep your NFS mounts mounted all the time and don't use any sleep functions, then it works.
Good news is that via ip I get around 300MiB/s transfer speed, which is enough for the current usage I have here.

These are the errors I get when trying to mount the share for the second time:
log:
kernel: rpcrdma: 'frwr' mode is not supported by device mthca0
kernel: rpcrdma: 'frwr' mode is not supported by device mthca0
kernel: rpcrdma: connection to 10.0.11.1:20049 on mthca0 rejected: stale conn
kernel: rpcrdma: connection to 10.0.11.1:20049 closed (-11)
... the messages about the 'frwr' mode also appear when succesfully mounting the share.

As a last resort I can start poking different module parameters:
shellcmd: lsmod | awk "\$1 ~ /^ib_|rdma/ {print \$1}" | while read mod; do echo -e "\n== $mod =="; modinfo -p "$mod"; done :
== rpcrdma ==

== rdma_cm ==

== ib_umad ==

== ib_ipoib ==
max_nonsrq_conn_qp:Max number of connected-mode QPs per interface (applied only if shared receive queue is not available) (int)
mcast_debug_level:Enable multicast debug tracing if > 0 (int)
send_queue_size:Number of descriptors in send queue (int)
recv_queue_size:Number of descriptors in receive queue (int)
debug_level:Enable debug tracing if > 0 (int)

== ib_cm ==

== ib_mthca ==
catas_reset_disable:disable reset on catastrophic event if nonzero (int)
fw_cmd_doorbell:post FW commands through doorbell page if nonzero (and supported by FW) (int)
debug_level:Enable debug tracing if > 0 (int)
msi_x:attempt to use MSI-X if nonzero (int)
tune_pci:increase PCI burst from the default set by BIOS if nonzero (int)
num_qp:maximum number of QPs per HCA (int)
rdb_per_qp:number of RDB buffers per QP (int)
num_cq:maximum number of CQs per HCA (int)
num_mcg:maximum number of multicast groups per HCA (int)
num_mpt:maximum number of memory protection table entries per HCA (int)
num_mtt:maximum number of memory translation table segments per HCA (int)
num_udav:maximum number of UD address vectors per HCA (int)
fmr_reserved_mtts:number of memory translation table segments reserved for FMR (int)
log_mtts_per_seg:Log2 number of MTT entries per segment (1-5) (int)

== ib_core ==
send_queue_size:Size of send queue in number of work requests (int)
recv_queue_size:Size of receive queue in number of work requests (int)
force_mr:Force usage of MRs for RDMA READ/WRITE operations (bool)
If any of you have any guesses which parameter to change (other than debug_level)... Please speak now. :)


EDIT01: My brain is starting to melt right now. I put debug_level=7 for ib_mthca in modprobe configuration and did several mounts+umounts. Every single time I managed to mount the share and view its contents. What the wtf is going on here?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Fri Jun 02, 2017 7:06 pm    Post subject: Reply with quote

I wonder, what options do you use for mount ntfs?
It seems -o soft makes quite a difference with unstable links, and suspend does look like a case of unstable link to me.

edit: lol, typo. Yes, "soft", of course.


Last edited by szatox on Sun Jun 04, 2017 2:51 pm; edited 1 time in total
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Fri Jun 02, 2017 8:13 pm    Post subject: Reply with quote

szatox wrote:
I wonder, what options do you use for mount ntfs?
It seems -o sotf makes quite a difference with unstable links, and suspend does look like a case of unstable link to me.
Thanks for the advice.
So far my uptime is now closing to 24h mark and with nine hibertating in a row everything seems to be working. If it's indeed the debug_level that "solved" the issue then there might be some race condition caused errors. I'll test out your tip if something fails in comming days.

Oh. At first I read "soft". Then as writing this I saw that you actually spelled "sotf", but I'll assume that you actually meant "soft" after all. :)
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Mon Jun 05, 2017 9:54 am    Post subject: Reply with quote

Ok. So after about 48h of uptime hibernating didn't succeed. Unfortunately I wasn't watching where it stuck. Usually it stucks after the "first" freeze (when hibernating the system actually first goes to a some kind of sleep state, where it wakes up to only write the hibernation image and then power off) and does not continue.
After SysRq REISUB I couldn't get eny information what had actually happened since systemd-journald had lost all the logs just after initiating the hibernation (a constant problem with journald).

I have now added the "soft" mount option. I'll hope for the best.

Also: For those who are thinking of buying used InfiniBand hardware - Try to avoid Mellanox ones that use mthca kernel driver if you need to put your computer to sleep. The fix isn't likely to come since the hardware using the driver is quite old. I learnt from this. ;)
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Talis9
n00b
n00b


Joined: 10 Feb 2018
Posts: 2

PostPosted: Sat Feb 10, 2018 1:08 am    Post subject: Reply with quote

Looking over IB perf achieved in GenToo, with limited server (prehistoric) inspires me to offer a small cluster, either sent out ( 1U racks) if yor in US.. or remote admin.. to use in benchmark. With 40G VPI cards, ( ConnectX2 or X3) thus avoiding mthca drivers, to prove in-mem DB loading to a head pair of http// transaction Servers. Real world 24/7 grind, thus no hibernate complications, using mlx4 drivers, if one has interest. Yor choice of switches. 6036 or 5030. Both managed. Custom kernel compile on Gentoo or straight OFED in Centos7.3. Suit finance oriented. All hardware provided. Keen to see Gentoo kernel perform rings around BSD distros.. This post is a good place to start.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 5234

PostPosted: Sat Feb 10, 2018 6:07 am    Post subject: Reply with quote

Zucca wrote:
Also: For those who are thinking of buying used InfiniBand hardware - Try to avoid Mellanox ones that use mthca kernel driver if you need to put your computer to sleep. The fix isn't likely to come since the hardware using the driver is quite old. I learnt from this. ;)

Since this topic's been dug up, I might as well add something... that sounds a lot like an ancient PCI sound card I had. I ended up using suspend/resume hooks that su to my user to run pasuspender (I needed pulseaudio because the next step's necessary), rmmod the driver (this hangs anything using the device!), suspend and then do the operation in reverse.

For NFS, I guess autofs is a good way to achieve the same effect: send it a SIGUSR1 to immediately unmount everything, do the hibernate thing, then after wakeup those mounts will reapply automatically.
Back to top
View user's profile Send private message
Talis9
n00b
n00b


Joined: 10 Feb 2018
Posts: 2

PostPosted: Tue Feb 13, 2018 9:52 am    Post subject: Reply with quote

Realizing only a seasoned operator to want 3 1U servers whining in the basement, as a minimal single http// head , IB linked to 2 Redis in-mem DB's .. thats a min working deployment without a switch, as dual port VPI cards keep a cable linked triage... WHERE ARE the transaction curious for a remote INSTALL / ADMIN ? 2 BUSY ? Given ALL that time spent on Hibernate (always a flaky business in any OS ) Would it not please Gentoo users to lay out an IB management in Paludis ?

Oh, if i may ask.. Gentoo has no PF as in the fine-grain packet filter functionality of pf ? is iptables the preferred package for wire speed 1Gb/S http// ? Nope.

netfilter/iptables is at 1.6.2 but offers no bugfix lists, no life of packet on the docs link .. http://www.netfilter.org/documentation/index.html#documentation-other

https://home.regit.org/netfilter-en/secure-use-of-helpers/ LINK does work (at least) only NO benchmarking for Port Listeners .. things are so incomplete..

What time to port check against FIB ? Clock cycles is asking too much, i kno.. but from kernel state in uS ..helpful to evaluate if it is in a live ballpark..

There must be real joy in coders to offer endless loop experience to newcomers with wirespeed (TODAYS) requirement.. iptables seems a home server / client ..

" .. With iptables, all of your packets pass through all of your rules. This can really slow things down, especially if you have complicated rulesets. If you use all sorts of crazy iptables modules, that will slow it down pretty heavily too. And if you pass the packet into userspace for further processing, it will slow it down even more. "

Example how many interface connections (clients) per Second may run with standard spoofing filter : at 10% resources of a dual Xeon 2690
(of course an annoying aspect is most code contributors are running UNIX from old resurrected dumpster CPUs )

iptables -A PREROUTING -t raw -i eth0 -s $NET_ETH1 -j DROP
iptables -A PREROUTING -t raw -i eth0 -s $ROUTED_VIA_ETH1 -j DROP
iptables -A PREROUTING -t raw -i eth1 -s $NET_ETH1 -j ACCEPT
iptables -A PREROUTING -t raw -i eth1 -s $ROUTED_VIA_ETH1 -j ACCEPT
iptables -A PREROUTING -t raw -i eth1 -j DROP


.. this Gentoo link is down : i thought pf was ported for BSD ..


 pf


Local USE flag

Packages describing “pf” as local USE flag


Package

“pf” Flag Description


sys-freebsd/freebsd-sbin
Build tools to administer the PF firewall.

Can a BSD pf package compile on Gentoo kernel ? it appears so .. why is the link down ?

Is pfSense available in Gentoo , or is it a BSD (UNIX) only package ?
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Tue Feb 13, 2018 2:43 pm    Post subject: Reply with quote

Ant P. wrote:
For NFS, I guess autofs is a good way to achieve the same effect: send it a SIGUSR1 to immediately unmount everything, do the hibernate thing, then after wakeup those mounts will reapply automatically.
A lot have happened since. Firstly I switched from systemd to OpenRC on my desktop. systemd was unable to unmount certain filesystems when needed and it resulted, sometimes, in systemd itself becoming zombie process. ... yeah... that was 'fun'.
I've been using autofs now. But I've left RDMA disabled on my nfs mounts. Now that you've said that using autofs might work, I may try it again. I just need to set hibernation settings on elogind.

Talis9 wrote:
Is pfSense available in Gentoo , or is it a BSD (UNIX) only package ?
Iirc, pfSense only works on top of FreeBSD kernel.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Thu Apr 26, 2018 2:55 pm    Post subject: Reply with quote

Bumping this topic back up as I noticed some stange behaviour.

I have had an udev rule on my desktop PC which automatically sets the InfiniBand network interface mode to connected and increases the mtu to maximum.
I made some network related reconfiguration on my server and found out that the server didn't had those mode and mtu settings. So I created the rule there too. But some time after my nfs shares to the desktop started to ping out. There was nothing on the server side logs. Slient side had these kinds of logs reated:
Code:
udevd[5340]: Validate module index
udevd[5340]: seq 4426 queued, 'add' 'module'
udevd[5340]: seq 4426 forked new worker [8079]
udevd[8079]: seq 4426 running
udevd[8079]: no db file to read /run/udev/data/+module:rpcsec_gss_krb5: No such file or directory
udevd[8079]: passed device to netlink monitor 0x55b6243a32b0
udevd[8079]: seq 4426 processed
udevd[5340]: seq 4427 queued, 'add' 'bdi'
udevd[5340]: passed 139 byte device to netlink monitor 0x55b624384b90
udevd[8079]: seq 4427 running

After removing the udev rule on the server side things started to work again. I still get those same error messages maybe once or twice a day (maybe every time when autofs mounts the nfs share), but non the less the connection is working.

Can someone "decrypt" those messages for me? I'd like to know if I still have something "not quite right" on my InfiniBand connection.

Also szatox, did you ever build your InfiniBand setup?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Sat Apr 28, 2018 2:19 pm    Post subject: Reply with quote

Zucca: No. Paypal hates me and those guys in China woudn't accept anything else (and I'm not paying for wires from USA, add they would be more expensive than literally anything else in my li'l lab), and then I got busy with other things.
Prompted by your message I took a quick glance at the market and I see it has definitely changed, so maybe it's time to give it another shot.


Quote:
udevd[8079]: no db file to read /run/udev/data/+module:rpcsec_gss_krb5: No such file or directory

It's NFS4, isn't it? Does the same problem exist with NFS3 in tcp mode?
Do you have anything else that uses kerberos?

Perhaps it's a missing kernel module?
Quote:
zcat /proc/config.gz | grep -i krb
CONFIG_RPCSEC_GSS_KRB5=m
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Sat Apr 28, 2018 3:17 pm    Post subject: Reply with quote

szatox wrote:
Perhaps it's a missing kernel module?
Yup. It's missing from server side. Thanks. :)
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Thu May 10, 2018 6:17 pm    Post subject: Reply with quote

Here we go. Prompted by your previous message, I had a look around and found some patch cords on sale, around 7$ each. Still expensive for a (damn short) wire, but it is much much more reasonable than 70$ basically everyone else asks just for plugs alone.
I've just got a handful of those delivered, so it's time for some fun.

Edit: one step at a time. Gotta upgrade quite a lot of stuff first.


Last edited by szatox on Thu May 10, 2018 10:06 pm; edited 1 time in total
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Thu May 10, 2018 6:32 pm    Post subject: Reply with quote

I think I paid 30€ for 7 meter 4x (non optical) cable. Same price as one card. :P
Going for 8x or 12x will empty your wallet pretty quickly. Also you'd need some strength to be able to bend the 12x cable to a tight turn. :D
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Sun Jun 03, 2018 1:06 pm    Post subject: Reply with quote

Just a quick note, since I haven't finished rebuilding my system images yet (busy life), and I forgot to install opensm in the old one, I went to find out how those infiniband cards are doing in ethernet mode. Since it's DDRx4, they should provide 16Gbps infiniband or 10Gbps ethernet.

Poor man's iperf was made of 10000 MiB (~9.8GB) file on ramdisk, nc and "time" directive from bash.
Code:

# (--half-close is necessary for nc to exit after transfer)
# for i in {1..5}; do time nc --half-close 10.1.0.3 8080 < /tmp/test ; done
nc: using stream socket

real    0m11.160s
user    0m0.650s
sys     0m9.860s
nc: using stream socket

real    0m11.422s
user    0m0.790s
sys     0m10.510s
nc: using stream socket

real    0m11.260s
user    0m0.800s
sys     0m9.870s
nc: using stream socket

real    0m11.129s
user    0m0.660s
sys     0m9.960s
nc: unable to connect to address 10.1.0.3, service 8080

real    0m0.001s
user    0m0.000s
sys     0m0.000s

Which (excluding connection error - nc on the receiving end apparently didn't start in time) gives us 7Gbps (NET!) at one CPU core fully loaded (so the network itself probably is NOT a bottleneck in this case). Gotta retry with multiple threads.
An attempt to use dd piped to nc for measurement resulted in transfer times over 60 seconds with 2 cores loaded around 70-80%. Pipes ain't good for performance.


Quote:
Also you'd need some strength to be able to bend the 12x cable to a tight turn.
If they are as stiff as those x4, and twice that thick on top of that, you could use them as snooker sticks. Lesson learnt: get optic fiber whenever possible.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Mon Jun 04, 2018 4:42 am    Post subject: Reply with quote

Have you tried qperf? Also how about enabling RDMA with NFS?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Sun Jun 24, 2018 1:17 am    Post subject: Reply with quote

I've finally rebuild the image from scratch (and also switched to ~arch), the most basic stuff with infiniband does work, opensm configured the link for me, I launched iperf and got transfer speed (one direction) in range of 11.5-11.9 Gbps out of theoretically possible
Code:
16/10*8
12.80

Not bad.

Obviously, it was going way too well, so something had to burst my bubble. Let's check this RDMA thing out.
* Installed qperf. Launched rc_bi_bw test. It failed to start because "no userspace driver".
* A quick question to my dear internets, installed the missing libs. Qperf failed because some path in /sys is missing.
Shit happens, I'll deal with it later.
* Let's start opensm as a service this time. Opensm failed to start complaining about openib not started.
* Another question to my dear internets, emerge ofed, rc-service start... It failed to start complaining that it can't load modules.
Of course you can't load modules, my image is kernel-agnostic and my kernel is self-contained. There is nothing to load.
At least I found variable that make ofed pull in missing libs.
* Inspected init script, grepped /etc for variables, commented out everything ending with =yes. Crash. I missed some modprobes outside of "if" statements.
* Commented them out, try again. Service started. Qperf failed. Path in /sys is still missing.
* Inspected openib init script again. Spotted some comment on udev loading modules, entered udevadm trigger -c add. qperf failed to connect.
Right... I have to update boot image with all those recent changes and restart the other machine. And probably rewrite that init script. Whoever modified it last asked in a comment for "second opinion". Here we go, buddy: it sucks.
* Wonder if that driver works in the first place. Runnin ibv_devices.... libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_ver0
Apparently no. I'm out for now, will try again after I've had some sleep.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Sun Jun 24, 2018 4:09 pm    Post subject: Reply with quote

You said, your cards can operate in ethernet mode too? Right?
Maybe you forgot to change the mode to InfiniBand?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Sun Jun 24, 2018 9:32 pm    Post subject: Reply with quote

Both ports start in infiniband mode. I have to explicitly switch them to ethernet if I want to use them that way.
Turned out I installed userspace driver for _the_other_ mellanox HCA. Openib init script sucks nonetheless, I'm not taking back this bit. I spent a good part of the day today rebuilding GCC (6 and 7) 2 or 3 times and sorting out python upgrade, emerge refused to perform clean action due to broken dependencies and at the same time didn't bother upgrading anything "because there's nothing to do" :roll: and in the end probably forgot to actually unmerge one of them.
Now have to figure out why shutdown doesn't work. This is really puzzling. And kinda irritating, since "reboot" blocks the terminal until timeout.


Anyway, back to infiniband. I've ran a few tests, they are short so I launched them several times so I can average the results (and also see how consistent they are). Here we go:

Unidirectional RDMA. Flat like a table top.
Code:
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ # qperf 10.1.0.1 ud_bw
ud_bw:
    send_bw  =  1.47 GB/sec
    recv_bw  =     0 bytes/sec
vhost-88 ~ #


TCP bandwidth. A bit more spiky
Code:

vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.22 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.05 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.05 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.05 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.05 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.22 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.05 GB/sec
vhost-88 ~ #

UDP bandwidth ( infiniband is supposed to be reliable, which makes UDP better )
Code:

vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.46 GB/sec
    recv_bw  =  1.46 GB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.46 GB/sec
    recv_bw  =  1.46 GB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.48 GB/sec
    recv_bw  =  1.47 GB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.46 GB/sec
    recv_bw  =  1.46 GB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.46 GB/sec
    recv_bw  =  1.46 GB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  1.46 GB/sec
    recv_bw  =  1.46 GB/sec
vhost-88 ~ #


Bidirectional RDMA. Not sure what to think about this. How it works for you?
Code:

vhost-88 ~ # qperf 10.1.0.1 rc_bi_bw
rc_bi_bw:
    bw  =  0 bytes/sec


Anyway, results for rdma and udp are in line with previous tests with iperf on 3 streams (total bandwidth maxed out), results for tcp are roughly in line with single steam on iperf.

Bonus: latency
Code:

vhost-88 ~ # qperf 10.1.0.1 udp_lat
udp_lat:
    latency  =  14.2 us
vhost-88 ~ # qperf 10.1.0.1 udp_lat
udp_lat:
    latency  =  14.4 us
vhost-88 ~ # qperf 10.1.0.1 udp_lat
udp_lat:
    latency  =  14.4 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  16.1 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  18 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  17.7 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  18.9 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  19.2 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  17.5 us
vhost-88 ~ # qperf 10.1.0.1 tcp_lat
tcp_lat:
    latency  =  17.3 us

Good, old-fashioned ping in infiniband mode:
Code:
vhost-88 ~ # ping 10.1.0.1
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.
64 bytes from 10.1.0.1: icmp_seq=1 ttl=64 time=0.064 ms
64 bytes from 10.1.0.1: icmp_seq=2 ttl=64 time=0.061 ms
64 bytes from 10.1.0.1: icmp_seq=3 ttl=64 time=0.056 ms
64 bytes from 10.1.0.1: icmp_seq=4 ttl=64 time=0.053 ms
64 bytes from 10.1.0.1: icmp_seq=5 ttl=64 time=0.052 ms
64 bytes from 10.1.0.1: icmp_seq=6 ttl=64 time=0.053 ms
64 bytes from 10.1.0.1: icmp_seq=7 ttl=64 time=0.053 ms
64 bytes from 10.1.0.1: icmp_seq=8 ttl=64 time=0.052 ms
^C
--- 10.1.0.1 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7268ms
rtt min/avg/max/mdev = 0.052/0.055/0.064/0.008 ms

The very same ping in ethernet mode:
Code:

vhost-88 ~ # ping 10.1.0.1
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.
64 bytes from 10.1.0.1: icmp_seq=1 ttl=64 time=0.204 ms
64 bytes from 10.1.0.1: icmp_seq=2 ttl=64 time=0.105 ms
64 bytes from 10.1.0.1: icmp_seq=3 ttl=64 time=0.070 ms
64 bytes from 10.1.0.1: icmp_seq=4 ttl=64 time=0.070 ms
64 bytes from 10.1.0.1: icmp_seq=5 ttl=64 time=0.067 ms
64 bytes from 10.1.0.1: icmp_seq=6 ttl=64 time=0.068 ms
64 bytes from 10.1.0.1: icmp_seq=7 ttl=64 time=0.068 ms
64 bytes from 10.1.0.1: icmp_seq=8 ttl=64 time=0.067 ms
64 bytes from 10.1.0.1: icmp_seq=9 ttl=64 time=0.066 ms
64 bytes from 10.1.0.1: icmp_seq=10 ttl=64 time=0.066 ms
64 bytes from 10.1.0.1: icmp_seq=11 ttl=64 time=0.066 ms
64 bytes from 10.1.0.1: icmp_seq=12 ttl=64 time=0.067 ms
^C
--- 10.1.0.1 ping statistics ---
12 packets transmitted, 12 received, 0% packet loss, time 11418ms
rtt min/avg/max/mdev = 0.066/0.082/0.204/0.038 ms


Actually, since I switched to ethernet mode, I thought I could just as well try qperf tests as well for reference.
Code:

# rdma failed. No surprise here
vhost-88 ~ # qperf 10.1.0.1 rc_bi_bw
rc_bi_bw:
failed to modify QP to RTR: Network is unreachable
# TCP bandwidth
vhost-88 ~ # qperf 10.1.0.1 tcp_bw 
tcp_bw:
    bw  =  1.17 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.17 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.17 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.09 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.17 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.17 GB/sec
vhost-88 ~ # qperf 10.1.0.1 tcp_bw
tcp_bw:
    bw  =  1.17 GB/sec

#UDP bandwidth This on is actually weird.
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  884 MB/sec
    recv_bw  =  884 MB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  887 MB/sec
    recv_bw  =  763 MB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  879 MB/sec
    recv_bw  =  800 MB/sec
vhost-88 ~ # qperf 10.1.0.1 udp_bw
udp_bw:
    send_bw  =  889 MB/sec
    recv_bw  =  889 MB/sec
vhost-88 ~ #



Edit: I know why shutdown doesn't work. I don't know the technical reasons behind it, but it's closely related to my silly me directing kernel to use a device that does not exist in initramfs as a console.
Gotta rebuild that part so I can actually get some output from boot presented to me over serial.
Gotta rebuild initramfs so I can actually get some ear
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Tue Jun 26, 2018 8:50 pm    Post subject: Reply with quote

I tried NFS with RDMA vs NFS in tcp mode (RAM -> RAM and RAM -> /dev/null)

Code:
vhost-88 ~ # ls -l /mnt/nfs
razem 9893204
-rw-r--r-- 1 root root 10130640896 06-26 20:29 testfile
vhost-88 ~ # ls -lh /opt/
razem 9,5G
-rw-r--r-- 1 root root 9,5G 06-26 20:29 testfile
vhost-88 ~ # time for c in {1..20}; do { for i in {0..38}; do S=$(($i*4)); dd if=/mnt/nfs-10.1.0.1/testfile  of=/dev/null iflag=fullblock conv=notrunc bs=64M seek=$S skip=$S count=4 2>/dev/null &  done; wait && echo done ;}; done

Both machines are NUMA with 2*4 cores, single threaded
Code:
Result for TCP @server load ~35
real    6m16,152s
user    0m0,550s
sys     5m7,016s

505MBps
Code:

Result for RDMA @server load ~0.35
real    2m11,697s
user    0m0,541s
sys     8m57,432s

1,44GBps :!:
Now, that's quite impressive.

Client load was somewhere around 35 in both cases.


edit: Yes, I did verify that this bunch of dds running in parallel actually copies the whole file. And compared checksums before and after transfer. And yes, they did match.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Mon Sep 03, 2018 5:41 pm    Post subject: Reply with quote

Used InfiniBand FDR class hardware is getting pretty cheap. Having two ports one could bond the two ip-over-infiniband interfaces together... I don't know if it's possible with ipoib interfaces, BUT if it is, then this might be the cheapest way to ~100Gbps network between two PCs. :P Not that I could utilize such speed in anywhere...
Also the optical cables might cost more than the card itself.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1356
Location: KUUSANKOSKI, Finland

PostPosted: Sun Sep 30, 2018 4:23 pm    Post subject: Reply with quote

And pretty soon after my last post LTT made a video about 100Gbps InfiniBand. Older InfiniBand hardware should now be even cheaper. 20-40Gbps network between two PC:s should be quite affordable, I think.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Goto page Previous  1, 2, 3, 4
Page 4 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum