Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
InfiniBand - a cheap way to _fast_ network, PC to PC?
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4  Next  
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
R0b0t1
Apprentice
Apprentice


Joined: 05 Jun 2008
Posts: 255

PostPosted: Wed Apr 19, 2017 5:45 pm    Post subject: Reply with quote

1clue wrote:
Zucca wrote:
As I'm not an expert (but learning every day) I'll just leave this here...


This page reminds me yet again how much overhead there is on the term "reliable delivery."


To be fair the explanation given seems to imply the major difference is that with IB there is no such thing as multipath and link redundancy is how you obtain reliability. There is still the possibility of mesh networks, but routes are preselected and can't change easily. The argument is made that this allows a less memory-intense implementation of routing hardware, and the prices seem to verify that.

I'm still very interested in its failure modes, especially in the case of a noisy channel - but that seems to have been all but eliminated as communication speeds have increased.
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Wed Apr 19, 2017 6:05 pm    Post subject: Reply with quote

Zucca, you seem to be limited by CPU speed. Yes is not the most efficient spam generator. $RANDOM is only expanded once and it will usually give you 4-5 digits per line vs 1 character served by default.
Try copying pre-processed data from RAM instead. Or make that param to yes much bigger.

Also, pipes are slow too. Not an issue in real life, but you may need something bigger if you want to push IB to its limits. Creating a large file on ramdisk and stuffing it into cat directly with < could do better. You can also background it and start another instance. And another. How many CPUs do you have there? :)
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Wed Apr 19, 2017 8:01 pm    Post subject: Reply with quote

szatox wrote:
Zucca, you seem to be limited by CPU speed.
I'll check that.
szatox wrote:
Yes is not the most efficient spam generator. $RANDOM is only expanded once and it will usually give you 4-5 digits per line vs 1 character served by default.
I figured out that by just echoing $RANDOM few times. Still does not make sense why series of "y"s move more slowly than 4-5 digit number in series.
szatox wrote:
Try copying pre-processed data from RAM instead. Or make that param to yes much bigger.
I thought passing much bigger string for "yes", but I then just ran qperf.
szatox wrote:
Also, pipes are slow too. Not an issue in real life, but you may need something bigger if you want to push IB to its limits. Creating a large file on ramdisk and stuffing it into cat directly with < could do better. You can also background it and start another instance. And another. How many CPUs do you have there? :)
Server has Opteron 3380, low power 8-core, not particulary fast.
Desktop has FX-8350 8-core, not so fast nowdays.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10130
Location: Somewhere over Atlanta, Georgia

PostPosted: Wed Apr 19, 2017 8:12 pm    Post subject: Reply with quote

Allegedly, a larger MTU has a positive effect on throughput, and InfiniBand allows an MTU of up to 65520. The MTU needs to be kept the same for all interfaces on a particular subnet, though.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Wed Apr 19, 2017 9:44 pm    Post subject: Reply with quote

John R. Graham wrote:
Allegedly, a larger MTU has a positive effect on throughput, and InfiniBand allows an MTU of up to 65520. The MTU needs to be kept the same for all interfaces on a particular subnet, though.
Well... I have only two HCAs... So no problem. :)
And oh boy! Indeed it helped. :o I just wasn't expecting this much...
qperf 10.0.10.1 tcp_bw tcp_lat udp_bw udp_lat:
tcp_bw:
    bw  =  692 MB/sec
tcp_lat:
    latency  =  48.3 us
udp_bw:
    send_bw  =  731 MB/sec
    recv_bw  =  666 MB/sec
udp_lat:
    latency  =  42.2 us


@szatox: Indeed the CPUs on both PCs are a bottleneck. At least on lower mtus. At maximum mtu netcat takes 100% on one core on slower server CPU.
Back to top
View user's profile Send private message
R0b0t1
Apprentice
Apprentice


Joined: 05 Jun 2008
Posts: 255

PostPosted: Thu Apr 20, 2017 1:41 am    Post subject: Reply with quote

Zucca wrote:
szatox wrote:
Yes is not the most efficient spam generator. $RANDOM is only expanded once and it will usually give you 4-5 digits per line vs 1 character served by default.
I figured out that by just echoing $RANDOM few times. Still does not make sense why series of "y"s move more slowly than 4-5 digit number in series.


Your writes probably aren't buffered at all, so there's syscall overhead between each byte in the case of `yes`. You might need to write a simple C program which calls write() with some ridiculously large buffer. There was an additional API that I can't recall at the moment which seems to expose the DMA functionality of the memory controller as well.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Thu Apr 20, 2017 9:06 am    Post subject: Reply with quote

Just a thing I noticed while tinkering...
I had just rebooted my desktop. I manually loaded ib_ipoib module to gain networking access (I have unplugged my ethernet cable). Then I changed the mode from datagram to connected
as root:
echo connected > /sys/class/net/ib0/mode
Next I was about to change the mtu but to my surprise mtu was already at 65520.
Is there something so smart happening that makes (ip) network interfaces sync their mtus?

I saw this because I'm currently planning a service that will safely stop all infiniband stuff before poweroff/reboot/sleep and also bring them up at boot or wakeup.

I'd also need something more simple (in terms of overhead at least) than sshfs between these two computers...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Tue Apr 25, 2017 9:40 pm    Post subject: Reply with quote

Ok.
I've got my systemd side of loading and unloading InfiniBand kernel modules. I hope it solves the problems I'm now having with hibernating the system.

Next up is setupping udev rules for ip-over-infiniband networking interface:
/etc/udev/rules.d/2-InfiniBand.rules:
KERNEL=="ib[0-9]*", SUBSYSTEM=="net", ATTR{mode}="connected"
KERNEL=="ib[0-9]*", SUBSYSTEM=="net", ATTR{mtu}="65520"

However I can't get udev to change these settings. I should have all the right settings. I also tried the above on just one line, but the result was the same.
shellcmd: udevadm info -a -p /sys/class/net/ib0 | grep -E "{(mode|mtu)}" :
    ATTR{mode}=="datagram"
    ATTR{mtu}=="2044"
shellcmd: ifconfig ib0 :
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.0.10.2  netmask 255.255.255.240  broadcast 10.0.10.15
        infiniband 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 1462  bytes 828125 (808.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1523  bytes 227528 (222.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

... and the commands above tell the same.

So if anyone knows what's wrong with my udev or the rules...

EDIT:
shellcmd: udevadm test -a add /sys/class/net/ib0 2>&1 | grep -E "(mode|mtu)" :
ATTR '/sys/devices/pci0000:00/0000:00:04.0/0000:02:00.0/net/ib0/mode' writing 'connected' /etc/udev/rules.d/2-InfiniBand.rules:1
ATTR '/sys/devices/pci0000:00/0000:00:04.0/0000:02:00.0/net/ib0/mtu' writing '65520' /etc/udev/rules.d/2-InfiniBand.rules:2
... At least it tries... I wonder if this is a bug actually.
I could go systemd way on this, but I'd want this to work on my server too.
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Wed Apr 26, 2017 8:53 pm    Post subject: Reply with quote

I'm building totally monolithic kernel (no modules at all), and I think I'll go with hand-crafted init scripts if I have to change any config options. Talking to the drived via sysfs works well enough.
Funny thing is the adapters I got are in fact IB/Eth hybrids. They start as infiniband, but can operate in ethernet 10G mode too.
Going further, each port is toggled independently (so one dual-port HCA can serve both, ethernet and infiniband connectivity). Toggling order does affect names of the ports though (togling ib to eth changes ibX to the first unoccupied ethX and then to the first unoccupied ibX)

MTU also changes to 65520 as soon as I demand connected mode for IB.

BTW, if you don't like sshfs, NFS does pretty good job for me. Even more so, since I'm also using it for PXE boot.
And a pro tip on NFS: stage3 comes without nfstools, but does have busybox. You can invoke '/bin/busybox mount' instead of '/bin/mount' which would require a non-existent mount.nfs helper :)

Now I'm gonna have a break. Gotta gather the missing hardware parts before I can do anything serious :lol:
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Wed Apr 26, 2017 10:10 pm    Post subject: Reply with quote

szatox wrote:
Funny thing is the adapters I got are in fact IB/Eth hybrids. They start as infiniband, but can operate in ethernet 10G mode too.
Going further, each port is toggled independently (so one dual-port HCA can serve both, ethernet and infiniband connectivity). Toggling order does affect names of the ports though (togling ib to eth changes ibX to the first unoccupied ethX and then to the first unoccupied ibX)
Wow. Interesting. How do you change the mode? Via sysfs?

szatox wrote:
MTU also changes to 65520 as soon as I demand connected mode for IB.
My cards seems to remember the mtu value once I had set it. In connected mode that is. So my cards behave the same here.

szatox wrote:
BTW, if you don't like sshfs, NFS does pretty good job for me.
I've already set up NFS. I think I had NFS in use 7-10 years ago.
But guess what I found out?


As for the udev rules... I hacked them to work:
*snip*:
ACTION=="add", KERNEL=="ib[0-9]*", SUBSYSTEM=="net", RUN+="/bin/sh -c 'echo connected > /sys/class/net/%k/mode && echo 65520 > /sys/class/net/%k/mtu'"
I have no idea why the "normal" way didn't work...
I tried to debug it, but passing --debug for systemd-udevd did nothing. Which, at this point, does not surprise me. I already had systemd lock-up on me while doing this. I couldn't change targets. Finally giving --force to reboot resulted in corrupted filesystem, which btrfs succesfully repaired. *sigh*
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Thu Apr 27, 2017 6:41 pm    Post subject: Reply with quote

Code:
echo ib > /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/mlx4_port1
echo eth > /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/mlx4_port1
echo ib > /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/mlx4_port1
echo connected > /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/net/ib0/mode
echo datagram  > /sys/devices/pci0000:00/0000:00:0a.0/0000:06:00.0/net/ib0/mode


I spotted a prettier path but since I was using find to discover the correct files, "directories" were preferred over symbolic links.
Good stuff with that NFS. Gotta try it out some day :D
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Tue May 02, 2017 7:29 pm    Post subject: Reply with quote

I made some "stress tests" to see if either of my IB cards might slow down under load (due to heating) trough TCP/IP:
shellcmd: iperf3 -c 10.0.10.1 -n 256G -i 60 :
Connecting to host 10.0.10.1, port 5201
[  4] local 10.0.10.2 port 37670 connected to 10.0.10.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-60.00  sec  40.2 GBytes  5.76 Gbits/sec    0   3.25 MBytes
[  4]  60.00-120.00 sec  40.2 GBytes  5.75 Gbits/sec    0   3.25 MBytes
[  4] 120.00-180.00 sec  40.2 GBytes  5.75 Gbits/sec    0   3.25 MBytes
[  4] 180.00-240.00 sec  40.2 GBytes  5.75 Gbits/sec    0   3.25 MBytes
[  4] 240.00-300.00 sec  40.2 GBytes  5.76 Gbits/sec    0   3.25 MBytes
[  4] 300.00-360.00 sec  40.2 GBytes  5.76 Gbits/sec    0   3.25 MBytes
[  4] 360.00-382.07 sec  14.8 GBytes  5.75 Gbits/sec    0   3.25 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-382.07 sec   256 GBytes  5.76 Gbits/sec    0             sender
[  4]   0.00-382.07 sec   256 GBytes  5.76 Gbits/sec                  receiver

iperf Done.
Apparently they work ok. Although the 4x slotted will "automatically" throttle. ;)
I haven't been able to test out RDMA tests using qperf. I get ”failed to find any InfiniBand devices: Function not implemented” -error if I try.

I'll try to set up the NFS-over-RDMA next and run some performance test on it.

szatox, have you already set up your IB network? What's the performance there?
Back to top
View user's profile Send private message
szatox
Veteran
Veteran


Joined: 27 Aug 2013
Posts: 1707

PostPosted: Tue May 02, 2017 9:43 pm    Post subject: Reply with quote

Nope. I was a bit busy in real life recently and didn't give much attention to this matter.. Anyway, I'm back on track with my PhD in economic law, so hopefully I'll get the missing pieces soon ™
No, seriously, procedures on shopping abroad are all shit and myths, but since this whole thing is kinda a research project I'd rather learn the options and gain new possibilities. I'll let you know when I finally have everything in place.

I'm doing some tricks with software in the meantime. Trying to design the final environment. Tune kernel, setup some iSCSI targets (there are options for iSCSI with RDMA in kernel too), make it autoconfigure in runtime, get some software that could run on top of it (A bunch of VMs? Blender with openmp?)
I know, excuses :lol:
Anyway, quite a few bits can be done with ethernet, which it seems I'll have to keep - as management network - regardless of infiniband, so I can just as well make some use of it for now.
Bragging time: Netboot-oriented initramfs with dynamic init script was a really good idea. Not something I'd recommend to everybody for use, but it makes testing new versions a bit faster and much more convenient.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Wed May 03, 2017 6:31 pm    Post subject: Reply with quote

Well. I managed to solve the problem with rdma tests using qperf: rdma related modules weren't loaded.
The error message pointed me to a direction where I started looking for different compile-time options and not a kernel module missing. *sigh*
Finally I was just going with trial and error until I found the solution.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Sat May 06, 2017 1:06 pm    Post subject: Reply with quote

Dang... I should have not rushed with my acquisition of the HCAs. :P Just look at this. Instead of DDR it's QDR. Although... I don't know if the bandwidth is split 50%/50% among those two ports, thus making it same speed as my DDR cards when connecting only two nodes together.

Also here's the most complete qperf tests so far:
shellcmd: qperf 10.0.10.1 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat udp_bw udp_lat :
ud_lat:
    latency  =  27.6 us
ud_bw:
    send_bw  =  676 MB/sec
    recv_bw  =  660 MB/sec
rc_rdma_read_bw:
    bw  =  855 MB/sec
rc_rdma_write_bw:
    bw  =  738 MB/sec
uc_rdma_write_bw:
    send_bw  =  768 MB/sec
    recv_bw  =  735 MB/sec
tcp_bw:
    bw  =  656 MB/sec
tcp_lat:
    latency  =  51.2 us
udp_bw:
    send_bw  =  719 MB/sec
    recv_bw  =  626 MB/sec
udp_lat:
    latency  =  49.2 us
Yay! I like this. :)
I could finally try to set up the NFS over RDMA. If I succeed in setting that up, I'll put a small guide into wiki.

EDIT: Post count reached 666. \,,/
(A VERY important thing to mention. Right?)
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...


Last edited by Zucca on Sat May 06, 2017 1:31 pm; edited 1 time in total
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10130
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat May 06, 2017 1:11 pm    Post subject: Reply with quote

I'm planning to pick up some cards that are (at least) QDR just to see how much of what you're experiencing is saturation on the motherboard / CPU side. I have two machines with free 8 lane slots and one older dual Xeon monster with PCI-X slots. Interesting results, though. :)

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Sat May 06, 2017 2:42 pm    Post subject: Reply with quote

John R. Graham wrote:
I'm planning to pick up some cards that are (at least) QDR
Hm... Wikipedia states that from FDR(10) onwards the encoding is 64/66 (~3.03% loss) rather than 8/10 (20% loss). I wonder if it would yield even higher real world bandwidth... Although there is still the PCIe there, which has 8/10 encoding... right?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10130
Location: Somewhere over Atlanta, Georgia

PostPosted: Sun May 07, 2017 12:03 am    Post subject: Reply with quote

Right, but presumably any given card will have enough PCIe lanes to keep itself properly fed.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Sun May 07, 2017 6:25 pm    Post subject: Reply with quote

Looks like 12x cables pop up for sale from time to time. Not a bad price. Although I think it's still more affordable to just buy QDR or FDR IB cards and 4x cable.

Does anyone know any programs which can utilize RDMA? NFS excluded. :)
I wonder if OpenCL could utilize... It would make sense kind of. In a bigger network it would be conveivent to be able to utilize all the GPUs and CPUs of each node when, say, rendering animation in Blender.

EDIT01: The keyword seems to be GPUDirect. Unfortunately that may require professional grade GPUs.

EDIT02: And it's CUDA only. No dice for me, since I'm apparently too much of an AMD fanboi.

EDIT03: This is something to watch for AMD GPU users.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Thu May 18, 2017 8:53 pm    Post subject: NFS over RDMA Reply with quote

I've now set up an NFS over RDMA here between two nodes.
Preliminary test showed transfer speed of 808MiB/s. I transferred a file that was recently been opened on the server side. On the client side I droppped all disk read caches before initiating the transfer.

I think my target on building faster network between these two computers has been now reached succesfully. :)

If this setup continues to work without any hickups, I'll fullfill the InfiniBand wiki article and push NFS over RDMA section into NFS wiki article too.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Fri May 19, 2017 2:56 pm    Post subject: Reply with quote

Aaaaand karma strikes!
The power switch on my server PSU fried. About one in ten tries makes sparky noises and the led inside the button does not lit.

So... don't hold on your hat for any updates anytime soon.

I don't think I have proper (better) switch stashed anywhere... :(
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 41417
Location: 56N 3W

PostPosted: Fri May 19, 2017 3:45 pm    Post subject: Reply with quote

Zucca,

Solder some wires across the switch so its always on.
You do pull the plug out too, when you need it off, don't you?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
1clue
Advocate
Advocate


Joined: 05 Feb 2006
Posts: 2395

PostPosted: Fri May 19, 2017 3:46 pm    Post subject: Reply with quote

What a drag.

I've been contemplating your project for quite awhile now, I was looking forward to a success story.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Sat May 20, 2017 1:05 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Zucca,

Solder some wires across the switch so its always on.
You do pull the plug out too, when you need it off, don't you?
I'll do that if I don't get my hands on a new (more robust) switch soon.

1clue wrote:
What a drag.

I've been contemplating your project for quite awhile now, I was looking forward to a success story.
I only hope I get things running quickly after I've fixed the PSU. This, indeed, was quite a setback. :(
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1358
Location: KUUSANKOSKI, Finland

PostPosted: Tue May 23, 2017 9:56 pm    Post subject: Reply with quote

Comrades!

I've done it! I didn't found a proper replacement switch, so I "just" soldered the wires together. I actually thought of just using some grimping wire connectors (I have an adjustable pliers specifically for that work), but went with soldering because it should give more surface area connection, thus making the connection more realiable.

Tomorrow I'll install the ressurected PSU. Fingers crossed.

Oh... the btrfs scrubbing will take little over 9 hours. Luckily it's an online process, meaning I can still use the filesystem.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum