The network bandwidth in the alltoall_perf test failed to meet expectations · Issue #209 · NVIDIA/nccl-tests
Format: markdownScore: 0Link: https://github.com RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool.
When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.
Have you ever encountered this phenomenon?
What are the possible reasons for this phenomenon? Looking forward to your reply.
the nccl-tests result is following
mpirun --allow-run-as-root --host xxxx -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_GID_INDEX=3 -x LD_LIBRARY_PATH=/root/nccl-bond/build/lib:$LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME==bond0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=136 -x NCCL_IB_HCA==mlx5_bond_0 -x NCCL_P2P_DISABLE=1 -x NCCL_SHM_DISABLE=1 /home/test/nccl-tests/build/alltoall_perf -b 2M -e 4096M -f 2 -g 2 -n 20
Test results of four devices:
# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2682161 on server1 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 2682162 on server1 device 2 [0x52] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 3139299 on server2 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 3139300 on server2 device 2 [0x52] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 50064 on server3 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 50065 on server3 device 2 [0x52] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 2672680 on server4 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 2672681 on server4 device 2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
67108864 2097152 float none -1 2665.2 25.18 22.03 0 2662.0 25.21 22.06 N/A
134217728 4194304 float none -1 5224.1 25.69 22.48 0 5264.2 25.50 22.31 N/A
268435456 8388608 float none -1 10289 26.09 22.83 0 10334 25.97 22.73 N/A
536870912 16777216 float none -1 20513 26.17 22.90 0 20585 26.08 22.82 N/A
1073741824 33554432 float none -1 40882 26.26 22.98 0 41022 26.17 22.90 N/A
2147483648 67108864 float none -1 81711 26.28 23.00 0 81959 26.20 22.93 N/A
4294967296 134217728 float none -1 163115 26.33 23.04 0 163963 26.19 22.92 N/A
Test results of three devices:
# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2617867 on server1 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 2617868 on server1 device 2 [0x52] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 3103671 on server2 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 3103672 on server2 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 2637126 on server3 device 0 [0x23] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 2637127 on server3 device 2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
67108848 2796202 float none -1 6499.2 10.33 8.60 0 6136.2 10.94 9.11 N/A
134217720 5592405 float none -1 14519 9.24 7.70 0 13511 9.93 8.28 N/A
268435440 11184810 float none -1 26193 10.25 8.54 0 23691 11.33 9.44 N/A
536870904 22369621 float none -1 58246 9.22 7.68 0 54668 9.82 8.18 N/A
1073741808 44739242 float none -1 105248 10.20 8.50 0 93663 11.46 9.55 N/A
2147483640 89478485 float none -1 233191 9.21 7.67 0 221382 9.70 8.08 N/A
4294967280 178956970 float none -1 420496 10.21 8.51 0 395454 10.86 9.05 N/A