rdma网络设置
https://developer.aliyun.com/article/664961
安装rdma
https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator
- 了解rdma和gpu https://zhuanlan.zhihu.com/p/664712789
NCCL测试
https://github.com/coreweave/nccl-tests
RDMA网络安装
- 确认物理机的RDMA网络驱动已经安装完毕,输入如下命令:
# 查看ib网卡是否安装完毕
ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:002c:a86c
base lid: 0x188
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:002c:b1d8
base lid: 0x187
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_2' port 1 status:
default gid: fe80:0000:0000:0000:eaeb:d3ff:fe8b:31d6
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 25 Gb/sec (1X EDR)
link_layer: Ethernet
Infiniband device 'mlx5_3' port 1 status:
default gid: fe80:0000:0000:0000:eaeb:d3ff:fe8b:31d7
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet
Infiniband device 'mlx5_4' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:002c:b034
base lid: 0x189
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_5' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:002c:b050
base lid: 0x18a
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
如上信息为驱动安装正常的样子。 图示总共有6张卡,有5张卡是ACTIVE的
- 输入如下命令,查看物理网卡绑定的逻辑网卡,会用在接下来的k8s rdma plugin配置中
ibdev2netdev
mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)
mlx5_2 port 1 ==> eth0 (Up)
mlx5_3 port 1 ==> eth1 (Down)
mlx5_4 port 1 ==> ib2 (Up)
mlx5_5 port 1 ==> ib3 (Up)
如上所示,mlx5_2===>eth0 只有25Gb,就不需要配置在rdma plugin的配置中。 选择ib0,ib1,ib2,ib3就可以了。
- 安装k8s rdma plugin。插件在这个位置,修改插件配置如下:
{
"periodicUpdateInterval": 300,
"configList": [{
"resourceName": "hca",
"resourcePrefix": "rdma",
"rdmaHcaMax": 1000,
"devices": ["ib0", "ib1","ib2","ib3"]
}
]
}
按照插件的部署教程部署到k8s网络中。
- 验证是否部署成功。用如下命令发布两个在不同node上的pod,注意修改node_selector。
apiVersion: v1
kind: Pod
metadata:
name: mofed-test-pod2
namespace: default
spec:
containers:
- name: mofed-test-ctr
image: mellanox/rping-test
command:
- sh
- '-c'
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
resources:
limits:
rdma/hca: '1'
requests:
rdma/hca: '1'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
capabilities:
add:
- IPC_LOCK
restartPolicy: OnFailure
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
nodeSelector:
k3s.io/hostname: tj5-ant-ai-studio01.kscn
serviceAccountName: default
nodeName: tj5-ant-ai-studio01.kscn
securityContext: {}
schedulerName: default-scheduler
priority: 0
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
发布完成后进入pod A, 输入ib_read_bw来测试两pod之间rdma网络带宽测试,等待客户端练上来。 在pod B中输入 ib_read_bw ${pod_a_ip},显示如下结果
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x188 QPN 0x047f PSN 0x472502 OUT 0x10 RKey 0x02edae VAddr 0x007f2e0eb1d000
remote address: LID 0x185 QPN 0x2824 PSN 0x833d4b OUT 0x10 RKey 0x07c181 VAddr 0x007f43d5930000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 23481.28 23476.67 0.375627
---------------------------------------------------------------------------------------
图中所示结果带宽平均值是23476.67 MB/s,接近200Gb网卡的极限值。 说明插件安装完成,配置正确。