rdma网络设置

https://developer.aliyun.com/article/664961

安装rdma

https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator

  1. 了解rdma和gpu https://zhuanlan.zhihu.com/p/664712789

NCCL测试

https://github.com/coreweave/nccl-tests

RDMA网络安装

  1. 确认物理机的RDMA网络驱动已经安装完毕,输入如下命令:
# 查看ib网卡是否安装完毕
ibstatus 

Infiniband device 'mlx5_0' port 1 status:
    default gid:     fe80:0000:0000:0000:a088:c203:002c:a86c
    base lid:     0x188
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         200 Gb/sec (4X HDR)
    link_layer:     InfiniBand

Infiniband device 'mlx5_1' port 1 status:
    default gid:     fe80:0000:0000:0000:a088:c203:002c:b1d8
    base lid:     0x187
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         200 Gb/sec (4X HDR)
    link_layer:     InfiniBand
Infiniband device 'mlx5_2' port 1 status:
    default gid:     fe80:0000:0000:0000:eaeb:d3ff:fe8b:31d6
    base lid:     0x0
    sm lid:         0x0
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         25 Gb/sec (1X EDR)
    link_layer:     Ethernet

Infiniband device 'mlx5_3' port 1 status:
    default gid:     fe80:0000:0000:0000:eaeb:d3ff:fe8b:31d7
    base lid:     0x0
    sm lid:         0x0
    state:         1: DOWN
    phys state:     3: Disabled
    rate:         40 Gb/sec (4X QDR)
    link_layer:     Ethernet

Infiniband device 'mlx5_4' port 1 status:
    default gid:     fe80:0000:0000:0000:a088:c203:002c:b034
    base lid:     0x189
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         200 Gb/sec (4X HDR)
    link_layer:     InfiniBand

Infiniband device 'mlx5_5' port 1 status:
    default gid:     fe80:0000:0000:0000:a088:c203:002c:b050
    base lid:     0x18a
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         200 Gb/sec (4X HDR)
    link_layer:     InfiniBand    

如上信息为驱动安装正常的样子。 图示总共有6张卡,有5张卡是ACTIVE的

  1. 输入如下命令,查看物理网卡绑定的逻辑网卡,会用在接下来的k8s rdma plugin配置中
ibdev2netdev

mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)
mlx5_2 port 1 ==> eth0 (Up)
mlx5_3 port 1 ==> eth1 (Down)
mlx5_4 port 1 ==> ib2 (Up)
mlx5_5 port 1 ==> ib3 (Up)

如上所示,mlx5_2===>eth0 只有25Gb,就不需要配置在rdma plugin的配置中。 选择ib0,ib1,ib2,ib3就可以了。

  1. 安装k8s rdma plugin。插件在这个位置,修改插件配置如下:
{
  "periodicUpdateInterval": 300,
  "configList": [{
      "resourceName": "hca",
      "resourcePrefix": "rdma",
      "rdmaHcaMax": 1000,
      "devices": ["ib0", "ib1","ib2","ib3"]
    }
  ]
}

按照插件的部署教程部署到k8s网络中。

  1. 验证是否部署成功。用如下命令发布两个在不同node上的pod,注意修改node_selector。
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-pod2
  namespace: default
spec:
  containers:
    - name: mofed-test-ctr
      image: mellanox/rping-test
      command:
        - sh
        - '-c'
        - |
          ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
          sleep 1000000
      resources:
        limits:
          rdma/hca: '1'
        requests:
          rdma/hca: '1'
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: Always
      securityContext:
        capabilities:
          add:
            - IPC_LOCK
  restartPolicy: OnFailure
  terminationGracePeriodSeconds: 30
  dnsPolicy: ClusterFirst
  nodeSelector:
    k3s.io/hostname: tj5-ant-ai-studio01.kscn
  serviceAccountName: default
  nodeName: tj5-ant-ai-studio01.kscn
  securityContext: {}
  schedulerName: default-scheduler
  priority: 0
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority

发布完成后进入pod A, 输入ib_read_bw来测试两pod之间rdma网络带宽测试,等待客户端练上来。 在pod B中输入 ib_read_bw ${pod_a_ip},显示如下结果

---------------------------------------------------------------------------------------
                    RDMA_Read BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Outstand reads  : 16
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x188 QPN 0x047f PSN 0x472502 OUT 0x10 RKey 0x02edae VAddr 0x007f2e0eb1d000
 remote address: LID 0x185 QPN 0x2824 PSN 0x833d4b OUT 0x10 RKey 0x07c181 VAddr 0x007f43d5930000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             23481.28            23476.67                  0.375627
---------------------------------------------------------------------------------------

图中所示结果带宽平均值是23476.67 MB/s,接近200Gb网卡的极限值。 说明插件安装完成,配置正确。