Zum Inhalt springen

[Part01] Getting Started with Red Hat OpenShift with NVIDIA

Getting Started with Red Hat OpenShift with NVIDIA

Table of Contents

  1. Introduction
  2. Remote Direct Memory Access (RDMA)
    • Introduction to RDMA
    • RDMA Protocols and Network Technologies
    • Verifying RDMA Capability in OpenShift
    • RDMA Configuration Options in OpenShift
    • RDMA Network Configuration
    • Testing and Verification
    • Performance Optimization
    • Common Issues and Troubleshooting
    • Integration with Storage and GPU Workloads
  3. NVIDIA GPU Architecture
    • Introduction to GPU Concurrency and Sharing Mechanisms
    • GPU Sharing Technologies
    • Deployment Considerations for Different OpenShift Scenarios
    • Implementation Guidelines
    • Performance Optimization
    • Integration with RDMA for High-Performance Computing
    • Troubleshooting
  4. Conclusion

Introduction

This comprehensive guide provides detailed guidance for architects, consultants, and practitioners involved in implementing Red Hat OpenShift with NVIDIA networking hardware and GPU technologies. The guide offers methodologies, best practices, and configuration examples to help organizations effectively leverage NVIDIA technologies in OpenShift environments for high-performance computing, AI/ML workloads, and other latency-sensitive applications.

NVIDIA technologies, when integrated with Red Hat OpenShift, provide high-bandwidth, low-latency connectivity and powerful GPU acceleration essential for modern data-intensive workloads. This guide covers both RDMA (Remote Direct Memory Access) networking and NVIDIA GPU architecture to provide a complete reference for implementation.

Remote Direct Memory Access (RDMA)

Introduction to RDMA

Remote Direct Memory Access (RDMA) is a technology that enables direct memory access from the memory of one computer to the memory of another without involving either computer’s operating system, CPU, or cache. RDMA provides high-throughput, low-latency networking by bypassing traditional networking stacks and reducing CPU overhead, making it ideal for data-intensive workloads in OpenShift environments.

Key benefits of RDMA include:

  • Reduced Latency: By bypassing the OS kernel and CPU, RDMA significantly reduces communication latency
  • Higher Bandwidth: Enables near line-rate data transfer speeds
  • Lower CPU Utilization: Offloads data transfer operations from the CPU to the network adapter
  • Zero-Copy Networking: Data is transferred directly between application memory spaces without intermediate copies
  • Kernel Bypass: Communication bypasses the operating system kernel, reducing context switches and interrupts

RDMA is particularly valuable for OpenShift deployments running high-performance computing (HPC) workloads, AI/ML training and inference, database applications, and storage systems that require high-bandwidth, low-latency communication.

RDMA Protocols and Network Technologies

RDMA can be implemented using several protocols and network technologies:

InfiniBand

InfiniBand is a specialized high-performance network technology designed specifically for high-throughput, low-latency communications. It provides native support for RDMA and is commonly used in HPC environments.

Key characteristics:

  • Purpose-built for high reliability, high bandwidth, and low latency
  • Uses a cut-through approach with 16-bit LID for faster forwarding
  • Provides end-to-end flow control for lossless networking
  • Built-in software-defined networking with subnet manager
  • Requires specialized hardware and infrastructure

Note: OpenShift installation requires Ethernet connectivity for the cluster API traffic. InfiniBand can only be used as a secondary network for application traffic after the cluster is installed.

RDMA over Converged Ethernet (RoCE)

RoCE enables RDMA functionality over standard Ethernet networks, making it more accessible and cost-effective than InfiniBand while still providing many of the performance benefits.

Key characteristics:

  • RoCEv1: Layer 2 protocol that works within a single broadcast domain
  • RoCEv2: Routable protocol that runs on top of UDP/IP (IPv4 or IPv6)
  • Widely supported by modern network adapters
  • Currently the most popular protocol for implementing RDMA
  • Compatible with existing Ethernet infrastructure

Internet Wide Area RDMA Protocol (iWARP)

iWARP implements RDMA over TCP/IP networks, providing RDMA capabilities over standard TCP connections.

Key characteristics:

  • Leverages TCP or SCTP for reliable transport
  • Works over standard TCP/IP networks without specialized hardware
  • Generally has higher latency than RoCE or InfiniBand
  • More tolerant of packet loss and congestion
  • Easier to deploy in existing TCP/IP networks

Verifying RDMA Capability in OpenShift

Before implementing RDMA in your OpenShift environment, you need to verify that your nodes have RDMA-capable hardware and that it’s properly recognized by the system.

Using Node Feature Discovery (NFD)

Node Feature Discovery automatically detects and labels nodes with RDMA capabilities. To verify RDMA capability:

  1. Ensure NFD is installed and running in your cluster
  2. Check for RDMA-related labels on your nodes:
oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'

Look for the following labels which indicate RDMA capability:

  • feature.node.kubernetes.io/rdma.available: "true"
  • feature.node.kubernetes.io/rdma.capable: "true"
  • feature.node.kubernetes.io/pci-15b3.present: "true" (for Mellanox/NVIDIA NICs)

Manual Verification

You can also manually verify RDMA capability on your nodes:

  1. Check for RDMA devices:
rdma link
  1. For Mellanox/NVIDIA NICs, verify the presence of InfiniBand devices:
lspci -nn | grep Infiniband
ibstat | grep "Link layer"
  1. Check the RDMA subsystem mode:
rdma system

RDMA Configuration Options in OpenShift

OpenShift with NVIDIA networking supports three primary RDMA configuration methods, each with different characteristics and use cases.

1. RDMA Shared Device

The RDMA Shared Device configuration allows multiple pods on a worker node to share the same RDMA device. This method is suitable for development environments or applications where maximum performance is not critical.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3

Key Parameters:

  • rdmaHcaMax: Maximum number of pods that can share the device
  • selectors.ifNames: Network interface names to be used for RDMA

Use Cases:

  • Development and testing environments
  • Applications where multiple pods need RDMA functionality but not maximum performance
  • Environments with limited hardware resources

Limitations:

  • All pods sharing the device compete for bandwidth and resources
  • No isolation between pods using the same device
  • Performance may degrade as more pods use the device

2. RDMA SR-IOV Legacy Device

The SR-IOV (Single Root I/O Virtualization) configuration segments a network device at the hardware layer, creating multiple virtual functions (VFs) that can be assigned to different pods. This provides better isolation and performance compared to the shared device method.

Configuration Example:

# NicClusterPolicy for OFED driver
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SR-IOV Network Node Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Key Parameters:

  • numVfs: Number of virtual functions to create
  • pfNames: Physical function names to use for SR-IOV
  • isRdma: Enable RDMA capability for the VFs

Use Cases:

  • Production environments requiring high performance
  • Workloads sensitive to latency and bandwidth
  • Applications requiring isolation between network resources

Limitations:

  • Limited by the maximum number of VFs supported by the hardware
  • Requires SR-IOV capable network adapters
  • May require system reboot when changing configuration

3. RDMA Host Device

The Host Device configuration passes the entire physical network device from the host to a pod. This provides maximum performance but limits the device to a single pod at a time.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }

Key Parameters:

  • resourcePrefix: Prefix for the resource name
  • resourceName: Name of the resource to be exposed
  • selectors.vendors: Vendor IDs to match (15b3 for Mellanox/NVIDIA)

Use Cases:

  • Workloads requiring maximum performance
  • Systems where SR-IOV is not supported
  • Applications needing features only available in the physical function driver

Limitations:

  • Device is exclusive to a single pod
  • Limited scalability as each device can only be used by one pod at a time
  • May not be suitable for environments with many pods requiring RDMA

RDMA Network Configuration

InfiniBand Network Configuration

For InfiniBand networks, additional configuration is required:

  1. Ensure the host has an InfiniBand card installed and the driver is properly installed
  2. Verify the RDMA subsystem mode:
   rdma system

For exclusive mode (recommended for production):

   rdma system set netns exclusive
   echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
   reboot
  1. Configure SR-IOV for InfiniBand:
   apiVersion: sriovnetwork.openshift.io/v1
   kind: SriovNetworkNodePolicy
   metadata:
     name: ib-sriov
     namespace: kube-system
   spec:
     nodeSelector:
       kubernetes.io/os: "linux"
     resourceName: mellanoxibsriov
     priority: 99
     numVfs: 12
     nicSelector:
         deviceID: "1017"
         rootDevices:
         - 0000:86:00.0
         vendor: "15b3"
     deviceType: netdevice
     isRdma: true
  1. Create a network attachment definition:
   apiVersion: spiderpool.spidernet.io/v2beta1
   kind: SpiderMultusConfig
   metadata:
     name: ib-sriov
     namespace: kube-system
   spec:
     cniType: ib-sriov
     ibsriov:
       resourceName: spidernet.io/mellanoxibsriov
       ippools:
         ipv4: ["v4-91"]

RoCE Network Configuration

For RoCE networks, ensure:

  1. The network adapters support RoCE (typically Mellanox ConnectX-4 or newer)
  2. Priority Flow Control (PFC) is configured on the switches
  3. Explicit Congestion Notification (ECN) is enabled
  4. Appropriate QoS settings are configured

Testing and Verification

Verifying RDMA Functionality

To verify RDMA functionality between pods:

  1. Deploy test pods with RDMA capabilities:
   apiVersion: v1
   kind: Pod
   metadata:
     name: rdma-test-pod-1
     annotations:
       k8s.v1.cni.cncf.io/networks: rdma-network
   spec:
     containers:
     - name: rdma-test-container
       image: mellanox/rping-test
       securityContext:
         capabilities:
           add: ["IPC_LOCK"]
       resources:
         limits:
           nvidia.com/hostdev: 1
         requests:
           nvidia.com/hostdev: 1
       command:
       - sh
       - -c
       - sleep infinity
  1. Verify RDMA devices in the pods:
   oc exec -it rdma-test-pod-1 -- rdma link
  1. Run RDMA performance tests:
   # In pod 1 (server)
   oc exec -it rdma-test-pod-1 -- ib_read_lat

   # In pod 2 (client)
   oc exec -it rdma-test-pod-2 -- ib_read_lat <server-ip>

Performance Optimization

Network Tuning

  1. MTU Size: Configure jumbo frames (MTU 9000) for improved throughput:
   apiVersion: sriovnetwork.openshift.io/v1
   kind: SriovNetworkNodePolicy
   metadata:
     name: sriov-policy
   spec:
     mtu: 9000
     # other parameters...
  1. NUMA Alignment: Ensure RDMA devices are aligned with CPU and memory resources:
   apiVersion: v1
   kind: Pod
   metadata:
     name: rdma-numa-aligned-pod
   spec:
     containers:
     - name: rdma-container
       # ...
     nodeSelector:
       kubernetes.io/hostname: node-with-aligned-resources
     topologySpreadConstraints:
     - maxSkew: 1
       topologyKey: kubernetes.io/hostname
       whenUnsatisfiable: DoNotSchedule
       labelSelector:
         matchLabels:
           app: rdma-app
  1. IRQ Affinity: Configure IRQ affinity for RDMA devices to specific CPU cores:
   # On the host
   set_irq_affinity.sh <interface_name>

Application Tuning

  1. Buffer Sizes: Adjust RDMA buffer sizes for optimal performance:
   # Example for increasing queue pairs
   echo 8192 > /sys/module/mlx4_core/parameters/log_num_qp
  1. Transport Selection: Choose the appropriate RDMA transport based on your network:
    • InfiniBand: Use native InfiniBand transport for lowest latency
    • RoCE: Use RoCEv2 for routable RDMA over Ethernet
    • iWARP: Use for compatibility with standard TCP/IP networks

Common Issues and Troubleshooting

RDMA Device Not Visible in Pod

  1. Verify the OFED driver is installed:
   oc get pods -n nvidia-network-operator | grep ofed
  1. Check RDMA device allocation:
   oc describe pod <pod-name>
   # Look for resource allocation in the Events section
  1. Verify RDMA capability is enabled:
   oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml
   # Check for isRdma: true

Performance Issues

  1. Check for network congestion:
   # On the host
   perfquery -r
  1. Verify PFC is working:
   # On the switch
   show priority-flow-control
  1. Monitor RDMA statistics:
   # On the host
   rdma -d mlx5_0 -p port_rcv_data,port_xmit_data stat show

Connectivity Issues

  1. Verify subnet manager is running (for InfiniBand):
   # On the host
   sminfo
  1. Check link state:
   # On the host
   ibv_devinfo
  1. Test basic connectivity:
   # In the pod
   ping <remote-ip>

Integration with Storage and GPU Workloads

NVMe over Fabrics (NVMe-oF)

RDMA is a key transport for NVMe over Fabrics, providing high-performance access to NVMe storage devices over a network:

  1. Configure NVMe-oF target:
   # Example configuration
   nvmetcli
  1. Connect to NVMe-oF using RDMA:
   # In the pod
   nvme connect -t rdma -a <target-ip> -s 4420 -n <subsystem-nqn>

GPUDirect RDMA

GPUDirect RDMA enables direct data transfer between GPU memory and network adapters, bypassing CPU and system memory:

  1. Ensure NVIDIA GPU Operator is installed
  2. Configure GPUDirect RDMA:
   apiVersion: mellanox.com/v1alpha1
   kind: NicClusterPolicy
   metadata:
     name: nic-cluster-policy
   spec:
     ofedDriver:
       # ...
     rdmaSharedDevicePlugin:
       # ...
     gpuDirectRdma:
       enabled: true
  1. Verify GPUDirect RDMA functionality:
   # In the pod
   nvidia-smi topo -m
   # Look for "NV" in the RDMA column

NVIDIA GPU Architecture

Introduction to GPU Concurrency and Sharing Mechanisms

In enterprise-level OpenShift environments, applications typically have varying compute requirements that can leave GPUs underutilized. Providing the right amount of compute resources for each workload is critical to reduce deployment costs and maximize GPU utilization. Red Hat and NVIDIA have developed GPU concurrency and sharing mechanisms to simplify GPU-accelerated computing on OpenShift clusters.

GPU concurrency mechanisms for improving utilization range from programming model APIs to system software and hardware partitioning, including virtualization. These mechanisms allow multiple workloads to share GPU resources efficiently, improving overall utilization and reducing costs.

GPU Sharing Technologies

CUDA Streams

Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model developed by NVIDIA for general computing on GPUs. CUDA streams provide a mechanism for parallel execution of operations on the GPU.

Key characteristics:

  • A stream is a sequence of operations that executes in issue-order on the GPU
  • CUDA commands are typically executed sequentially in a default stream
  • Asynchronous processing across different streams allows for parallel execution
  • Tasks in different streams can run before, during, or after each other
  • Enables the GPU to run multiple tasks simultaneously in no prescribed order

Use Cases:

  • Applications with multiple independent tasks that can be executed in parallel
  • Workloads that can benefit from overlapping data transfers and computations
  • Scenarios where multiple small kernels need to be executed concurrently

Time-Slicing

GPU time-slicing interleaves workloads scheduled on overloaded GPUs when running multiple CUDA applications. This approach allows for better utilization of GPU resources without requiring hardware-level partitioning.

Key characteristics:

  • Enables sharing of GPUs by defining a set of replicas for a GPU
  • Each replica can be independently distributed to a pod
  • No memory or fault isolation between replicas
  • Uses GPU time-slicing to multiplex workloads from replicas
  • Can be applied cluster-wide or to specific nodes

Configuration Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  devicePlugin:
    config:
      name: time-slicing-config
      default: any-gpu-time-slicing
      sharing:
        timeSlicing:
          resources:
          - name: any-gpu-time-slicing
            replicas: 4
            renameByDefault: false
            failRequestsGreaterThanOne: false

Use Cases:

  • Older NVIDIA cards with no MIG support on bare metal
  • Workloads that don’t require strict isolation
  • Development and testing environments

CUDA Multi-Process Service (MPS)

CUDA Multi-Process Service (MPS) allows a single GPU to use multiple CUDA processes. The processes run in parallel on the GPU, eliminating saturation of the GPU compute resources.

Key characteristics:

  • Enables concurrent execution of kernel operations from different processes
  • Allows overlapping of memory copying from different processes
  • Enhances GPU utilization by enabling multiple processes to share the GPU
  • Provides a server process that manages access to the GPU

Use Cases:

  • HPC workloads with multiple MPI ranks
  • Applications with multiple small CUDA kernels
  • Scenarios where multiple processes need to share a single GPU

Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) is a feature of the NVIDIA Ampere architecture that enables splitting GPU compute units and memory into multiple MIG instances. Each instance represents a standalone GPU device from a system perspective.

Key characteristics:

  • Each MIG instance appears as an individual GPU to the system
  • Provides hardware-level isolation between instances
  • Supported on NVIDIA A100 and A30 Ampere cards
  • Can support up to seven independent CUDA applications
  • Offers complete isolation with dedicated hardware resources

Configuration Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  mig:
    strategy: single
    config:
      - gpuIds: [0, 1]
        migEnabled: true
        devices:
          - profile: "1g.5gb"
            count: 2
          - profile: "2g.10gb"
            count: 1

Use Cases:

  • Production environments requiring strict isolation
  • Mixed workloads with different resource requirements
  • Bare metal deployments with MIG-enabled cards

Virtualization with vGPU

Virtual machines (VMs) can directly access a single physical GPU using NVIDIA vGPU. This capability combines the power of GPU performance with the management and security benefits provided by virtualization.

Key characteristics:

  • Creates virtual GPUs that can be shared by VMs across the enterprise
  • Provides management and monitoring for VM environments
  • Enables workload balancing for mixed VDI and compute workloads
  • Allows resource sharing across multiple VMs
  • Offers proactive management capabilities

Use Cases:

  • VM environments requiring GPU acceleration
  • OpenShift Virtualization deployments
  • Mixed VDI and compute workloads

Deployment Considerations for Different OpenShift Scenarios

When implementing GPU sharing in OpenShift, consider the following recommendations for different scenarios:

Bare Metal Deployments

For bare metal OpenShift deployments:

  • vGPU is not available
  • Consider using MIG-enabled cards (A100, A30) for hardware-level partitioning
  • If using older NVIDIA cards without MIG support, consider time-slicing
  • For maximum performance, use direct GPU assignment without sharing

Virtual Machine Deployments

For OpenShift deployments on virtual machines:

  • vGPU is the best choice for sharing GPU resources
  • Consider using separate VMs when you need both passthrough and vGPU
  • Ensure the hypervisor supports GPU passthrough or vGPU

Mixed Environments with OKD Virtualization

For bare metal with OKD Virtualization and multiple GPUs:

  • Consider using pass-through for hosted VMs
  • Use time-slicing for containers
  • Align NUMA topology for optimal performance

Implementation Guidelines

Enabling Time-Slicing

To enable time-slicing of GPUs on Kubernetes:

  1. Create a ConfigMap with the time-slicing configuration:
   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: time-slicing-config
     namespace: gpu-operator
   data:
     config.yaml: |
       version: v1
       sharing:
         timeSlicing:
           resources:
           - name: nvidia.com/gpu
             replicas: 4
  1. Apply the configuration to the NVIDIA GPU Operator:
   apiVersion: nvidia.com/v1
   kind: ClusterPolicy
   metadata:
     name: gpu-cluster-policy
   spec:
     devicePlugin:
       config:
         name: time-slicing-config
  1. Verify the configuration:
   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'

Configuring MIG

To configure Multi-Instance GPU (MIG) on OpenShift:

  1. Ensure you have MIG-capable GPUs (A100, A30)

  2. Create a MIG strategy configuration:

   apiVersion: nvidia.com/v1
   kind: ClusterPolicy
   metadata:
     name: gpu-cluster-policy
   spec:
     mig:
       strategy: single
  1. Apply the configuration and wait for the GPU Operator to configure MIG

  2. Verify MIG instances:

   oc exec -it -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz -- nvidia-smi mig -lgi

Setting Up vGPU

To set up vGPU in an OpenShift environment:

  1. Install the NVIDIA vGPU software on the hypervisor

  2. Configure vGPU profiles for your VMs

  3. Install the GPU Operator in the OpenShift cluster:

   helm install --wait --generate-name 
     -n gpu-operator --create-namespace 
     nvidia/gpu-operator 
     --set driver.enabled=false
  1. Verify vGPU detection:
   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'

Performance Optimization

NUMA Alignment

For optimal GPU performance, ensure NUMA alignment between GPUs, NICs, and CPU cores:

  1. Identify NUMA topology:
   oc debug node/<node-name> -- chroot /host nvidia-smi topo -m
  1. Configure pod placement to respect NUMA boundaries:
   apiVersion: v1
   kind: Pod
   metadata:
     name: gpu-numa-aligned-pod
   spec:
     containers:
     - name: gpu-container
       # ...
     nodeSelector:
       nvidia.com/gpu.present: "true"
     topologySpreadConstraints:
     - maxSkew: 1
       topologyKey: kubernetes.io/hostname
       whenUnsatisfiable: DoNotSchedule
       labelSelector:
         matchLabels:
           app: gpu-app

GPU Monitoring

Monitor GPU utilization to identify opportunities for optimization:

  1. Deploy NVIDIA DCGM-Exporter:
   oc apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployment/openshift/dcgm-exporter.yaml
  1. Configure Prometheus to scrape metrics:
   apiVersion: monitoring.coreos.com/v1
   kind: ServiceMonitor
   metadata:
     name: dcgm-exporter
     namespace: nvidia-gpu-operator
   spec:
     endpoints:
     - port: metrics
       path: /metrics
       interval: 15s
     selector:
       matchLabels:
         app: dcgm-exporter
  1. Create Grafana dashboards to visualize GPU metrics

Integration with RDMA for High-Performance Computing

GPUDirect RDMA

GPUDirect RDMA enables direct data transfer between GPU memory and network adapters, bypassing CPU and system memory. This integration is particularly valuable for high-performance computing workloads.

To configure GPUDirect RDMA:

  1. Ensure both NVIDIA GPU Operator and NVIDIA Network Operator are installed

  2. Enable GPUDirect RDMA in the NicClusterPolicy:

   apiVersion: mellanox.com/v1alpha1
   kind: NicClusterPolicy
   metadata:
     name: nic-cluster-policy
   spec:
     ofedDriver:
       # ...
     rdmaSharedDevicePlugin:
       # ...
     gpuDirectRdma:
       enabled: true
  1. Verify GPUDirect RDMA functionality:
   # In the pod
   nvidia-smi topo -m
   # Look for "NV" in the RDMA column

Performance Considerations

When using GPUDirect RDMA with GPU sharing mechanisms:

  1. MIG instances can use GPUDirect RDMA independently

  2. Time-slicing may introduce additional latency for RDMA operations

  3. vGPU requires SR-IOV network adapters for optimal RDMA performance

  4. Align NUMA topology for GPUs and network adapters to minimize PCIe traffic across NUMA nodes

Troubleshooting

Common Issues

  1. GPU Not Detected:
   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'
   oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-xyz
  1. MIG Configuration Failures:
   oc logs -n nvidia-gpu-operator nvidia-mig-manager-xyz
   oc exec -it -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz -- nvidia-smi -mig 0
  1. Time-Slicing Issues:
   oc describe configmap -n gpu-operator time-slicing-config
   oc logs -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz

Debugging GPU Workloads

  1. Check GPU allocation:
   oc describe pod <pod-name>
   # Look for resource allocation in the Events section
  1. Verify GPU visibility in the pod:
   oc exec -it <pod-name> -- nvidia-smi
  1. Check GPU utilization:
   oc exec -it <pod-name> -- nvidia-smi dmon

Conclusion

This comprehensive guide has covered both RDMA networking and NVIDIA GPU architecture in OpenShift environments. By understanding and implementing these technologies, organizations can significantly improve the performance and efficiency of their data-intensive workloads.

RDMA Benefits and Best Practices

RDMA provides significant performance benefits for high-throughput, low-latency applications in OpenShift environments. When implementing RDMA in OpenShift:

  • Choose the appropriate RDMA configuration based on your performance and isolation requirements
  • Ensure proper network configuration for optimal performance
  • Align RDMA resources with CPU, memory, and GPU resources for best results
  • Monitor and tune your RDMA deployment to maintain peak performance

GPU Sharing Benefits and Best Practices

NVIDIA GPU architecture in OpenShift provides flexible options for sharing and utilizing GPU resources efficiently. When implementing GPU sharing in OpenShift:

  • Select the appropriate sharing mechanism based on your hardware capabilities and isolation requirements
  • Consider the deployment scenario (bare metal, VMs, or mixed) when choosing a sharing approach
  • Optimize performance through proper NUMA alignment and monitoring
  • Integrate with RDMA for high-performance computing workloads

By combining RDMA networking with GPU acceleration and implementing the appropriate sharing mechanisms, organizations can build high-performance, cost-effective OpenShift environments for AI/ML, HPC, and other data-intensive workloads.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert