Getting Started with Red Hat OpenShift with NVIDIA

Introduction
Remote Direct Memory Access (RDMA)
- Introduction to RDMA
- RDMA Protocols and Network Technologies
- Verifying RDMA Capability in OpenShift
- RDMA Configuration Options in OpenShift
- RDMA Network Configuration
- Testing and Verification
- Performance Optimization
- Common Issues and Troubleshooting
- Integration with Storage and GPU Workloads
NVIDIA GPU Architecture
- Introduction to GPU Concurrency and Sharing Mechanisms
- GPU Sharing Technologies
- Deployment Considerations for Different OpenShift Scenarios
- Implementation Guidelines
- Performance Optimization
- Integration with RDMA for High-Performance Computing
- Troubleshooting
Conclusion

Introduction

This comprehensive guide provides detailed guidance for architects, consultants, and practitioners involved in implementing Red Hat OpenShift with NVIDIA networking hardware and GPU technologies. The guide offers methodologies, best practices, and configuration examples to help organizations effectively leverage NVIDIA technologies in OpenShift environments for high-performance computing, AI/ML workloads, and other latency-sensitive applications.

NVIDIA technologies, when integrated with Red Hat OpenShift, provide high-bandwidth, low-latency connectivity and powerful GPU acceleration essential for modern data-intensive workloads. This guide covers both RDMA (Remote Direct Memory Access) networking and NVIDIA GPU architecture to provide a complete reference for implementation.

Remote Direct Memory Access (RDMA)

Introduction to RDMA

Remote Direct Memory Access (RDMA) is a technology that enables direct memory access from the memory of one computer to the memory of another without involving either computer’s operating system, CPU, or cache. RDMA provides high-throughput, low-latency networking by bypassing traditional networking stacks and reducing CPU overhead, making it ideal for data-intensive workloads in OpenShift environments.

Key benefits of RDMA include:

Reduced Latency: By bypassing the OS kernel and CPU, RDMA significantly reduces communication latency
Higher Bandwidth: Enables near line-rate data transfer speeds
Lower CPU Utilization: Offloads data transfer operations from the CPU to the network adapter
Zero-Copy Networking: Data is transferred directly between application memory spaces without intermediate copies
Kernel Bypass: Communication bypasses the operating system kernel, reducing context switches and interrupts

RDMA is particularly valuable for OpenShift deployments running high-performance computing (HPC) workloads, AI/ML training and inference, database applications, and storage systems that require high-bandwidth, low-latency communication.

RDMA Protocols and Network Technologies

RDMA can be implemented using several protocols and network technologies:

InfiniBand

InfiniBand is a specialized high-performance network technology designed specifically for high-throughput, low-latency communications. It provides native support for RDMA and is commonly used in HPC environments.

Key characteristics:

Purpose-built for high reliability, high bandwidth, and low latency
Uses a cut-through approach with 16-bit LID for faster forwarding
Provides end-to-end flow control for lossless networking
Built-in software-defined networking with subnet manager
Requires specialized hardware and infrastructure

Note: OpenShift installation requires Ethernet connectivity for the cluster API traffic. InfiniBand can only be used as a secondary network for application traffic after the cluster is installed.

RDMA over Converged Ethernet (RoCE)

RoCE enables RDMA functionality over standard Ethernet networks, making it more accessible and cost-effective than InfiniBand while still providing many of the performance benefits.

Key characteristics:

RoCEv1: Layer 2 protocol that works within a single broadcast domain
RoCEv2: Routable protocol that runs on top of UDP/IP (IPv4 or IPv6)
Widely supported by modern network adapters
Currently the most popular protocol for implementing RDMA
Compatible with existing Ethernet infrastructure

Internet Wide Area RDMA Protocol (iWARP)

iWARP implements RDMA over TCP/IP networks, providing RDMA capabilities over standard TCP connections.

Key characteristics:

Leverages TCP or SCTP for reliable transport
Works over standard TCP/IP networks without specialized hardware
Generally has higher latency than RoCE or InfiniBand
More tolerant of packet loss and congestion
Easier to deploy in existing TCP/IP networks

Verifying RDMA Capability in OpenShift

Before implementing RDMA in your OpenShift environment, you need to verify that your nodes have RDMA-capable hardware and that it’s properly recognized by the system.

Using Node Feature Discovery (NFD)

Node Feature Discovery automatically detects and labels nodes with RDMA capabilities. To verify RDMA capability:

Ensure NFD is installed and running in your cluster
Check for RDMA-related labels on your nodes:

oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'

Look for the following labels which indicate RDMA capability:

feature.node.kubernetes.io/rdma.available: "true"
feature.node.kubernetes.io/rdma.capable: "true"
feature.node.kubernetes.io/pci-15b3.present: "true" (for Mellanox/NVIDIA NICs)

Manual Verification

You can also manually verify RDMA capability on your nodes:

Check for RDMA devices:

rdma link

For Mellanox/NVIDIA NICs, verify the presence of InfiniBand devices:

lspci -nn | grep Infiniband
ibstat | grep "Link layer"

Check the RDMA subsystem mode:

rdma system

RDMA Configuration Options in OpenShift

OpenShift with NVIDIA networking supports three primary RDMA configuration methods, each with different characteristics and use cases.

1. RDMA Shared Device

The RDMA Shared Device configuration allows multiple pods on a worker node to share the same RDMA device. This method is suitable for development environments or applications where maximum performance is not critical.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3

Key Parameters:

rdmaHcaMax: Maximum number of pods that can share the device
selectors.ifNames: Network interface names to be used for RDMA

Use Cases:

Development and testing environments
Applications where multiple pods need RDMA functionality but not maximum performance
Environments with limited hardware resources

Limitations:

All pods sharing the device compete for bandwidth and resources
No isolation between pods using the same device
Performance may degrade as more pods use the device

2. RDMA SR-IOV Legacy Device

The SR-IOV (Single Root I/O Virtualization) configuration segments a network device at the hardware layer, creating multiple virtual functions (VFs) that can be assigned to different pods. This provides better isolation and performance compared to the shared device method.

Configuration Example:

# NicClusterPolicy for OFED driver
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SR-IOV Network Node Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Key Parameters:

numVfs: Number of virtual functions to create
pfNames: Physical function names to use for SR-IOV
isRdma: Enable RDMA capability for the VFs

Use Cases:

Production environments requiring high performance
Workloads sensitive to latency and bandwidth
Applications requiring isolation between network resources

Limitations:

Limited by the maximum number of VFs supported by the hardware
Requires SR-IOV capable network adapters
May require system reboot when changing configuration

3. RDMA Host Device

The Host Device configuration passes the entire physical network device from the host to a pod. This provides maximum performance but limits the device to a single pod at a time.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }

Key Parameters:

resourcePrefix: Prefix for the resource name
resourceName: Name of the resource to be exposed
selectors.vendors: Vendor IDs to match (15b3 for Mellanox/NVIDIA)

Use Cases:

Workloads requiring maximum performance
Systems where SR-IOV is not supported
Applications needing features only available in the physical function driver

Limitations:

Device is exclusive to a single pod
Limited scalability as each device can only be used by one pod at a time
May not be suitable for environments with many pods requiring RDMA

RDMA Network Configuration

InfiniBand Network Configuration

For InfiniBand networks, additional configuration is required:

Ensure the host has an InfiniBand card installed and the driver is properly installed
Verify the RDMA subsystem mode:

   rdma system

For exclusive mode (recommended for production):

   rdma system set netns exclusive
   echo "options ib_core netns_mode=0" >> /etc/modprobe.d/ib_core.conf
   reboot

Configure SR-IOV for InfiniBand:

   apiVersion: sriovnetwork.openshift.io/v1
   kind: SriovNetworkNodePolicy
   metadata:
     name: ib-sriov
     namespace: kube-system
   spec:
     nodeSelector:
       kubernetes.io/os: "linux"
     resourceName: mellanoxibsriov
     priority: 99
     numVfs: 12
     nicSelector:
         deviceID: "1017"
         rootDevices:
         - 0000:86:00.0
         vendor: "15b3"
     deviceType: netdevice
     isRdma: true

Create a network attachment definition:

   apiVersion: spiderpool.spidernet.io/v2beta1
   kind: SpiderMultusConfig
   metadata:
     name: ib-sriov
     namespace: kube-system
   spec:
     cniType: ib-sriov
     ibsriov:
       resourceName: spidernet.io/mellanoxibsriov
       ippools:
         ipv4: ["v4-91"]

RoCE Network Configuration

For RoCE networks, ensure:

The network adapters support RoCE (typically Mellanox ConnectX-4 or newer)
Priority Flow Control (PFC) is configured on the switches
Explicit Congestion Notification (ECN) is enabled
Appropriate QoS settings are configured

Testing and Verification

Verifying RDMA Functionality

To verify RDMA functionality between pods:

Deploy test pods with RDMA capabilities:

   apiVersion: v1
   kind: Pod
   metadata:
     name: rdma-test-pod-1
     annotations:
       k8s.v1.cni.cncf.io/networks: rdma-network
   spec:
     containers:
     - name: rdma-test-container
       image: mellanox/rping-test
       securityContext:
         capabilities:
           add: ["IPC_LOCK"]
       resources:
         limits:
           nvidia.com/hostdev: 1
         requests:
           nvidia.com/hostdev: 1
       command:
       - sh
       - -c
       - sleep infinity

Verify RDMA devices in the pods:

   oc exec -it rdma-test-pod-1 -- rdma link

Run RDMA performance tests:

   # In pod 1 (server)
   oc exec -it rdma-test-pod-1 -- ib_read_lat

   # In pod 2 (client)
   oc exec -it rdma-test-pod-2 -- ib_read_lat <server-ip>

Performance Optimization

Network Tuning

MTU Size: Configure jumbo frames (MTU 9000) for improved throughput:

   apiVersion: sriovnetwork.openshift.io/v1
   kind: SriovNetworkNodePolicy
   metadata:
     name: sriov-policy
   spec:
     mtu: 9000
     # other parameters...

NUMA Alignment: Ensure RDMA devices are aligned with CPU and memory resources:

   apiVersion: v1
   kind: Pod
   metadata:
     name: rdma-numa-aligned-pod
   spec:
     containers:
     - name: rdma-container
       # ...
     nodeSelector:
       kubernetes.io/hostname: node-with-aligned-resources
     topologySpreadConstraints:
     - maxSkew: 1
       topologyKey: kubernetes.io/hostname
       whenUnsatisfiable: DoNotSchedule
       labelSelector:
         matchLabels:
           app: rdma-app

IRQ Affinity: Configure IRQ affinity for RDMA devices to specific CPU cores:

   # On the host
   set_irq_affinity.sh <interface_name>

Application Tuning

Buffer Sizes: Adjust RDMA buffer sizes for optimal performance:

   # Example for increasing queue pairs
   echo 8192 > /sys/module/mlx4_core/parameters/log_num_qp

Transport Selection: Choose the appropriate RDMA transport based on your network:
- InfiniBand: Use native InfiniBand transport for lowest latency
- RoCE: Use RoCEv2 for routable RDMA over Ethernet
- iWARP: Use for compatibility with standard TCP/IP networks

Common Issues and Troubleshooting

RDMA Device Not Visible in Pod

Verify the OFED driver is installed:

   oc get pods -n nvidia-network-operator | grep ofed

Check RDMA device allocation:

   oc describe pod <pod-name>
   # Look for resource allocation in the Events section

Verify RDMA capability is enabled:

   oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml
   # Check for isRdma: true

Performance Issues

Check for network congestion:

   # On the host
   perfquery -r

Verify PFC is working:

   # On the switch
   show priority-flow-control

Monitor RDMA statistics:

   # On the host
   rdma -d mlx5_0 -p port_rcv_data,port_xmit_data stat show

Connectivity Issues

Verify subnet manager is running (for InfiniBand):

   # On the host
   sminfo

Check link state:

   # On the host
   ibv_devinfo

Test basic connectivity:

   # In the pod
   ping <remote-ip>

Integration with Storage and GPU Workloads

NVMe over Fabrics (NVMe-oF)

RDMA is a key transport for NVMe over Fabrics, providing high-performance access to NVMe storage devices over a network:

Configure NVMe-oF target:

   # Example configuration
   nvmetcli

Connect to NVMe-oF using RDMA:

   # In the pod
   nvme connect -t rdma -a <target-ip> -s 4420 -n <subsystem-nqn>

GPUDirect RDMA

GPUDirect RDMA enables direct data transfer between GPU memory and network adapters, bypassing CPU and system memory:

Ensure NVIDIA GPU Operator is installed
Configure GPUDirect RDMA:

   apiVersion: mellanox.com/v1alpha1
   kind: NicClusterPolicy
   metadata:
     name: nic-cluster-policy
   spec:
     ofedDriver:
       # ...
     rdmaSharedDevicePlugin:
       # ...
     gpuDirectRdma:
       enabled: true

Verify GPUDirect RDMA functionality:

   # In the pod
   nvidia-smi topo -m
   # Look for "NV" in the RDMA column

NVIDIA GPU Architecture

Introduction to GPU Concurrency and Sharing Mechanisms

In enterprise-level OpenShift environments, applications typically have varying compute requirements that can leave GPUs underutilized. Providing the right amount of compute resources for each workload is critical to reduce deployment costs and maximize GPU utilization. Red Hat and NVIDIA have developed GPU concurrency and sharing mechanisms to simplify GPU-accelerated computing on OpenShift clusters.

GPU concurrency mechanisms for improving utilization range from programming model APIs to system software and hardware partitioning, including virtualization. These mechanisms allow multiple workloads to share GPU resources efficiently, improving overall utilization and reducing costs.

GPU Sharing Technologies

CUDA Streams

Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model developed by NVIDIA for general computing on GPUs. CUDA streams provide a mechanism for parallel execution of operations on the GPU.

Key characteristics:

A stream is a sequence of operations that executes in issue-order on the GPU
CUDA commands are typically executed sequentially in a default stream
Asynchronous processing across different streams allows for parallel execution
Tasks in different streams can run before, during, or after each other
Enables the GPU to run multiple tasks simultaneously in no prescribed order

Use Cases:

Applications with multiple independent tasks that can be executed in parallel
Workloads that can benefit from overlapping data transfers and computations
Scenarios where multiple small kernels need to be executed concurrently

Time-Slicing

GPU time-slicing interleaves workloads scheduled on overloaded GPUs when running multiple CUDA applications. This approach allows for better utilization of GPU resources without requiring hardware-level partitioning.

Key characteristics:

Enables sharing of GPUs by defining a set of replicas for a GPU
Each replica can be independently distributed to a pod
No memory or fault isolation between replicas
Uses GPU time-slicing to multiplex workloads from replicas
Can be applied cluster-wide or to specific nodes

Configuration Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  devicePlugin:
    config:
      name: time-slicing-config
      default: any-gpu-time-slicing
      sharing:
        timeSlicing:
          resources:
          - name: any-gpu-time-slicing
            replicas: 4
            renameByDefault: false
            failRequestsGreaterThanOne: false

Use Cases:

Older NVIDIA cards with no MIG support on bare metal
Workloads that don’t require strict isolation
Development and testing environments

CUDA Multi-Process Service (MPS)

CUDA Multi-Process Service (MPS) allows a single GPU to use multiple CUDA processes. The processes run in parallel on the GPU, eliminating saturation of the GPU compute resources.

Key characteristics:

Enables concurrent execution of kernel operations from different processes
Allows overlapping of memory copying from different processes
Enhances GPU utilization by enabling multiple processes to share the GPU
Provides a server process that manages access to the GPU

Use Cases:

HPC workloads with multiple MPI ranks
Applications with multiple small CUDA kernels
Scenarios where multiple processes need to share a single GPU

Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) is a feature of the NVIDIA Ampere architecture that enables splitting GPU compute units and memory into multiple MIG instances. Each instance represents a standalone GPU device from a system perspective.

Key characteristics:

Each MIG instance appears as an individual GPU to the system
Provides hardware-level isolation between instances
Supported on NVIDIA A100 and A30 Ampere cards
Can support up to seven independent CUDA applications
Offers complete isolation with dedicated hardware resources

Configuration Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  mig:
    strategy: single
    config:
      - gpuIds: [0, 1]
        migEnabled: true
        devices:
          - profile: "1g.5gb"
            count: 2
          - profile: "2g.10gb"
            count: 1

Use Cases:

Production environments requiring strict isolation
Mixed workloads with different resource requirements
Bare metal deployments with MIG-enabled cards

Virtualization with vGPU

Virtual machines (VMs) can directly access a single physical GPU using NVIDIA vGPU. This capability combines the power of GPU performance with the management and security benefits provided by virtualization.

Key characteristics:

Creates virtual GPUs that can be shared by VMs across the enterprise
Provides management and monitoring for VM environments
Enables workload balancing for mixed VDI and compute workloads
Allows resource sharing across multiple VMs
Offers proactive management capabilities

Use Cases:

VM environments requiring GPU acceleration
OpenShift Virtualization deployments
Mixed VDI and compute workloads

Deployment Considerations for Different OpenShift Scenarios

When implementing GPU sharing in OpenShift, consider the following recommendations for different scenarios:

Bare Metal Deployments

For bare metal OpenShift deployments:

vGPU is not available
Consider using MIG-enabled cards (A100, A30) for hardware-level partitioning
If using older NVIDIA cards without MIG support, consider time-slicing
For maximum performance, use direct GPU assignment without sharing

Virtual Machine Deployments

For OpenShift deployments on virtual machines:

vGPU is the best choice for sharing GPU resources
Consider using separate VMs when you need both passthrough and vGPU
Ensure the hypervisor supports GPU passthrough or vGPU

Mixed Environments with OKD Virtualization

For bare metal with OKD Virtualization and multiple GPUs:

Consider using pass-through for hosted VMs
Use time-slicing for containers
Align NUMA topology for optimal performance

Implementation Guidelines

Enabling Time-Slicing

To enable time-slicing of GPUs on Kubernetes:

Create a ConfigMap with the time-slicing configuration:

   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: time-slicing-config
     namespace: gpu-operator
   data:
     config.yaml: |
       version: v1
       sharing:
         timeSlicing:
           resources:
           - name: nvidia.com/gpu
             replicas: 4

Apply the configuration to the NVIDIA GPU Operator:

   apiVersion: nvidia.com/v1
   kind: ClusterPolicy
   metadata:
     name: gpu-cluster-policy
   spec:
     devicePlugin:
       config:
         name: time-slicing-config

Verify the configuration:

   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'

Configuring MIG

To configure Multi-Instance GPU (MIG) on OpenShift:

Ensure you have MIG-capable GPUs (A100, A30)
Create a MIG strategy configuration:

   apiVersion: nvidia.com/v1
   kind: ClusterPolicy
   metadata:
     name: gpu-cluster-policy
   spec:
     mig:
       strategy: single

Apply the configuration and wait for the GPU Operator to configure MIG
Verify MIG instances:

   oc exec -it -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz -- nvidia-smi mig -lgi

Setting Up vGPU

To set up vGPU in an OpenShift environment:

Install the NVIDIA vGPU software on the hypervisor
Configure vGPU profiles for your VMs
Install the GPU Operator in the OpenShift cluster:

   helm install --wait --generate-name 
     -n gpu-operator --create-namespace 
     nvidia/gpu-operator 
     --set driver.enabled=false

Verify vGPU detection:

   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'

Performance Optimization

NUMA Alignment

For optimal GPU performance, ensure NUMA alignment between GPUs, NICs, and CPU cores:

Identify NUMA topology:

   oc debug node/<node-name> -- chroot /host nvidia-smi topo -m

Configure pod placement to respect NUMA boundaries:

   apiVersion: v1
   kind: Pod
   metadata:
     name: gpu-numa-aligned-pod
   spec:
     containers:
     - name: gpu-container
       # ...
     nodeSelector:
       nvidia.com/gpu.present: "true"
     topologySpreadConstraints:
     - maxSkew: 1
       topologyKey: kubernetes.io/hostname
       whenUnsatisfiable: DoNotSchedule
       labelSelector:
         matchLabels:
           app: gpu-app

GPU Monitoring

Monitor GPU utilization to identify opportunities for optimization:

Deploy NVIDIA DCGM-Exporter:

   oc apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployment/openshift/dcgm-exporter.yaml

Configure Prometheus to scrape metrics:

   apiVersion: monitoring.coreos.com/v1
   kind: ServiceMonitor
   metadata:
     name: dcgm-exporter
     namespace: nvidia-gpu-operator
   spec:
     endpoints:
     - port: metrics
       path: /metrics
       interval: 15s
     selector:
       matchLabels:
         app: dcgm-exporter

Create Grafana dashboards to visualize GPU metrics

Integration with RDMA for High-Performance Computing

GPUDirect RDMA

GPUDirect RDMA enables direct data transfer between GPU memory and network adapters, bypassing CPU and system memory. This integration is particularly valuable for high-performance computing workloads.

To configure GPUDirect RDMA:

Ensure both NVIDIA GPU Operator and NVIDIA Network Operator are installed
Enable GPUDirect RDMA in the NicClusterPolicy:

   apiVersion: mellanox.com/v1alpha1
   kind: NicClusterPolicy
   metadata:
     name: nic-cluster-policy
   spec:
     ofedDriver:
       # ...
     rdmaSharedDevicePlugin:
       # ...
     gpuDirectRdma:
       enabled: true

Verify GPUDirect RDMA functionality:

   # In the pod
   nvidia-smi topo -m
   # Look for "NV" in the RDMA column

Performance Considerations

When using GPUDirect RDMA with GPU sharing mechanisms:

MIG instances can use GPUDirect RDMA independently
Time-slicing may introduce additional latency for RDMA operations
vGPU requires SR-IOV network adapters for optimal RDMA performance
Align NUMA topology for GPUs and network adapters to minimize PCIe traffic across NUMA nodes

Troubleshooting

Common Issues

GPU Not Detected:

   oc get nodes -o json | jq '.items[].status.allocatable | select(has("nvidia.com/gpu"))'
   oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-xyz

MIG Configuration Failures:

   oc logs -n nvidia-gpu-operator nvidia-mig-manager-xyz
   oc exec -it -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz -- nvidia-smi -mig 0

Time-Slicing Issues:

   oc describe configmap -n gpu-operator time-slicing-config
   oc logs -n nvidia-gpu-operator nvidia-device-plugin-daemonset-xyz

Debugging GPU Workloads

Check GPU allocation:

   oc describe pod <pod-name>
   # Look for resource allocation in the Events section

Verify GPU visibility in the pod:

   oc exec -it <pod-name> -- nvidia-smi

Check GPU utilization:

   oc exec -it <pod-name> -- nvidia-smi dmon

Conclusion

This comprehensive guide has covered both RDMA networking and NVIDIA GPU architecture in OpenShift environments. By understanding and implementing these technologies, organizations can significantly improve the performance and efficiency of their data-intensive workloads.

RDMA Benefits and Best Practices

RDMA provides significant performance benefits for high-throughput, low-latency applications in OpenShift environments. When implementing RDMA in OpenShift:

Choose the appropriate RDMA configuration based on your performance and isolation requirements
Ensure proper network configuration for optimal performance
Align RDMA resources with CPU, memory, and GPU resources for best results
Monitor and tune your RDMA deployment to maintain peak performance

GPU Sharing Benefits and Best Practices

NVIDIA GPU architecture in OpenShift provides flexible options for sharing and utilizing GPU resources efficiently. When implementing GPU sharing in OpenShift:

Select the appropriate sharing mechanism based on your hardware capabilities and isolation requirements
Consider the deployment scenario (bare metal, VMs, or mixed) when choosing a sharing approach
Optimize performance through proper NUMA alignment and monitoring
Integrate with RDMA for high-performance computing workloads

By combining RDMA networking with GPU acceleration and implementing the appropriate sharing mechanisms, organizations can build high-performance, cost-effective OpenShift environments for AI/ML, HPC, and other data-intensive workloads.

Name	Typ	Größe	Geändert am	Zugriff
📄 archlinux-2025.05.01-x86_64.iso	ISO	1.16 GB	18.05.2025 09:45	-rw-r--r--
📄 kubuntu-24.04.2-desktop-amd64.iso	ISO	4.22 GB	18.05.2025 09:48	-rw-r--r--
📄 neon-user-20250511-0744.iso	ISO	2.65 GB	18.05.2025 09:46	-rw-r--r--
📄 ubuntu-24.04.2-live-server-amd64.iso	ISO	2.99 GB	19.05.2025 07:44	-rw-r--r--

[Part01] Getting Started with Red Hat OpenShift with NVIDIA

Getting Started with Red Hat OpenShift with NVIDIA

Table of Contents

Introduction

Remote Direct Memory Access (RDMA)

Introduction to RDMA

RDMA Protocols and Network Technologies

InfiniBand

RDMA over Converged Ethernet (RoCE)

Internet Wide Area RDMA Protocol (iWARP)

Verifying RDMA Capability in OpenShift

Using Node Feature Discovery (NFD)

Manual Verification

RDMA Configuration Options in OpenShift

1. RDMA Shared Device

2. RDMA SR-IOV Legacy Device

3. RDMA Host Device

RDMA Network Configuration

InfiniBand Network Configuration

RoCE Network Configuration

Testing and Verification

Verifying RDMA Functionality

Performance Optimization

Network Tuning

Application Tuning

Common Issues and Troubleshooting

RDMA Device Not Visible in Pod

Performance Issues

Connectivity Issues

Integration with Storage and GPU Workloads

NVMe over Fabrics (NVMe-oF)

GPUDirect RDMA

NVIDIA GPU Architecture

Introduction to GPU Concurrency and Sharing Mechanisms

GPU Sharing Technologies

CUDA Streams

Time-Slicing

CUDA Multi-Process Service (MPS)

Multi-Instance GPU (MIG)

Virtualization with vGPU

Deployment Considerations for Different OpenShift Scenarios

Bare Metal Deployments

Virtual Machine Deployments

Mixed Environments with OKD Virtualization

Implementation Guidelines

Enabling Time-Slicing

Configuring MIG

Setting Up vGPU

Performance Optimization

NUMA Alignment

GPU Monitoring

Integration with RDMA for High-Performance Computing

GPUDirect RDMA

Performance Considerations

Troubleshooting

Common Issues

Debugging GPU Workloads

Conclusion

RDMA Benefits and Best Practices

GPU Sharing Benefits and Best Practices

Schreibe einen Kommentar Antworten abbrechen