Kubernetes 1.35 et GPUs : Orchestrer vos Workloads ML/AI à Grande Échelle

Kubernetes 1.35 et GPUs : Orchestrer vos Workloads ML/AI à Grande Échelle

L'explosion des workloads ML/AI en production transforme Kubernetes en plateforme d'orchestration GPU indispensable. Kubernetes 1.35, sorti en janvier 2026, apporte une révolution majeure : Dynamic Resource Allocation (DRA), permettant enfin un scheduling intelligent des GPUs avec partitionnement MIG (Multi-Instance GPU), time-sharing, et allocation fractionnaire.

Avec 78% des équipes ML utilisant Kubernetes pour l'entraînement et l'inférence (CNCF Survey 2026), maîtriser l'orchestration GPU devient une compétence critique. Les enjeux : optimiser l'utilisation GPU (coût H100 : $2-4/h cloud), réduire le temps d'attente des jobs ML (scheduling efficace), et scaler élastiquement les workloads d'inférence.

Cet article technique vous guide à travers la configuration complète Kubernetes GPU : installation des drivers NVIDIA/AMD, Device Plugins, DRA avec partitionnement, scheduling avancé, monitoring Prometheus, et architectures multi-tenant production.

Pourquoi Kubernetes pour les Workloads GPU ?

Le Problème : GPUs Sous-Utilisés et Coûteux

Sans orchestration :

  • Jobs ML tournent sur serveurs dédiés → 40-60% d'idle time
  • Pas de partage GPU entre équipes → gaspillage coûteux
  • Scaling manuel lent → bottlenecks en prod
  • Pas de monitoring centralisé → pas de visibilité

Coûts cloud GPU 2026 :

GPUTypePerformanceCoût/heure (AWS)Coût/mois (24/7) ------------------------------------------------------------ NVIDIA A100Training312 TFLOPS (FP16)$3.06$2,200 NVIDIA H100Training/Inference756 TFLOPS (FP16)$4.10$3,000 NVIDIA L4Inference242 TFLOPS (FP16)$0.88$635 AMD MI300XTraining1300 TFLOPS (FP16)$3.50$2,520

Avec Kubernetes + GPU scheduling :

  • Partage GPU entre plusieurs pods (MIG, time-sharing)
  • Autoscaling basé sur charge GPU
  • Scheduling intelligent (bin packing, GPU topology)
  • Monitoring unifié (utilisation, mémoire, température)
  • Multi-tenancy sécurisé avec quotas

Gains mesurés :

  • +60-80% d'utilisation GPU vs serveurs dédiés
  • -40% de coûts grâce au partage
  • -70% de temps d'attente jobs ML (scheduling efficace)

Kubernetes 1.35 : Dynamic Resource Allocation (DRA)

La feature killer de K8s 1.35 :

Avant (Device Plugin API - limité) :

resources:
  limits:
    nvidia.com/gpu: 1  # 🔴 GPU entier ou rien

Maintenant (DRA - flexible) :

resources:
  claims:
  - name: gpu-partition
    request: gpu.nvidia.com
    # ✅ Demander une fraction de GPU, MIG slice, ou configuration spécifique
    parameters:
      memory: "20Gi"      # 20GB de VRAM seulement
      compute: "3.5"      # 3.5 TFLOPS minimum
      mig-profile: "1g.5gb"  # MIG partition spécifique

Avantages DRA :

  • Partitionnement GPU natif (pas besoin de hacks)
  • Time-sharing intelligent entre pods
  • Scheduling basé sur besoins réels (mémoire, compute)
  • Support multi-GPU et GPU topology

Installation et Configuration GPU sur Kubernetes

1. Prérequis Nœuds GPU

Installer les drivers NVIDIA :

# Ubuntu 22.04/24.04
sudo apt update
sudo apt install -y nvidia-driver-550 nvidia-utils-550

# Vérifier
nvidia-smi

Installer NVIDIA Container Toolkit :

# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install
sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Configurer containerd (si utilisé) :

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

2. Déployer NVIDIA Device Plugin

Méthode 1 : Helm Chart (recommandé) :

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

Configuration time-slicing :

# time-slicing-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 5  # 5 pods peuvent partager 1 GPU
        renameByDefault: false
        failRequestsGreaterThanOne: false

Méthode 2 : Manifest YAML :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin
        args:
        - --pass-device-specs=true
        - --fail-on-init-error=false
        - --mig-strategy=mixed
        - --device-list-strategy=envvar
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Vérifier l'installation :

kubectl get nodes -o json | jq '.items[].status.capacity'

3. Configuration MIG (Multi-Instance GPU)

Pour les A100 et H100, MIG permet de partitionner un GPU en instances isolées :

Activer MIG sur le nœud :

# Sur le serveur GPU
sudo nvidia-smi -mig 1

# Créer des partitions MIG (exemple A100 40GB)
sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C  # 7 instances de 1g.5gb

# Vérifier
nvidia-smi -L

Configurer Device Plugin pour MIG :

# mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.5gb": 7
      all-2g.10gb:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.10gb": 3
      mixed:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.5gb": 2
            "2g.10gb": 1
            "3g.20gb": 1

Labelliser les nœuds :

kubectl label nodes gpu-node-1 nvidia.com/mig.config=all-1g.5gb

Scheduling GPU : Stratégies Avancées

1. GPU Scheduling Basique

apiVersion: v1
kind: Pod
metadata:
  name: cuda-test
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.4.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1  # Demande 1 GPU complet
---
# Avec MIG
apiVersion: v1
kind: Pod
metadata:
  name: cuda-test-mig
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.4.0-base-ubuntu22.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # Demande 1 partition MIG 1g.5gb

2. Scheduling Intelligent avec Node Affinity

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # Seulement sur nœuds avec A100 ou H100
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-A100-SXM4-40GB
                - NVIDIA-H100-SXM5-80GB
              # Avec au moins 2 GPUs
              - key: nvidia.com/gpu.count
                operator: Gt
                values: ["1"]
      containers:
      - name: trainer
        image: pytorch/pytorch:2.2.0-cuda12.1
        resources:
          limits:
            nvidia.com/gpu: 2  # 2 GPUs pour training distribué
        env:
        - name: NCCL_DEBUG
          value: "INFO"

3. Topology-Aware Scheduling (Multi-GPU)

Pour le training distribué, placer les GPUs sur le même nœud et avec NVLink optimal :

apiVersion: v1
kind: Pod
metadata:
  name: multi-gpu-training
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    command: ["python", "train.py"]
    resources:
      limits:
        nvidia.com/gpu: 8  # Tous les GPUs d'un nœud DGX
    env:
    - name: NCCL_SOCKET_IFNAME
      value: "eth0"
    - name: NCCL_DEBUG
      value: "INFO"
  # Forcer sur un seul nœud (pour NVLink)
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: job-name
            operator: In
            values: [multi-gpu-training]
        topologyKey: kubernetes.io/hostname

4. Priority Classes pour Jobs ML

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-high-priority
value: 1000000
globalDefault: false
description: "High priority for critical ML workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-batch-priority
value: 100
globalDefault: false
description: "Low priority for batch training jobs"
---
# Utilisation dans un pod
apiVersion: v1
kind: Pod
metadata:
  name: critical-inference
spec:
  priorityClassName: ml-high-priority  # Préempte les jobs batch si besoin
  containers:
  - name: inference
    image: mymodel:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Workloads ML/AI sur Kubernetes

1. Training Job avec PyTorch

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
  namespace: ml-training
spec:
  parallelism: 1
  completions: 1
  backoffLimit: 3
  template:
    metadata:
      labels:
        app: pytorch-training
        job-name: pytorch-training
    spec:
      restartPolicy: OnFailure
      containers:
      - name: pytorch
        image: pytorch/pytorch:2.2.0-cuda12.1
        command:
        - python
        - train.py
        - --epochs=100
        - --batch-size=64
        - --lr=0.001
        resources:
          limits:
            nvidia.com/gpu: 2
            memory: "32Gi"
            cpu: "16"
          requests:
            nvidia.com/gpu: 2
            memory: "16Gi"
            cpu: "8"
        env:
        - name: NCCL_DEBUG
          value: "WARN"
        - name: MASTER_ADDR
          value: "localhost"
        - name: MASTER_PORT
          value: "29500"
        volumeMounts:
        - name: training-data
          mountPath: /data
          readOnly: true
        - name: model-output
          mountPath: /output
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: training-dataset-pvc
      - name: model-output
        persistentVolumeClaim:
          claimName: model-artifacts-pvc

2. Inférence avec TensorFlow Serving

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving-gpu
  namespace: ml-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tf-serving
  template:
    metadata:
      labels:
        app: tf-serving
    spec:
      containers:
      - name: tensorflow-serving
        image: tensorflow/serving:2.15.0-gpu
        ports:
        - containerPort: 8501
          name: http
        - containerPort: 8500
          name: grpc
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: 1
            memory: "4Gi"
            cpu: "2"
        env:
        - name: MODEL_NAME
          value: "resnet50"
        - name: TF_CUDA_COMPUTE_CAPABILITIES
          value: "7.0,8.0"
        volumeMounts:
        - name: model-volume
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /v1/models/resnet50
            port: 8501
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /v1/models/resnet50
            port: 8501
          initialDelaySeconds: 20
          periodSeconds: 5
      volumes:
      - name: model-volume
        persistentVolumeClaim:
          claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: tf-serving-service
  namespace: ml-inference
spec:
  type: LoadBalancer
  ports:
  - port: 8501
    targetPort: 8501
    protocol: TCP
    name: http
  selector:
    app: tf-serving

3. Autoscaling GPU Workloads (HPA + KEDA)

# HPA basé sur métriques GPU custom
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
  namespace: ml-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tf-serving-gpu
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: nvidia_gpu_duty_cycle  # Métrique Prometheus custom
      target:
        type: AverageValue
        averageValue: "80"  # Scale si GPU > 80%
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

Monitoring GPU avec Prometheus

1. Installer NVIDIA GPU Exporter

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  template:
    metadata:
      labels:
        app: nvidia-dcgm-exporter
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: nvidia-dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          privileged: true
        volumeMounts:
        - name: pod-gpu-resources
          mountPath: /var/lib/kubelet/pod-resources
          readOnly: true
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
  name: nvidia-dcgm-exporter
  namespace: monitoring
  labels:
    app: nvidia-dcgm-exporter
spec:
  ports:
  - port: 9400
    targetPort: 9400
    protocol: TCP
    name: metrics
  selector:
    app: nvidia-dcgm-exporter

2. Configuration Prometheus

# ServiceMonitor pour Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

3. Métriques GPU Clés

# Utilisation GPU (%)
DCGM_FI_DEV_GPU_UTIL

# Utilisation mémoire GPU (%)
DCGM_FI_DEV_MEM_COPY_UTIL

# Température (°C)
DCGM_FI_DEV_GPU_TEMP

# Power usage (Watts)
DCGM_FI_DEV_POWER_USAGE

# Mémoire utilisée (MB)
DCGM_FI_DEV_FB_USED

# Requêtes PromQL utiles
# Utilisation moyenne GPU par nœud
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)

# GPUs sous-utilisés (<30%)
DCGM_FI_DEV_GPU_UTIL < 30

# Top 5 pods consommant le plus de GPU
topk(5, sum(DCGM_FI_DEV_GPU_UTIL) by (pod))

4. Dashboard Grafana

{
  "dashboard": {
    "title": "Kubernetes GPU Monitoring",
    "panels": [
      {
        "title": "GPU Utilization by Node",
        "targets": [
          {
            "expr": "avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)"
          }
        ]
      },
      {
        "title": "GPU Memory Usage",
        "targets": [
          {
            "expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100"
          }
        ]
      },
      {
        "title": "GPU Temperature",
        "targets": [
          {
            "expr": "DCGM_FI_DEV_GPU_TEMP"
          }
        ]
      }
    ]
  }
}

Multi-Tenancy et Isolation GPU

1. Resource Quotas par Namespace

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-team-quota
  namespace: ml-team-alpha
spec:
  hard:
    requests.nvidia.com/gpu: "10"  # Max 10 GPUs simultanés
    limits.nvidia.com/gpu: "10"
    requests.memory: "200Gi"
    limits.memory: "400Gi"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: ml-limits
  namespace: ml-team-alpha
spec:
  limits:
  - max:
      nvidia.com/gpu: "4"  # Max 4 GPUs par pod
      memory: "64Gi"
    min:
      nvidia.com/gpu: "1"
      memory: "1Gi"
    type: Container

2. Network Policies (Isolation Pods GPU)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-pod-isolation
  namespace: ml-training
spec:
  podSelector:
    matchLabels:
      gpu-workload: "true"
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: monitoring
    ports:
    - protocol: TCP
      port: 9400  # Metrics seulement
  egress:
  - to:
    - podSelector: {}
    ports:
    - protocol: TCP
      port: 6379  # Redis cache
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 53  # DNS

Optimisation des Coûts GPU

1. Spot Instances avec Node Autoscaling

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-spot-provisioner
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot"]  # Spot instances (-70% de coût)
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - p3.2xlarge    # V100
    - g5.xlarge     # A10G
    - g5.2xlarge    # A10G
  - key: kubernetes.io/arch
    operator: In
    values: ["amd64"]
  - key: nvidia.com/gpu
    operator: Exists
  limits:
    resources:
      nvidia.com/gpu: 50  # Max 50 GPUs spot
  ttlSecondsAfterEmpty: 30  # Terminé 30s après libération
  ttlSecondsUntilExpired: 604800  # 7 jours max
  providerRef:
    name: default

2. GPU Time-Sharing pour Inférence

# Activer time-sharing (5 pods / GPU)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        replicas: 5
---
# Pod d'inférence légère
apiVersion: apps/v1
kind: Deployment
metadata:
  name: lightweight-inference
spec:
  replicas: 10  # 10 pods sur 2 GPUs physiques
  template:
    spec:
      containers:
      - name: inference
        image: lightweight-model:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # GPU partagé

3. Monitoring des Coûts

# Script Python pour calculer coûts GPU
import prometheus_api_client
from datetime import datetime, timedelta

prom = prometheus_api_client.PrometheusConnect(url="http://prometheus:9090")

# Requête : heures GPU utilisées par namespace
query = '''
sum(
  rate(container_gpu_allocation[1h]) * 3600
) by (namespace)
'''

result = prom.custom_query(query=query)

# Prix GPU/heure (A100 = $3.06/h)
GPU_PRICE = 3.06

for metric in result:
    namespace = metric['metric']['namespace']
    gpu_hours = float(metric['value'][1])
    cost = gpu_hours * GPU_PRICE

print(f"{namespace}: {gpu_hours:.2f} GPU-hours → ${cost:.2f}")

Architecture Production Multi-GPU

# Cluster Kubernetes avec pools GPU hétérogènes
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-training-1
  labels:
    node-role: gpu-training
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
    nvidia.com/gpu.count: "8"
    topology.kubernetes.io/zone: us-east-1a
---
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-inference-1
  labels:
    node-role: gpu-inference
    nvidia.com/gpu.product: NVIDIA-L4
    nvidia.com/gpu.count: "4"
    topology.kubernetes.io/zone: us-east-1b
---
# Job training distribué sur nœuds training
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  template:
    spec:
      nodeSelector:
        node-role: gpu-training  # Force sur nœuds training
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 8  # 8 A100
---
# Deployment inférence sur nœuds inference
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving
spec:
  replicas: 12
  template:
    spec:
      nodeSelector:
        node-role: gpu-inference  # Force sur nœuds inference
      containers:
      - name: server
        image: triton-inference-server:latest
        resources:
          limits:
            nvidia.com/gpu: 1  # 1 L4 par pod

Conclusion : Kubernetes GPU en 2026

Kubernetes 1.35 avec Dynamic Resource Allocation marque un tournant pour l'orchestration GPU ML/AI. Les organisations peuvent enfin maximiser l'utilisation GPU (60-80% vs 30-40% sans orchestration), réduire les coûts de 40% grâce au partage intelligent, et scaler élastiquement les workloads d'inférence.

Points clés :

  • DRA permet partitionnement GPU flexible (MIG, time-sharing)
  • Scheduling intelligent avec topology awareness
  • Monitoring unifié Prometheus + DCGM Exporter
  • Multi-tenancy sécurisé avec quotas
  • Autoscaling basé sur métriques GPU custom

Architecture production recommandée :

  • Nœuds training : A100/H100 (8 GPUs, NVLink)
  • Nœuds inference : L4/T4 (4 GPUs, coût optimisé)
  • Time-sharing sur inference (5 pods/GPU)
  • Spot instances pour batch training (-70% coût)
  • Prometheus + Grafana monitoring

La prochaine frontière : orchestration multi-cluster GPU avec Karmada/Submariner pour workloads distribués géographiquement. Pour approfondir l'écosystème Kubernetes ML, consultez nos guides sur Kubernetes pour développeurs pour les bases, observabilité OpenTelemetry pour le monitoring avancé, et Edge AI production pour déployer des modèles sur edge avec Kubernetes léger.

Kubernetes 1.35 + DRA + GPU Operator = La plateforme ML/AI ultime pour 2026.