Kubernetes 1.35 et GPUs : Orchestrer vos Workloads ML/AI à Grande Échelle
L'explosion des workloads ML/AI en production transforme Kubernetes en plateforme d'orchestration GPU indispensable. Kubernetes 1.35, sorti en janvier 2026, apporte une révolution majeure : Dynamic Resource Allocation (DRA), permettant enfin un scheduling intelligent des GPUs avec partitionnement MIG (Multi-Instance GPU), time-sharing, et allocation fractionnaire.
Avec 78% des équipes ML utilisant Kubernetes pour l'entraînement et l'inférence (CNCF Survey 2026), maîtriser l'orchestration GPU devient une compétence critique. Les enjeux : optimiser l'utilisation GPU (coût H100 : $2-4/h cloud), réduire le temps d'attente des jobs ML (scheduling efficace), et scaler élastiquement les workloads d'inférence.
Cet article technique vous guide à travers la configuration complète Kubernetes GPU : installation des drivers NVIDIA/AMD, Device Plugins, DRA avec partitionnement, scheduling avancé, monitoring Prometheus, et architectures multi-tenant production.
Pourquoi Kubernetes pour les Workloads GPU ?
Le Problème : GPUs Sous-Utilisés et Coûteux
Sans orchestration :
- Jobs ML tournent sur serveurs dédiés → 40-60% d'idle time
- Pas de partage GPU entre équipes → gaspillage coûteux
- Scaling manuel lent → bottlenecks en prod
- Pas de monitoring centralisé → pas de visibilité
Coûts cloud GPU 2026 :
GPUTypePerformanceCoût/heure (AWS)Coût/mois (24/7) ------------------------------------------------------------ NVIDIA A100Training312 TFLOPS (FP16)$3.06$2,200 NVIDIA H100Training/Inference756 TFLOPS (FP16)$4.10$3,000 NVIDIA L4Inference242 TFLOPS (FP16)$0.88$635 AMD MI300XTraining1300 TFLOPS (FP16)$3.50$2,520
Avec Kubernetes + GPU scheduling :
- ✅ Partage GPU entre plusieurs pods (MIG, time-sharing)
- ✅ Autoscaling basé sur charge GPU
- ✅ Scheduling intelligent (bin packing, GPU topology)
- ✅ Monitoring unifié (utilisation, mémoire, température)
- ✅ Multi-tenancy sécurisé avec quotas
Gains mesurés :
- +60-80% d'utilisation GPU vs serveurs dédiés
- -40% de coûts grâce au partage
- -70% de temps d'attente jobs ML (scheduling efficace)
Kubernetes 1.35 : Dynamic Resource Allocation (DRA)
La feature killer de K8s 1.35 :
Avant (Device Plugin API - limité) :
resources:
limits:
nvidia.com/gpu: 1 # 🔴 GPU entier ou rien
Maintenant (DRA - flexible) :
resources:
claims:
- name: gpu-partition
request: gpu.nvidia.com
# ✅ Demander une fraction de GPU, MIG slice, ou configuration spécifique
parameters:
memory: "20Gi" # 20GB de VRAM seulement
compute: "3.5" # 3.5 TFLOPS minimum
mig-profile: "1g.5gb" # MIG partition spécifique
Avantages DRA :
- Partitionnement GPU natif (pas besoin de hacks)
- Time-sharing intelligent entre pods
- Scheduling basé sur besoins réels (mémoire, compute)
- Support multi-GPU et GPU topology
Installation et Configuration GPU sur Kubernetes
1. Prérequis Nœuds GPU
Installer les drivers NVIDIA :
# Ubuntu 22.04/24.04
sudo apt update
sudo apt install -y nvidia-driver-550 nvidia-utils-550
# Vérifier
nvidia-smi
Installer NVIDIA Container Toolkit :
# Add repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install
sudo apt update
sudo apt install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Configurer containerd (si utilisé) :
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
2. Déployer NVIDIA Device Plugin
Méthode 1 : Helm Chart (recommandé) :
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
Configuration time-slicing :
# time-slicing-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: kube-system
data:
any: |-
version: v1
sharing:
timeSlicing:
replicas: 5 # 5 pods peuvent partager 1 GPU
renameByDefault: false
failRequestsGreaterThanOne: false
Méthode 2 : Manifest YAML :
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: nvidia-device-plugin
args:
- --pass-device-specs=true
- --fail-on-init-error=false
- --mig-strategy=mixed
- --device-list-strategy=envvar
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
Vérifier l'installation :
kubectl get nodes -o json | jq '.items[].status.capacity'
3. Configuration MIG (Multi-Instance GPU)
Pour les A100 et H100, MIG permet de partitionner un GPU en instances isolées :
Activer MIG sur le nœud :
# Sur le serveur GPU
sudo nvidia-smi -mig 1
# Créer des partitions MIG (exemple A100 40GB)
sudo nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C # 7 instances de 1g.5gb
# Vérifier
nvidia-smi -L
Configurer Device Plugin pour MIG :
# mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: kube-system
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.5gb": 7
all-2g.10gb:
- devices: [0]
mig-enabled: true
mig-devices:
"2g.10gb": 3
mixed:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1
Labelliser les nœuds :
kubectl label nodes gpu-node-1 nvidia.com/mig.config=all-1g.5gb
Scheduling GPU : Stratégies Avancées
1. GPU Scheduling Basique
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1 # Demande 1 GPU complet
---
# Avec MIG
apiVersion: v1
kind: Pod
metadata:
name: cuda-test-mig
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/mig-1g.5gb: 1 # Demande 1 partition MIG 1g.5gb
2. Scheduling Intelligent avec Node Affinity
apiVersion: batch/v1
kind: Job
metadata:
name: training-job
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
# Seulement sur nœuds avec A100 ou H100
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-A100-SXM4-40GB
- NVIDIA-H100-SXM5-80GB
# Avec au moins 2 GPUs
- key: nvidia.com/gpu.count
operator: Gt
values: ["1"]
containers:
- name: trainer
image: pytorch/pytorch:2.2.0-cuda12.1
resources:
limits:
nvidia.com/gpu: 2 # 2 GPUs pour training distribué
env:
- name: NCCL_DEBUG
value: "INFO"
3. Topology-Aware Scheduling (Multi-GPU)
Pour le training distribué, placer les GPUs sur le même nœud et avec NVLink optimal :
apiVersion: v1
kind: Pod
metadata:
name: multi-gpu-training
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 8 # Tous les GPUs d'un nœud DGX
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "INFO"
# Forcer sur un seul nœud (pour NVLink)
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values: [multi-gpu-training]
topologyKey: kubernetes.io/hostname
4. Priority Classes pour Jobs ML
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-high-priority
value: 1000000
globalDefault: false
description: "High priority for critical ML workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-batch-priority
value: 100
globalDefault: false
description: "Low priority for batch training jobs"
---
# Utilisation dans un pod
apiVersion: v1
kind: Pod
metadata:
name: critical-inference
spec:
priorityClassName: ml-high-priority # Préempte les jobs batch si besoin
containers:
- name: inference
image: mymodel:latest
resources:
limits:
nvidia.com/gpu: 1
Workloads ML/AI sur Kubernetes
1. Training Job avec PyTorch
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
namespace: ml-training
spec:
parallelism: 1
completions: 1
backoffLimit: 3
template:
metadata:
labels:
app: pytorch-training
job-name: pytorch-training
spec:
restartPolicy: OnFailure
containers:
- name: pytorch
image: pytorch/pytorch:2.2.0-cuda12.1
command:
- python
- train.py
- --epochs=100
- --batch-size=64
- --lr=0.001
resources:
limits:
nvidia.com/gpu: 2
memory: "32Gi"
cpu: "16"
requests:
nvidia.com/gpu: 2
memory: "16Gi"
cpu: "8"
env:
- name: NCCL_DEBUG
value: "WARN"
- name: MASTER_ADDR
value: "localhost"
- name: MASTER_PORT
value: "29500"
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-dataset-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-artifacts-pvc
2. Inférence avec TensorFlow Serving
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving-gpu
namespace: ml-inference
spec:
replicas: 3
selector:
matchLabels:
app: tf-serving
template:
metadata:
labels:
app: tf-serving
spec:
containers:
- name: tensorflow-serving
image: tensorflow/serving:2.15.0-gpu
ports:
- containerPort: 8501
name: http
- containerPort: 8500
name: grpc
resources:
limits:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
requests:
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "2"
env:
- name: MODEL_NAME
value: "resnet50"
- name: TF_CUDA_COMPUTE_CAPABILITIES
value: "7.0,8.0"
volumeMounts:
- name: model-volume
mountPath: /models
livenessProbe:
httpGet:
path: /v1/models/resnet50
port: 8501
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /v1/models/resnet50
port: 8501
initialDelaySeconds: 20
periodSeconds: 5
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
name: tf-serving-service
namespace: ml-inference
spec:
type: LoadBalancer
ports:
- port: 8501
targetPort: 8501
protocol: TCP
name: http
selector:
app: tf-serving
3. Autoscaling GPU Workloads (HPA + KEDA)
# HPA basé sur métriques GPU custom
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
namespace: ml-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tf-serving-gpu
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: nvidia_gpu_duty_cycle # Métrique Prometheus custom
target:
type: AverageValue
averageValue: "80" # Scale si GPU > 80%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Monitoring GPU avec Prometheus
1. Installer NVIDIA GPU Exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
template:
metadata:
labels:
app: nvidia-dcgm-exporter
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
ports:
- containerPort: 9400
name: metrics
securityContext:
runAsNonRoot: false
runAsUser: 0
privileged: true
volumeMounts:
- name: pod-gpu-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
volumes:
- name: pod-gpu-resources
hostPath:
path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
name: nvidia-dcgm-exporter
namespace: monitoring
labels:
app: nvidia-dcgm-exporter
spec:
ports:
- port: 9400
targetPort: 9400
protocol: TCP
name: metrics
selector:
app: nvidia-dcgm-exporter
2. Configuration Prometheus
# ServiceMonitor pour Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
3. Métriques GPU Clés
# Utilisation GPU (%)
DCGM_FI_DEV_GPU_UTIL
# Utilisation mémoire GPU (%)
DCGM_FI_DEV_MEM_COPY_UTIL
# Température (°C)
DCGM_FI_DEV_GPU_TEMP
# Power usage (Watts)
DCGM_FI_DEV_POWER_USAGE
# Mémoire utilisée (MB)
DCGM_FI_DEV_FB_USED
# Requêtes PromQL utiles
# Utilisation moyenne GPU par nœud
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)
# GPUs sous-utilisés (<30%)
DCGM_FI_DEV_GPU_UTIL < 30
# Top 5 pods consommant le plus de GPU
topk(5, sum(DCGM_FI_DEV_GPU_UTIL) by (pod))
4. Dashboard Grafana
{
"dashboard": {
"title": "Kubernetes GPU Monitoring",
"panels": [
{
"title": "GPU Utilization by Node",
"targets": [
{
"expr": "avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)"
}
]
},
{
"title": "GPU Memory Usage",
"targets": [
{
"expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100"
}
]
},
{
"title": "GPU Temperature",
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_TEMP"
}
]
}
]
}
}
Multi-Tenancy et Isolation GPU
1. Resource Quotas par Namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-team-quota
namespace: ml-team-alpha
spec:
hard:
requests.nvidia.com/gpu: "10" # Max 10 GPUs simultanés
limits.nvidia.com/gpu: "10"
requests.memory: "200Gi"
limits.memory: "400Gi"
---
apiVersion: v1
kind: LimitRange
metadata:
name: ml-limits
namespace: ml-team-alpha
spec:
limits:
- max:
nvidia.com/gpu: "4" # Max 4 GPUs par pod
memory: "64Gi"
min:
nvidia.com/gpu: "1"
memory: "1Gi"
type: Container
2. Network Policies (Isolation Pods GPU)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: gpu-pod-isolation
namespace: ml-training
spec:
podSelector:
matchLabels:
gpu-workload: "true"
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
role: monitoring
ports:
- protocol: TCP
port: 9400 # Metrics seulement
egress:
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 6379 # Redis cache
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 53 # DNS
Optimisation des Coûts GPU
1. Spot Instances avec Node Autoscaling
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: gpu-spot-provisioner
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"] # Spot instances (-70% de coût)
- key: node.kubernetes.io/instance-type
operator: In
values:
- p3.2xlarge # V100
- g5.xlarge # A10G
- g5.2xlarge # A10G
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: nvidia.com/gpu
operator: Exists
limits:
resources:
nvidia.com/gpu: 50 # Max 50 GPUs spot
ttlSecondsAfterEmpty: 30 # Terminé 30s après libération
ttlSecondsUntilExpired: 604800 # 7 jours max
providerRef:
name: default
2. GPU Time-Sharing pour Inférence
# Activer time-sharing (5 pods / GPU)
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: kube-system
data:
any: |-
version: v1
sharing:
timeSlicing:
replicas: 5
---
# Pod d'inférence légère
apiVersion: apps/v1
kind: Deployment
metadata:
name: lightweight-inference
spec:
replicas: 10 # 10 pods sur 2 GPUs physiques
template:
spec:
containers:
- name: inference
image: lightweight-model:latest
resources:
limits:
nvidia.com/gpu: 1 # GPU partagé
3. Monitoring des Coûts
# Script Python pour calculer coûts GPU
import prometheus_api_client
from datetime import datetime, timedelta
prom = prometheus_api_client.PrometheusConnect(url="http://prometheus:9090")
# Requête : heures GPU utilisées par namespace
query = '''
sum(
rate(container_gpu_allocation[1h]) * 3600
) by (namespace)
'''
result = prom.custom_query(query=query)
# Prix GPU/heure (A100 = $3.06/h)
GPU_PRICE = 3.06
for metric in result:
namespace = metric['metric']['namespace']
gpu_hours = float(metric['value'][1])
cost = gpu_hours * GPU_PRICE
print(f"{namespace}: {gpu_hours:.2f} GPU-hours → ${cost:.2f}")
Architecture Production Multi-GPU
# Cluster Kubernetes avec pools GPU hétérogènes
apiVersion: v1
kind: Node
metadata:
name: gpu-node-training-1
labels:
node-role: gpu-training
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
nvidia.com/gpu.count: "8"
topology.kubernetes.io/zone: us-east-1a
---
apiVersion: v1
kind: Node
metadata:
name: gpu-node-inference-1
labels:
node-role: gpu-inference
nvidia.com/gpu.product: NVIDIA-L4
nvidia.com/gpu.count: "4"
topology.kubernetes.io/zone: us-east-1b
---
# Job training distribué sur nœuds training
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
template:
spec:
nodeSelector:
node-role: gpu-training # Force sur nœuds training
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 8 # 8 A100
---
# Deployment inférence sur nœuds inference
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving
spec:
replicas: 12
template:
spec:
nodeSelector:
node-role: gpu-inference # Force sur nœuds inference
containers:
- name: server
image: triton-inference-server:latest
resources:
limits:
nvidia.com/gpu: 1 # 1 L4 par pod
Conclusion : Kubernetes GPU en 2026
Kubernetes 1.35 avec Dynamic Resource Allocation marque un tournant pour l'orchestration GPU ML/AI. Les organisations peuvent enfin maximiser l'utilisation GPU (60-80% vs 30-40% sans orchestration), réduire les coûts de 40% grâce au partage intelligent, et scaler élastiquement les workloads d'inférence.
Points clés :
- ✅ DRA permet partitionnement GPU flexible (MIG, time-sharing)
- ✅ Scheduling intelligent avec topology awareness
- ✅ Monitoring unifié Prometheus + DCGM Exporter
- ✅ Multi-tenancy sécurisé avec quotas
- ✅ Autoscaling basé sur métriques GPU custom
Architecture production recommandée :
- Nœuds training : A100/H100 (8 GPUs, NVLink)
- Nœuds inference : L4/T4 (4 GPUs, coût optimisé)
- Time-sharing sur inference (5 pods/GPU)
- Spot instances pour batch training (-70% coût)
- Prometheus + Grafana monitoring
La prochaine frontière : orchestration multi-cluster GPU avec Karmada/Submariner pour workloads distribués géographiquement. Pour approfondir l'écosystème Kubernetes ML, consultez nos guides sur Kubernetes pour développeurs pour les bases, observabilité OpenTelemetry pour le monitoring avancé, et Edge AI production pour déployer des modèles sur edge avec Kubernetes léger.
Kubernetes 1.35 + DRA + GPU Operator = La plateforme ML/AI ultime pour 2026.