Observabilité en production : logs, métriques et traces avec OpenTelemetry
Guide pratique OpenTelemetry : instrumenter logs, métriques et traces. Architecture observabilité avec Prometheus, Grafana et Jaeger pour la production.
Votre application est déployée en production. Mais savez-vous vraiment ce qui s'y passe ? L'observabilité va au-delà du simple monitoring : elle permet de comprendre le comportement interne de votre système à partir de ses outputs. OpenTelemetry standardise cette collecte de données.
Les trois piliers de l'observabilité
L'observabilité repose sur trois types de données complémentaires :
| Pilier | Question | Exemple |
|---|---|---|
| Logs | Que s'est-il passé ? | "User 123 logged in at 10:30" |
| Métriques | Combien ? À quelle vitesse ? | "95th percentile latency: 45ms" |
| Traces | Comment la requête a-t-elle traversé le système ? | "Request → API → DB → Cache → Response" |
Pourquoi OpenTelemetry ?
OpenTelemetry (OTel) est le standard CNCF pour l'instrumentation. Ses avantages :
- Vendor-neutral : changez de backend sans modifier le code
- Auto-instrumentation : instrumentez les frameworks automatiquement
- Corrélation : liez logs, métriques et traces avec le même trace ID
- Standard : supporté par tous les outils majeurs
Architecture cible
┌──────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Application │────▶│ OTel Collector │────▶│ Backends │
│ (SDK OTel) │ │ (Agent/Gateway)│ │ │
└──────────────┘ └─────────────────┘ │ - Prometheus │
│ - Jaeger │
│ - Loki │
└──────────────┘
│
▼
┌──────────────┐
│ Grafana │
│ (Dashboards) │
└──────────────┘Instrumentation Java/Spring Boot
Dépendances
io.opentelemetry
opentelemetry-bom
1.35.0
pom
import
io.opentelemetry
opentelemetry-api
io.opentelemetry.instrumentation
opentelemetry-spring-boot-starter
io.opentelemetry
opentelemetry-exporter-otlp
io.micrometer
micrometer-registry-otlp
Configuration
# application.yaml
otel:
exporter:
otlp:
endpoint: http://otel-collector:4317
protocol: grpc
resource:
attributes:
service.name: mon-api
service.version: ${APP_VERSION:1.0.0}
deployment.environment: ${ENVIRONMENT:production}
traces:
sampler: parentbased_traceidratio
sampler.arg: 0.1 # 10% des traces en prod
logs:
exporter: otlp
metrics:
exporter: otlp
# Spring Boot Actuator
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
tags:
application: ${spring.application.name}
distribution:
percentiles-histogram:
http.server.requests: trueAuto-instrumentation vs Agent
Deux approches pour instrumenter votre application :
Option 1 : Agent Java (recommandé)
FROM eclipse-temurin:21-jre-alpine
# Télécharger l'agent OpenTelemetry
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /opt/otel-agent.jar
ENTRYPOINT ["java", \
"-javaagent:/opt/otel-agent.jar", \
"-Dotel.service.name=mon-api", \
"-Dotel.exporter.otlp.endpoint=http://otel-collector:4317", \
"-jar", "app.jar"]L'agent instrumente automatiquement :
- HTTP clients (RestTemplate, WebClient, OkHttp)
- Bases de données (JDBC, Hibernate)
- Messaging (Kafka, RabbitMQ)
- Cache (Redis, Caffeine)
Option 2 : SDK dans le code
@Configuration
public class OtelConfig {
@Bean
public OpenTelemetry openTelemetry() {
Resource resource = Resource.getDefault()
.merge(Resource.create(Attributes.of(
ResourceAttributes.SERVICE_NAME, "mon-api",
ResourceAttributes.SERVICE_VERSION, "1.0.0"
)));
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://otel-collector:4317")
.build()
).build())
.setResource(resource)
.build();
return OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.buildAndRegisterGlobal();
}
}Instrumentation personnalisée
Créer des spans custom
@Service
public class PaymentService {
private final Tracer tracer;
public PaymentService(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("payment-service");
}
public PaymentResult processPayment(PaymentRequest request) {
Span span = tracer.spanBuilder("process-payment")
.setSpanKind(SpanKind.INTERNAL)
.setAttribute("payment.amount", request.getAmount())
.setAttribute("payment.currency", request.getCurrency())
.setAttribute("payment.method", request.getMethod())
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Validation
Span validationSpan = tracer.spanBuilder("validate-payment")
.startSpan();
try {
validatePayment(request);
} finally {
validationSpan.end();
}
// Processing
PaymentResult result = executePayment(request);
span.setAttribute("payment.transaction_id", result.getTransactionId());
span.setStatus(StatusCode.OK);
return result;
} catch (PaymentException e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
throw e;
} finally {
span.end();
}
}
}Annotations simplifiées
@Service
public class OrderService {
@WithSpan("create-order")
public Order createOrder(
@SpanAttribute("order.customer_id") String customerId,
@SpanAttribute("order.items_count") int itemsCount) {
// La span est créée automatiquement
return processOrder(customerId, itemsCount);
}
}Métriques avec Micrometer
Métriques custom
@Service
public class OrderMetrics {
private final Counter ordersCreated;
private final Timer orderProcessingTime;
private final Gauge activeOrders;
private final AtomicInteger activeOrderCount = new AtomicInteger(0);
public OrderMetrics(MeterRegistry registry) {
this.ordersCreated = Counter.builder("orders.created")
.description("Number of orders created")
.tags("service", "order-service")
.register(registry);
this.orderProcessingTime = Timer.builder("orders.processing.time")
.description("Time to process an order")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
this.activeOrders = Gauge.builder("orders.active", activeOrderCount, AtomicInteger::get)
.description("Number of orders being processed")
.register(registry);
}
public void recordOrderCreated(String status) {
ordersCreated.increment();
}
public Timer.Sample startProcessing() {
activeOrderCount.incrementAndGet();
return Timer.start();
}
public void endProcessing(Timer.Sample sample) {
sample.stop(orderProcessingTime);
activeOrderCount.decrementAndGet();
}
}Métriques business
@Component
public class BusinessMetrics {
private final MeterRegistry registry;
@EventListener
public void onPaymentCompleted(PaymentCompletedEvent event) {
registry.counter("payments.completed",
"method", event.getMethod(),
"currency", event.getCurrency()
).increment();
registry.summary("payments.amount",
"currency", event.getCurrency()
).record(event.getAmount());
}
}Logs structurés avec corrélation
Configuration Logback
trace_id
span_id
{"service":"mon-api","environment":"${ENVIRONMENT:-dev}"}
Corrélation trace/logs
@Aspect
@Component
public class LoggingAspect {
private static final Logger log = LoggerFactory.getLogger(LoggingAspect.class);
@Around("@annotation(org.springframework.web.bind.annotation.RequestMapping)")
public Object logRequest(ProceedingJoinPoint joinPoint) throws Throwable {
Span currentSpan = Span.current();
String traceId = currentSpan.getSpanContext().getTraceId();
String spanId = currentSpan.getSpanContext().getSpanId();
MDC.put("trace_id", traceId);
MDC.put("span_id", spanId);
try {
log.info("Request started: {}", joinPoint.getSignature().getName());
Object result = joinPoint.proceed();
log.info("Request completed successfully");
return result;
} catch (Exception e) {
log.error("Request failed: {}", e.getMessage());
throw e;
} finally {
MDC.clear();
}
}
}Output log
{
"@timestamp": "2026-01-15T10:30:00.000Z",
"level": "INFO",
"message": "Payment processed successfully",
"service": "mon-api",
"trace_id": "abc123def456",
"span_id": "789xyz",
"payment_id": "PAY-001",
"amount": 99.99
}OpenTelemetry Collector
Configuration du Collector
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
resource:
attributes:
- key: environment
value: production
action: upsert
exporters:
# Traces vers Jaeger
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
# Métriques vers Prometheus
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
# Logs vers Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
# Debug (dev only)
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]Déploiement Kubernetes
# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.95.0
args:
- --config=/conf/otel-collector-config.yaml
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8889 # Prometheus metrics
volumeMounts:
- name: config
mountPath: /conf
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
volumes:
- name: config
configMap:
name: otel-collector-configStack complète : Prometheus + Grafana + Jaeger
Docker Compose pour le développement
# docker-compose.observability.yaml
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.95.0
command: --config=/etc/otel-collector-config.yaml
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
- "8889:8889"
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yaml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # UI
- "4317" # OTLP gRPC
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioningConfiguration Prometheus
# prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['mon-api:8080']Dashboard Grafana
Métriques essentielles
# Latence P95
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))
# Taux d'erreur
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) * 100
# Requêtes par seconde
sum(rate(http_server_requests_seconds_count[1m])) by (uri)
# JVM Heap utilisé
jvm_memory_used_bytes{area="heap"}
# Threads actifs
jvm_threads_live_threadsAlertes recommandées
# prometheus-alerts.yaml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: PodMemoryHigh
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod memory usage high"Bonnes pratiques
Sampling intelligent
En production, capturer 100% des traces est coûteux. Configurez un sampling adapté :
@Bean
public Sampler sampler() {
return Sampler.parentBased(
Sampler.traceIdRatioBased(0.1) // 10% en temps normal
);
}
// Ou sampling adaptatif
@Bean
public Sampler adaptiveSampler() {
return new Sampler() {
@Override
public SamplingResult shouldSample(Context context, String traceId,
String name, SpanKind spanKind,
Attributes attributes,
List links) {
// Toujours sampler les erreurs
if (attributes.get(AttributeKey.booleanKey("error")) == Boolean.TRUE) {
return SamplingResult.recordAndSample();
}
// Toujours sampler les transactions importantes
if (name.contains("payment") || name.contains("checkout")) {
return SamplingResult.recordAndSample();
}
// 10% pour le reste
return Math.random() < 0.1 ?
SamplingResult.recordAndSample() :
SamplingResult.drop();
}
};
}Attributs standards
Utilisez les conventions sémantiques OpenTelemetry :
span.setAttribute(SemanticAttributes.HTTP_METHOD, "POST");
span.setAttribute(SemanticAttributes.HTTP_URL, "/api/orders");
span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, 200);
span.setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql");
span.setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM orders");Cardinalité des métriques
Évitez les labels à haute cardinalité :
// MAUVAIS : user_id a potentiellement des millions de valeurs
counter.increment(Tags.of("user_id", userId));
// BON : user_type a quelques valeurs possibles
counter.increment(Tags.of("user_type", userType));Checklist observabilité
Avant de passer en production :
- [ ] OpenTelemetry SDK ou Agent configuré
- [ ] Traces exportées vers Jaeger/Tempo
- [ ] Métriques RED (Rate, Error, Duration) exposées
- [ ] Logs structurés en JSON avec trace_id
- [ ] Sampling configuré (pas 100% en prod)
- [ ] Dashboards Grafana créés
- [ ] Alertes configurées (latence, erreurs)
- [ ] Collector en haute disponibilité
- [ ] Rétention des données définie
Conclusion
L'observabilité avec OpenTelemetry transforme votre capacité à comprendre et debugger vos applications en production. Les trois piliers (logs, métriques, traces) combinés avec la corrélation via trace_id vous permettent de suivre une requête de bout en bout.
Commencez par l'auto-instrumentation avec l'agent Java, puis ajoutez des spans custom pour vos logiques métier critiques. L'investissement initial est rapidement rentabilisé lors du premier incident de production.
---
Pour les fondamentaux Kubernetes : Kubernetes pour développeurs : ce qu'il faut vraiment maîtriser
Pour la partie RAG : RAG en production : architecture simple qui fonctionne vraiment