Observabilité en production : logs, métriques et traces avec OpenTelemetry

Guide pratique OpenTelemetry : instrumenter logs, métriques et traces. Architecture observabilité avec Prometheus, Grafana et Jaeger pour la production.

Observabilité en production : logs, métriques et traces avec OpenTelemetry

Votre application est déployée en production. Mais savez-vous vraiment ce qui s'y passe ? L'observabilité va au-delà du simple monitoring : elle permet de comprendre le comportement interne de votre système à partir de ses outputs. OpenTelemetry standardise cette collecte de données.

Les trois piliers de l'observabilité

L'observabilité repose sur trois types de données complémentaires :

PilierQuestionExemple
LogsQue s'est-il passé ?"User 123 logged in at 10:30"
MétriquesCombien ? À quelle vitesse ?"95th percentile latency: 45ms"
TracesComment la requête a-t-elle traversé le système ?"Request → API → DB → Cache → Response"

Pourquoi OpenTelemetry ?

OpenTelemetry (OTel) est le standard CNCF pour l'instrumentation. Ses avantages :

  • Vendor-neutral : changez de backend sans modifier le code
  • Auto-instrumentation : instrumentez les frameworks automatiquement
  • Corrélation : liez logs, métriques et traces avec le même trace ID
  • Standard : supporté par tous les outils majeurs

Architecture cible

┌──────────────┐     ┌─────────────────┐     ┌──────────────┐
│  Application │────▶│  OTel Collector │────▶│   Backends   │
│  (SDK OTel)  │     │  (Agent/Gateway)│     │              │
└──────────────┘     └─────────────────┘     │ - Prometheus │
                                              │ - Jaeger     │
                                              │ - Loki       │
                                              └──────────────┘
                                                     │
                                                     ▼
                                              ┌──────────────┐
                                              │   Grafana    │
                                              │ (Dashboards) │
                                              └──────────────┘

Instrumentation Java/Spring Boot

Dépendances



    
        
            io.opentelemetry
            opentelemetry-bom
            1.35.0
            pom
            import
        
    



    
    
        io.opentelemetry
        opentelemetry-api
    

    
    
        io.opentelemetry.instrumentation
        opentelemetry-spring-boot-starter
    

    
    
        io.opentelemetry
        opentelemetry-exporter-otlp
    

    
    
        io.micrometer
        micrometer-registry-otlp
    

Configuration

# application.yaml
otel:
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
      protocol: grpc
  resource:
    attributes:
      service.name: mon-api
      service.version: ${APP_VERSION:1.0.0}
      deployment.environment: ${ENVIRONMENT:production}
  traces:
    sampler: parentbased_traceidratio
    sampler.arg: 0.1  # 10% des traces en prod
  logs:
    exporter: otlp
  metrics:
    exporter: otlp

# Spring Boot Actuator
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: ${spring.application.name}
    distribution:
      percentiles-histogram:
        http.server.requests: true

Auto-instrumentation vs Agent

Deux approches pour instrumenter votre application :

Option 1 : Agent Java (recommandé)

FROM eclipse-temurin:21-jre-alpine

# Télécharger l'agent OpenTelemetry
ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /opt/otel-agent.jar

ENTRYPOINT ["java", \
    "-javaagent:/opt/otel-agent.jar", \
    "-Dotel.service.name=mon-api", \
    "-Dotel.exporter.otlp.endpoint=http://otel-collector:4317", \
    "-jar", "app.jar"]

L'agent instrumente automatiquement :

  • HTTP clients (RestTemplate, WebClient, OkHttp)
  • Bases de données (JDBC, Hibernate)
  • Messaging (Kafka, RabbitMQ)
  • Cache (Redis, Caffeine)

Option 2 : SDK dans le code

@Configuration
public class OtelConfig {

    @Bean
    public OpenTelemetry openTelemetry() {
        Resource resource = Resource.getDefault()
            .merge(Resource.create(Attributes.of(
                ResourceAttributes.SERVICE_NAME, "mon-api",
                ResourceAttributes.SERVICE_VERSION, "1.0.0"
            )));

        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(
                OtlpGrpcSpanExporter.builder()
                    .setEndpoint("http://otel-collector:4317")
                    .build()
            ).build())
            .setResource(resource)
            .build();

        return OpenTelemetrySdk.builder()
            .setTracerProvider(tracerProvider)
            .buildAndRegisterGlobal();
    }
}

Instrumentation personnalisée

Créer des spans custom

@Service
public class PaymentService {

    private final Tracer tracer;

    public PaymentService(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer("payment-service");
    }

    public PaymentResult processPayment(PaymentRequest request) {
        Span span = tracer.spanBuilder("process-payment")
            .setSpanKind(SpanKind.INTERNAL)
            .setAttribute("payment.amount", request.getAmount())
            .setAttribute("payment.currency", request.getCurrency())
            .setAttribute("payment.method", request.getMethod())
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            // Validation
            Span validationSpan = tracer.spanBuilder("validate-payment")
                .startSpan();
            try {
                validatePayment(request);
            } finally {
                validationSpan.end();
            }

            // Processing
            PaymentResult result = executePayment(request);

            span.setAttribute("payment.transaction_id", result.getTransactionId());
            span.setStatus(StatusCode.OK);

            return result;

        } catch (PaymentException e) {
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }
}

Annotations simplifiées

@Service
public class OrderService {

    @WithSpan("create-order")
    public Order createOrder(
            @SpanAttribute("order.customer_id") String customerId,
            @SpanAttribute("order.items_count") int itemsCount) {

        // La span est créée automatiquement
        return processOrder(customerId, itemsCount);
    }
}

Métriques avec Micrometer

Métriques custom

@Service
public class OrderMetrics {

    private final Counter ordersCreated;
    private final Timer orderProcessingTime;
    private final Gauge activeOrders;
    private final AtomicInteger activeOrderCount = new AtomicInteger(0);

    public OrderMetrics(MeterRegistry registry) {
        this.ordersCreated = Counter.builder("orders.created")
            .description("Number of orders created")
            .tags("service", "order-service")
            .register(registry);

        this.orderProcessingTime = Timer.builder("orders.processing.time")
            .description("Time to process an order")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);

        this.activeOrders = Gauge.builder("orders.active", activeOrderCount, AtomicInteger::get)
            .description("Number of orders being processed")
            .register(registry);
    }

    public void recordOrderCreated(String status) {
        ordersCreated.increment();
    }

    public Timer.Sample startProcessing() {
        activeOrderCount.incrementAndGet();
        return Timer.start();
    }

    public void endProcessing(Timer.Sample sample) {
        sample.stop(orderProcessingTime);
        activeOrderCount.decrementAndGet();
    }
}

Métriques business

@Component
public class BusinessMetrics {

    private final MeterRegistry registry;

    @EventListener
    public void onPaymentCompleted(PaymentCompletedEvent event) {
        registry.counter("payments.completed",
            "method", event.getMethod(),
            "currency", event.getCurrency()
        ).increment();

        registry.summary("payments.amount",
            "currency", event.getCurrency()
        ).record(event.getAmount());
    }
}

Logs structurés avec corrélation

Configuration Logback



    
        
            trace_id
            span_id
            
                {"service":"mon-api","environment":"${ENVIRONMENT:-dev}"}
            
        
    

    
        
    

Corrélation trace/logs

@Aspect
@Component
public class LoggingAspect {

    private static final Logger log = LoggerFactory.getLogger(LoggingAspect.class);

    @Around("@annotation(org.springframework.web.bind.annotation.RequestMapping)")
    public Object logRequest(ProceedingJoinPoint joinPoint) throws Throwable {
        Span currentSpan = Span.current();
        String traceId = currentSpan.getSpanContext().getTraceId();
        String spanId = currentSpan.getSpanContext().getSpanId();

        MDC.put("trace_id", traceId);
        MDC.put("span_id", spanId);

        try {
            log.info("Request started: {}", joinPoint.getSignature().getName());
            Object result = joinPoint.proceed();
            log.info("Request completed successfully");
            return result;
        } catch (Exception e) {
            log.error("Request failed: {}", e.getMessage());
            throw e;
        } finally {
            MDC.clear();
        }
    }
}

Output log

{
  "@timestamp": "2026-01-15T10:30:00.000Z",
  "level": "INFO",
  "message": "Payment processed successfully",
  "service": "mon-api",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "payment_id": "PAY-001",
  "amount": 99.99
}

OpenTelemetry Collector

Configuration du Collector

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  # Traces vers Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  # Métriques vers Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel

  # Logs vers Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

  # Debug (dev only)
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Déploiement Kubernetes

# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.95.0
          args:
            - --config=/conf/otel-collector-config.yaml
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8889  # Prometheus metrics
          volumeMounts:
            - name: config
              mountPath: /conf
          resources:
            requests:
              memory: 256Mi
              cpu: 100m
            limits:
              memory: 512Mi
              cpu: 500m
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

Stack complète : Prometheus + Grafana + Jaeger

Docker Compose pour le développement

# docker-compose.observability.yaml
version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.95.0
    command: --config=/etc/otel-collector-config.yaml
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
      - "8889:8889"

  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"  # UI
      - "4317"         # OTLP gRPC

  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

Configuration Prometheus

# prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['mon-api:8080']

Dashboard Grafana

Métriques essentielles

# Latence P95
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))

# Taux d'erreur
sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
sum(rate(http_server_requests_seconds_count[5m])) * 100

# Requêtes par seconde
sum(rate(http_server_requests_seconds_count[1m])) by (uri)

# JVM Heap utilisé
jvm_memory_used_bytes{area="heap"}

# Threads actifs
jvm_threads_live_threads

Alertes recommandées

# prometheus-alerts.yaml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
          sum(rate(http_server_requests_seconds_count[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"

      - alert: PodMemoryHigh
        expr: |
          container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod memory usage high"

Bonnes pratiques

Sampling intelligent

En production, capturer 100% des traces est coûteux. Configurez un sampling adapté :

@Bean
public Sampler sampler() {
    return Sampler.parentBased(
        Sampler.traceIdRatioBased(0.1)  // 10% en temps normal
    );
}

// Ou sampling adaptatif
@Bean
public Sampler adaptiveSampler() {
    return new Sampler() {
        @Override
        public SamplingResult shouldSample(Context context, String traceId,
                                           String name, SpanKind spanKind,
                                           Attributes attributes,
                                           List links) {
            // Toujours sampler les erreurs
            if (attributes.get(AttributeKey.booleanKey("error")) == Boolean.TRUE) {
                return SamplingResult.recordAndSample();
            }
            // Toujours sampler les transactions importantes
            if (name.contains("payment") || name.contains("checkout")) {
                return SamplingResult.recordAndSample();
            }
            // 10% pour le reste
            return Math.random() < 0.1 ?
                SamplingResult.recordAndSample() :
                SamplingResult.drop();
        }
    };
}

Attributs standards

Utilisez les conventions sémantiques OpenTelemetry :

span.setAttribute(SemanticAttributes.HTTP_METHOD, "POST");
span.setAttribute(SemanticAttributes.HTTP_URL, "/api/orders");
span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, 200);
span.setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql");
span.setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM orders");

Cardinalité des métriques

Évitez les labels à haute cardinalité :

// MAUVAIS : user_id a potentiellement des millions de valeurs
counter.increment(Tags.of("user_id", userId));

// BON : user_type a quelques valeurs possibles
counter.increment(Tags.of("user_type", userType));

Checklist observabilité

Avant de passer en production :

  • [ ] OpenTelemetry SDK ou Agent configuré
  • [ ] Traces exportées vers Jaeger/Tempo
  • [ ] Métriques RED (Rate, Error, Duration) exposées
  • [ ] Logs structurés en JSON avec trace_id
  • [ ] Sampling configuré (pas 100% en prod)
  • [ ] Dashboards Grafana créés
  • [ ] Alertes configurées (latence, erreurs)
  • [ ] Collector en haute disponibilité
  • [ ] Rétention des données définie

Conclusion

L'observabilité avec OpenTelemetry transforme votre capacité à comprendre et debugger vos applications en production. Les trois piliers (logs, métriques, traces) combinés avec la corrélation via trace_id vous permettent de suivre une requête de bout en bout.

Commencez par l'auto-instrumentation avec l'agent Java, puis ajoutez des spans custom pour vos logiques métier critiques. L'investissement initial est rapidement rentabilisé lors du premier incident de production.

---

Pour les fondamentaux Kubernetes : Kubernetes pour développeurs : ce qu'il faut vraiment maîtriser

Pour la partie RAG : RAG en production : architecture simple qui fonctionne vraiment