OpenTelemetry Distributed Tracing for Microservices

Metrics tell you something is wrong. Logs tell you what was happening at a specific time. Traces tell you why a specific request was slow or failed — which service, which database query, which downstream call. Without distributed tracing, debugging latency across microservices means correlating timestamps across log files from different services and guessing. With tracing, you click on the slow request and see every span: the Redis call that took 8ms, the Postgres query that took 340ms, the downstream HTTP call that timed out. OpenTelemetry is the vendor-neutral SDK that instruments your code; Grafana Tempo (or Jaeger, or Zipkin) stores and queries the traces. This post covers the Node.js and Python instrumentation, useful span attributes, log-trace correlation, and the collector configuration that gets data from code to dashboard.

The OpenTelemetry architecture

Three components work together: the SDK in your application creates spans and propagates context across service boundaries; the OTel Collector (an optional but recommended sidecar or agent) receives OTLP-formatted telemetry, batches it, and forwards to your backend; the backend (Tempo, Jaeger) stores and indexes traces for querying. The SDK → Collector → Backend pattern decouples instrumentation from storage: you can swap backends without touching application code.

Node.js auto-instrumentation

OpenTelemetry has auto-instrumentation packages that patch common libraries (HTTP, Express, gRPC, PostgreSQL, Redis, MongoDB) without any code changes in your application logic. Load them before your app starts:

// otel.ts — load this FIRST with node -r ./otel.js app.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]:    process.env.SERVICE_NAME || 'unknown-service',
    [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '0.0.0',
    'deployment.environment':      process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
      '@opentelemetry/instrumentation-http': {
        // Add useful attributes to every HTTP span
        requestHook: (span, request) => {
          span.setAttribute('http.request.id', request.headers['x-request-id'] as string);
        },
      },
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

// package.json start script:
// "start": "node -r ./dist/otel.js dist/app.js"

Manual spans for business-critical operations

Auto-instrumentation covers library calls. For your own business logic — a PDF generation step, a pricing calculation, a background job — add manual spans to make them visible in traces:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payments-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create a span for this operation
  return tracer.startActiveSpan('payment.process', async (span) => {
    try {
      // Add attributes that help debug failures
      span.setAttributes({
        'payment.order_id':    orderId,
        'payment.amount_cents': amount,
        'payment.currency':    'EUR',
      });

      const result = await chargeCard(orderId, amount);

      // Record the outcome
      span.setAttributes({
        'payment.provider_reference': result.transactionId,
        'payment.status':             result.status,
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return result;

    } catch (error) {
      // Record the error — this shows up as a red span in Grafana
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Log-trace correlation: linking logs to spans

The killer feature of distributed tracing is correlating log lines with the trace that generated them. Inject the current trace ID and span ID into every log line so you can jump from a Loki log query to the corresponding Grafana Tempo trace in one click:

// logger.ts — Winston with OTel context injection
import winston from 'winston';
import { context, trace } from '@opentelemetry/api';

const otelFormat = winston.format((info) => {
  const activeSpan = trace.getActiveSpan();
  if (activeSpan) {
    const { traceId, spanId, traceFlags } = activeSpan.spanContext();
    info['trace_id'] = traceId;
    info['span_id']  = spanId;
    info['trace_flags'] = traceFlags.toString(16).padStart(2, '0');
  }
  return info;
});

export const logger = winston.createLogger({
  format: winston.format.combine(
    otelFormat(),
    winston.format.json(),
  ),
  transports: [new winston.transports.Console()],
});

// Log output now includes trace_id:
// {"level":"info","message":"Payment processed","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}

OTel Collector configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http: { endpoint: "0.0.0.0:4318" }
      grpc: { endpoint: "0.0.0.0:4317" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 256
    spike_limit_mib: 64

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls: { insecure: true }   # Use proper TLS in production
  prometheusremotewrite:       # Export span metrics to Prometheus
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlp/tempo]

In 47Network products, OpenTelemetry traces are used extensively in Sven — the AI agent platform. Long-running agent tasks involve dozens of sequential LLM calls, tool invocations, and database writes. Tracing makes it possible to see exactly which step in a multi-turn agent conversation took 4 seconds, whether it was the embedding lookup, the context retrieval, or the LLM API call. The Sven agent design post covers how observability is built into the agent architecture from the start.

Sampling: don't trace everything

In production, tracing every request is expensive. A high-throughput service producing 1,000 RPS would generate millions of spans per minute. Sampling controls which traces you actually store. There are two approaches:

Head-based sampling: the decision is made at the start of the trace, before any spans are created. Simple to implement (just set a rate), but you'll miss rare slow requests if they fall outside your sample rate.
Tail-based sampling: the decision is made at the end of the trace, after all spans are collected. Lets you always sample traces that had errors or exceeded a latency threshold. Requires a collector (like the OpenTelemetry Collector) to buffer and evaluate complete traces.

// Head-based sampling — keep 10% of traces
const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
  sampler: new TraceIdRatioBased(0.1),  // 10% sample rate
});

// ParentBased sampler — respect the sampling decision from upstream services
// If the caller sampled this trace, we sample it too
const sdk = new NodeSDK({
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBased(0.1),   // 10% for new traces we originate
  }),
});

Exporting to Grafana Tempo

Grafana Tempo is the trace backend that integrates naturally with the Prometheus/Grafana stack. It stores traces efficiently, correlates them with Loki logs via trace IDs, and links from Grafana dashboards directly into trace views:

# docker-compose.yml — Tempo + OpenTelemetry Collector
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./otel-collector-config.yml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP

  tempo:
    image: grafana/tempo:latest
    volumes:
      - ./tempo-config.yml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "3200:3200"   # Tempo query API (Grafana data source)

# otel-collector-config.yml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]

Once Tempo is running, add it as a Grafana data source (type: Tempo, URL: http://tempo:3200). In your Grafana dashboards, the exemplars feature lets you click any spike in a Prometheus metric graph and jump directly to a trace from that moment — closing the loop between metrics, logs, and traces.

← Back to Blog Grafana Guide →

OpenTelemetry distributed tracing for microservices.

The OpenTelemetry architecture

Node.js auto-instrumentation

Manual spans for business-critical operations

Log-trace correlation: linking logs to spans

OTel Collector configuration

Sampling: don't trace everything

Exporting to Grafana Tempo