Metrics tell you something is wrong. Logs tell you what was happening at a specific time. Traces tell you why a specific request was slow or failed โ which service, which database query, which downstream call. Without distributed tracing, debugging latency across microservices means correlating timestamps across log files from different services and guessing. With tracing, you click on the slow request and see every span: the Redis call that took 8ms, the Postgres query that took 340ms, the downstream HTTP call that timed out. OpenTelemetry is the vendor-neutral SDK that instruments your code; Grafana Tempo (or Jaeger, or Zipkin) stores and queries the traces. This post covers the Node.js and Python instrumentation, useful span attributes, log-trace correlation, and the collector configuration that gets data from code to dashboard.
The OpenTelemetry architecture
Three components work together: the SDK in your application creates spans and propagates context across service boundaries; the OTel Collector (an optional but recommended sidecar or agent) receives OTLP-formatted telemetry, batches it, and forwards to your backend; the backend (Tempo, Jaeger) stores and indexes traces for querying. The SDK โ Collector โ Backend pattern decouples instrumentation from storage: you can swap backends without touching application code.
Node.js auto-instrumentation
OpenTelemetry has auto-instrumentation packages that patch common libraries (HTTP, Express, gRPC, PostgreSQL, Redis, MongoDB) without any code changes in your application logic. Load them before your app starts:
// otel.ts โ load this FIRST with node -r ./otel.js app.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
[SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '0.0.0',
'deployment.environment': process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
'@opentelemetry/instrumentation-http': {
// Add useful attributes to every HTTP span
requestHook: (span, request) => {
span.setAttribute('http.request.id', request.headers['x-request-id'] as string);
},
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
// package.json start script:
// "start": "node -r ./dist/otel.js dist/app.js"
Manual spans for business-critical operations
Auto-instrumentation covers library calls. For your own business logic โ a PDF generation step, a pricing calculation, a background job โ add manual spans to make them visible in traces:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payments-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
// Create a span for this operation
return tracer.startActiveSpan('payment.process', async (span) => {
try {
// Add attributes that help debug failures
span.setAttributes({
'payment.order_id': orderId,
'payment.amount_cents': amount,
'payment.currency': 'EUR',
});
const result = await chargeCard(orderId, amount);
// Record the outcome
span.setAttributes({
'payment.provider_reference': result.transactionId,
'payment.status': result.status,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
// Record the error โ this shows up as a red span in Grafana
span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
Log-trace correlation: linking logs to spans
The killer feature of distributed tracing is correlating log lines with the trace that generated them. Inject the current trace ID and span ID into every log line so you can jump from a Loki log query to the corresponding Grafana Tempo trace in one click:
// logger.ts โ Winston with OTel context injection
import winston from 'winston';
import { context, trace } from '@opentelemetry/api';
const otelFormat = winston.format((info) => {
const activeSpan = trace.getActiveSpan();
if (activeSpan) {
const { traceId, spanId, traceFlags } = activeSpan.spanContext();
info['trace_id'] = traceId;
info['span_id'] = spanId;
info['trace_flags'] = traceFlags.toString(16).padStart(2, '0');
}
return info;
});
export const logger = winston.createLogger({
format: winston.format.combine(
otelFormat(),
winston.format.json(),
),
transports: [new winston.transports.Console()],
});
// Log output now includes trace_id:
// {"level":"info","message":"Payment processed","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7"}
OTel Collector configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
http: { endpoint: "0.0.0.0:4318" }
grpc: { endpoint: "0.0.0.0:4317" }
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
limit_mib: 256
spike_limit_mib: 64
exporters:
otlp/tempo:
endpoint: "tempo:4317"
tls: { insecure: true } # Use proper TLS in production
prometheusremotewrite: # Export span metrics to Prometheus
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
In 47Network products, OpenTelemetry traces are used extensively in Sven โ the AI agent platform. Long-running agent tasks involve dozens of sequential LLM calls, tool invocations, and database writes. Tracing makes it possible to see exactly which step in a multi-turn agent conversation took 4 seconds, whether it was the embedding lookup, the context retrieval, or the LLM API call. The Sven agent design post covers how observability is built into the agent architecture from the start.
Sampling: don't trace everything
In production, tracing every request is expensive. A high-throughput service producing 1,000 RPS would generate millions of spans per minute. Sampling controls which traces you actually store. There are two approaches:
- Head-based sampling: the decision is made at the start of the trace, before any spans are created. Simple to implement (just set a rate), but you'll miss rare slow requests if they fall outside your sample rate.
- Tail-based sampling: the decision is made at the end of the trace, after all spans are collected. Lets you always sample traces that had errors or exceeded a latency threshold. Requires a collector (like the OpenTelemetry Collector) to buffer and evaluate complete traces.
// Head-based sampling โ keep 10% of traces
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
sampler: new TraceIdRatioBased(0.1), // 10% sample rate
});
// ParentBased sampler โ respect the sampling decision from upstream services
// If the caller sampled this trace, we sample it too
const sdk = new NodeSDK({
sampler: new ParentBasedSampler({
root: new TraceIdRatioBased(0.1), // 10% for new traces we originate
}),
});
Exporting to Grafana Tempo
Grafana Tempo is the trace backend that integrates naturally with the Prometheus/Grafana stack. It stores traces efficiently, correlates them with Loki logs via trace IDs, and links from Grafana dashboards directly into trace views:
# docker-compose.yml โ Tempo + OpenTelemetry Collector
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
volumes:
- ./otel-collector-config.yml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
tempo:
image: grafana/tempo:latest
volumes:
- ./tempo-config.yml:/etc/tempo.yaml
- tempo-data:/var/tempo
ports:
- "3200:3200" # Tempo query API (Grafana data source)
# otel-collector-config.yml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
Once Tempo is running, add it as a Grafana data source (type: Tempo, URL: http://tempo:3200). In your Grafana dashboards, the exemplars feature lets you click any spike in a Prometheus metric graph and jump directly to a trace from that moment โ closing the loop between metrics, logs, and traces.