DEV Community

myougaTheAxo
myougaTheAxo

Posted on

OpenTelemetry Integration: Distributed Tracing, Metrics, and Logging for AI Systems

What is OpenTelemetry and Why It Matters Now

OpenTelemetry (OTel) is an open standard for achieving observability in distributed systems. It handles the three pillars — traces, metrics, and logs — through a unified API.

In microservice and AI Agent systems, it's difficult to identify "where is the slow request?" and "which LLM call is eating costs?" OpenTelemetry solves this problem.

Backend options:

  • Jaeger: OSS distributed tracing (self-hosted)
  • Grafana Tempo + Prometheus: Metrics + traces integration
  • Datadog / Honeycomb: Managed services
  • Signoz: OSS full-stack observability

Basic Setup: Instrumentation for Node.js

// src/instrumentation.ts - Must run before app starts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";

const resource = new Resource({
  "service.name": "my-ai-service",
  "service.version": process.env.npm_package_version ?? "0.0.0",
  "deployment.environment": process.env.NODE_ENV ?? "development",
});

export const sdk = new NodeSDK({
  resource,
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 15_000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown().finally(() => process.exit(0)));
Enter fullscreen mode Exit fullscreen mode

Custom Traces: Instrumenting LLM Calls

import { trace, SpanStatusCode, SpanKind } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";

const tracer = trace.getTracer("llm-service", "1.0.0");

async function tracedLLMCall(prompt: string, model = "claude-sonnet-4-5"): Promise<string> {
  return tracer.startActiveSpan("llm.call", {
    kind: SpanKind.CLIENT,
    attributes: { "llm.model": model, "llm.prompt_length": prompt.length, "llm.provider": "anthropic" },
  }, async (span) => {
    try {
      const client = new Anthropic();
      const startTime = Date.now();

      const response = await client.messages.create({
        model, max_tokens: 1024,
        messages: [{ role: "user", content: prompt }],
      });

      span.setAttributes({
        "llm.input_tokens": response.usage.input_tokens,
        "llm.output_tokens": response.usage.output_tokens,
        "llm.latency_ms": Date.now() - startTime,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return response.content[0].text;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}
Enter fullscreen mode Exit fullscreen mode

Custom Metrics: Measuring Business KPIs

import { metrics, ValueType } from "@opentelemetry/api";

const meter = metrics.getMeter("ai-service", "1.0.0");

const requestCounter = meter.createCounter("api.requests.total", {
  description: "Total number of API requests",
});

const latencyHistogram = meter.createHistogram("api.latency.ms", {
  description: "API request latency in milliseconds",
  unit: "ms",
  advice: {
    explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
  },
});

const activeConnectionsGauge = meter.createObservableGauge("db.connections.active");
activeConnectionsGauge.addCallback((result) => {
  result.observe(pool.totalCount - pool.idleCount, { db: "primary" });
});
Enter fullscreen mode Exit fullscreen mode

Correlated Logs and Traces

import { trace } from "@opentelemetry/api";
import pino from "pino";

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return { traceId: ctx.traceId, spanId: ctx.spanId };
  },
});

// Now logs and traces are correlated
// Search by trace ID in Jaeger to find corresponding logs
logger.info({ userId: 123 }, "User logged in");
Enter fullscreen mode Exit fullscreen mode

Docker Compose: OTel Collector + Jaeger

version: "3.8"

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"

  jaeger:
    image: jaegertracing/all-in-one:1.55
    ports:
      - "16686:16686"  # Jaeger UI

  prometheus:
    image: prom/prometheus:v2.50.0
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
Enter fullscreen mode Exit fullscreen mode

Implementing OpenTelemetry makes "what's slow" and "what's eating costs" visible. Especially instrumenting LLM calls is a critical investment that directly contributes to AI system optimization.


This article is from the Claude Code Complete Guide (7 chapters) on note.com.
myouga (@myougatheaxo) - VTuber axolotl. Sharing practical AI development tips.

Top comments (0)