Why OpenTelemetry?

When you build a monitoring system, the same thing always happens:

You pick a vendor (Datadog, New Relic, Prometheus)
You instrument your code against their API
2 years later you switch to a different vendor
Now you need to rewrite ALL your instrumentation code

That's the definition of vendor lock-in.

OpenTelemetry (OTel) solves this: A single standard for observability that exports to any vendor.

Your code instruments against OTel, not against Datadog or Prometheus. Then you configure an "exporter" to decide where your data goes — Prometheus, Grafana Loki, Datadog, or OpenSearch.

Standards instead of vendor lock-in.

What Is OpenTelemetry?

OpenTelemetry is an open-source project from the Cloud Native Computing Foundation. It has three pillars:

1. Traces (Distributed Tracing)

Follows a request through your system in milliseconds:

User Request comes in
├── API Gateway (5ms)
├── Auth Service (12ms)
│   └── Database Call (8ms)
├── Business Logic (45ms)
│   ├── Cache Check (2ms)
│   └── Database Query (30ms)
└── Response (2ms)
TOTAL: 67ms

You see not just that the request took 67ms, but where the time was spent (Auth 12ms is fine, Database 30ms could be optimized).

2. Metrics (Quantitative Data)

Numerical measurements over time:

Request rate (per second)
Error rate (%)
Latency (P50, P95, P99)
Memory usage
CPU usage
Custom metrics (orders per minute, conversions, etc.)

3. Logs (Qualitative Data)

Textual data that provides context:

Error stack traces
Business events ("User registered")
Debug information

OTel unifies all three in a single framework.

Installation in 10 Minutes

Step 1: Deploy the OpenTelemetry Collector

The OTel Collector is a lightweight agent that collects observability data from your services and forwards it.

# Add the chart repository
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# Install the Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace observability --create-namespace

That's it. The Collector is now running in your cluster.

Step 2: Configure the Collector (YAML)

The Collector needs to know where your data should go. Create a values.yaml:

# Receives data from your applications
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Optional: Scrape Prometheus metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod

# Process the data (optional)
processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

# Export to backends
exporters:
  prometheus:
    endpoint: "0.0.0.0:8888"
  otlp:
    endpoint: loki.observability:3100

# Wire everything together
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch]
      exporters: [prometheus]

Deploy:

helm install otel-collector open-telemetry/opentelemetry-collector \
  -f values.yaml --namespace observability

Step 3: Instrument Your Application

In your Node.js/Python/Go/Java app:

Node.js Example:

const opentelemetry = require("@opentelemetry/api");
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-grpc");

// Auto-instrumentation (traces, metrics for Express, DB, HTTP)
const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4317"
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// That's it! Every request is now traced
const express = require("express");
const app = express();

app.get("/api/users/:id", (req, res) => {
  // OpenTelemetry automatically traces:
  // - Request comes in
  // - Database query (if using Prisma/Sequelize)
  // - Response goes out
  res.json({ id: req.params.id, name: "John" });
});

app.listen(3000);

Python Example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
otlp_exporter = OTLPSpanExporter(
    endpoint="otel-collector:4317",
    insecure=True
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

# Use it in your Flask/FastAPI app
from flask import Flask
app = Flask(__name__)

@app.route("/api/users/<id>")
def get_user(id):
    # Automatically traced
    return {"id": id, "name": "John"}

Step 4: Visualize Metrics

Connect Prometheus or Grafana to the Collector:

# In your Prometheus config
scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8888']

Open Grafana → Add Data Source → Prometheus → Create queries:

# Request rate per second
rate(http_server_request_count[1m])

# P95 Latency
histogram_quantile(0.95, http_server_request_duration_seconds_bucket)

# Error rate
rate(http_server_request_count{status_code=~"5.."}[1m])

Done.

What You Now Have

After 10 minutes:

Traces across your entire system
Metrics in Prometheus
Visualizations in Grafana
Vendor-agnostic (any exporter works)

Common Use Cases

1. Finding a Slow Endpoint

Grafana shows: /api/checkout takes 2 seconds
You open Jaeger/Grafana Traces
You see:
  - Payment Service takes 1.8s
  - Database query takes 1.7s

Instantly clear where the problem is.

2. Discovering Service Dependencies

Don't know which services depend on each other?
OTel traces show: Service A calls Service B
                  Service B calls Database
                  Service C is unused (can be deleted)

3. Correlation IDs for Debugging

User says: "My order 12345 was not processed"
You search the order ID in logs
OTel traces follow: API → Database → Queue → Worker
You see: Worker crashed at step 3

Best Practices

1. Use Auto-Instrumentation

Don't manually create every span. Auto-instrumentations handle 95% of the work:

go install github.com/open-telemetry/opentelemetry-go-instrumentation/cmd/otelcontribcol@latest

2. Sampling for High Volume

If you have 100,000 requests/second, you can't trace everything. Use sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Only sample 10%

3. Custom Spans for Business Logic

const tracer = opentelemetry.trace.getTracer("my-app");
const span = tracer.startSpan("process-order");
span.setAttributes({
  "order.id": orderId,
  "order.amount": amount,
  "order.currency": "EUR"
});
// ... your business logic
span.end();

Integration with KI-Ops

KI-Ops uses OTel traces and metrics to automatically diagnose problems:

# KI-Ops reads traces from your OTel setup
ki-ops analyze --include-traces --include-metrics

# Output:
# Problem: /api/checkout takes 5 seconds (should be <1s)
# Root Cause (based on traces): Payment Service is 3s slow
# Why: Query without index on transactions table
# Fix: CREATE INDEX idx_payment_status ON transactions(status, created_at)

With KI-Ops Pro, the AI doesn't just diagnose — it generates a validated PR with the database migration, Helm values update, or YAML fix.

The Takeaway

OpenTelemetry isn't complicated:

Deploy the OTel Collector (5 min)
Instrument your apps (3 min)
Visualize in Grafana (2 min)

Then you have complete visibility across your system — without vendor lock-in.

And that makes your incident response exponentially faster.

Try it out:

git clone https://github.com/open-telemetry/opentelemetry-demo
cd opentelemetry-demo
docker-compose up
# Open http://localhost:8080 — done

OpenTelemetry in 10 Minutes: The Standard for Modern Observability