Why OpenTelemetry?
When you build a monitoring system, the same thing always happens:
- You pick a vendor (Datadog, New Relic, Prometheus)
- You instrument your code against their API
- 2 years later you switch to a different vendor
- Now you need to rewrite ALL your instrumentation code
That's the definition of vendor lock-in.
OpenTelemetry (OTel) solves this: A single standard for observability that exports to any vendor.
Your code instruments against OTel, not against Datadog or Prometheus. Then you configure an "exporter" to decide where your data goes — Prometheus, Grafana Loki, Datadog, or OpenSearch.
Standards instead of vendor lock-in.
What Is OpenTelemetry?
OpenTelemetry is an open-source project from the Cloud Native Computing Foundation. It has three pillars:
1. Traces (Distributed Tracing)
Follows a request through your system in milliseconds:
User Request comes in
├── API Gateway (5ms)
├── Auth Service (12ms)
│ └── Database Call (8ms)
├── Business Logic (45ms)
│ ├── Cache Check (2ms)
│ └── Database Query (30ms)
└── Response (2ms)
TOTAL: 67ms
You see not just that the request took 67ms, but where the time was spent (Auth 12ms is fine, Database 30ms could be optimized).
2. Metrics (Quantitative Data)
Numerical measurements over time:
- Request rate (per second)
- Error rate (%)
- Latency (P50, P95, P99)
- Memory usage
- CPU usage
- Custom metrics (orders per minute, conversions, etc.)
3. Logs (Qualitative Data)
Textual data that provides context:
- Error stack traces
- Business events ("User registered")
- Debug information
OTel unifies all three in a single framework.
Installation in 10 Minutes
Step 1: Deploy the OpenTelemetry Collector
The OTel Collector is a lightweight agent that collects observability data from your services and forwards it.
# Add the chart repository
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
# Install the Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace observability --create-namespace
That's it. The Collector is now running in your cluster.
Step 2: Configure the Collector (YAML)
The Collector needs to know where your data should go. Create a values.yaml:
# Receives data from your applications
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Optional: Scrape Prometheus metrics
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
# Process the data (optional)
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Export to backends
exporters:
prometheus:
endpoint: "0.0.0.0:8888"
otlp:
endpoint: loki.observability:3100
# Wire everything together
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
metrics:
receivers: [otlp, prometheus]
processors: [batch]
exporters: [prometheus]
Deploy:
helm install otel-collector open-telemetry/opentelemetry-collector \
-f values.yaml --namespace observability
Step 3: Instrument Your Application
In your Node.js/Python/Go/Java app:
Node.js Example:
const opentelemetry = require("@opentelemetry/api");
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-grpc");
// Auto-instrumentation (traces, metrics for Express, DB, HTTP)
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: "http://otel-collector:4317"
}),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
// That's it! Every request is now traced
const express = require("express");
const app = express();
app.get("/api/users/:id", (req, res) => {
// OpenTelemetry automatically traces:
// - Request comes in
// - Database query (if using Prisma/Sequelize)
// - Response goes out
res.json({ id: req.params.id, name: "John" });
});
app.listen(3000);
Python Example:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup
otlp_exporter = OTLPSpanExporter(
endpoint="otel-collector:4317",
insecure=True
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Use it in your Flask/FastAPI app
from flask import Flask
app = Flask(__name__)
@app.route("/api/users/<id>")
def get_user(id):
# Automatically traced
return {"id": id, "name": "John"}
Step 4: Visualize Metrics
Connect Prometheus or Grafana to the Collector:
# In your Prometheus config
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8888']
Open Grafana → Add Data Source → Prometheus → Create queries:
# Request rate per second
rate(http_server_request_count[1m])
# P95 Latency
histogram_quantile(0.95, http_server_request_duration_seconds_bucket)
# Error rate
rate(http_server_request_count{status_code=~"5.."}[1m])
Done.
What You Now Have
After 10 minutes:
- Traces across your entire system
- Metrics in Prometheus
- Visualizations in Grafana
- Vendor-agnostic (any exporter works)
Common Use Cases
1. Finding a Slow Endpoint
Grafana shows: /api/checkout takes 2 seconds
You open Jaeger/Grafana Traces
You see:
- Payment Service takes 1.8s
- Database query takes 1.7s
Instantly clear where the problem is.
2. Discovering Service Dependencies
Don't know which services depend on each other?
OTel traces show: Service A calls Service B
Service B calls Database
Service C is unused (can be deleted)
3. Correlation IDs for Debugging
User says: "My order 12345 was not processed"
You search the order ID in logs
OTel traces follow: API → Database → Queue → Worker
You see: Worker crashed at step 3
Best Practices
1. Use Auto-Instrumentation
Don't manually create every span. Auto-instrumentations handle 95% of the work:
go install github.com/open-telemetry/opentelemetry-go-instrumentation/cmd/otelcontribcol@latest
2. Sampling for High Volume
If you have 100,000 requests/second, you can't trace everything. Use sampling:
processors:
probabilistic_sampler:
sampling_percentage: 10 # Only sample 10%
3. Custom Spans for Business Logic
const tracer = opentelemetry.trace.getTracer("my-app");
const span = tracer.startSpan("process-order");
span.setAttributes({
"order.id": orderId,
"order.amount": amount,
"order.currency": "EUR"
});
// ... your business logic
span.end();
Integration with KI-Ops
KI-Ops uses OTel traces and metrics to automatically diagnose problems:
# KI-Ops reads traces from your OTel setup
ki-ops analyze --include-traces --include-metrics
# Output:
# Problem: /api/checkout takes 5 seconds (should be <1s)
# Root Cause (based on traces): Payment Service is 3s slow
# Why: Query without index on transactions table
# Fix: CREATE INDEX idx_payment_status ON transactions(status, created_at)
With KI-Ops Pro, the AI doesn't just diagnose — it generates a validated PR with the database migration, Helm values update, or YAML fix.
The Takeaway
OpenTelemetry isn't complicated:
- Deploy the OTel Collector (5 min)
- Instrument your apps (3 min)
- Visualize in Grafana (2 min)
Then you have complete visibility across your system — without vendor lock-in.
And that makes your incident response exponentially faster.
Try it out:
git clone https://github.com/open-telemetry/opentelemetry-demo
cd opentelemetry-demo
docker-compose up
# Open http://localhost:8080 — done