LLM Application Observability: Complete Guide to Logging, Monitoring, and Debugging#
When your Agent calls Claude 4, GPT-5, and Gemini 2.5 Pro at 3 AM to complete a multi-step reasoning task and returns a wrong answer, you don’t just need an error log — you need a complete observability system.
Why LLM Applications Need Specialized Observability#
Traditional web application observability revolves around request-response cycles, database queries, and CPU/memory metrics. LLM applications introduce entirely new dimensions of complexity:
- Non-deterministic outputs: The same input can produce different results every time
- Expensive operations: A single API call can cost several dollars
- Multi-model orchestration: One user request may chain 3-5 model calls across providers
- Quality is hard to quantify: The line between “correct” and “hallucination” is blurry
- Wild latency variance: Response times can range from 200ms to 30s+
In 2026, with models like Claude 4 Opus, GPT-5, Gemini 2.5 Pro, Llama 4, and DeepSeek-V3 deployed at production scale, observability has evolved from “nice-to-have” to “absolutely essential.”
The Three Pillars of Observability for LLM Applications#
1. Structured Logging for LLM Calls#
LLM call logging is not just print(response). You need to capture the full context of every call.
Core Field Design#
import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class LLMCallLog:
request_id: str
trace_id: str
timestamp: str
model: str # e.g. "claude-4-opus", "gpt-5"
provider: str # e.g. "anthropic", "openai"
prompt_tokens: int
completion_tokens: int
total_tokens: int
latency_ms: float
cost_usd: float
status: str # "success" | "error" | "timeout"
error_type: Optional[str]
temperature: float
max_tokens: int
user_id: Optional[str]
session_id: Optional[str]
prompt_hash: str # For dedup/clustering, never store raw
response_hash: str
metadata: dict # Custom fields
class LLMLogger:
def __init__(self, log_path: str = "/var/log/llm/calls.jsonl"):
self.log_path = log_path
self.token_prices = {
"claude-4-opus": {"input": 15.0, "output": 75.0},
"claude-4-sonnet": {"input": 3.0, "output": 15.0},
"gpt-5": {"input": 10.0, "output": 30.0},
"gpt-5-mini": {"input": 1.5, "output": 6.0},
"gemini-2.5-pro": {"input": 7.0, "output": 21.0},
"deepseek-v3": {"input": 0.27, "output": 1.10},
"llama-4-maverick": {"input": 0.20, "output": 0.60},
}
def calculate_cost(self, model: str, prompt_tokens: int,
completion_tokens: int) -> float:
prices = self.token_prices.get(model, {"input": 0, "output": 0})
return (prompt_tokens * prices["input"] +
completion_tokens * prices["output"]) / 1_000_000
def log_call(self, log_entry: LLMCallLog):
with open(self.log_path, "a") as f:
f.write(json.dumps(asdict(log_entry), ensure_ascii=False) + "\n")Log Context Propagation#
In async Python applications, use contextvars to propagate trace IDs:
import contextvars
trace_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
'trace_id', default=''
)
request_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
'request_id', default=''
)
def get_current_trace_id() -> str:
return trace_id_var.get() or str(uuid.uuid4())
# Set at the entry point
async def handle_request(request):
trace_id = str(uuid.uuid4())
trace_id_var.set(trace_id)
request_id_var.set(str(uuid.uuid4()))
# ... handle request2. Metrics: Latency, Tokens, Cost, Error Rate#
Key Metrics Matrix#
| Category | Metric Name | Type | Description |
|---|---|---|---|
| Latency | llm_request_duration_seconds | Histogram | End-to-end request latency |
| Latency | llm_time_to_first_token_seconds | Histogram | TTFT for streaming |
| Throughput | llm_requests_total | Counter | Total request count |
| Tokens | llm_tokens_total | Counter | Total tokens consumed |
| Cost | llm_cost_usd_total | Counter | Cumulative cost |
| Errors | llm_errors_total | Counter | Error count by type |
| Quality | llm_quality_score | Histogram | Quality evaluation score |
| Cache | llm_cache_hit_ratio | Gauge | Cache hit rate |
Prometheus Metric Definitions#
from prometheus_client import Histogram, Counter, Gauge
# Request latency
LLM_REQUEST_DURATION = Histogram(
'llm_request_duration_seconds',
'LLM API request duration in seconds',
['model', 'provider', 'operation', 'status'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)
# Time to First Token
LLM_TTFT = Histogram(
'llm_time_to_first_token_seconds',
'Time to first token for streaming requests',
['model', 'provider'],
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)
# Token consumption
LLM_TOKENS = Counter(
'llm_tokens_total',
'Total tokens consumed',
['model', 'provider', 'token_type'] # token_type: input/output
)
# Request cost
LLM_COST = Counter(
'llm_cost_usd_total',
'Total cost in USD',
['model', 'provider']
)
# Error counter
LLM_ERRORS = Counter(
'llm_errors_total',
'Total LLM errors',
['model', 'provider', 'error_type']
)
# Active requests
LLM_ACTIVE_REQUESTS = Gauge(
'llm_active_requests',
'Currently active LLM requests',
['model', 'provider']
)
# Quality scores
LLM_QUALITY_SCORE = Histogram(
'llm_quality_score',
'LLM response quality score (0-1)',
['model', 'evaluator'],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)Auto-Instrumentation Middleware#
import asyncio
from functools import wraps
def llm_instrumented(model: str, provider: str, operation: str = "chat"):
"""Decorator: automatically instrument LLM call metrics"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
LLM_ACTIVE_REQUESTS.labels(model=model, provider=provider).inc()
start_time = time.time()
status = "success"
error_type = None
try:
result = await func(*args, **kwargs)
# Record tokens
LLM_TOKENS.labels(
model=model, provider=provider, token_type="input"
).inc(result.prompt_tokens)
LLM_TOKENS.labels(
model=model, provider=provider, token_type="output"
).inc(result.completion_tokens)
# Record cost
cost = calculate_cost(model, result.prompt_tokens,
result.completion_tokens)
LLM_COST.labels(model=model, provider=provider).inc(cost)
return result
except Exception as e:
status = "error"
error_type = type(e).__name__
LLM_ERRORS.labels(
model=model, provider=provider, error_type=error_type
).inc()
raise
finally:
duration = time.time() - start_time
LLM_REQUEST_DURATION.labels(
model=model, provider=provider,
operation=operation, status=status
).observe(duration)
LLM_ACTIVE_REQUESTS.labels(
model=model, provider=provider
).dec()
return wrapper
return decorator
# Usage
@llm_instrumented(model="gpt-5", provider="openai", operation="chat")
async def call_gpt5(prompt: str):
return await openai_client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}]
)Grafana Dashboard Configuration#
{
"dashboard": {
"title": "LLM Observability - 2026",
"panels": [
{
"title": "Request Latency Distribution (P50/P95/P99)",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "Token Consumption Rate by Model",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(llm_tokens_total[5m])) by (model)",
"legendFormat": "{{model}}"
}
]
},
{
"title": "Hourly Cost",
"type": "stat",
"targets": [
{
"expr": "sum(increase(llm_cost_usd_total[1h]))",
"legendFormat": "Cost/hour"
}
]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) * 100",
"legendFormat": "Error % ({{model}})"
}
]
}
]
}
}3. Distributed Tracing Across Multi-Model Calls#
Multi-agent and multi-model orchestration is the standard architecture in 2026 LLM applications. A single user request might traverse:
User Request → Router Agent
├─ Claude 4 Opus (complex reasoning)
├─ GPT-5 (code generation)
└─ Gemini 2.5 Pro (multimodal understanding)
└─ Llama 4 (fast local classification)
└─ DeepSeek-V3 (data extraction)OpenTelemetry Integration#
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter
)
from opentelemetry.sdk.resources import Resource
# Initialize Tracer
resource = Resource.create({
"service.name": "llm-agent-service",
"service.version": "2.0.0",
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-observability")
async def traced_llm_call(
model: str,
messages: list,
parent_span: trace.Span = None
):
"""LLM call with distributed tracing"""
with tracer.start_as_current_span(
f"llm.call.{model}",
kind=trace.SpanKind.CLIENT,
attributes={
"llm.model": model,
"llm.provider": get_provider(model),
"llm.request.type": "chat",
"llm.prompt.length": sum(len(m["content"]) for m in messages),
}
) as span:
try:
response = await call_model(model, messages)
span.set_attribute("llm.response.tokens.prompt",
response.usage.prompt_tokens)
span.set_attribute("llm.response.tokens.completion",
response.usage.completion_tokens)
span.set_attribute("llm.response.tokens.total",
response.usage.total_tokens)
span.set_attribute("llm.response.finish_reason",
response.choices[0].finish_reason)
span.set_status(trace.Status(trace.StatusCode.OK))
return response
except Exception as e:
span.set_status(
trace.Status(trace.StatusCode.ERROR, str(e))
)
span.record_exception(e)
raise
# Multi-model orchestration tracing
async def multi_model_agent(user_query: str):
with tracer.start_as_current_span("agent.multi_model_pipeline") as root:
root.set_attribute("user.query.length", len(user_query))
# Parallel model calls
with tracer.start_as_current_span("parallel.model_calls"):
results = await asyncio.gather(
traced_llm_call("claude-4-opus", complex_reasoning_prompt),
traced_llm_call("gpt-5", code_generation_prompt),
traced_llm_call("gemini-2.5-pro", multimodal_prompt),
)
# Synthesize results
with tracer.start_as_current_span("agent.synthesize"):
final = await traced_llm_call(
"claude-4-opus",
synthesize_prompt(results)
)
return final4. Prompt/Response Logging with PII Redaction#
Recording raw prompts and responses is critical for debugging, but sensitive information must be handled properly.
PII Redaction Solution#
import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
class PIIRedactor:
"""PII redactor for LLM requests/responses"""
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
# Custom patterns
self.custom_patterns = {
"api_key": re.compile(
r'(sk-[a-zA-Z0-9]{20,}|AIza[a-zA-Z0-9_-]{35})'
),
"phone_cn": re.compile(r'1[3-9]\d{9}'),
"ssn": re.compile(r'\d{3}-\d{2}-\d{4}'),
}
def redact(self, text: str, language: str = "en") -> str:
# Use Presidio for PII detection
results = self.analyzer.analyze(
text=text,
entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"CREDIT_CARD", "IP_ADDRESS"],
language=language,
)
anonymized = self.anonymizer.anonymize(
text=text, analyzer_results=results
)
# Apply custom regex
result = anonymized.text
for name, pattern in self.custom_patterns.items():
result = pattern.sub(f"[REDACTED_{name.upper()}]", result)
return result
def safe_log_prompt(self, messages: list) -> list:
"""Safely log prompts with PII redaction"""
return [
{**msg, "content": self.redact(msg["content"])}
for msg in messages
]
# Usage
redactor = PIIRedactor()
def safe_log_llm_call(request, response):
safe_log = {
"request_id": str(uuid.uuid4()),
"timestamp": datetime.utcnow().isoformat(),
"model": request.model,
"messages": redactor.safe_log_prompt(request.messages),
"response": redactor.redact(response.content),
"metadata": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
}
}
logger.info(json.dumps(safe_log))5. Quality Monitoring & Hallucination Detection#
Quality monitoring in 2026 goes far beyond simple human evaluation.
Automated Hallucination Detection#
class HallucinationDetector:
"""Multi-strategy hallucination detector"""
def __init__(self):
self.fact_checker_model = "claude-4-sonnet"
self.fact_checker = LiteLLMClient(model=self.fact_checker_model)
async def detect(
self,
query: str,
response: str,
context: list[str] = None
) -> dict:
scores = {}
# Strategy 1: Context-based faithfulness check
if context:
scores["context_faithfulness"] = await self._check_faithfulness(
response, context
)
# Strategy 2: Self-consistency check (multiple sampling)
scores["self_consistency"] = await self._check_self_consistency(
query, response
)
# Strategy 3: Fact verification
scores["fact_check"] = await self._fact_check(response)
# Strategy 4: Citation verification
scores["citation_accuracy"] = await self._verify_citations(
response, context
)
# Composite score
weights = {
"context_faithfulness": 0.35,
"self_consistency": 0.25,
"fact_check": 0.25,
"citation_accuracy": 0.15
}
composite = sum(
scores.get(k, 0) * v for k, v in weights.items()
)
return {
"hallucination_score": 1.0 - composite,
"detail_scores": scores,
"is_hallucination": composite < 0.6,
"confidence": self._calculate_confidence(scores),
}
async def _check_faithfulness(
self, response: str, context: list[str]
) -> float:
prompt = f"""Evaluate whether the following answer is faithful to the provided context.
Score based only on context information, 0=completely unfaithful, 1=fully faithful.
Context: {chr(10).join(context)}
Answer: {response}
Output a number between 0-1."""
result = await self.fact_checker.complete(prompt)
try:
return float(result.strip())
except ValueError:
return 0.5
async def _check_self_consistency(
self, query: str, response: str
) -> float:
"""Multi-sample consistency check"""
samples = []
for _ in range(3):
sample = await self.fact_checker.complete(
f"Answer the following question: {query}"
)
samples.append(sample)
# Simplified consistency: compare key information points
agreements = 0
total = 0
response_claims = self._extract_claims(response)
for sample in samples:
sample_claims = self._extract_claims(sample)
for claim in response_claims:
if any(self._claims_match(claim, sc)
for sc in sample_claims):
agreements += 1
total += 1
return agreements / total if total > 0 else 0.5
# Quality metrics reporting
async def evaluate_and_report(
query: str, response: str, model: str
):
detector = HallucinationDetector()
result = await detector.detect(query, response)
# Report to Prometheus
LLM_QUALITY_SCORE.labels(
model=model, evaluator="hallucination"
).observe(1.0 - result["hallucination_score"])
if result["is_hallucination"]:
logger.warning(
f"Potential hallucination detected",
extra={
"model": model,
"hallucination_score": result["hallucination_score"],
"detail_scores": result["detail_scores"],
}
)
return result6. Cost Dashboards and Alerts#
Cost Tracking & Budget Alerts#
import asyncio
# Cost budget alert rules (Prometheus AlertManager)
ALERT_RULES = """
groups:
- name: llm_cost_alerts
rules:
- alert: LLMHourlyCostHigh
expr: sum(increase(llm_cost_usd_total[1h])) > 50
for: 5m
labels:
severity: warning
annotations:
summary: "LLM hourly cost exceeds $50"
description: "Current hourly cost: {{ $value | humanize }} USD"
- alert: LLMDailyCostCritical
expr: sum(increase(llm_cost_usd_total[24h])) > 500
for: 10m
labels:
severity: critical
annotations:
summary: "LLM daily cost exceeds $500"
description: "Current daily cost: {{ $value | humanize }} USD"
- alert: LLMTokenRateAnomaly
expr: rate(llm_tokens_total[5m]) > 3 * rate(llm_tokens_total[1h] offset 1d)
for: 15m
labels:
severity: warning
annotations:
summary: "Token consumption rate anomaly detected"
description: "Current rate is 3x above the same period yesterday"
- alert: LLMErrorRateHigh
expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "LLM error rate exceeds 10%"
"""
# Dynamic cost budget management
class CostBudgetManager:
def __init__(self, daily_limit: float = 100.0,
hourly_limit: float = 20.0):
self.daily_limit = daily_limit
self.hourly_limit = hourly_limit
self.daily_spend = Gauge('llm_budget_daily_remaining_usd',
'Remaining daily budget')
self.hourly_spend = Gauge('llm_budget_hourly_remaining_usd',
'Remaining hourly budget')
async def check_budget(self, model: str,
estimated_cost: float) -> bool:
"""Check budget before making a call"""
remaining = await self._get_remaining_budget()
if estimated_cost > remaining["hourly"]:
logger.warning(
f"Budget exceeded: estimated ${estimated_cost:.4f}, "
f"hourly remaining ${remaining['hourly']:.4f}"
)
return False
return True
async def _get_remaining_budget(self) -> dict:
# Query current spend from Prometheus
pass7. Debugging Tools and Techniques#
Common Issue Diagnostic Checklist#
class LLMDebugger:
"""LLM call diagnostic tool"""
def diagnose(self, call_log: dict) -> list[str]:
issues = []
# 1. Latency anomaly
if call_log["latency_ms"] > 10000:
issues.append(
f"⚠️ High latency: {call_log['latency_ms']}ms "
f"(model: {call_log['model']})"
)
# 2. Token efficiency
ratio = (call_log["completion_tokens"] /
max(call_log["prompt_tokens"], 1))
if ratio > 10:
issues.append(
f"⚠️ Output/Input ratio too high: {ratio:.1f}x, "
f"consider optimizing your prompt"
)
# 3. Cost spike
expected_cost = self._get_expected_cost(call_log["model"])
if call_log["cost_usd"] > expected_cost * 2:
issues.append(
f"⚠️ Cost anomaly: ${call_log['cost_usd']:.4f} "
f"(expected: ${expected_cost:.4f})"
)
# 4. Frequent retries
if call_log.get("retry_count", 0) > 2:
issues.append(
f"⚠️ Frequent retries: {call_log['retry_count']} attempts, "
f"error type: {call_log.get('error_type')}"
)
# 5. Truncation detection
if call_log.get("finish_reason") == "length":
issues.append(
"⚠️ Output truncated (max_tokens too low)"
)
return issues
def compare_models(
self, logs: list[dict], models: list[str]
) -> dict:
"""Compare different models on the same request set"""
comparison = {}
for model in models:
model_logs = [l for l in logs if l["model"] == model]
if model_logs:
comparison[model] = {
"avg_latency_ms": mean(
[l["latency_ms"] for l in model_logs]
),
"avg_cost_usd": mean(
[l["cost_usd"] for l in model_logs]
),
"success_rate": (
len([l for l in model_logs
if l["status"] == "success"])
/ len(model_logs)
),
"avg_quality_score": mean(
[l.get("quality_score", 0)
for l in model_logs]
),
}
return comparisonInteractive Debug Session#
class LLMDebugSession:
"""Interactive debug session for replaying requests step by step"""
def __init__(self, trace_id: str):
self.trace_id = trace_id
self.calls = self._load_trace(trace_id)
def _load_trace(self, trace_id: str) -> list[dict]:
# Load complete trace from log storage
pass
def timeline(self):
"""Display call timeline"""
for i, call in enumerate(self.calls):
bar = "█" * int(call["latency_ms"] / 100)
print(f"[{i}] {call['model']:25s} | "
f"{call['latency_ms']:8.0f}ms | "
f"{bar}")
def replay_call(self, index: int, model: str = None):
"""Replay a single call with a different model"""
original = self.calls[index]
target_model = model or original["model"]
print(f"Replaying with {target_model}...")
# Replay logic
pass
def export_for_evaluation(self) -> dict:
"""Export trace data for quality evaluation"""
return {
"trace_id": self.trace_id,
"calls": self.calls,
"total_cost": sum(c["cost_usd"] for c in self.calls),
"total_latency_ms": sum(c["latency_ms"] for c in self.calls),
"models_used": list(set(c["model"] for c in self.calls)),
}8. Popular Tools: LangSmith, Helicone, Lunary & Custom Solutions#
The LLM observability tool ecosystem is mature in 2026. Here’s a comparison of the major players.
LangSmith#
The official LangChain platform with deep LangChain/LangGraph integration.
from langsmith import traceable
@traceable(
name="my_agent",
run_type="chain",
metadata={"version": "2.0"}
)
async def my_agent(query: str):
# LangSmith auto-records input/output, latency, token usage
result = await chain.ainvoke({"query": query})
return resultStrengths: Seamless LangChain ecosystem integration, powerful Prompt Hub, built-in evaluation framework.
Helicone#
Proxy-based logging with zero code changes.
# Just change the base_url
client = OpenAI(
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer YOUR_HELICONE_KEY",
"Helicone-User-Id": "user-123",
}
)Strengths: Zero instrumentation, caching support, cost analysis dashboard.
Lunary#
Open-source full-stack observability platform.
import lunary
lunary.init(app_id="your-app-id")
@lunary.track()
async def chat_handler(message: str):
# Lunary auto-captures call data
response = await client.chat.completions.create(...)
return responseStrengths: Fully open-source, built-in user feedback collection, multi-model comparison.
Tool Comparison#
| Feature | LangSmith | Helicone | Lunary | Custom |
|---|---|---|---|---|
| Open Source | ❌ | ❌ | ✅ | ✅ |
| Proxy Mode | ❌ | ✅ | ❌ | N/A |
| PII Redaction | ✅ | ✅ | ✅ | Custom |
| Cost Tracking | ✅ | ✅ | ✅ | Custom |
| Tracing | ✅ | Limited | ✅ | Custom |
| Eval Framework | ✅ | ❌ | ✅ | Custom |
| Pricing | From $39/mo | Free tier | Free tier | Infra cost |
XiDao API Gateway: Out-of-the-Box LLM Observability#
If you’re using XiDao API Gateway, you already have a powerful observability foundation.
Core Features#
1. Unified Request Logging
XiDao Gateway automatically logs all LLM calls passing through it, with no application code changes needed:
# xidao-gateway configuration
observability:
logging:
enabled: true
format: json
include_request_body: true
include_response_body: true
pii_redaction:
enabled: true
patterns:
- email
- phone
- credit_card
- api_key
storage:
type: elasticsearch
endpoint: "https://es.example.com:9200"
index: "llm-logs-{yyyy.MM.dd}"2. Real-time Metrics Exposure
observability:
metrics:
enabled: true
endpoint: /metrics
format: prometheus
custom_labels:
- team
- environment
- cost_centerXiDao auto-generates standard metrics like llm_request_duration_seconds and llm_tokens_total, ready for Grafana integration.
3. Distributed Tracing Injection
observability:
tracing:
enabled: true
exporter: otlp
endpoint: "http://jaeger-collector:4317"
sample_rate: 0.1 # 10% sampling in production
propagation: w3c4. Cost Dashboard
XiDao has built-in cost tracking with team, user, and project-level analysis:
# View cost distribution for the past 24 hours
xidao cost report --period 24h --group-by team
# Set budget alerts
xidao cost alert set \
--team=engineering \
--daily-limit=200 \
--hourly-limit=30 \
--webhook=https://hooks.slack.com/xxx5. Multi-Model A/B Testing Tracing
routing:
ab_tests:
- name: "model-comparison-q2-2026"
variants:
- model: claude-4-opus
weight: 30
- model: gpt-5
weight: 40
- model: gemini-2.5-pro
weight: 30
metrics:
- latency_p95
- quality_score
- cost_per_requestBest Practices Summary#
Layered Observability Architecture#
┌─────────────────────────────────────────────────┐
│ Application Layer │
│ Structured Logs │ Business Metrics │ Quality │
├─────────────────────────────────────────────────┤
│ Collection Layer │
│ XiDao Gateway │ OpenTelemetry Collector │
├─────────────────────────────────────────────────┤
│ Storage Layer │
│ Elasticsearch │ Prometheus │ ClickHouse │
├─────────────────────────────────────────────────┤
│ Visualization Layer │
│ Grafana │ LangSmith │ Custom Dashboard │
├─────────────────────────────────────────────────┤
│ Alerting Layer │
│ AlertManager │ PagerDuty │ Slack Webhook │
└─────────────────────────────────────────────────┘Key Recommendations#
- Start logging from day one: Log schema is hard to change later — design it carefully upfront
- trace_id through the entire chain: Every step from user request to final response must carry it
- PII redaction is non-negotiable: When in doubt, redact more, not less
- Cost monitoring must be real-time: LLM costs can spiral out of control in minutes
- Automate quality monitoring: Human evaluation doesn’t scale — build automated evaluation pipelines
- Use XiDao Gateway to simplify infrastructure: Let the gateway handle log collection and metrics exposure while your app focuses on business logic
Conclusion#
LLM applications in 2026 are no longer simple API calls — they are complex multi-model orchestration systems. Observability is not optional; it’s a fundamental requirement for surviving in production.
Start with structured logging, then progressively add metrics, distributed tracing, quality monitoring, and cost alerting. Use XiDao API Gateway as your observability entry point to make building the entire system simple and efficient.
Remember: You can’t optimize what you can’t see.
Author: XiDao Team | May 2026
Want to learn more about LLM observability practices? Visit XiDao Docs or join our community discussions.