Table of Contents

LLM Application Observability: Complete Guide to Logging, Monitoring, and Debugging
#

When your Agent calls Claude 4, GPT-5, and Gemini 2.5 Pro at 3 AM to complete a multi-step reasoning task and returns a wrong answer, you don’t just need an error log — you need a complete observability system.

Why LLM Applications Need Specialized Observability
#

Traditional web application observability revolves around request-response cycles, database queries, and CPU/memory metrics. LLM applications introduce entirely new dimensions of complexity:

Non-deterministic outputs: The same input can produce different results every time
Expensive operations: A single API call can cost several dollars
Multi-model orchestration: One user request may chain 3-5 model calls across providers
Quality is hard to quantify: The line between “correct” and “hallucination” is blurry
Wild latency variance: Response times can range from 200ms to 30s+

In 2026, with models like Claude 4 Opus, GPT-5, Gemini 2.5 Pro, Llama 4, and DeepSeek-V3 deployed at production scale, observability has evolved from “nice-to-have” to “absolutely essential.”

The Three Pillars of Observability for LLM Applications
#

1. Structured Logging for LLM Calls
#

LLM call logging is not just print(response). You need to capture the full context of every call.

Core Field Design
#

import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class LLMCallLog:
    request_id: str
    trace_id: str
    timestamp: str
    model: str                    # e.g. "claude-4-opus", "gpt-5"
    provider: str                 # e.g. "anthropic", "openai"
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    latency_ms: float
    cost_usd: float
    status: str                   # "success" | "error" | "timeout"
    error_type: Optional[str]
    temperature: float
    max_tokens: int
    user_id: Optional[str]
    session_id: Optional[str]
    prompt_hash: str              # For dedup/clustering, never store raw
    response_hash: str
    metadata: dict                # Custom fields

class LLMLogger:
    def __init__(self, log_path: str = "/var/log/llm/calls.jsonl"):
        self.log_path = log_path
        self.token_prices = {
            "claude-4-opus": {"input": 15.0, "output": 75.0},
            "claude-4-sonnet": {"input": 3.0, "output": 15.0},
            "gpt-5": {"input": 10.0, "output": 30.0},
            "gpt-5-mini": {"input": 1.5, "output": 6.0},
            "gemini-2.5-pro": {"input": 7.0, "output": 21.0},
            "deepseek-v3": {"input": 0.27, "output": 1.10},
            "llama-4-maverick": {"input": 0.20, "output": 0.60},
        }

    def calculate_cost(self, model: str, prompt_tokens: int,
                       completion_tokens: int) -> float:
        prices = self.token_prices.get(model, {"input": 0, "output": 0})
        return (prompt_tokens * prices["input"] +
                completion_tokens * prices["output"]) / 1_000_000

    def log_call(self, log_entry: LLMCallLog):
        with open(self.log_path, "a") as f:
            f.write(json.dumps(asdict(log_entry), ensure_ascii=False) + "\n")

Log Context Propagation
#

In async Python applications, use contextvars to propagate trace IDs:

import contextvars

trace_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
    'trace_id', default=''
)
request_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
    'request_id', default=''
)

def get_current_trace_id() -> str:
    return trace_id_var.get() or str(uuid.uuid4())

# Set at the entry point
async def handle_request(request):
    trace_id = str(uuid.uuid4())
    trace_id_var.set(trace_id)
    request_id_var.set(str(uuid.uuid4()))
    # ... handle request

2. Metrics: Latency, Tokens, Cost, Error Rate
#

Key Metrics Matrix
#

Category	Metric Name	Type	Description
Latency	`llm_request_duration_seconds`	Histogram	End-to-end request latency
Latency	`llm_time_to_first_token_seconds`	Histogram	TTFT for streaming
Throughput	`llm_requests_total`	Counter	Total request count
Tokens	`llm_tokens_total`	Counter	Total tokens consumed
Cost	`llm_cost_usd_total`	Counter	Cumulative cost
Errors	`llm_errors_total`	Counter	Error count by type
Quality	`llm_quality_score`	Histogram	Quality evaluation score
Cache	`llm_cache_hit_ratio`	Gauge	Cache hit rate

Prometheus Metric Definitions
#

from prometheus_client import Histogram, Counter, Gauge

# Request latency
LLM_REQUEST_DURATION = Histogram(
    'llm_request_duration_seconds',
    'LLM API request duration in seconds',
    ['model', 'provider', 'operation', 'status'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)

# Time to First Token
LLM_TTFT = Histogram(
    'llm_time_to_first_token_seconds',
    'Time to first token for streaming requests',
    ['model', 'provider'],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)

# Token consumption
LLM_TOKENS = Counter(
    'llm_tokens_total',
    'Total tokens consumed',
    ['model', 'provider', 'token_type']  # token_type: input/output
)

# Request cost
LLM_COST = Counter(
    'llm_cost_usd_total',
    'Total cost in USD',
    ['model', 'provider']
)

# Error counter
LLM_ERRORS = Counter(
    'llm_errors_total',
    'Total LLM errors',
    ['model', 'provider', 'error_type']
)

# Active requests
LLM_ACTIVE_REQUESTS = Gauge(
    'llm_active_requests',
    'Currently active LLM requests',
    ['model', 'provider']
)

# Quality scores
LLM_QUALITY_SCORE = Histogram(
    'llm_quality_score',
    'LLM response quality score (0-1)',
    ['model', 'evaluator'],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

Auto-Instrumentation Middleware
#

import asyncio
from functools import wraps

def llm_instrumented(model: str, provider: str, operation: str = "chat"):
    """Decorator: automatically instrument LLM call metrics"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            LLM_ACTIVE_REQUESTS.labels(model=model, provider=provider).inc()
            start_time = time.time()
            status = "success"
            error_type = None
            try:
                result = await func(*args, **kwargs)
                # Record tokens
                LLM_TOKENS.labels(
                    model=model, provider=provider, token_type="input"
                ).inc(result.prompt_tokens)
                LLM_TOKENS.labels(
                    model=model, provider=provider, token_type="output"
                ).inc(result.completion_tokens)
                # Record cost
                cost = calculate_cost(model, result.prompt_tokens,
                                      result.completion_tokens)
                LLM_COST.labels(model=model, provider=provider).inc(cost)
                return result
            except Exception as e:
                status = "error"
                error_type = type(e).__name__
                LLM_ERRORS.labels(
                    model=model, provider=provider, error_type=error_type
                ).inc()
                raise
            finally:
                duration = time.time() - start_time
                LLM_REQUEST_DURATION.labels(
                    model=model, provider=provider,
                    operation=operation, status=status
                ).observe(duration)
                LLM_ACTIVE_REQUESTS.labels(
                    model=model, provider=provider
                ).dec()
        return wrapper
    return decorator

# Usage
@llm_instrumented(model="gpt-5", provider="openai", operation="chat")
async def call_gpt5(prompt: str):
    return await openai_client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}]
    )

Grafana Dashboard Configuration
#

{
  "dashboard": {
    "title": "LLM Observability - 2026",
    "panels": [
      {
        "title": "Request Latency Distribution (P50/P95/P99)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "Token Consumption Rate by Model",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(llm_tokens_total[5m])) by (model)",
            "legendFormat": "{{model}}"
          }
        ]
      },
      {
        "title": "Hourly Cost",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(increase(llm_cost_usd_total[1h]))",
            "legendFormat": "Cost/hour"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) * 100",
            "legendFormat": "Error % ({{model}})"
          }
        ]
      }
    ]
  }
}

3. Distributed Tracing Across Multi-Model Calls
#

Multi-agent and multi-model orchestration is the standard architecture in 2026 LLM applications. A single user request might traverse:

User Request → Router Agent
  ├─ Claude 4 Opus (complex reasoning)
  ├─ GPT-5 (code generation)
  └─ Gemini 2.5 Pro (multimodal understanding)
     └─ Llama 4 (fast local classification)
        └─ DeepSeek-V3 (data extraction)

OpenTelemetry Integration
#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter
)
from opentelemetry.sdk.resources import Resource

# Initialize Tracer
resource = Resource.create({
    "service.name": "llm-agent-service",
    "service.version": "2.0.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-observability")

async def traced_llm_call(
    model: str,
    messages: list,
    parent_span: trace.Span = None
):
    """LLM call with distributed tracing"""
    with tracer.start_as_current_span(
        f"llm.call.{model}",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "llm.model": model,
            "llm.provider": get_provider(model),
            "llm.request.type": "chat",
            "llm.prompt.length": sum(len(m["content"]) for m in messages),
        }
    ) as span:
        try:
            response = await call_model(model, messages)

            span.set_attribute("llm.response.tokens.prompt",
                               response.usage.prompt_tokens)
            span.set_attribute("llm.response.tokens.completion",
                               response.usage.completion_tokens)
            span.set_attribute("llm.response.tokens.total",
                               response.usage.total_tokens)
            span.set_attribute("llm.response.finish_reason",
                               response.choices[0].finish_reason)
            span.set_status(trace.Status(trace.StatusCode.OK))
            return response

        except Exception as e:
            span.set_status(
                trace.Status(trace.StatusCode.ERROR, str(e))
            )
            span.record_exception(e)
            raise

# Multi-model orchestration tracing
async def multi_model_agent(user_query: str):
    with tracer.start_as_current_span("agent.multi_model_pipeline") as root:
        root.set_attribute("user.query.length", len(user_query))

        # Parallel model calls
        with tracer.start_as_current_span("parallel.model_calls"):
            results = await asyncio.gather(
                traced_llm_call("claude-4-opus", complex_reasoning_prompt),
                traced_llm_call("gpt-5", code_generation_prompt),
                traced_llm_call("gemini-2.5-pro", multimodal_prompt),
            )

        # Synthesize results
        with tracer.start_as_current_span("agent.synthesize"):
            final = await traced_llm_call(
                "claude-4-opus",
                synthesize_prompt(results)
            )
            return final

4. Prompt/Response Logging with PII Redaction
#

Recording raw prompts and responses is critical for debugging, but sensitive information must be handled properly.

PII Redaction Solution
#

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

class PIIRedactor:
    """PII redactor for LLM requests/responses"""

    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()
        # Custom patterns
        self.custom_patterns = {
            "api_key": re.compile(
                r'(sk-[a-zA-Z0-9]{20,}|AIza[a-zA-Z0-9_-]{35})'
            ),
            "phone_cn": re.compile(r'1[3-9]\d{9}'),
            "ssn": re.compile(r'\d{3}-\d{2}-\d{4}'),
        }

    def redact(self, text: str, language: str = "en") -> str:
        # Use Presidio for PII detection
        results = self.analyzer.analyze(
            text=text,
            entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
                       "CREDIT_CARD", "IP_ADDRESS"],
            language=language,
        )
        anonymized = self.anonymizer.anonymize(
            text=text, analyzer_results=results
        )

        # Apply custom regex
        result = anonymized.text
        for name, pattern in self.custom_patterns.items():
            result = pattern.sub(f"[REDACTED_{name.upper()}]", result)

        return result

    def safe_log_prompt(self, messages: list) -> list:
        """Safely log prompts with PII redaction"""
        return [
            {**msg, "content": self.redact(msg["content"])}
            for msg in messages
        ]

# Usage
redactor = PIIRedactor()

def safe_log_llm_call(request, response):
    safe_log = {
        "request_id": str(uuid.uuid4()),
        "timestamp": datetime.utcnow().isoformat(),
        "model": request.model,
        "messages": redactor.safe_log_prompt(request.messages),
        "response": redactor.redact(response.content),
        "metadata": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        }
    }
    logger.info(json.dumps(safe_log))

5. Quality Monitoring & Hallucination Detection
#

Quality monitoring in 2026 goes far beyond simple human evaluation.

Automated Hallucination Detection
#

class HallucinationDetector:
    """Multi-strategy hallucination detector"""

    def __init__(self):
        self.fact_checker_model = "claude-4-sonnet"
        self.fact_checker = LiteLLMClient(model=self.fact_checker_model)

    async def detect(
        self,
        query: str,
        response: str,
        context: list[str] = None
    ) -> dict:
        scores = {}

        # Strategy 1: Context-based faithfulness check
        if context:
            scores["context_faithfulness"] = await self._check_faithfulness(
                response, context
            )

        # Strategy 2: Self-consistency check (multiple sampling)
        scores["self_consistency"] = await self._check_self_consistency(
            query, response
        )

        # Strategy 3: Fact verification
        scores["fact_check"] = await self._fact_check(response)

        # Strategy 4: Citation verification
        scores["citation_accuracy"] = await self._verify_citations(
            response, context
        )

        # Composite score
        weights = {
            "context_faithfulness": 0.35,
            "self_consistency": 0.25,
            "fact_check": 0.25,
            "citation_accuracy": 0.15
        }
        composite = sum(
            scores.get(k, 0) * v for k, v in weights.items()
        )

        return {
            "hallucination_score": 1.0 - composite,
            "detail_scores": scores,
            "is_hallucination": composite < 0.6,
            "confidence": self._calculate_confidence(scores),
        }

    async def _check_faithfulness(
        self, response: str, context: list[str]
    ) -> float:
        prompt = f"""Evaluate whether the following answer is faithful to the provided context.
Score based only on context information, 0=completely unfaithful, 1=fully faithful.

Context: {chr(10).join(context)}
Answer: {response}

Output a number between 0-1."""

        result = await self.fact_checker.complete(prompt)
        try:
            return float(result.strip())
        except ValueError:
            return 0.5

    async def _check_self_consistency(
        self, query: str, response: str
    ) -> float:
        """Multi-sample consistency check"""
        samples = []
        for _ in range(3):
            sample = await self.fact_checker.complete(
                f"Answer the following question: {query}"
            )
            samples.append(sample)

        # Simplified consistency: compare key information points
        agreements = 0
        total = 0
        response_claims = self._extract_claims(response)
        for sample in samples:
            sample_claims = self._extract_claims(sample)
            for claim in response_claims:
                if any(self._claims_match(claim, sc)
                       for sc in sample_claims):
                    agreements += 1
                total += 1

        return agreements / total if total > 0 else 0.5

# Quality metrics reporting
async def evaluate_and_report(
    query: str, response: str, model: str
):
    detector = HallucinationDetector()
    result = await detector.detect(query, response)

    # Report to Prometheus
    LLM_QUALITY_SCORE.labels(
        model=model, evaluator="hallucination"
    ).observe(1.0 - result["hallucination_score"])

    if result["is_hallucination"]:
        logger.warning(
            f"Potential hallucination detected",
            extra={
                "model": model,
                "hallucination_score": result["hallucination_score"],
                "detail_scores": result["detail_scores"],
            }
        )

    return result

6. Cost Dashboards and Alerts
#

Cost Tracking & Budget Alerts
#

import asyncio

# Cost budget alert rules (Prometheus AlertManager)
ALERT_RULES = """
groups:
  - name: llm_cost_alerts
    rules:
      - alert: LLMHourlyCostHigh
        expr: sum(increase(llm_cost_usd_total[1h])) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM hourly cost exceeds $50"
          description: "Current hourly cost: {{ $value | humanize }} USD"

      - alert: LLMDailyCostCritical
        expr: sum(increase(llm_cost_usd_total[24h])) > 500
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "LLM daily cost exceeds $500"
          description: "Current daily cost: {{ $value | humanize }} USD"

      - alert: LLMTokenRateAnomaly
        expr: rate(llm_tokens_total[5m]) > 3 * rate(llm_tokens_total[1h] offset 1d)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Token consumption rate anomaly detected"
          description: "Current rate is 3x above the same period yesterday"

      - alert: LLMErrorRateHigh
        expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate exceeds 10%"
"""

# Dynamic cost budget management
class CostBudgetManager:
    def __init__(self, daily_limit: float = 100.0,
                 hourly_limit: float = 20.0):
        self.daily_limit = daily_limit
        self.hourly_limit = hourly_limit
        self.daily_spend = Gauge('llm_budget_daily_remaining_usd',
                                 'Remaining daily budget')
        self.hourly_spend = Gauge('llm_budget_hourly_remaining_usd',
                                  'Remaining hourly budget')

    async def check_budget(self, model: str,
                           estimated_cost: float) -> bool:
        """Check budget before making a call"""
        remaining = await self._get_remaining_budget()
        if estimated_cost > remaining["hourly"]:
            logger.warning(
                f"Budget exceeded: estimated ${estimated_cost:.4f}, "
                f"hourly remaining ${remaining['hourly']:.4f}"
            )
            return False
        return True

    async def _get_remaining_budget(self) -> dict:
        # Query current spend from Prometheus
        pass

7. Debugging Tools and Techniques
#

Common Issue Diagnostic Checklist
#

class LLMDebugger:
    """LLM call diagnostic tool"""

    def diagnose(self, call_log: dict) -> list[str]:
        issues = []

        # 1. Latency anomaly
        if call_log["latency_ms"] > 10000:
            issues.append(
                f"⚠️ High latency: {call_log['latency_ms']}ms "
                f"(model: {call_log['model']})"
            )

        # 2. Token efficiency
        ratio = (call_log["completion_tokens"] /
                 max(call_log["prompt_tokens"], 1))
        if ratio > 10:
            issues.append(
                f"⚠️ Output/Input ratio too high: {ratio:.1f}x, "
                f"consider optimizing your prompt"
            )

        # 3. Cost spike
        expected_cost = self._get_expected_cost(call_log["model"])
        if call_log["cost_usd"] > expected_cost * 2:
            issues.append(
                f"⚠️ Cost anomaly: ${call_log['cost_usd']:.4f} "
                f"(expected: ${expected_cost:.4f})"
            )

        # 4. Frequent retries
        if call_log.get("retry_count", 0) > 2:
            issues.append(
                f"⚠️ Frequent retries: {call_log['retry_count']} attempts, "
                f"error type: {call_log.get('error_type')}"
            )

        # 5. Truncation detection
        if call_log.get("finish_reason") == "length":
            issues.append(
                "⚠️ Output truncated (max_tokens too low)"
            )

        return issues

    def compare_models(
        self, logs: list[dict], models: list[str]
    ) -> dict:
        """Compare different models on the same request set"""
        comparison = {}
        for model in models:
            model_logs = [l for l in logs if l["model"] == model]
            if model_logs:
                comparison[model] = {
                    "avg_latency_ms": mean(
                        [l["latency_ms"] for l in model_logs]
                    ),
                    "avg_cost_usd": mean(
                        [l["cost_usd"] for l in model_logs]
                    ),
                    "success_rate": (
                        len([l for l in model_logs
                             if l["status"] == "success"])
                        / len(model_logs)
                    ),
                    "avg_quality_score": mean(
                        [l.get("quality_score", 0)
                         for l in model_logs]
                    ),
                }
        return comparison

Interactive Debug Session
#

class LLMDebugSession:
    """Interactive debug session for replaying requests step by step"""

    def __init__(self, trace_id: str):
        self.trace_id = trace_id
        self.calls = self._load_trace(trace_id)

    def _load_trace(self, trace_id: str) -> list[dict]:
        # Load complete trace from log storage
        pass

    def timeline(self):
        """Display call timeline"""
        for i, call in enumerate(self.calls):
            bar = "█" * int(call["latency_ms"] / 100)
            print(f"[{i}] {call['model']:25s} | "
                  f"{call['latency_ms']:8.0f}ms | "
                  f"{bar}")

    def replay_call(self, index: int, model: str = None):
        """Replay a single call with a different model"""
        original = self.calls[index]
        target_model = model or original["model"]
        print(f"Replaying with {target_model}...")
        # Replay logic
        pass

    def export_for_evaluation(self) -> dict:
        """Export trace data for quality evaluation"""
        return {
            "trace_id": self.trace_id,
            "calls": self.calls,
            "total_cost": sum(c["cost_usd"] for c in self.calls),
            "total_latency_ms": sum(c["latency_ms"] for c in self.calls),
            "models_used": list(set(c["model"] for c in self.calls)),
        }

8. Popular Tools: LangSmith, Helicone, Lunary & Custom Solutions
#

The LLM observability tool ecosystem is mature in 2026. Here’s a comparison of the major players.

LangSmith
#

The official LangChain platform with deep LangChain/LangGraph integration.

from langsmith import traceable

@traceable(
    name="my_agent",
    run_type="chain",
    metadata={"version": "2.0"}
)
async def my_agent(query: str):
    # LangSmith auto-records input/output, latency, token usage
    result = await chain.ainvoke({"query": query})
    return result

Strengths: Seamless LangChain ecosystem integration, powerful Prompt Hub, built-in evaluation framework.

Helicone
#

Proxy-based logging with zero code changes.

# Just change the base_url
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer YOUR_HELICONE_KEY",
        "Helicone-User-Id": "user-123",
    }
)

Strengths: Zero instrumentation, caching support, cost analysis dashboard.

Lunary
#

Open-source full-stack observability platform.

import lunary

lunary.init(app_id="your-app-id")

@lunary.track()
async def chat_handler(message: str):
    # Lunary auto-captures call data
    response = await client.chat.completions.create(...)
    return response

Strengths: Fully open-source, built-in user feedback collection, multi-model comparison.

Tool Comparison
#

Feature	LangSmith	Helicone	Lunary	Custom
Open Source	❌	❌	✅	✅
Proxy Mode	❌	✅	❌	N/A
PII Redaction	✅	✅	✅	Custom
Cost Tracking	✅	✅	✅	Custom
Tracing	✅	Limited	✅	Custom
Eval Framework	✅	❌	✅	Custom
Pricing	From $39/mo	Free tier	Free tier	Infra cost

XiDao API Gateway: Out-of-the-Box LLM Observability
#

If you’re using XiDao API Gateway, you already have a powerful observability foundation.

Core Features
#

1. Unified Request Logging

XiDao Gateway automatically logs all LLM calls passing through it, with no application code changes needed:

# xidao-gateway configuration
observability:
  logging:
    enabled: true
    format: json
    include_request_body: true
    include_response_body: true
    pii_redaction:
      enabled: true
      patterns:
        - email
        - phone
        - credit_card
        - api_key
    storage:
      type: elasticsearch
      endpoint: "https://es.example.com:9200"
      index: "llm-logs-{yyyy.MM.dd}"

2. Real-time Metrics Exposure

observability:
  metrics:
    enabled: true
    endpoint: /metrics
    format: prometheus
    custom_labels:
      - team
      - environment
      - cost_center

XiDao auto-generates standard metrics like llm_request_duration_seconds and llm_tokens_total, ready for Grafana integration.

3. Distributed Tracing Injection

observability:
  tracing:
    enabled: true
    exporter: otlp
    endpoint: "http://jaeger-collector:4317"
    sample_rate: 0.1  # 10% sampling in production
    propagation: w3c

4. Cost Dashboard

XiDao has built-in cost tracking with team, user, and project-level analysis:

# View cost distribution for the past 24 hours
xidao cost report --period 24h --group-by team

# Set budget alerts
xidao cost alert set \
  --team=engineering \
  --daily-limit=200 \
  --hourly-limit=30 \
  --webhook=https://hooks.slack.com/xxx

5. Multi-Model A/B Testing Tracing

routing:
  ab_tests:
    - name: "model-comparison-q2-2026"
      variants:
        - model: claude-4-opus
          weight: 30
        - model: gpt-5
          weight: 40
        - model: gemini-2.5-pro
          weight: 30
      metrics:
        - latency_p95
        - quality_score
        - cost_per_request

Best Practices Summary
#

Layered Observability Architecture
#

┌─────────────────────────────────────────────────┐
│               Application Layer                  │
│   Structured Logs │ Business Metrics │ Quality   │
├─────────────────────────────────────────────────┤
│                Collection Layer                   │
│   XiDao Gateway │ OpenTelemetry Collector        │
├─────────────────────────────────────────────────┤
│                  Storage Layer                    │
│   Elasticsearch │ Prometheus │ ClickHouse        │
├─────────────────────────────────────────────────┤
│               Visualization Layer                 │
│   Grafana │ LangSmith │ Custom Dashboard          │
├─────────────────────────────────────────────────┤
│                 Alerting Layer                    │
│   AlertManager │ PagerDuty │ Slack Webhook        │
└─────────────────────────────────────────────────┘

Key Recommendations
#

Start logging from day one: Log schema is hard to change later — design it carefully upfront
trace_id through the entire chain: Every step from user request to final response must carry it
PII redaction is non-negotiable: When in doubt, redact more, not less
Cost monitoring must be real-time: LLM costs can spiral out of control in minutes
Automate quality monitoring: Human evaluation doesn’t scale — build automated evaluation pipelines
Use XiDao Gateway to simplify infrastructure: Let the gateway handle log collection and metrics exposure while your app focuses on business logic

Conclusion
#

LLM applications in 2026 are no longer simple API calls — they are complex multi-model orchestration systems. Observability is not optional; it’s a fundamental requirement for surviving in production.

Start with structured logging, then progressively add metrics, distributed tracing, quality monitoring, and cost alerting. Use XiDao API Gateway as your observability entry point to make building the entire system simple and efficient.

Remember: You can’t optimize what you can’t see.

Author: XiDao Team | May 2026

Want to learn more about LLM observability practices? Visit XiDao Docs or join our community discussions.

LLM Application Observability: Complete Guide to Logging, Monitoring, and Debugging#

Why LLM Applications Need Specialized Observability#

The Three Pillars of Observability for LLM Applications#

1. Structured Logging for LLM Calls#

Core Field Design#

Log Context Propagation#

2. Metrics: Latency, Tokens, Cost, Error Rate#

Key Metrics Matrix#

Prometheus Metric Definitions#

Auto-Instrumentation Middleware#

Grafana Dashboard Configuration#

3. Distributed Tracing Across Multi-Model Calls#

OpenTelemetry Integration#

4. Prompt/Response Logging with PII Redaction#

PII Redaction Solution#

5. Quality Monitoring & Hallucination Detection#

Automated Hallucination Detection#

6. Cost Dashboards and Alerts#

Cost Tracking & Budget Alerts#

7. Debugging Tools and Techniques#

Common Issue Diagnostic Checklist#

Interactive Debug Session#

8. Popular Tools: LangSmith, Helicone, Lunary & Custom Solutions#

LangSmith#

Helicone#

Lunary#

Tool Comparison#

XiDao API Gateway: Out-of-the-Box LLM Observability#

Core Features#

Best Practices Summary#

Layered Observability Architecture#

Key Recommendations#

Conclusion#

Related