Skip to main content
  1. Posts/

LLM Application Observability: Complete Guide to Logging, Monitoring, and Debugging

Author
XiDao
XiDao provides stable, high-speed, and cost-effective LLM API gateway services for developers worldwide. One API Key to access OpenAI, Anthropic, Google, Meta models with smart routing and auto-retry.
Table of Contents

LLM Application Observability: Complete Guide to Logging, Monitoring, and Debugging
#

When your Agent calls Claude 4, GPT-5, and Gemini 2.5 Pro at 3 AM to complete a multi-step reasoning task and returns a wrong answer, you don’t just need an error log — you need a complete observability system.

Why LLM Applications Need Specialized Observability
#

Traditional web application observability revolves around request-response cycles, database queries, and CPU/memory metrics. LLM applications introduce entirely new dimensions of complexity:

  • Non-deterministic outputs: The same input can produce different results every time
  • Expensive operations: A single API call can cost several dollars
  • Multi-model orchestration: One user request may chain 3-5 model calls across providers
  • Quality is hard to quantify: The line between “correct” and “hallucination” is blurry
  • Wild latency variance: Response times can range from 200ms to 30s+

In 2026, with models like Claude 4 Opus, GPT-5, Gemini 2.5 Pro, Llama 4, and DeepSeek-V3 deployed at production scale, observability has evolved from “nice-to-have” to “absolutely essential.”

The Three Pillars of Observability for LLM Applications
#

1. Structured Logging for LLM Calls
#

LLM call logging is not just print(response). You need to capture the full context of every call.

Core Field Design
#

import json
import time
import uuid
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class LLMCallLog:
    request_id: str
    trace_id: str
    timestamp: str
    model: str                    # e.g. "claude-4-opus", "gpt-5"
    provider: str                 # e.g. "anthropic", "openai"
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    latency_ms: float
    cost_usd: float
    status: str                   # "success" | "error" | "timeout"
    error_type: Optional[str]
    temperature: float
    max_tokens: int
    user_id: Optional[str]
    session_id: Optional[str]
    prompt_hash: str              # For dedup/clustering, never store raw
    response_hash: str
    metadata: dict                # Custom fields

class LLMLogger:
    def __init__(self, log_path: str = "/var/log/llm/calls.jsonl"):
        self.log_path = log_path
        self.token_prices = {
            "claude-4-opus": {"input": 15.0, "output": 75.0},
            "claude-4-sonnet": {"input": 3.0, "output": 15.0},
            "gpt-5": {"input": 10.0, "output": 30.0},
            "gpt-5-mini": {"input": 1.5, "output": 6.0},
            "gemini-2.5-pro": {"input": 7.0, "output": 21.0},
            "deepseek-v3": {"input": 0.27, "output": 1.10},
            "llama-4-maverick": {"input": 0.20, "output": 0.60},
        }

    def calculate_cost(self, model: str, prompt_tokens: int,
                       completion_tokens: int) -> float:
        prices = self.token_prices.get(model, {"input": 0, "output": 0})
        return (prompt_tokens * prices["input"] +
                completion_tokens * prices["output"]) / 1_000_000

    def log_call(self, log_entry: LLMCallLog):
        with open(self.log_path, "a") as f:
            f.write(json.dumps(asdict(log_entry), ensure_ascii=False) + "\n")

Log Context Propagation
#

In async Python applications, use contextvars to propagate trace IDs:

import contextvars

trace_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
    'trace_id', default=''
)
request_id_var: contextvars.ContextVar[str] = contextvars.ContextVar(
    'request_id', default=''
)

def get_current_trace_id() -> str:
    return trace_id_var.get() or str(uuid.uuid4())

# Set at the entry point
async def handle_request(request):
    trace_id = str(uuid.uuid4())
    trace_id_var.set(trace_id)
    request_id_var.set(str(uuid.uuid4()))
    # ... handle request

2. Metrics: Latency, Tokens, Cost, Error Rate
#

Key Metrics Matrix
#

CategoryMetric NameTypeDescription
Latencyllm_request_duration_secondsHistogramEnd-to-end request latency
Latencyllm_time_to_first_token_secondsHistogramTTFT for streaming
Throughputllm_requests_totalCounterTotal request count
Tokensllm_tokens_totalCounterTotal tokens consumed
Costllm_cost_usd_totalCounterCumulative cost
Errorsllm_errors_totalCounterError count by type
Qualityllm_quality_scoreHistogramQuality evaluation score
Cachellm_cache_hit_ratioGaugeCache hit rate

Prometheus Metric Definitions
#

from prometheus_client import Histogram, Counter, Gauge

# Request latency
LLM_REQUEST_DURATION = Histogram(
    'llm_request_duration_seconds',
    'LLM API request duration in seconds',
    ['model', 'provider', 'operation', 'status'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0]
)

# Time to First Token
LLM_TTFT = Histogram(
    'llm_time_to_first_token_seconds',
    'Time to first token for streaming requests',
    ['model', 'provider'],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)

# Token consumption
LLM_TOKENS = Counter(
    'llm_tokens_total',
    'Total tokens consumed',
    ['model', 'provider', 'token_type']  # token_type: input/output
)

# Request cost
LLM_COST = Counter(
    'llm_cost_usd_total',
    'Total cost in USD',
    ['model', 'provider']
)

# Error counter
LLM_ERRORS = Counter(
    'llm_errors_total',
    'Total LLM errors',
    ['model', 'provider', 'error_type']
)

# Active requests
LLM_ACTIVE_REQUESTS = Gauge(
    'llm_active_requests',
    'Currently active LLM requests',
    ['model', 'provider']
)

# Quality scores
LLM_QUALITY_SCORE = Histogram(
    'llm_quality_score',
    'LLM response quality score (0-1)',
    ['model', 'evaluator'],
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

Auto-Instrumentation Middleware
#

import asyncio
from functools import wraps

def llm_instrumented(model: str, provider: str, operation: str = "chat"):
    """Decorator: automatically instrument LLM call metrics"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            LLM_ACTIVE_REQUESTS.labels(model=model, provider=provider).inc()
            start_time = time.time()
            status = "success"
            error_type = None
            try:
                result = await func(*args, **kwargs)
                # Record tokens
                LLM_TOKENS.labels(
                    model=model, provider=provider, token_type="input"
                ).inc(result.prompt_tokens)
                LLM_TOKENS.labels(
                    model=model, provider=provider, token_type="output"
                ).inc(result.completion_tokens)
                # Record cost
                cost = calculate_cost(model, result.prompt_tokens,
                                      result.completion_tokens)
                LLM_COST.labels(model=model, provider=provider).inc(cost)
                return result
            except Exception as e:
                status = "error"
                error_type = type(e).__name__
                LLM_ERRORS.labels(
                    model=model, provider=provider, error_type=error_type
                ).inc()
                raise
            finally:
                duration = time.time() - start_time
                LLM_REQUEST_DURATION.labels(
                    model=model, provider=provider,
                    operation=operation, status=status
                ).observe(duration)
                LLM_ACTIVE_REQUESTS.labels(
                    model=model, provider=provider
                ).dec()
        return wrapper
    return decorator

# Usage
@llm_instrumented(model="gpt-5", provider="openai", operation="chat")
async def call_gpt5(prompt: str):
    return await openai_client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}]
    )

Grafana Dashboard Configuration
#

{
  "dashboard": {
    "title": "LLM Observability - 2026",
    "panels": [
      {
        "title": "Request Latency Distribution (P50/P95/P99)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(llm_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "Token Consumption Rate by Model",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(llm_tokens_total[5m])) by (model)",
            "legendFormat": "{{model}}"
          }
        ]
      },
      {
        "title": "Hourly Cost",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(increase(llm_cost_usd_total[1h]))",
            "legendFormat": "Cost/hour"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) * 100",
            "legendFormat": "Error % ({{model}})"
          }
        ]
      }
    ]
  }
}

3. Distributed Tracing Across Multi-Model Calls
#

Multi-agent and multi-model orchestration is the standard architecture in 2026 LLM applications. A single user request might traverse:

User Request → Router Agent
  ├─ Claude 4 Opus (complex reasoning)
  ├─ GPT-5 (code generation)
  └─ Gemini 2.5 Pro (multimodal understanding)
     └─ Llama 4 (fast local classification)
        └─ DeepSeek-V3 (data extraction)

OpenTelemetry Integration
#

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter
)
from opentelemetry.sdk.resources import Resource

# Initialize Tracer
resource = Resource.create({
    "service.name": "llm-agent-service",
    "service.version": "2.0.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-observability")

async def traced_llm_call(
    model: str,
    messages: list,
    parent_span: trace.Span = None
):
    """LLM call with distributed tracing"""
    with tracer.start_as_current_span(
        f"llm.call.{model}",
        kind=trace.SpanKind.CLIENT,
        attributes={
            "llm.model": model,
            "llm.provider": get_provider(model),
            "llm.request.type": "chat",
            "llm.prompt.length": sum(len(m["content"]) for m in messages),
        }
    ) as span:
        try:
            response = await call_model(model, messages)

            span.set_attribute("llm.response.tokens.prompt",
                               response.usage.prompt_tokens)
            span.set_attribute("llm.response.tokens.completion",
                               response.usage.completion_tokens)
            span.set_attribute("llm.response.tokens.total",
                               response.usage.total_tokens)
            span.set_attribute("llm.response.finish_reason",
                               response.choices[0].finish_reason)
            span.set_status(trace.Status(trace.StatusCode.OK))
            return response

        except Exception as e:
            span.set_status(
                trace.Status(trace.StatusCode.ERROR, str(e))
            )
            span.record_exception(e)
            raise

# Multi-model orchestration tracing
async def multi_model_agent(user_query: str):
    with tracer.start_as_current_span("agent.multi_model_pipeline") as root:
        root.set_attribute("user.query.length", len(user_query))

        # Parallel model calls
        with tracer.start_as_current_span("parallel.model_calls"):
            results = await asyncio.gather(
                traced_llm_call("claude-4-opus", complex_reasoning_prompt),
                traced_llm_call("gpt-5", code_generation_prompt),
                traced_llm_call("gemini-2.5-pro", multimodal_prompt),
            )

        # Synthesize results
        with tracer.start_as_current_span("agent.synthesize"):
            final = await traced_llm_call(
                "claude-4-opus",
                synthesize_prompt(results)
            )
            return final

4. Prompt/Response Logging with PII Redaction
#

Recording raw prompts and responses is critical for debugging, but sensitive information must be handled properly.

PII Redaction Solution
#

import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

class PIIRedactor:
    """PII redactor for LLM requests/responses"""

    def __init__(self):
        self.analyzer = AnalyzerEngine()
        self.anonymizer = AnonymizerEngine()
        # Custom patterns
        self.custom_patterns = {
            "api_key": re.compile(
                r'(sk-[a-zA-Z0-9]{20,}|AIza[a-zA-Z0-9_-]{35})'
            ),
            "phone_cn": re.compile(r'1[3-9]\d{9}'),
            "ssn": re.compile(r'\d{3}-\d{2}-\d{4}'),
        }

    def redact(self, text: str, language: str = "en") -> str:
        # Use Presidio for PII detection
        results = self.analyzer.analyze(
            text=text,
            entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
                       "CREDIT_CARD", "IP_ADDRESS"],
            language=language,
        )
        anonymized = self.anonymizer.anonymize(
            text=text, analyzer_results=results
        )

        # Apply custom regex
        result = anonymized.text
        for name, pattern in self.custom_patterns.items():
            result = pattern.sub(f"[REDACTED_{name.upper()}]", result)

        return result

    def safe_log_prompt(self, messages: list) -> list:
        """Safely log prompts with PII redaction"""
        return [
            {**msg, "content": self.redact(msg["content"])}
            for msg in messages
        ]

# Usage
redactor = PIIRedactor()

def safe_log_llm_call(request, response):
    safe_log = {
        "request_id": str(uuid.uuid4()),
        "timestamp": datetime.utcnow().isoformat(),
        "model": request.model,
        "messages": redactor.safe_log_prompt(request.messages),
        "response": redactor.redact(response.content),
        "metadata": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        }
    }
    logger.info(json.dumps(safe_log))

5. Quality Monitoring & Hallucination Detection
#

Quality monitoring in 2026 goes far beyond simple human evaluation.

Automated Hallucination Detection
#

class HallucinationDetector:
    """Multi-strategy hallucination detector"""

    def __init__(self):
        self.fact_checker_model = "claude-4-sonnet"
        self.fact_checker = LiteLLMClient(model=self.fact_checker_model)

    async def detect(
        self,
        query: str,
        response: str,
        context: list[str] = None
    ) -> dict:
        scores = {}

        # Strategy 1: Context-based faithfulness check
        if context:
            scores["context_faithfulness"] = await self._check_faithfulness(
                response, context
            )

        # Strategy 2: Self-consistency check (multiple sampling)
        scores["self_consistency"] = await self._check_self_consistency(
            query, response
        )

        # Strategy 3: Fact verification
        scores["fact_check"] = await self._fact_check(response)

        # Strategy 4: Citation verification
        scores["citation_accuracy"] = await self._verify_citations(
            response, context
        )

        # Composite score
        weights = {
            "context_faithfulness": 0.35,
            "self_consistency": 0.25,
            "fact_check": 0.25,
            "citation_accuracy": 0.15
        }
        composite = sum(
            scores.get(k, 0) * v for k, v in weights.items()
        )

        return {
            "hallucination_score": 1.0 - composite,
            "detail_scores": scores,
            "is_hallucination": composite < 0.6,
            "confidence": self._calculate_confidence(scores),
        }

    async def _check_faithfulness(
        self, response: str, context: list[str]
    ) -> float:
        prompt = f"""Evaluate whether the following answer is faithful to the provided context.
Score based only on context information, 0=completely unfaithful, 1=fully faithful.

Context: {chr(10).join(context)}
Answer: {response}

Output a number between 0-1."""

        result = await self.fact_checker.complete(prompt)
        try:
            return float(result.strip())
        except ValueError:
            return 0.5

    async def _check_self_consistency(
        self, query: str, response: str
    ) -> float:
        """Multi-sample consistency check"""
        samples = []
        for _ in range(3):
            sample = await self.fact_checker.complete(
                f"Answer the following question: {query}"
            )
            samples.append(sample)

        # Simplified consistency: compare key information points
        agreements = 0
        total = 0
        response_claims = self._extract_claims(response)
        for sample in samples:
            sample_claims = self._extract_claims(sample)
            for claim in response_claims:
                if any(self._claims_match(claim, sc)
                       for sc in sample_claims):
                    agreements += 1
                total += 1

        return agreements / total if total > 0 else 0.5

# Quality metrics reporting
async def evaluate_and_report(
    query: str, response: str, model: str
):
    detector = HallucinationDetector()
    result = await detector.detect(query, response)

    # Report to Prometheus
    LLM_QUALITY_SCORE.labels(
        model=model, evaluator="hallucination"
    ).observe(1.0 - result["hallucination_score"])

    if result["is_hallucination"]:
        logger.warning(
            f"Potential hallucination detected",
            extra={
                "model": model,
                "hallucination_score": result["hallucination_score"],
                "detail_scores": result["detail_scores"],
            }
        )

    return result

6. Cost Dashboards and Alerts
#

Cost Tracking & Budget Alerts
#

import asyncio

# Cost budget alert rules (Prometheus AlertManager)
ALERT_RULES = """
groups:
  - name: llm_cost_alerts
    rules:
      - alert: LLMHourlyCostHigh
        expr: sum(increase(llm_cost_usd_total[1h])) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM hourly cost exceeds $50"
          description: "Current hourly cost: {{ $value | humanize }} USD"

      - alert: LLMDailyCostCritical
        expr: sum(increase(llm_cost_usd_total[24h])) > 500
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "LLM daily cost exceeds $500"
          description: "Current daily cost: {{ $value | humanize }} USD"

      - alert: LLMTokenRateAnomaly
        expr: rate(llm_tokens_total[5m]) > 3 * rate(llm_tokens_total[1h] offset 1d)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Token consumption rate anomaly detected"
          description: "Current rate is 3x above the same period yesterday"

      - alert: LLMErrorRateHigh
        expr: rate(llm_errors_total[5m]) / rate(llm_requests_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate exceeds 10%"
"""

# Dynamic cost budget management
class CostBudgetManager:
    def __init__(self, daily_limit: float = 100.0,
                 hourly_limit: float = 20.0):
        self.daily_limit = daily_limit
        self.hourly_limit = hourly_limit
        self.daily_spend = Gauge('llm_budget_daily_remaining_usd',
                                 'Remaining daily budget')
        self.hourly_spend = Gauge('llm_budget_hourly_remaining_usd',
                                  'Remaining hourly budget')

    async def check_budget(self, model: str,
                           estimated_cost: float) -> bool:
        """Check budget before making a call"""
        remaining = await self._get_remaining_budget()
        if estimated_cost > remaining["hourly"]:
            logger.warning(
                f"Budget exceeded: estimated ${estimated_cost:.4f}, "
                f"hourly remaining ${remaining['hourly']:.4f}"
            )
            return False
        return True

    async def _get_remaining_budget(self) -> dict:
        # Query current spend from Prometheus
        pass

7. Debugging Tools and Techniques
#

Common Issue Diagnostic Checklist
#

class LLMDebugger:
    """LLM call diagnostic tool"""

    def diagnose(self, call_log: dict) -> list[str]:
        issues = []

        # 1. Latency anomaly
        if call_log["latency_ms"] > 10000:
            issues.append(
                f"⚠️ High latency: {call_log['latency_ms']}ms "
                f"(model: {call_log['model']})"
            )

        # 2. Token efficiency
        ratio = (call_log["completion_tokens"] /
                 max(call_log["prompt_tokens"], 1))
        if ratio > 10:
            issues.append(
                f"⚠️ Output/Input ratio too high: {ratio:.1f}x, "
                f"consider optimizing your prompt"
            )

        # 3. Cost spike
        expected_cost = self._get_expected_cost(call_log["model"])
        if call_log["cost_usd"] > expected_cost * 2:
            issues.append(
                f"⚠️ Cost anomaly: ${call_log['cost_usd']:.4f} "
                f"(expected: ${expected_cost:.4f})"
            )

        # 4. Frequent retries
        if call_log.get("retry_count", 0) > 2:
            issues.append(
                f"⚠️ Frequent retries: {call_log['retry_count']} attempts, "
                f"error type: {call_log.get('error_type')}"
            )

        # 5. Truncation detection
        if call_log.get("finish_reason") == "length":
            issues.append(
                "⚠️ Output truncated (max_tokens too low)"
            )

        return issues

    def compare_models(
        self, logs: list[dict], models: list[str]
    ) -> dict:
        """Compare different models on the same request set"""
        comparison = {}
        for model in models:
            model_logs = [l for l in logs if l["model"] == model]
            if model_logs:
                comparison[model] = {
                    "avg_latency_ms": mean(
                        [l["latency_ms"] for l in model_logs]
                    ),
                    "avg_cost_usd": mean(
                        [l["cost_usd"] for l in model_logs]
                    ),
                    "success_rate": (
                        len([l for l in model_logs
                             if l["status"] == "success"])
                        / len(model_logs)
                    ),
                    "avg_quality_score": mean(
                        [l.get("quality_score", 0)
                         for l in model_logs]
                    ),
                }
        return comparison

Interactive Debug Session
#

class LLMDebugSession:
    """Interactive debug session for replaying requests step by step"""

    def __init__(self, trace_id: str):
        self.trace_id = trace_id
        self.calls = self._load_trace(trace_id)

    def _load_trace(self, trace_id: str) -> list[dict]:
        # Load complete trace from log storage
        pass

    def timeline(self):
        """Display call timeline"""
        for i, call in enumerate(self.calls):
            bar = "█" * int(call["latency_ms"] / 100)
            print(f"[{i}] {call['model']:25s} | "
                  f"{call['latency_ms']:8.0f}ms | "
                  f"{bar}")

    def replay_call(self, index: int, model: str = None):
        """Replay a single call with a different model"""
        original = self.calls[index]
        target_model = model or original["model"]
        print(f"Replaying with {target_model}...")
        # Replay logic
        pass

    def export_for_evaluation(self) -> dict:
        """Export trace data for quality evaluation"""
        return {
            "trace_id": self.trace_id,
            "calls": self.calls,
            "total_cost": sum(c["cost_usd"] for c in self.calls),
            "total_latency_ms": sum(c["latency_ms"] for c in self.calls),
            "models_used": list(set(c["model"] for c in self.calls)),
        }

8. Popular Tools: LangSmith, Helicone, Lunary & Custom Solutions#

The LLM observability tool ecosystem is mature in 2026. Here’s a comparison of the major players.

LangSmith
#

The official LangChain platform with deep LangChain/LangGraph integration.

from langsmith import traceable

@traceable(
    name="my_agent",
    run_type="chain",
    metadata={"version": "2.0"}
)
async def my_agent(query: str):
    # LangSmith auto-records input/output, latency, token usage
    result = await chain.ainvoke({"query": query})
    return result

Strengths: Seamless LangChain ecosystem integration, powerful Prompt Hub, built-in evaluation framework.

Helicone
#

Proxy-based logging with zero code changes.

# Just change the base_url
client = OpenAI(
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer YOUR_HELICONE_KEY",
        "Helicone-User-Id": "user-123",
    }
)

Strengths: Zero instrumentation, caching support, cost analysis dashboard.

Lunary
#

Open-source full-stack observability platform.

import lunary

lunary.init(app_id="your-app-id")

@lunary.track()
async def chat_handler(message: str):
    # Lunary auto-captures call data
    response = await client.chat.completions.create(...)
    return response

Strengths: Fully open-source, built-in user feedback collection, multi-model comparison.

Tool Comparison
#

FeatureLangSmithHeliconeLunaryCustom
Open Source
Proxy ModeN/A
PII RedactionCustom
Cost TrackingCustom
TracingLimitedCustom
Eval FrameworkCustom
PricingFrom $39/moFree tierFree tierInfra cost

XiDao API Gateway: Out-of-the-Box LLM Observability
#

If you’re using XiDao API Gateway, you already have a powerful observability foundation.

Core Features
#

1. Unified Request Logging

XiDao Gateway automatically logs all LLM calls passing through it, with no application code changes needed:

# xidao-gateway configuration
observability:
  logging:
    enabled: true
    format: json
    include_request_body: true
    include_response_body: true
    pii_redaction:
      enabled: true
      patterns:
        - email
        - phone
        - credit_card
        - api_key
    storage:
      type: elasticsearch
      endpoint: "https://es.example.com:9200"
      index: "llm-logs-{yyyy.MM.dd}"

2. Real-time Metrics Exposure

observability:
  metrics:
    enabled: true
    endpoint: /metrics
    format: prometheus
    custom_labels:
      - team
      - environment
      - cost_center

XiDao auto-generates standard metrics like llm_request_duration_seconds and llm_tokens_total, ready for Grafana integration.

3. Distributed Tracing Injection

observability:
  tracing:
    enabled: true
    exporter: otlp
    endpoint: "http://jaeger-collector:4317"
    sample_rate: 0.1  # 10% sampling in production
    propagation: w3c

4. Cost Dashboard

XiDao has built-in cost tracking with team, user, and project-level analysis:

# View cost distribution for the past 24 hours
xidao cost report --period 24h --group-by team

# Set budget alerts
xidao cost alert set \
  --team=engineering \
  --daily-limit=200 \
  --hourly-limit=30 \
  --webhook=https://hooks.slack.com/xxx

5. Multi-Model A/B Testing Tracing

routing:
  ab_tests:
    - name: "model-comparison-q2-2026"
      variants:
        - model: claude-4-opus
          weight: 30
        - model: gpt-5
          weight: 40
        - model: gemini-2.5-pro
          weight: 30
      metrics:
        - latency_p95
        - quality_score
        - cost_per_request

Best Practices Summary
#

Layered Observability Architecture
#

┌─────────────────────────────────────────────────┐
│               Application Layer                  │
│   Structured Logs │ Business Metrics │ Quality   │
├─────────────────────────────────────────────────┤
│                Collection Layer                   │
│   XiDao Gateway │ OpenTelemetry Collector        │
├─────────────────────────────────────────────────┤
│                  Storage Layer                    │
│   Elasticsearch │ Prometheus │ ClickHouse        │
├─────────────────────────────────────────────────┤
│               Visualization Layer                 │
│   Grafana │ LangSmith │ Custom Dashboard          │
├─────────────────────────────────────────────────┤
│                 Alerting Layer                    │
│   AlertManager │ PagerDuty │ Slack Webhook        │
└─────────────────────────────────────────────────┘

Key Recommendations
#

  1. Start logging from day one: Log schema is hard to change later — design it carefully upfront
  2. trace_id through the entire chain: Every step from user request to final response must carry it
  3. PII redaction is non-negotiable: When in doubt, redact more, not less
  4. Cost monitoring must be real-time: LLM costs can spiral out of control in minutes
  5. Automate quality monitoring: Human evaluation doesn’t scale — build automated evaluation pipelines
  6. Use XiDao Gateway to simplify infrastructure: Let the gateway handle log collection and metrics exposure while your app focuses on business logic

Conclusion
#

LLM applications in 2026 are no longer simple API calls — they are complex multi-model orchestration systems. Observability is not optional; it’s a fundamental requirement for surviving in production.

Start with structured logging, then progressively add metrics, distributed tracing, quality monitoring, and cost alerting. Use XiDao API Gateway as your observability entry point to make building the entire system simple and efficient.

Remember: You can’t optimize what you can’t see.


Author: XiDao Team | May 2026

Want to learn more about LLM observability practices? Visit XiDao Docs or join our community discussions.

Related

2026 AI API Price War: Who is the Cost-Performance King

·1976 words·10 mins
2026 AI API Price War: Who is the Cost-Performance King # In 2026, the AI large model API market has entered an unprecedented era of fierce price competition. From the shocking launch of DeepSeek R2 at the start of the year to the wave of price cuts by major providers mid-year, developers and businesses face increasingly complex decisions when choosing API services. This article provides a deep analysis of pricing strategies from major AI API providers, reveals hidden cost traps, and helps you find the true cost-performance champion.

2026 LLM Application Cost Optimization Complete Handbook

2026 LLM Application Cost Optimization Complete Handbook # In 2026, LLM API prices continue to decline, yet enterprise LLM bills are skyrocketing due to exponential growth in use cases. This guide provides a systematic cost optimization framework across 10 core dimensions, helping you reduce LLM operating costs by 70%+ without sacrificing quality. Table of Contents # Model Selection Strategy Prompt Engineering for Cost Reduction Context Caching Batch API for 50% Savings Token Counting & Monitoring Smart Routing by Task Complexity Streaming Responses Fine-tuning vs Few-shot Cost Analysis Response Caching XiDao API Gateway for Unified Cost Management 1. Model Selection Strategy # The 2026 LLM API market has stratified into clear pricing tiers. Choosing the right model is the single highest-impact cost optimization lever.

2026 Open Source LLM Landscape: Llama 4, Qwen 3, Mistral & the Rise of Open Models

Introduction: 2026 — The Golden Age of Open Source LLMs # The development of open source large language models (LLMs) in 2026 has exceeded all expectations. Just two years ago, the industry was still debating whether open source models could catch up to GPT-4. Today, that question has been completely rewritten — open source models haven’t just caught up; in many critical areas, they’ve surpassed their closed-source counterparts.

Top 10 AI Industry Events in May 2026: A Deep Dive for Developers

Top 10 AI Industry Events in May 2026: A Deep Dive for Developers # The AI industry in 2026 is evolving at an unprecedented pace. From major leaps in model capabilities to the standardization of protocols, from the large-scale deployment of enterprise AI Agents to the full-spectrum rise of open source models — every development is reshaping the entire technology ecosystem. This article provides an in-depth analysis of the ten most significant events this month, along with actionable insights for developers.