Contents

LLM Observability: Practical Methods for Monitoring Prompt and Response

Your LLM service is live and users are complaining about output quality. You open your monitoring dashboard — CPU normal, memory normal, P99 latency within SLA. From an infrastructure perspective, the service looks perfectly healthy. But you have zero visibility into what prompts are being sent, what the model is returning, or why quality dropped. This is the LLM observability gap: traditional APM tools can tell you if the system is healthy, but not if the AI is.

Three Dimensions of LLM Observability

LLM observability differs from standard service monitoring. You need to watch three layers simultaneously:

Operational Metrics

These overlap with traditional APM, but LLMs have unique metrics:

Metric Description Alert Threshold (reference)
latency_p99 End-to-end P99 latency > 10s
ttft (Time to First Token) Time to first token > 3s
tokens_per_second Generation speed < 10 TPS
input_tokens / output_tokens Token usage (directly drives cost) over budget
error_rate Grouped by error type > 1%

Quality Metrics

This is uniquely challenging for LLMs — how do you quantify “is the answer good?”:

  • Faithfulness: Is the answer grounded in retrieved context? (RAG scenarios)
  • Relevance: Does the answer actually address the user’s question?
  • Coherence: Is the answer internally consistent and logically clear?
  • LLM-as-Judge score: Use another LLM to score output — practical for batch evaluation

Business Metrics

Ultimately what matters: did users get what they wanted?

  • Task completion rate (does the user need to follow up multiple times?)
  • User satisfaction (explicit ratings or implicit behavioral signals)
  • Conversation abandonment rate (users who quit mid-session)

Tool Selection

Choosing the right LLM observability tool saves a lot of reinventing the wheel:

Langfuse

Open source, self-hostable, with Python and JavaScript SDKs. Biggest advantage: native integration with LangChain and LlamaIndex — a few lines to instrument. Supports prompt version management and A/B test tracking. Best for teams that need full data sovereignty.

Helicone

Proxy-mode integration — change one line (the API base URL), zero code changes. Routes OpenAI requests through Helicone’s proxy to automatically log all prompts and responses. Best for quick integration without touching your codebase.

Phoenix by Arize

ML observability background, stronger analysis for embeddings and retrieval quality. Has built-in Hallucination Detector and QA evaluators. Best for deep RAG system evaluation.

OpenTelemetry

If your team already has OTel infrastructure, use the standard protocol and build your own pipeline. The community maintains packages like opentelemetry-instrumentation-openai. Flexible but more build work required.

In Practice: Tracing Prompts/Responses with Langfuse

import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import openai

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host="https://cloud.langfuse.com",
)

openai_client = openai.OpenAI()

@observe()  # automatically traces this function call
def answer_question(user_question: str, user_id: str) -> str:
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=["production", "v2"],
    )

    # Fetch versioned prompt from Langfuse console
    prompt = langfuse.get_prompt("support-assistant")
    compiled = prompt.compile(question=user_question)

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=compiled,
    )
    answer = response.choices[0].message.content

    langfuse_context.update_current_observation(
        output=answer,
        metadata={"model": "gpt-4o", "user_id": user_id},
    )
    return answer

def record_user_feedback(trace_id: str, score: float, comment: str = ""):
    langfuse.score(
        trace_id=trace_id,
        name="user_satisfaction",
        value=score,  # 0.0 - 1.0
        comment=comment,
    )

After this runs, the Langfuse dashboard shows: prompt content, response content, token usage, latency, user ID, and quality score trends — all per request.

Key Metrics and Alerting

Set alerts in tiers:

P1 (immediate response):

  • Error rate > 5% (grouped by error_type: context_length_exceeded, rate_limit, content_filter)
  • P99 latency > 30s

P2 (respond within 1 hour):

  • P95 latency > 10s
  • Daily token consumption exceeds 80% of budget
  • Faithfulness score below 0.75 for more than 1 consecutive hour

Trend monitoring (daily review):

  • Average input/output token length trend (sudden increase may indicate prompt leakage or attack)
  • Quality score week-over-week change
  • New error types appearing

Prometheus alerting rules:

# prometheus-rules.yml
groups:
  - name: llm_alerts
    rules:
      - alert: LLMHighErrorRate
        expr: >
          rate(llm_requests_total{status="error"}[5m])
          / rate(llm_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate {{ $value | humanizePercentage }}"

      - alert: LLMHighLatency
        expr: histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning

Data Storage and Querying

LLM trace data can be high-volume — storage strategy matters:

Storage Options

  • ClickHouse: high write throughput, columnar storage, ideal for querying large trace datasets. Langfuse’s self-hosted version uses ClickHouse.
  • PostgreSQL: adequate for smaller volumes (< 1M requests/day), more flexible querying.
  • S3 + Athena: cold data archiving, lowest cost, higher query latency.

Sampling Strategy

You don’t need to store every request — sample by importance:

import random

def should_trace(request_context: dict) -> bool:
    # Error requests: always record
    if request_context.get("error"):
        return True

    # Requests with user feedback: always record
    if request_context.get("has_feedback"):
        return True

    # Low quality score (< 0.6): always record
    if request_context.get("quality_score", 1.0) < 0.6:
        return True

    # Normal requests: 10% random sample
    return random.random() < 0.10

RETENTION_POLICY = {
    "full_fidelity_days": 90,   # last 90 days at full detail
    "compressed_days": 365,     # 90-365 days: key fields only
    "archive_days": 730,        # 1-2 years: cold storage
}

Further Reading

  • Langfuse — open-source LLM observability platform with comprehensive docs
  • Helicone — zero-code proxy-mode integration
  • OpenTelemetry — standard observability protocol
  • Prometheus docs — metric collection and alerting rule configuration