LLM Observability: Practical Methods for Monitoring Prompt and Response

Simi included in AI

2025-12-23 818 words 4 minutes

Contents

Your LLM service is live and users are complaining about output quality. You open your monitoring dashboard — CPU normal, memory normal, P99 latency within SLA. From an infrastructure perspective, the service looks perfectly healthy. But you have zero visibility into what prompts are being sent, what the model is returning, or why quality dropped. This is the LLM observability gap: traditional APM tools can tell you if the system is healthy, but not if the AI is.

Three Dimensions of LLM Observability

LLM observability differs from standard service monitoring. You need to watch three layers simultaneously:

Operational Metrics

These overlap with traditional APM, but LLMs have unique metrics:

Metric	Description	Alert Threshold (reference)
`latency_p99`	End-to-end P99 latency	> 10s
`ttft` (Time to First Token)	Time to first token	> 3s
`tokens_per_second`	Generation speed	< 10 TPS
`input_tokens` / `output_tokens`	Token usage (directly drives cost)	over budget
`error_rate`	Grouped by error type	> 1%

Quality Metrics

This is uniquely challenging for LLMs — how do you quantify “is the answer good?”:

Faithfulness: Is the answer grounded in retrieved context? (RAG scenarios)
Relevance: Does the answer actually address the user’s question?
Coherence: Is the answer internally consistent and logically clear?
LLM-as-Judge score: Use another LLM to score output — practical for batch evaluation

Business Metrics

Ultimately what matters: did users get what they wanted?

Task completion rate (does the user need to follow up multiple times?)
User satisfaction (explicit ratings or implicit behavioral signals)
Conversation abandonment rate (users who quit mid-session)

Tool Selection

Choosing the right LLM observability tool saves a lot of reinventing the wheel:

Langfuse

Open source, self-hostable, with Python and JavaScript SDKs. Biggest advantage: native integration with LangChain and LlamaIndex — a few lines to instrument. Supports prompt version management and A/B test tracking. Best for teams that need full data sovereignty.

In Practice: Tracing Prompts/Responses with Langfuse

        
        
        
    
import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import openai

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host="https://cloud.langfuse.com",
)

openai_client = openai.OpenAI()

@observe()  # automatically traces this function call
def answer_question(user_question: str, user_id: str) -> str:
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=["production", "v2"],
    )

    # Fetch versioned prompt from Langfuse console
    prompt = langfuse.get_prompt("support-assistant")
    compiled = prompt.compile(question=user_question)

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=compiled,
    )
    answer = response.choices[0].message.content

    langfuse_context.update_current_observation(
        output=answer,
        metadata={"model": "gpt-4o", "user_id": user_id},
    )
    return answer

def record_user_feedback(trace_id: str, score: float, comment: str = ""):
    langfuse.score(
        trace_id=trace_id,
        name="user_satisfaction",
        value=score,  # 0.0 - 1.0
        comment=comment,
    )

After this runs, the Langfuse dashboard shows: prompt content, response content, token usage, latency, user ID, and quality score trends — all per request.

Key Metrics and Alerting

Set alerts in tiers:

P1 (immediate response):

Error rate > 5% (grouped by error_type: context_length_exceeded, rate_limit, content_filter)
P99 latency > 30s

P2 (respond within 1 hour):

P95 latency > 10s
Daily token consumption exceeds 80% of budget
Faithfulness score below 0.75 for more than 1 consecutive hour

Trend monitoring (daily review):

Average input/output token length trend (sudden increase may indicate prompt leakage or attack)
Quality score week-over-week change
New error types appearing

Prometheus alerting rules:

        
        
        
    
# prometheus-rules.yml
groups:
  - name: llm_alerts
    rules:
      - alert: LLMHighErrorRate
        expr: >
          rate(llm_requests_total{status="error"}[5m])
          / rate(llm_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate {{ $value | humanizePercentage }}"

      - alert: LLMHighLatency
        expr: histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning

Data Storage and Querying

LLM trace data can be high-volume — storage strategy matters:

Storage Options

ClickHouse: high write throughput, columnar storage, ideal for querying large trace datasets. Langfuse’s self-hosted version uses ClickHouse.
PostgreSQL: adequate for smaller volumes (< 1M requests/day), more flexible querying.
S3 + Athena: cold data archiving, lowest cost, higher query latency.

Sampling Strategy

You don’t need to store every request — sample by importance:

        
        
        
    
import random

def should_trace(request_context: dict) -> bool:
    # Error requests: always record
    if request_context.get("error"):
        return True

    # Requests with user feedback: always record
    if request_context.get("has_feedback"):
        return True

    # Low quality score (< 0.6): always record
    if request_context.get("quality_score", 1.0) < 0.6:
        return True

    # Normal requests: 10% random sample
    return random.random() < 0.10

RETENTION_POLICY = {
    "full_fidelity_days": 90,   # last 90 days at full detail
    "compressed_days": 365,     # 90-365 days: key fields only
    "archive_days": 730,        # 1-2 years: cold storage
}