LLM Observability: Practical Methods for Monitoring Prompt and Response
Your LLM service is live and users are complaining about output quality. You open your monitoring dashboard — CPU normal, memory normal, P99 latency within SLA. From an infrastructure perspective, the service looks perfectly healthy. But you have zero visibility into what prompts are being sent, what the model is returning, or why quality dropped. This is the LLM observability gap: traditional APM tools can tell you if the system is healthy, but not if the AI is.
Three Dimensions of LLM Observability
LLM observability differs from standard service monitoring. You need to watch three layers simultaneously:
Operational Metrics
These overlap with traditional APM, but LLMs have unique metrics:
| Metric | Description | Alert Threshold (reference) |
|---|---|---|
latency_p99 |
End-to-end P99 latency | > 10s |
ttft (Time to First Token) |
Time to first token | > 3s |
tokens_per_second |
Generation speed | < 10 TPS |
input_tokens / output_tokens |
Token usage (directly drives cost) | over budget |
error_rate |
Grouped by error type | > 1% |
Quality Metrics
This is uniquely challenging for LLMs — how do you quantify “is the answer good?”:
- Faithfulness: Is the answer grounded in retrieved context? (RAG scenarios)
- Relevance: Does the answer actually address the user’s question?
- Coherence: Is the answer internally consistent and logically clear?
- LLM-as-Judge score: Use another LLM to score output — practical for batch evaluation
Business Metrics
Ultimately what matters: did users get what they wanted?
- Task completion rate (does the user need to follow up multiple times?)
- User satisfaction (explicit ratings or implicit behavioral signals)
- Conversation abandonment rate (users who quit mid-session)
Tool Selection
Choosing the right LLM observability tool saves a lot of reinventing the wheel:
Langfuse
Open source, self-hostable, with Python and JavaScript SDKs. Biggest advantage: native integration with LangChain and LlamaIndex — a few lines to instrument. Supports prompt version management and A/B test tracking. Best for teams that need full data sovereignty.
Helicone
Proxy-mode integration — change one line (the API base URL), zero code changes. Routes OpenAI requests through Helicone’s proxy to automatically log all prompts and responses. Best for quick integration without touching your codebase.
Phoenix by Arize
ML observability background, stronger analysis for embeddings and retrieval quality. Has built-in Hallucination Detector and QA evaluators. Best for deep RAG system evaluation.
OpenTelemetry
If your team already has OTel infrastructure, use the standard protocol and build your own pipeline. The community maintains packages like opentelemetry-instrumentation-openai. Flexible but more build work required.
In Practice: Tracing Prompts/Responses with Langfuse
import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import openai
langfuse = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host="https://cloud.langfuse.com",
)
openai_client = openai.OpenAI()
@observe() # automatically traces this function call
def answer_question(user_question: str, user_id: str) -> str:
langfuse_context.update_current_trace(
user_id=user_id,
tags=["production", "v2"],
)
# Fetch versioned prompt from Langfuse console
prompt = langfuse.get_prompt("support-assistant")
compiled = prompt.compile(question=user_question)
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=compiled,
)
answer = response.choices[0].message.content
langfuse_context.update_current_observation(
output=answer,
metadata={"model": "gpt-4o", "user_id": user_id},
)
return answer
def record_user_feedback(trace_id: str, score: float, comment: str = ""):
langfuse.score(
trace_id=trace_id,
name="user_satisfaction",
value=score, # 0.0 - 1.0
comment=comment,
)After this runs, the Langfuse dashboard shows: prompt content, response content, token usage, latency, user ID, and quality score trends — all per request.
Key Metrics and Alerting
Set alerts in tiers:
P1 (immediate response):
- Error rate > 5% (grouped by
error_type:context_length_exceeded,rate_limit,content_filter) - P99 latency > 30s
P2 (respond within 1 hour):
- P95 latency > 10s
- Daily token consumption exceeds 80% of budget
- Faithfulness score below 0.75 for more than 1 consecutive hour
Trend monitoring (daily review):
- Average input/output token length trend (sudden increase may indicate prompt leakage or attack)
- Quality score week-over-week change
- New error types appearing
Prometheus alerting rules:
# prometheus-rules.yml
groups:
- name: llm_alerts
rules:
- alert: LLMHighErrorRate
expr: >
rate(llm_requests_total{status="error"}[5m])
/ rate(llm_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "LLM error rate {{ $value | humanizePercentage }}"
- alert: LLMHighLatency
expr: histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warningData Storage and Querying
LLM trace data can be high-volume — storage strategy matters:
Storage Options
- ClickHouse: high write throughput, columnar storage, ideal for querying large trace datasets. Langfuse’s self-hosted version uses ClickHouse.
- PostgreSQL: adequate for smaller volumes (< 1M requests/day), more flexible querying.
- S3 + Athena: cold data archiving, lowest cost, higher query latency.
Sampling Strategy
You don’t need to store every request — sample by importance:
import random
def should_trace(request_context: dict) -> bool:
# Error requests: always record
if request_context.get("error"):
return True
# Requests with user feedback: always record
if request_context.get("has_feedback"):
return True
# Low quality score (< 0.6): always record
if request_context.get("quality_score", 1.0) < 0.6:
return True
# Normal requests: 10% random sample
return random.random() < 0.10
RETENTION_POLICY = {
"full_fidelity_days": 90, # last 90 days at full detail
"compressed_days": 365, # 90-365 days: key fields only
"archive_days": 730, # 1-2 years: cold storage
}Further Reading
- Langfuse — open-source LLM observability platform with comprehensive docs
- Helicone — zero-code proxy-mode integration
- OpenTelemetry — standard observability protocol
- Prometheus docs — metric collection and alerting rule configuration