Prometheus + Grafana Monitoring: Making LLM Services Visible
Contents
Why LLMs Need Monitoring
Traditional service monitoring:
- CPU / memory / network
- request latency / error rate
LLM monitoring additionally needs:
- token consumption
- model response latency
- hallucination rate
- prompt length distributionPrometheus Config
scrape_configs:
- job_name: 'llm-service'
metrics_path: '/metrics'
static_configs:
- targets: ['llm-api:8000']Key Metrics
# token consumption rate
rate(llm_tokens_total[5m])
# average response latency
rate(llm_request_duration_seconds_sum[5m]) / rate(llm_request_duration_seconds_count[5m])
# error rate
rate(llm_errors_total[5m]) / rate(llm_requests_total[5m])Grafana Dashboard
Common panels:
- token consumption trend
- P50/P95/P99 latency distribution
- error type distribution
- model call success rate
Conclusion
LLM service = regular service + AI-specific metrics.
Monitoring must keep up, or you won’t know when problems occur.