Contents

Prometheus + Grafana Monitoring: Making LLM Services Visible

Why LLMs Need Monitoring

Traditional service monitoring:
  - CPU / memory / network
  - request latency / error rate

LLM monitoring additionally needs:
  - token consumption
  - model response latency
  - hallucination rate
  - prompt length distribution

Prometheus Config

scrape_configs:
  - job_name: 'llm-service'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['llm-api:8000']

Key Metrics

# token consumption rate
rate(llm_tokens_total[5m])

# average response latency
rate(llm_request_duration_seconds_sum[5m]) / rate(llm_request_duration_seconds_count[5m])

# error rate
rate(llm_errors_total[5m]) / rate(llm_requests_total[5m])

Grafana Dashboard

Common panels:

  • token consumption trend
  • P50/P95/P99 latency distribution
  • error type distribution
  • model call success rate

Conclusion

LLM service = regular service + AI-specific metrics.

Monitoring must keep up, or you won’t know when problems occur.