RAG Evaluation Guide: How to Know If Your RAG Is Good
You build a RAG system, it answers questions, and it looks great in demos. Then a user asks something and the system confidently returns wrong information from a completely unrelated document. Where’s the bug? You don’t know — because you have no quantitative evaluation. You need an evaluation framework before shipping RAG to production, otherwise problems only get discovered by users.
Three Core Metrics for RAG Evaluation
RAG quality is determined by three orthogonal dimensions:
Faithfulness
Question: Is the model’s answer actually grounded in the retrieved context? Or is it hallucinating?
Low faithfulness = hallucination. Even when the right document is retrieved, the model can still “not follow the document” and fabricate content.
Score reference:
- High faithfulness (> 0.85): every claim in the answer can be traced to a retrieved document
- Low faithfulness (< 0.6): answer contains information not present in any retrieved document
Answer Relevance
Question: Does the model’s answer actually address what the user asked?
A common failure mode: the right document is retrieved, but the model answers something else from the document rather than the user’s actual question.
Context Precision
Question: Of the retrieved documents, how many are actually useful?
| Situation | Description |
|---|---|
| High Precision + Low Recall | Retrieved docs are all relevant, but missed important ones |
| Low Precision + High Recall | Retrieved many docs, but most are irrelevant |
| High Precision + High Recall | Ideal state |
Tool Selection
RAGAS
The most mature RAG evaluation metrics library. Directly implements Faithfulness, Answer Relevancy, Context Precision and others. Native support for LangChain and LlamaIndex. Scores are LLM-as-Judge based — no manual annotation required.
TruLens
Evaluation + Tracing in one framework. End-to-end tracing for LangChain and LlamaIndex applications with real-time quality metrics. Best when you want evaluation embedded into application monitoring.
LLM-as-Judge (roll your own)
Use a separate LLM call to evaluate output quality. Cost-controllable and flexible, but requires careful evaluation prompt design.
In Practice: Evaluating Your RAG Pipeline with RAGAS
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
eval_data = {
"question": [
"What is the return policy?",
"How do I reset my password?",
"What payment methods are accepted?",
],
"answer": [
"You can return items within 30 days of purchase.",
"Click 'Forgot Password' on the login page and follow the email instructions.",
"We accept Visa, Mastercard, and PayPal.",
],
"contexts": [
["Our return policy allows returns within 30 days. Items must be unused."],
["To reset your password, click 'Forgot Password' and check your email."],
["Accepted payment methods: Visa, Mastercard, American Express, PayPal."],
],
"ground_truth": [
"Items can be returned within 30 days if unused.",
"Use the 'Forgot Password' link and follow email instructions.",
"Visa, Mastercard, American Express, and PayPal are accepted.",
],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# Example output:
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.85, 'context_recall': 0.79}Reference score thresholds (varies by use case):
| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| Faithfulness | > 0.75 | > 0.85 | > 0.92 |
| Answer Relevancy | > 0.70 | > 0.80 | > 0.90 |
| Context Precision | > 0.65 | > 0.80 | > 0.90 |
Building an Evaluation Dataset
Evaluation quality depends on dataset quality. Three approaches:
Synthetic Dataset (fastest, good for getting started)
Use LLM to auto-generate QA pairs from your corpus:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
generator = TestsetGenerator.with_openai()
testset = generator.generate_with_langchain_docs(
documents,
test_size=50,
distributions={
simple: 0.5, # 50% straightforward questions
reasoning: 0.3, # 30% multi-step reasoning
multi_context: 0.2, # 20% cross-document questions
},
)Human-annotated Dataset (most accurate, for critical scenarios)
- Domain experts manually write questions and ground-truth answers
- Higher cost, but highest quality
- Recommend at least 20 “golden QA pairs” covering core scenarios
Production Data Sampling (closest to real usage, requires privacy handling)
Sample real user queries from production traffic, then label them:
sampled_queries = (
db.query(ProductionTrace)
.filter(ProductionTrace.date >= "2024-01-01")
.order_by(func.random())
.limit(100)
.all()
)
for q in sampled_queries:
print(f"Q: {q.user_query}")
print(f"A: {q.model_response}")
print("---")Minimum viable dataset: 50 QA pairs is enough to start. Include at least:
- 10 “should answer well” questions for core scenarios
- 10 out-of-corpus questions (test if the model knows when to say “I don’t know”)
- 10 questions requiring synthesis across multiple documents
Continuous Evaluation: Run RAG Scoring in CI
Every change to retrieval parameters (embedding model, chunk size, top-k) should trigger an evaluation run:
# evaluate_rag.py — runs in CI
import sys
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
THRESHOLDS = {
"faithfulness": 0.80,
"answer_relevancy": 0.75,
}
def run_evaluation(dataset):
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
failed = []
for metric, threshold in THRESHOLDS.items():
score = result[metric]
status = "✅" if score >= threshold else "❌"
print(f"{status} {metric}: {score:.3f} (threshold: {threshold})")
if score < threshold:
failed.append(metric)
if failed:
print(f"\nFailed metrics: {failed}")
sys.exit(1) # fail CI
if __name__ == "__main__":
dataset = load_eval_dataset("eval_dataset.json")
run_evaluation(dataset)# .github/workflows/rag-eval.yml
name: RAG Evaluation
on:
push:
paths:
- "rag/**"
- "embeddings/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run RAG evaluation
run: python evaluate_rag.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Common Failure Patterns
Understanding failure modes helps you pinpoint problems faster:
Retrieval Failure
Symptom: Low Context Precision, but the right answer exists in the corpus.
Causes:
- Chunk size too large or small, cutting across key information
- Embedding model doesn’t understand domain vocabulary (general vs. domain-specific model)
- Top-k too small — right document ranked outside the cutoff
Debug: For questions with known answers, inspect the top-5 retrieved docs and find where the correct document ranks.
Generation Failure
Symptom: High Context Precision, but low Faithfulness — right document was retrieved, but model didn’t use it.
Causes:
- Context too long — model “forgets” key sections (long context forgetting)
- System prompt too vague — doesn’t explicitly require grounding in context
- Model’s intrinsic hallucination tendency (worse with smaller models)
Fix: Add explicit instruction in System Prompt: “Only answer based on the provided context. If the answer is not in the context, say ‘I don’t know’.”
Coverage Failure
Symptom: User asks something that genuinely isn’t in the corpus, but the model fabricates an answer instead of saying “I don’t know”.
Protection: Add out-of-scope questions to your evaluation set. Test whether the model correctly refuses to answer.
Further Reading
- RAGAS docs — the most comprehensive RAG evaluation metrics documentation
- TruLens — evaluation + tracing in one framework
- LlamaIndex evaluation guide — LlamaIndex official evaluation best practices