Contents

RAG Evaluation Guide: How to Know If Your RAG Is Good

You build a RAG system, it answers questions, and it looks great in demos. Then a user asks something and the system confidently returns wrong information from a completely unrelated document. Where’s the bug? You don’t know — because you have no quantitative evaluation. You need an evaluation framework before shipping RAG to production, otherwise problems only get discovered by users.

Three Core Metrics for RAG Evaluation

RAG quality is determined by three orthogonal dimensions:

Faithfulness

Question: Is the model’s answer actually grounded in the retrieved context? Or is it hallucinating?

Low faithfulness = hallucination. Even when the right document is retrieved, the model can still “not follow the document” and fabricate content.

Score reference:

  • High faithfulness (> 0.85): every claim in the answer can be traced to a retrieved document
  • Low faithfulness (< 0.6): answer contains information not present in any retrieved document

Answer Relevance

Question: Does the model’s answer actually address what the user asked?

A common failure mode: the right document is retrieved, but the model answers something else from the document rather than the user’s actual question.

Context Precision

Question: Of the retrieved documents, how many are actually useful?

Situation Description
High Precision + Low Recall Retrieved docs are all relevant, but missed important ones
Low Precision + High Recall Retrieved many docs, but most are irrelevant
High Precision + High Recall Ideal state

Tool Selection

RAGAS

The most mature RAG evaluation metrics library. Directly implements Faithfulness, Answer Relevancy, Context Precision and others. Native support for LangChain and LlamaIndex. Scores are LLM-as-Judge based — no manual annotation required.

TruLens

Evaluation + Tracing in one framework. End-to-end tracing for LangChain and LlamaIndex applications with real-time quality metrics. Best when you want evaluation embedded into application monitoring.

LLM-as-Judge (roll your own)

Use a separate LLM call to evaluate output quality. Cost-controllable and flexible, but requires careful evaluation prompt design.

In Practice: Evaluating Your RAG Pipeline with RAGAS

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

eval_data = {
    "question": [
        "What is the return policy?",
        "How do I reset my password?",
        "What payment methods are accepted?",
    ],
    "answer": [
        "You can return items within 30 days of purchase.",
        "Click 'Forgot Password' on the login page and follow the email instructions.",
        "We accept Visa, Mastercard, and PayPal.",
    ],
    "contexts": [
        ["Our return policy allows returns within 30 days. Items must be unused."],
        ["To reset your password, click 'Forgot Password' and check your email."],
        ["Accepted payment methods: Visa, Mastercard, American Express, PayPal."],
    ],
    "ground_truth": [
        "Items can be returned within 30 days if unused.",
        "Use the 'Forgot Password' link and follow email instructions.",
        "Visa, Mastercard, American Express, and PayPal are accepted.",
    ],
}

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(result)
# Example output:
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.79}

Reference score thresholds (varies by use case):

Metric Acceptable Good Excellent
Faithfulness > 0.75 > 0.85 > 0.92
Answer Relevancy > 0.70 > 0.80 > 0.90
Context Precision > 0.65 > 0.80 > 0.90

Building an Evaluation Dataset

Evaluation quality depends on dataset quality. Three approaches:

Synthetic Dataset (fastest, good for getting started)

Use LLM to auto-generate QA pairs from your corpus:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
    distributions={
        simple: 0.5,         # 50% straightforward questions
        reasoning: 0.3,      # 30% multi-step reasoning
        multi_context: 0.2,  # 20% cross-document questions
    },
)

Human-annotated Dataset (most accurate, for critical scenarios)

  • Domain experts manually write questions and ground-truth answers
  • Higher cost, but highest quality
  • Recommend at least 20 “golden QA pairs” covering core scenarios

Production Data Sampling (closest to real usage, requires privacy handling)

Sample real user queries from production traffic, then label them:

sampled_queries = (
    db.query(ProductionTrace)
    .filter(ProductionTrace.date >= "2024-01-01")
    .order_by(func.random())
    .limit(100)
    .all()
)

for q in sampled_queries:
    print(f"Q: {q.user_query}")
    print(f"A: {q.model_response}")
    print("---")

Minimum viable dataset: 50 QA pairs is enough to start. Include at least:

  • 10 “should answer well” questions for core scenarios
  • 10 out-of-corpus questions (test if the model knows when to say “I don’t know”)
  • 10 questions requiring synthesis across multiple documents

Continuous Evaluation: Run RAG Scoring in CI

Every change to retrieval parameters (embedding model, chunk size, top-k) should trigger an evaluation run:

# evaluate_rag.py — runs in CI
import sys
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
}

def run_evaluation(dataset):
    result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])

    failed = []
    for metric, threshold in THRESHOLDS.items():
        score = result[metric]
        status = "✅" if score >= threshold else "❌"
        print(f"{status} {metric}: {score:.3f} (threshold: {threshold})")
        if score < threshold:
            failed.append(metric)

    if failed:
        print(f"\nFailed metrics: {failed}")
        sys.exit(1)  # fail CI

if __name__ == "__main__":
    dataset = load_eval_dataset("eval_dataset.json")
    run_evaluation(dataset)
# .github/workflows/rag-eval.yml
name: RAG Evaluation
on:
  push:
    paths:
      - "rag/**"
      - "embeddings/**"
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run RAG evaluation
        run: python evaluate_rag.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Common Failure Patterns

Understanding failure modes helps you pinpoint problems faster:

Retrieval Failure

Symptom: Low Context Precision, but the right answer exists in the corpus.

Causes:

  • Chunk size too large or small, cutting across key information
  • Embedding model doesn’t understand domain vocabulary (general vs. domain-specific model)
  • Top-k too small — right document ranked outside the cutoff

Debug: For questions with known answers, inspect the top-5 retrieved docs and find where the correct document ranks.

Generation Failure

Symptom: High Context Precision, but low Faithfulness — right document was retrieved, but model didn’t use it.

Causes:

  • Context too long — model “forgets” key sections (long context forgetting)
  • System prompt too vague — doesn’t explicitly require grounding in context
  • Model’s intrinsic hallucination tendency (worse with smaller models)

Fix: Add explicit instruction in System Prompt: “Only answer based on the provided context. If the answer is not in the context, say ‘I don’t know’.”

Coverage Failure

Symptom: User asks something that genuinely isn’t in the corpus, but the model fabricates an answer instead of saying “I don’t know”.

Protection: Add out-of-scope questions to your evaluation set. Test whether the model correctly refuses to answer.

Further Reading