RAG in Production: Seven Pitfalls Nobody Tells You About

Simi included in AI

2024-03-20 1007 words 5 minutes

Contents

Background

RAG sounds simple in principle: before LLM answers, retrieve relevant content from your knowledge base and use it as context.

User Query → Retrieval → Relevant Docs → LLM → Answer

But between principle and engineering implementation lies a hundred pitfalls.

This is real production debugging notes—not a RAG tutorial. I’m assuming you know what RAG is, focusing on what actually breaks in production.

Pitfall 1: Wrong Embedding Model

Embedding model determines your retrieval quality ceiling.

Common mistake: using OpenAI’s text-embedding-ada-002 by default.

ada-002 wasn’t the best by late 2023. Newer models have significantly better Chinese understanding and code retrieval.

        
        
        
    
# ❌ Defaulting to ada-002
from openai import OpenAI
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="Your text"
)

# ✅ Consider alternatives based on your use case
# Chinese embedding: multilingual-e5-large, BGE
# Code retrieval: codellama embedding, GTE-code

Internal test results (our data):

Embedding Model	Chinese Semantic	Code Retrieval
ada-002	72%	65%
text-embedding-3-small	78%	71%
BGE-large-zh	89%	68%
GTE-code-large	81%	91%

Choice depends on your content. For technical docs, code retrieval capability matters.

Pitfall 2: Chunking Strategy is Arbitrary

Most tutorials teach:

        
# Fixed-size chunking
texts = text_splitter.split_text(document)

This cuts sentences in half. Retrieved chunks are headless and tailless fragments that LLM can’t interpret.

Better Strategy: Semantic Chunking

        
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # Target chunk size
    chunk_overlap=50,    # Overlap between chunks, preserve context
    separators=["。\n", "，\n", "\n", " "]  # Split by sentence/paragraph
)

But better: chunk by semantic unit:

        
        
        
    
def semantic_chunk(text):
    """
    Chunk by semantic unit: each chunk is a complete Q&A or topic paragraph
    """
    sections = []
    
    for section in text.split("\n## "):  # Split by Markdown headers
        if len(section) < 100:
            continue  # Skip too-short header lines
        
        # If chunk too large, split by sub-headers
        if len(section) > 1000:
            subsections = section.split("\n### ")
            for sub in subsections:
                if len(sub) > 50:
                    sections.append(sub.strip())
        else:
            sections.append(section.strip())
    
    return sections

Pitfall 3: Vector-Only Retrieval

Most people only do vector similarity search:

        
# This is insufficient
results = vector_store.similarity_search(query, k=5)

When user query is structured—e.g., “find all logs where timeout > 30s”—vector retrieval is useless. You need keyword + vector hybrid retrieval.

        
from langchain.retrievers import EnsembleRetriever

# BM25 keyword retrieval
bm25_retriever = BM25Retriever.from_texts(chunks)

# Vector retrieval
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Hybrid retrieval
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]  # Low keyword weight, high vector weight
)

Pitfall 4: No Query Rewriting

User’s question and knowledge base content often use different phrasing.

User asks: "Service CPU is very high, what do I do?"
Knowledge base: "High Load Troubleshooting Guide - CPU Usage Anomaly Handling"

Same meaning, completely different wording. Direct retrieval might miss it.

Query Rewriting:

        
        
        
    
def rewrite_query(query, llm):
    """
    Use LLM to rewrite user question into a form better suited for retrieval
    """
    prompt = f"""Rewrite the following natural language question into a form better suited for knowledge base retrieval.
    Keep the original meaning, but use more formal language closer to documentation phrasing.
    
    Original: {query}
    Rewritten:"""
    
    response = llm.complete(prompt)
    return response.text.strip()

Usage:

        
# User asks: service won't start
original = "service won't start"
rewritten = rewrite_query(original, llm)  # Might be "service startup failure troubleshooting"

# Use rewritten query for retrieval
results = ensemble_retriever.get_relevant_documents(rewritten)

Pitfall 5: Context Overloaded

Retrieval returns 5 chunks × 1000 tokens each = 5000 tokens. Plus system prompt and conversation history—context window fills fast.

Several handling strategies:

5.1 Only Keep Most Relevant Chunks

        
        
        
    
def rerank_and_truncate(query, docs, llm, max_tokens=3000):
    """
    Rerank with LLM, then truncate to max_tokens
    """
    # Retrieve more candidates first
    candidates = vector_store.similarity_search(query, k=10)
    
    # Rerank with LLM (simple approach: score each)
    scored = []
    for doc in candidates:
        score_prompt = f"""Rate the relevance of this document chunk to answering the question.
        Question: {query}
        Document: {doc.page_content}
        Relevance: 1-5 score"""
        
        score = int(llm.complete(score_prompt).text.strip())
        scored.append((score, doc))
    
    # Take top ones
    top_docs = sorted(scored, key=lambda x: x[0], reverse=True)[:5]
    
    # Truncate to max_tokens
    total_tokens = 0
    selected = []
    for score, doc in top_docs:
        doc_tokens = len(doc.page_content) // 4  # Rough estimate
        if total_tokens + doc_tokens <= max_tokens:
            selected.append(doc)
            total_tokens += doc_tokens
    
    return selected

5.2 Use Summaries Instead of Full Chunks

        
        
        
    
def summarize_chunks(chunks, llm, max_per_chunk=200):
    """
    Summarize each chunk first, then put in context
    """
    summarized = []
    for chunk in chunks:
        if len(chunk.page_content) > max_per_chunk * 4:  # Exceeds ~200 tokens
            summary = llm.complete(
                f"Summarize the following, keep key information, no more than {max_per_chunk} characters:\n\n{chunk.page_content}"
            )
            summarized.append(summary.text.strip())
        else:
            summarized.append(chunk.page_content)
    
    return summarized

Pitfall 6: Embeddings Don’t Auto-Update

Knowledge base is dynamic, but embedding database doesn’t auto-update.

Problem:

Document updated, but embedding is stale—retrieval results inaccurate
Product evolved, terminology changed—users can’t find with new terms

Solution:

        
        
        
    
# Periodically rebuild embedding
import schedule

def rebuild_index_if_needed():
    """
    Check document modification time, rebuild embedding if changed
    """
    docs_version = get_docs_version()  # From DB or file hash
    
    if docs_version != cached_version:
        print("Documents updated, rebuilding index...")
        
        # Incremental update: only process changed documents
        changed_docs = get_changed_documents(cached_version, docs_version)
        
        for doc in changed_docs:
            # Delete old embedding
            vector_store.delete_by_doc_id(doc.id)
            
            # Rebuild new embedding
            new_embedding = embed_text(doc.content)
            vector_store.add_texts([doc.content], ids=[doc.id])
        
        update_cached_version(docs_version)

# Check daily
schedule.every().day.at("02:00").do(rebuild_index_if_needed)

Pitfall 7: No Evaluation Metrics

After launch, how do you know if the RAG system is good? Most teams don’t think about this.

Basic Metrics

        
        
        
    
def evaluate_rag_system(test_questions, ground_truth, rag_pipeline):
    """
    Basic evaluation: retrieval Recall and Answer accuracy
    """
    results = {
        'retrieval_recall': [],
        'answer_relevance': []
    }
    
    for question, expected_docs in zip(test_questions, ground_truth):
        # Retrieval evaluation
        retrieved = rag_pipeline.retrieve(question)
        retrieved_ids = {doc.id for doc in retrieved}
        expected_ids = {doc.id for doc in expected_docs}
        
        recall = len(retrieved_ids & expected_ids) / len(expected_ids)
        results['retrieval_recall'].append(recall)
        
        # Answer quality evaluation (LLM-scored)
        answer = rag_pipeline.answer(question)
        relevance_score = llm.evaluate(
            f"Question: {question}\nAnswer: {answer}\nScore 1-5, 5 is best"
        )
        results['answer_relevance'].append(int(relevance_score))
    
    return {
        'avg_recall': sum(results['retrieval_recall']) / len(results['retrieval_recall']),
        'avg_relevance': sum(results['answer_relevance']) / len(results['answer_relevance'])
    }

Conclusion

Real production RAG difficulties:

Embedding selection — not one-size-fits-all with ada-002
Chunking strategy — semantic » fixed-length
Hybrid retrieval — keyword + vector, not vector-only
Query rewriting — gap between user language and documentation language
Context management — truncation, summarization, reranking
Index updates — knowledge base changed, embeddings must sync
Evaluation system — no metrics = no optimization

RAG isn’t “deploy and forget”—it needs continuous tuning and monitoring.