RAG in Production: Seven Pitfalls Nobody Tells You About
Background
RAG sounds simple in principle: before LLM answers, retrieve relevant content from your knowledge base and use it as context.
User Query → Retrieval → Relevant Docs → LLM → AnswerBut between principle and engineering implementation lies a hundred pitfalls.
This is real production debugging notes—not a RAG tutorial. I’m assuming you know what RAG is, focusing on what actually breaks in production.
Pitfall 1: Wrong Embedding Model
Embedding model determines your retrieval quality ceiling.
Common mistake: using OpenAI’s text-embedding-ada-002 by default.
ada-002 wasn’t the best by late 2023. Newer models have significantly better Chinese understanding and code retrieval.
# ❌ Defaulting to ada-002
from openai import OpenAI
response = client.embeddings.create(
model="text-embedding-ada-002",
input="Your text"
)
# ✅ Consider alternatives based on your use case
# Chinese embedding: multilingual-e5-large, BGE
# Code retrieval: codellama embedding, GTE-codeInternal test results (our data):
| Embedding Model | Chinese Semantic | Code Retrieval |
|---|---|---|
| ada-002 | 72% | 65% |
| text-embedding-3-small | 78% | 71% |
| BGE-large-zh | 89% | 68% |
| GTE-code-large | 81% | 91% |
Choice depends on your content. For technical docs, code retrieval capability matters.
Pitfall 2: Chunking Strategy is Arbitrary
Most tutorials teach:
# Fixed-size chunking
texts = text_splitter.split_text(document)This cuts sentences in half. Retrieved chunks are headless and tailless fragments that LLM can’t interpret.
Better Strategy: Semantic Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Target chunk size
chunk_overlap=50, # Overlap between chunks, preserve context
separators=["。\n", ",\n", "\n", " "] # Split by sentence/paragraph
)But better: chunk by semantic unit:
def semantic_chunk(text):
"""
Chunk by semantic unit: each chunk is a complete Q&A or topic paragraph
"""
sections = []
for section in text.split("\n## "): # Split by Markdown headers
if len(section) < 100:
continue # Skip too-short header lines
# If chunk too large, split by sub-headers
if len(section) > 1000:
subsections = section.split("\n### ")
for sub in subsections:
if len(sub) > 50:
sections.append(sub.strip())
else:
sections.append(section.strip())
return sectionsPitfall 3: Vector-Only Retrieval
Most people only do vector similarity search:
# This is insufficient
results = vector_store.similarity_search(query, k=5)When user query is structured—e.g., “find all logs where timeout > 30s”—vector retrieval is useless. You need keyword + vector hybrid retrieval.
from langchain.retrievers import EnsembleRetriever
# BM25 keyword retrieval
bm25_retriever = BM25Retriever.from_texts(chunks)
# Vector retrieval
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
# Hybrid retrieval
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7] # Low keyword weight, high vector weight
)Pitfall 4: No Query Rewriting
User’s question and knowledge base content often use different phrasing.
User asks: "Service CPU is very high, what do I do?"
Knowledge base: "High Load Troubleshooting Guide - CPU Usage Anomaly Handling"Same meaning, completely different wording. Direct retrieval might miss it.
Query Rewriting:
def rewrite_query(query, llm):
"""
Use LLM to rewrite user question into a form better suited for retrieval
"""
prompt = f"""Rewrite the following natural language question into a form better suited for knowledge base retrieval.
Keep the original meaning, but use more formal language closer to documentation phrasing.
Original: {query}
Rewritten:"""
response = llm.complete(prompt)
return response.text.strip()Usage:
# User asks: service won't start
original = "service won't start"
rewritten = rewrite_query(original, llm) # Might be "service startup failure troubleshooting"
# Use rewritten query for retrieval
results = ensemble_retriever.get_relevant_documents(rewritten)Pitfall 5: Context Overloaded
Retrieval returns 5 chunks × 1000 tokens each = 5000 tokens. Plus system prompt and conversation history—context window fills fast.
Several handling strategies:
5.1 Only Keep Most Relevant Chunks
def rerank_and_truncate(query, docs, llm, max_tokens=3000):
"""
Rerank with LLM, then truncate to max_tokens
"""
# Retrieve more candidates first
candidates = vector_store.similarity_search(query, k=10)
# Rerank with LLM (simple approach: score each)
scored = []
for doc in candidates:
score_prompt = f"""Rate the relevance of this document chunk to answering the question.
Question: {query}
Document: {doc.page_content}
Relevance: 1-5 score"""
score = int(llm.complete(score_prompt).text.strip())
scored.append((score, doc))
# Take top ones
top_docs = sorted(scored, key=lambda x: x[0], reverse=True)[:5]
# Truncate to max_tokens
total_tokens = 0
selected = []
for score, doc in top_docs:
doc_tokens = len(doc.page_content) // 4 # Rough estimate
if total_tokens + doc_tokens <= max_tokens:
selected.append(doc)
total_tokens += doc_tokens
return selected5.2 Use Summaries Instead of Full Chunks
def summarize_chunks(chunks, llm, max_per_chunk=200):
"""
Summarize each chunk first, then put in context
"""
summarized = []
for chunk in chunks:
if len(chunk.page_content) > max_per_chunk * 4: # Exceeds ~200 tokens
summary = llm.complete(
f"Summarize the following, keep key information, no more than {max_per_chunk} characters:\n\n{chunk.page_content}"
)
summarized.append(summary.text.strip())
else:
summarized.append(chunk.page_content)
return summarizedPitfall 6: Embeddings Don’t Auto-Update
Knowledge base is dynamic, but embedding database doesn’t auto-update.
Problem:
- Document updated, but embedding is stale—retrieval results inaccurate
- Product evolved, terminology changed—users can’t find with new terms
Solution:
# Periodically rebuild embedding
import schedule
def rebuild_index_if_needed():
"""
Check document modification time, rebuild embedding if changed
"""
docs_version = get_docs_version() # From DB or file hash
if docs_version != cached_version:
print("Documents updated, rebuilding index...")
# Incremental update: only process changed documents
changed_docs = get_changed_documents(cached_version, docs_version)
for doc in changed_docs:
# Delete old embedding
vector_store.delete_by_doc_id(doc.id)
# Rebuild new embedding
new_embedding = embed_text(doc.content)
vector_store.add_texts([doc.content], ids=[doc.id])
update_cached_version(docs_version)
# Check daily
schedule.every().day.at("02:00").do(rebuild_index_if_needed)Pitfall 7: No Evaluation Metrics
After launch, how do you know if the RAG system is good? Most teams don’t think about this.
Basic Metrics
def evaluate_rag_system(test_questions, ground_truth, rag_pipeline):
"""
Basic evaluation: retrieval Recall and Answer accuracy
"""
results = {
'retrieval_recall': [],
'answer_relevance': []
}
for question, expected_docs in zip(test_questions, ground_truth):
# Retrieval evaluation
retrieved = rag_pipeline.retrieve(question)
retrieved_ids = {doc.id for doc in retrieved}
expected_ids = {doc.id for doc in expected_docs}
recall = len(retrieved_ids & expected_ids) / len(expected_ids)
results['retrieval_recall'].append(recall)
# Answer quality evaluation (LLM-scored)
answer = rag_pipeline.answer(question)
relevance_score = llm.evaluate(
f"Question: {question}\nAnswer: {answer}\nScore 1-5, 5 is best"
)
results['answer_relevance'].append(int(relevance_score))
return {
'avg_recall': sum(results['retrieval_recall']) / len(results['retrieval_recall']),
'avg_relevance': sum(results['answer_relevance']) / len(results['answer_relevance'])
}Conclusion
Real production RAG difficulties:
- Embedding selection — not one-size-fits-all with ada-002
- Chunking strategy — semantic » fixed-length
- Hybrid retrieval — keyword + vector, not vector-only
- Query rewriting — gap between user language and documentation language
- Context management — truncation, summarization, reranking
- Index updates — knowledge base changed, embeddings must sync
- Evaluation system — no metrics = no optimization
RAG isn’t “deploy and forget”—it needs continuous tuning and monitoring.