AI Context Management in Practice: RAG Is Not a Silver Bullet
Contents
Bottom Line First
RAG and long context aren’t substitutes—they’re complementary.
RAG good for:
- large-scale knowledge retrieval (>1M tokens)
- scenarios needing precise retrieval
- cost-sensitive
Long Context good for:
- cross-document reasoning
- complex multi-hop queries
- medium scale (<200k tokens)RAG’s Real Capability
What RAG Is
# RAG = Retrieval Augmented Generation
# 1. retrieve relevant documents
# 2. put documents in prompt
# 3. LLM generates answer
retrieved = vectorstore.similarity_search(query, k=5)
prompt = f"Context: {retrieved}\n\nQuestion: {query}"
response = llm.generate(prompt)RAG Good Scenarios
# RAG is good for:
# - document store > 1M tokens
# - need precise retrieval (like "find policy XX")
# - real-time updating knowledge base
# - cost-sensitive (token metered)RAG Limitations
# RAG problems:
# 1. retrieval quality depends on embedding model
# 2. poor multi-hop query performance
# 3. retrieval results may be irrelevant
# example:
query = "Who authored this function? When was it last modified?"
# RAG probably can't retrieve this cross-dimensional queryLong Context’s Real Capability
Long Context Good Scenarios
# Long Context good for:
# - codebase analysis (cross-file dependency understanding)
# - contract review (understanding multi-chapter relationships)
# - complex document Q&A
# example:
context = load_entire_codebase() # 50k tokens
query = "What's the data flow of this module? Where might have problems?"
# Long Context understands cross-file dependenciesLong Context Limitations
# Long Context problems:
# 1. high cost (all tokens go to LLM)
# 2. middle information easily ignored ("lost in the middle")
# 3. quality drops beyond 200k
# Research finding:
# LLM has lowest attention to middle portion of context
# important info at beginning or end works betterWhen to Use Which
| Scenario | Recommended | Reason |
|---|---|---|
| 100k document retrieval | RAG | too large, RAG more efficient |
| single complex codebase analysis | Long Context | needs cross-file understanding |
| contract multi-chapter analysis | Long Context | chapters have dependencies |
| real-time updating knowledge base | RAG | data source changes |
| simple Q&A | RAG | precise retrieval more effective |
Practice: RAG + Long Context Combined
# Approach: first RAG rough filter, then Long Context detailed read
# Step 1: RAG coarse filter
relevant_docs = vectorstore.similarity_search(query, k=10)
# Step 2: Long Context detailed read
context = "\n\n".join(relevant_docs)
prompt = f"Context: {context}\n\nQuestion: {query}"
# if context exceeds 200k:
if len(context) > 200_000:
# batch process, 200k each
results = []
for chunk in split(context, 200_000):
results.append(llm.generate(f"Context: {chunk}\n\nQuestion: {query}"))
final_result = merge_results(results)Conclusion
RAG and Long Context each have their applicable scenarios:
- RAG: large-scale retrieval, cost-sensitive, precise matching
- Long Context: complex reasoning, cross-document understanding, medium scale
Real systems often combine both:
- RAG for coarse filtering
- Long Context for detailed reading
Understand if your scenario is “retrieval” or “reasoning”—pick the right approach.