May 20, 2025
The RAG Quality Problem: Why Retrieval is Only Half the Battle
Every enterprise wants RAG. Retrieval-Augmented Generation promises to make LLMs useful for internal knowledge - your documents, your policies, your data.
The standard pitch goes: index your documents into a vector database, retrieve relevant chunks when users ask questions, feed those chunks to an LLM, get accurate answers grounded in your content.
Simple, right?
After building RAG systems for banking, insurance, and government clients over the past 18 months, we’ve learned that the industry’s obsession with retrieval optimization misses where RAG actually fails in production.
The Retrieval Fixation
Open any RAG tutorial and you’ll find detailed coverage of:
- Chunking strategies (fixed-size, semantic, recursive)
- Embedding model selection (OpenAI, Cohere, open-source)
- Vector database comparisons (Pinecone, Weaviate, Qdrant, Milvus)
- Retrieval algorithms (similarity search, hybrid search, reranking)
These matter. But in our production deployments, retrieval accounts for maybe 30% of quality issues. The other 70% comes from problems that happen after you’ve retrieved the right documents.
Where RAG Actually Fails
Problem 1: The Model Ignores Retrieved Context
You retrieve the perfect document chunk. It contains exactly the answer the user needs. The model ignores it and hallucinates anyway.
This happens more than you’d expect, especially when:
- The retrieved context contradicts the model’s training data
- The answer requires synthesizing information across multiple chunks
- The context is technical or domain-specific
We measured this on a financial services RAG system. With perfect retrieval (human-verified relevant chunks), the model still produced incorrect or ungrounded answers 23% of the time.
Root cause: LLMs are trained to be helpful. When the retrieved context doesn’t clearly answer the question, the model fills in gaps from its parametric knowledge - which might be wrong or outdated.
What works: Explicit grounding instructions that tell the model to say “I don’t have enough information” rather than guess. We’ve found this prompt pattern effective:
You are answering questions using ONLY the provided context.
Rules:
1. If the context contains the answer, provide it with a citation
2. If the context is relevant but incomplete, say what you found and what's missing
3. If the context doesn't address the question, say "This information is not in the available documents"
4. NEVER use information from outside the provided context
Context:
{retrieved_chunks}
Question: {user_question}
This reduces hallucination but increases “I don’t know” responses. That’s the right tradeoff for enterprise use cases.
Problem 2: Citation Accuracy
Users don’t just want answers - they want to verify those answers against source documents. RAG systems promise this but often fail to deliver accurate citations.
Common citation failures:
- Citing a document that doesn’t actually support the claim
- Providing page numbers that don’t exist
- Attributing synthesized information to a single source
- Hallucinating document names entirely
We evaluated citation accuracy on a legal document RAG system:
| Metric | Score |
|---|---|
| Answer relevance | 87% |
| Citation provided | 94% |
| Citation exists | 89% |
| Citation actually supports claim | 71% |
That last number is the killer. 71% means nearly 1 in 3 citations don’t actually support what the model claims they do.
What works: Generate-then-verify pipelines:
flowchart LR
A[User Query] --> B[Retrieve Chunks]
B --> C[Generate Answer + Citations]
C --> D[Citation Verifier]
D --> E{Citations Valid?}
E -->|Yes| F[Return Response]
E -->|No| G[Regenerate with Feedback]
G --> C
style D fill:#90EE90
The citation verifier checks whether each cited chunk actually contains information supporting the claim. If not, it provides feedback to the generator for a second attempt.
Problem 3: Cross-Document Synthesis
Real questions often require information from multiple documents. “What’s our leave policy for employees who joined after 2023 and work in the Bangalore office?”
This might require:
- General leave policy document
- 2023 policy amendments
- Bangalore-specific HR guidelines
Standard RAG retrieves chunks independently. The model receives a jumble of potentially contradictory information and has to figure out which parts apply.
What works: Hierarchical retrieval with explicit relationship mapping:
class HierarchicalRetriever:
def retrieve(self, query: str) -> RetrievalResult:
# First pass: identify relevant document categories
categories = self.classify_query(query)
# Second pass: retrieve within each category
chunks_by_category = {}
for category in categories:
chunks_by_category[category] = self.retrieve_from_category(
query, category, top_k=3
)
# Third pass: identify relationships
relationships = self.identify_relationships(chunks_by_category)
return RetrievalResult(
chunks=chunks_by_category,
relationships=relationships,
synthesis_guidance=self.generate_synthesis_prompt(relationships)
)
The synthesis guidance tells the model: “Document A is the general policy. Document B contains amendments that override Document A for employees joining after 2023. Document C contains location-specific rules that take precedence for Bangalore office.”
Problem 4: Temporal Reasoning
Documents have versions. Policies get updated. Which version applies to the user’s question?
We’ve seen RAG systems confidently cite outdated policies because:
- The old version was better chunked
- The old version had more similar terminology to the query
- The timestamp metadata wasn’t used in retrieval
What works: Temporal-aware retrieval that understands document lifecycle:
def retrieve_with_temporal_awareness(query: str, context_date: date) -> list[Chunk]:
# Retrieve candidates
candidates = vector_search(query, top_k=20)
# Filter by temporal validity
valid_candidates = []
for chunk in candidates:
doc = chunk.document
if doc.effective_date <= context_date:
if doc.superseded_date is None or doc.superseded_date > context_date:
valid_candidates.append(chunk)
# Re-rank valid candidates
return rerank(query, valid_candidates, top_k=5)
Problem 5: Confidence Calibration
When should the RAG system answer vs. escalate to a human? Most implementations have no principled way to answer this.
A system that’s wrong 20% of the time sounds bad. But if it knew which 20% it was uncertain about and escalated those, it would be highly useful. The problem is that LLM confidence scores don’t correlate well with actual correctness.
What works: Multi-signal confidence estimation:
flowchart TD
A[RAG Response] --> B[Confidence Estimator]
B --> C[Retrieval Confidence]
B --> D[Generation Confidence]
B --> E[Citation Confidence]
B --> F[Consistency Check]
C --> G{Aggregate Score}
D --> G
E --> G
F --> G
G -->|High| H[Return Response]
G -->|Medium| I[Return with Caveat]
G -->|Low| J[Escalate to Human]
Each signal contributes:
- Retrieval confidence: How similar are the top chunks? High variance suggests uncertain retrieval.
- Generation confidence: Does the model express uncertainty in its language?
- Citation confidence: Do citations verify correctly?
- Consistency check: If we run the query again, do we get the same answer?
The Architecture That Works
Based on our production deployments, here’s the RAG architecture we recommend:
flowchart TB
subgraph QU["Query Understanding"]
A[User Query] --> B[Query Analyzer]
B --> C[Intent Classification]
B --> D[Temporal Context]
B --> E[Entity Extraction]
end
subgraph RP["Retrieval Pipeline"]
F[Hierarchical Retriever]
F --> G[Temporal Filter]
G --> H[Cross-Reference Resolution]
H --> I[Context Assembly]
end
subgraph GP["Generation Pipeline"]
I --> J[Grounded Generator]
J --> K[Citation Extractor]
K --> L[Citation Verifier]
L --> M{Valid?}
M -->|No| J
M -->|Yes| N[Confidence Estimator]
end
subgraph QC["Quality Control"]
N --> O{Confidence Level}
O -->|High| P[Direct Response]
O -->|Medium| Q[Response with Caveats]
O -->|Low| R[Human Escalation Queue]
end
QU --> RP
Measuring What Matters
Stop measuring just retrieval metrics. Here’s what actually predicts user satisfaction:
| Metric | What It Measures | Target |
|---|---|---|
| Answer Groundedness | % of claims supported by retrieved context | > 95% |
| Citation Accuracy | % of citations that verify correctly | > 90% |
| Completeness | % of relevant information included | > 85% |
| Appropriate Uncertainty | % of uncertain cases correctly flagged | > 80% |
| User Correction Rate | % of responses users mark as wrong | < 10% |
We’ve built these metrics into Guardian, our AI reliability monitoring platform. You can’t improve what you don’t measure.
The Indian Enterprise Context
RAG for Indian enterprises has specific challenges:
Multilingual documents: A policy document might be in English, but the regional office implementation guide is in Hindi or Tamil. Your retrieval needs to work across languages.
Document quality variance: Government documents, regulatory circulars, and internal memos have wildly different formatting quality. OCR errors are common in scanned documents.
Update patterns: Indian regulatory documents update frequently with circulars and amendments rather than clean new versions. Tracking what’s current requires understanding the amendment chain.
This is why we built Dastavez with Indian document understanding at its core - multi-script OCR, government form recognition, and amendment tracking built in.
Getting RAG Right
RAG isn’t a product you install. It’s an architecture you build, measure, and iterate on.
If you’re struggling with RAG quality in production:
- Measure generation quality, not just retrieval. You probably have blind spots.
- Implement citation verification. Your users will fact-check you.
- Build appropriate escalation paths. Not every question should get an AI answer.
- Test with your actual documents. Demo-quality PDFs behave differently than real enterprise content.
We’ve helped enterprises across banking, insurance, and government build RAG systems that actually work in production. The difference between demo and deployment is engineering discipline applied to all the problems that happen after retrieval.
Contact us if you want to discuss your RAG challenges.