Your RAG Pipeline is a Hallucination Machine

Retrieval-Augmented Generation was supposed to be the answer.

The pitch was compelling: instead of relying on the model’s parametric memory (which hallucinates), retrieve relevant documents and ground responses in actual sources. Problem solved.

Except it wasn’t.

After eighteen months of deploying RAG systems in production across financial services, legal, and healthcare in India, we’ve catalogued failure modes that nobody warned us about. RAG doesn’t eliminate hallucination. It transforms it into something more insidious: hallucination that looks like it has sources.

The Retrieval-Generation Gap

Let’s start with the fundamental problem: retrieval and generation are two different systems with two different failure modes, and we’ve glued them together hoping for the best.

The retriever finds chunks that match the query - usually by embedding similarity. The generator takes those chunks and produces a response. Nothing guarantees these two systems are aligned.

flowchart TD
    subgraph Query["USER QUERY"]
        Q["What is the penalty for late GST filing?"]
    end

    subgraph Retriever["RETRIEVER"]
        R1["GST filing deadlines"]
        R2["penalties under GST"]
        R3["late fee calculations"]
        R4["interest on delayed payment<br/><span style='color:#f97316'>← different topic</span>"]
        R5["filing extensions during COVID<br/><span style='color:#ef4444'>← outdated</span>"]

        Check1["✓ Semantically similar"]
        Check2["? Actually answers the question"]
    end

    subgraph Generator["GENERATOR"]
        G1["Receives: 5 chunks, some relevant, some not"]
        G2["May synthesize contradictory information"]
        G3["May cite outdated COVID provisions as current"]
        G4["May confuse penalties with interest"]
        G5["May fill gaps with parametric hallucination"]

        Check3["✓ Fluent output"]
        Check4["? Factually correct"]
    end

    Query --> Retriever
    Retriever --> Generator

    style R4 fill:#fef3c7,stroke:#f97316
    style R5 fill:#fee2e2,stroke:#ef4444

THE GAP

Retrieval quality ≠ Generation quality. You can retrieve the right documents and still generate the wrong answer.

The Seven Deadly Failure Modes

We’ve identified seven distinct ways RAG pipelines fail. Each one appears correct to users - they all come with citations and confident language. That’s what makes them dangerous.

1. Contradictory Chunk Synthesis

When the retriever pulls chunks that contradict each other, the generator doesn’t flag this. It synthesizes them into a coherent-sounding response that’s actually nonsense.

We saw this repeatedly with regulatory documents. An older circular says one thing, a newer one amends it. The retriever pulls both. The generator confidently mashes them together.

Contradictory Chunk Synthesis Example

CHUNK 1 (2019)

"Form 26AS must be filed annually by June 30th"

CHUNK 2 (2023)

"The deadline for Form 26AS has been extended to July 31st"

↓

GENERATED RESPONSE

"Form 26AS must be filed by June 30th, though in some cases the deadline may be July 31st depending on circumstances."

warning FAILURE: Model doesn't identify that one chunk supersedes the other. Creates false optionality.

2. Confidence-Accuracy Inversion

This is the failure mode that scares me most.

When the model retrieves chunks that are tangentially related to the query, it becomes more confident, not less. It found “evidence.” Never mind that the evidence doesn’t actually answer the question.

In our testing, responses with irrelevant retrieved context were rated as more confident by users than responses where the model admitted uncertainty. The citations created false trust.

Confidence-Accuracy Inversion

Users trust responses MORE when they have citations, even when those citations are tangentially relevant

3. Citation Fabrication

The model cites a document. The document exists. The quote… doesn’t.

This is different from pure hallucination. The model isn’t inventing sources out of thin air. It’s inventing quotes from real sources. It saw the document in the context, understood its general topic, and generated a plausible-sounding quote that matches the topic but never actually appears in the document.

We call this “attributed confabulation.” It’s nearly impossible for users to catch without reading the full source document.

4. Temporal Confusion

RAG pipelines typically don’t handle time well. A chunk from 2019 and a chunk from 2024 are treated equivalently. The model has no reliable way to know which represents current truth.

For legal and regulatory use cases, this is catastrophic. Laws change. Regulations update. A RAG system that doesn’t understand document temporality will confidently cite repealed provisions.

flowchart LR
    subgraph Store["DOCUMENT STORE"]
        D1["Circular 2015<br/>FDI limit: 49%"]
        D2["Amendment 2021<br/>FDI limit: 74%"]
    end

    Q["Query: What are FDI limits<br/>in insurance?"] --> Retriever

    D1 --> Retriever
    D2 --> Retriever

    Retriever["Both retrieved with<br/>similar relevance scores"] --> Outputs

    subgraph Outputs["POSSIBLE OUTPUTS"]
        O1["'The FDI limit is 49%'<br/><span style='color:#ef4444'>(outdated)</span>"]
        O2["'The FDI limit is 74%'<br/><span style='color:#22c55e'>(correct but why?)</span>"]
        O3["'The FDI limit ranges<br/>from 49-74%'<br/><span style='color:#f97316'>(confused)</span>"]
    end

    style O1 fill:#fee2e2,stroke:#ef4444
    style O2 fill:#dcfce7,stroke:#22c55e
    style O3 fill:#fef3c7,stroke:#f97316

Without explicit temporal reasoning, RAG can't distinguish current law from historical artifacts

5. Context Boundary Bleeding

Chunking creates artificial boundaries. A concept explained across two paragraphs gets split. The model retrieves one chunk without the other and generates a response based on incomplete information.

Even worse: sometimes the second paragraph qualifies or contradicts the first. “The standard rate is X. However, in cases of Y, the rate is Z.” If your chunk boundary falls between these sentences, the model will confidently cite the standard rate for a Y case.

6. Retrieval Anchoring

The first few retrieved chunks have disproportionate influence on the response. If your retriever surfaces a tangentially relevant but incorrect chunk as the top result, the model will often anchor on it even if more relevant chunks appear lower.

This is the RAG equivalent of the “lost in the middle” problem. The model doesn’t treat all retrieved chunks equally - position matters, and position is determined by retriever quirks, not ground truth.

7. Parametric Gap-Filling

When retrieved chunks don’t fully answer the question, the model fills gaps with parametric knowledge. This might be correct, it might be hallucinated, and the user has no way to tell which parts came from retrieval and which came from the model’s training.

The dangerous part: the response reads seamlessly. Retrieved facts and parametric gap-fills are woven together with no indication of which is which.

Parametric Gap-Filling Example

RETRIEVED CONTEXT

"Company XYZ was founded in 2018 and operates in the fintech sector, focusing on payment solutions."

QUERY

"What is XYZ's annual revenue?"

GENERATED RESPONSE

"Company XYZ, founded in 2018, is a fintech company specializing in payment solutions. Based on their market position, their annual revenue is estimated at approximately Rs.500 crores with projected growth of 40% year-over-year given the expansion of digital payments in India."

From retrieval Hallucinated

User sees: Confident response with specific numbers
Reality: Only company name and founding year are grounded

Why Standard Fixes Don’t Work

The industry has proposed various fixes. None of them fully solve the problem.

“Just use better embeddings.” Better embeddings improve retrieval relevance, but they don’t solve temporal confusion, contradiction handling, or parametric gap-filling. You can have perfect retrieval and still generate hallucinated responses.

“Add a reranker.” Rerankers help with retrieval anchoring, but they’re still operating on semantic similarity. A reranker can’t tell you that a 2019 regulation was superseded by a 2023 amendment.

“Use hybrid search.” Combining keyword and semantic search improves recall, but it doesn’t address any of the generation-side failure modes. More retrieved chunks often means more opportunities for contradiction and confusion.

“Fine-tune on your domain.” Fine-tuning can help with domain-specific language, but it doesn’t fix the fundamental architecture. A fine-tuned model can still synthesize contradictions and fabricate citations.

“Prompt engineering.” Prompting the model to be more careful, cite sources, or admit uncertainty helps at the margins but doesn’t eliminate failure modes. Models prompted to “only use information from the provided context” still parametric gap-fill.

What Actually Works

After extensive testing, here’s what we’ve found actually reduces RAG failure rates in production:

Chunk-Level Verification

Don’t trust the generator to correctly use retrieved chunks. Build a verification layer that checks whether each claim in the response actually appears in the cited chunk.

This catches citation fabrication and parametric gap-filling. It’s expensive - you’re essentially running inference twice - but for high-stakes applications, it’s necessary.

flowchart TD
    Q[Query] --> R[Retriever]
    R --> C["Retrieved Chunks (1-N)"]
    C --> G[Generator]
    G --> Resp["Response with<br/>inline citations"]

    Resp --> V[VERIFIER]

    subgraph Verification["VERIFICATION PROCESS"]
        V --> V1["Extract cited chunk"]
        V1 --> V2["Check if claim appears"]
        V2 --> V3["Flag gaps"]
    end

    V3 --> Decision{Result?}

    Decision -->|Verified| Pass["VERIFIED<br/>(Serve)"]
    Decision -->|Flagged| Review["FLAGGED<br/>(Review)"]
    Decision -->|Rejected| Retry["REJECTED<br/>(Retry)"]

    style Pass fill:#22c55e,stroke:#16a34a,color:#000
    style Review fill:#eab308,stroke:#ca8a04,color:#000
    style Retry fill:#ef4444,stroke:#dc2626,color:#fff

Temporal Metadata Enforcement

Every chunk needs temporal metadata. When was this document published? When was it last updated? Is it superseded by something else?

The retriever should use this metadata to filter or rank. The generator should receive explicit temporal context. “Based on the 2023 amendment (superseding 2019 circular)…” not just “Based on the retrieved documents…”

Contradiction Detection

Before generation, run a contradiction detection pass on retrieved chunks. If chunks contradict each other, don’t just pass them to the generator and hope for the best.

Either resolve the contradiction (using temporal metadata, source authority ranking, or explicit conflict resolution rules), or surface it explicitly to the user. “The retrieved documents contain conflicting information about X. Document A says… while Document B says…”

Confidence Decomposition

Decompose confidence into retrieval confidence and generation confidence. Show users both.

“High retrieval confidence, high generation confidence” means something different from “high retrieval confidence, low generation confidence” (the retrieved docs are relevant but don’t answer the question) or “low retrieval confidence, high generation confidence” (the answer is likely parametric).

Provenance Tracking

Track exactly which parts of the response came from which chunks. Not just “this response cites documents A, B, and C” but “this sentence came from chunk A, this phrase came from chunk B, this part is unsourced.”

When users can see provenance at the sentence level, they can make informed trust decisions.

The Honest Conclusion

RAG is genuinely useful. It enables AI systems to work with private data, stay current, and cite sources. These are real capabilities.

But RAG doesn’t solve hallucination. It transforms it. And in some ways, the transformed version is more dangerous - because it comes with citations, users trust it more, even when they shouldn’t.

If you’re deploying RAG in production, especially for high-stakes applications, you need:

Verification layers that catch fabricated citations
Temporal reasoning that distinguishes current from historical
Contradiction detection that surfaces conflicts rather than synthesizing nonsense
Provenance tracking that shows users exactly what’s grounded and what isn’t
Confidence decomposition that separates retrieval quality from generation quality

Building RAG without these safeguards is building a hallucination machine that looks like it has sources.

It’s worse than no sources at all.

RAG is a tool, not a solution. Use it with appropriate safeguards - or don’t use it for high-stakes applications.

Struggling with RAG reliability? Guardian monitors RAG pipelines in production, catching hallucination patterns before they reach users. Get in touch to discuss your deployment.