RAG is a retrieval problem until it isn't

The dominant failure mode in RAG deployments is not what most teams think it is.

When a RAG system returns a wrong or hallucinated answer, the default assumption is that the embeddings are poor, the chunking is too coarse, or the retriever is returning irrelevant passages. So teams swap embedding models, tune chunk sizes, and re-run evals. Sometimes this helps. Usually the score moves by a few points and the system still fails on the same class of questions.

The real problem is context discipline.

What context discipline means

A RAG system gives the LLM a set of passages and asks it to generate an answer. The LLM’s job is to use only those passages. In practice, LLMs don’t do this cleanly. They blend the retrieved context with their parametric knowledge — the information baked into the weights during training.

When parametric knowledge is accurate and the retrieved context is relevant, this blending produces good results. When they conflict, or when the retrieved context is noisy, the LLM tends to favour its parametric knowledge. It “ignores” the retrieval and answers from memory.

This isn’t a retrieval failure. The retriever returned correct passages. The failure is in how the generation step handles context conflict.

Why this is harder to fix

Embedding quality is a measurable, tractable problem. You can benchmark it with BEIR, swap models, and observe score changes. There’s a tight feedback loop.

Context discipline is murkier. You’re trying to change a behaviour that’s deeply embedded in how the model was trained. The options are:

Prompt engineering: Explicit instructions to “use only the provided context” help somewhat but don’t fully prevent parametric bleeding, especially in long context windows.
Context-faithful fine-tuning: Training on examples where the correct answer requires ignoring conflicting parametric knowledge. This works but requires a carefully constructed training set and access to compute.
LLM-as-judge evaluation: Separate faithfulness evaluation from relevance evaluation. A faithfulness judge checks whether the answer is grounded in the retrieved passages. This surfaces context discipline failures that standard retrieval metrics miss entirely.

The third approach is what RAG Arena was built around. Most RAG benchmarks collapse faithfulness and relevance into a single quality score, which obscures which part of the pipeline is failing.

The practical implication

If your RAG system is underperforming, before touching the retriever:

Evaluate faithfulness separately from relevance. Use an LLM judge that answers: is this response supported by the retrieved passages?
Identify the question types where faithfulness drops. They’ll cluster — typically around questions where the LLM has strong parametric knowledge (well-known entities, recent events in training data) and the retrieved context adds nuance or correction.
Only then look at the retriever. If faithfulness is low even when relevance is high, you have a generation problem, not a retrieval problem.

Most RAG failures I’ve debugged turn out to be context discipline failures wearing retrieval’s clothes.