RAG is the easy part. Retrieval is the hard part.

Engineering notes · Black Iris

The first time you stand up a retrieval-augmented generation system, it feels too easy. Embed your documents, drop them into a vector database, run a similarity search, stuff the top-K chunks into the prompt, ship. Two afternoons of work. The demo is convincing.

Then real users start asking real questions and the system starts confidently citing the wrong section of the manual. The LLM didn't fail. The retriever did. And almost nobody is monitoring the retriever.

We shipped a RAG-based assistant over a 77-page administrator manual for one of our platforms — Postgres with pgvector as the store, a reasonably mainstream embedding model, a popular LLM behind it. Here's what we actually learned, and what we'd tell anyone about to start.

1. Chunking is where the system is decided

Most chunking advice is one of: "split on tokens", "split on sentences", "split on paragraphs". None of these is wrong, and none of these is enough. A heading like "Step 3: Approve a Volunteer" matters more than the paragraph beneath it, but split-on-paragraphs throws the heading away. Split-on-tokens cleaves mid-sentence and turns half your chunks into nonsense.

What worked for us was hierarchical, structure-aware chunking. Parse the document into sections by heading. Within sections, emit overlapping passages that always carry the section title as a prefix. Each chunk has not just text, but ancestry: which manual, which section, which subsection. That ancestry lands in metadata, in the embedding (because the section title is prefixed), and in the response (as the citation).

A weekend spent on the chunker pays back more than a quarter spent fine-tuning the LLM.

2. You cannot ship without recall metrics

"Vibes-based evaluation" is the default state of most RAG systems in production. Someone on the team asks five questions, gets answers that look reasonable, and the system goes live. Six months later, no one can answer whether retrieval has gotten better or worse — because no one is measuring.

Build a labeled evaluation set the day you build the retriever. Real questions from real users (or plausible ones if you don't have traffic yet), each annotated with which chunks contain the correct answer. Now you can measure:

Recall@K — for what fraction of questions is at least one correct chunk in the top K?
MRR (mean reciprocal rank) — how high in the list does the correct chunk land, on average?
End-to-end answer accuracy — given the retrieved context, does the LLM produce a correct response?

You'll discover that "raise K from 3 to 10" buys more accuracy than "switch to the more expensive model." That's normal. The cheaper change is also the better one. You'd never have known without the metrics.

3. The LLM is downstream of the truth

A common failure mode: the retrieval misses the answer, the LLM is asked to respond anyway, and the model — eager to please — fabricates something that sounds right. Hallucinations look like an LLM problem; most of them are a retrieval problem in disguise.

Two cheap defenses go a long way. First, instruct the model explicitly: if the answer is not in the provided context, say you don't know. Then test that instruction with adversarial queries. Second, require citations. Every claim in the response should map back to a specific chunk, and the UI should surface that mapping. A model that can't cite is a model that's guessing.

4. Hybrid retrieval beats vector-only, almost every time

Vector similarity is great at "this question is semantically near that passage." It is terrible at "this question contains a literal phrase that appears in exactly one section." Most real questions are a mix.

Combining vector search with classical keyword scoring (BM25, or just Postgres full-text search) — and re-ranking the union — consistently outperformed vector-only retrieval for us. The improvement was largest on the hardest queries, where precise terminology matters and the embedding alone smooths over the wrong details.

5. The cost question is upside down

The temptation when a RAG system underperforms is to reach for a more expensive model. Better generators do help — at the margins. The bigger wins, in our experience, come from cheap LLMs with excellent retrieval. A small model that always gets the right context outperforms a large model that's compensating for missing information.

This has commercial implications too. A system that nails recall doesn't need to send 16 KB of half-relevant context with every request. Tighter retrieval directly translates to lower per-query cost.

The actual takeaway

RAG isn't an LLM technique. It's an information retrieval problem with a generative interface bolted onto the end. The teams getting good results are the ones who took the retrieval part seriously — built the eval set, measured recall, invested in chunking, used hybrid search, and refused to ship without a citation. The teams shipping convincing demos and quietly-broken production systems are the ones who treated retrieval as a one-line library call. Don't be the second team.

← More writing from Black Iris