RAGnosis · benchmark report
Retrieval experiments over a synthetic clinic database
GitHub
Applied research · retrieval-augmented generation

Model choice helped. Changing what the system retrieves helped more.

The experiment, run by run

Eight runs, two jumps

Each point is the best configuration of that run (highest answer overall). The two marked interventions changed what was retrieved, not which model ranked it — and they account for most of the gain.

Where the gains came from

Answer quality by question category

Judge scores on a 1–5 scale, per question category, at four milestones. Holistic (whole-corpus) questions did not move until precomputed rollup documents made their answers retrievable; numerical questions needed rollups to pass 4. Each category holds 3–5 questions, so read swings as directional.

1 (poor)5 (correct & complete)
The final architecture

How the final pipeline answers a question

Small documents are precise to search; large documents are complete to answer from. The pipeline searches one and answers from the other, and precomputed aggregates fill the gap neither can cover.

01 · QUESTION

A clinic-database question

“Which doctor has the largest appointment load?” — a fact, a join, or an aggregate.

02 · REWRITE

Query rewriting

An LLM rewrites the question into a search query; retrieval runs with both versions and merges candidates.

03 · SEARCH

Vector search over small children

Only compact, identity-anchored documents are embedded:

per-visit chunksdoctor / dept docsrollup aggregates
04 · RERANK

Rerank candidates

A local reranker reorders the merged candidates (BGE cross-encoder or Jina listwise).

05 · EXPAND

Small-to-Big expansion

Winning children swap for their full parent patient record; rollups pass through verbatim with exact numbers.

06 · ANSWER

Generate the answer

The chat model answers from complete parent context instead of scattered chunks.

Rollup documents exist because aggregate answers (“how many”, “most”, “total”) appear in no single row — they are precomputed at indexing time as counts, rankings, and totals, so retrieval can find them as text.

Reading the numbers

Every configuration, side by side

Filter, sort, and expand any run for its difficulty and category breakdown. Cells shade light→dark on each metric's own scale; the outlined cell is the best in view.

Leaderboard

Click a row for its difficulty / category breakdown & vector plots

Compare on one metric

Sorted best → worst

Run sets & fixed models

What each archived run changed, and the models it held constant