Model choice helped. Changing what the system retrieves helped more.
Eight runs, two jumps
Each point is the best configuration of that run (highest answer overall). The two marked interventions changed what was retrieved, not which model ranked it — and they account for most of the gain.
Answer quality by question category
Judge scores on a 1–5 scale, per question category, at four milestones. Holistic (whole-corpus) questions did not move until precomputed rollup documents made their answers retrievable; numerical questions needed rollups to pass 4. Each category holds 3–5 questions, so read swings as directional.
How the final pipeline answers a question
Small documents are precise to search; large documents are complete to answer from. The pipeline searches one and answers from the other, and precomputed aggregates fill the gap neither can cover.
A clinic-database question
“Which doctor has the largest appointment load?” — a fact, a join, or an aggregate.
Query rewriting
An LLM rewrites the question into a search query; retrieval runs with both versions and merges candidates.
Vector search over small children
Only compact, identity-anchored documents are embedded:
Rerank candidates
A local reranker reorders the merged candidates (BGE cross-encoder or Jina listwise).
Small-to-Big expansion
Winning children swap for their full parent patient record; rollups pass through verbatim with exact numbers.
Generate the answer
The chat model answers from complete parent context instead of scattered chunks.
Rollup documents exist because aggregate answers (“how many”, “most”, “total”) appear in no single row — they are precomputed at indexing time as counts, rankings, and totals, so retrieval can find them as text.
Every configuration, side by side
Filter, sort, and expand any run for its difficulty and category breakdown. Cells shade light→dark on each metric's own scale; the outlined cell is the best in view.