Lesson 2 · Finding what matters

Finding what matters — retrieve, don't stuff

A page of sixth-century Latin uncial script from the Littera Florentina, the oldest surviving manuscript of Justinian's Digest of Roman law. — A folio of the *Littera Florentina*, the sixth-century manuscript of Justinian's *Digest*. Confronted with some two thousand books of the classical jurists — over three million lines, far more law than any court could hold — Tribonian's commission read the whole and kept roughly a twentieth: about 150,000 lines, excerpted, reordered into fifty books, and made citable. Promulgated 533 AD. RAG rebuilds the same machine with vectors instead of scribes — and meets the same two failures the compilers did, who silently altered what they quoted: keeping the wrong passage, or losing the one that mattered. Source: Wikimedia Commons · public domain.

In Lesson 1 you watched a case bundle overflow the context window: not everything fits, and even what does fit is read unevenly, with the middle used least. The obvious response is to stop trying to cram the whole bundle in. Instead, you ask a question and pull back only the few passages that answer it. That is retrieval — the engine room of what gets called RAG (retrieval-augmented generation) — and the cell below is a working, if humble, version of it: the same fourteen-document file as Lesson 1, scored against your query, with only the top few handed back.

Now do the thing this course is built on: change the boxes and watch the ranking move. Ask who owed a duty of care? and the duty pleading rises cleanly to the top. Then ask what was the settlement offer? and look carefully — the actual without-prejudice settlement letter lands only third, behind two documents that merely share common words; ask is there a limitation on liability? and the real limitation-of-liability clause does not crack the top three at all. Push k up to cast a wider net, or down to 1 to demand the single best passage. The bundle is fixed; only what you extract from it changes — and, as those last two questions show, what you extract is sometimes the wrong thing. This is the chunk-and-retrieve skill you bank from this lesson: a way to feed the model the relevant slice of a file too big to load whole, together with a healthy distrust of it.

But notice what the closing note warns and the closing question presses. The anchoring work here is Reuter et al., “Towards Reliable Retrieval in RAG Systems for Large Legal Datasets” (arXiv:2510.06999, 2025): in a realistic legal corpus, standard RAG retrieved the wrong document at rates above 95%, and a summary-augmented chunking strategy roughly halved that — a Document-Level Retrieval Mismatch, the “retrieved the wrong document” failure. Our toy scorer is word-overlap, which legal boilerplate fools easily; real systems use embeddings and still hit the same wall. Treat this as emerging, single-study evidence, not settled law — but take the caution seriously, because it improved the retrieval step only and did not prove the final answer was more truthful. That is exactly the seam Lesson 3 works: once you have retrieved the right passages, you still have to summarise them without inventing authorities the documents never contained.

✏️ your question

✏️ how many passages to retrieve

the code · starting Python…

# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# In Lesson 1 you watched the bundle overflow the window: not all 14 documents fit, and
# even the ones that did were read unevenly. RETRIEVAL is the answer to that. Instead of
# stuffing the whole bundle in and hoping, you ask a QUESTION and pull back only the few
# documents that actually answer it. That is what this cell does.

# `query` is your question, from the first box. `k` is how many passages to bring back.
# `bundle` is the same 14-document case file as Lesson 1; `score(query, doc)` (built behind
# the scenes) rates how well a single document answers the query — higher means more relevant.

# Score EVERY document in the bundle against your question, keeping its index alongside:
scored = [(score(query, doc), i) for i, doc in enumerate(bundle)]

# Sort best-first, so the most relevant document is at the top:
scored.sort(reverse=True)

# Show what we asked, and how many of the 14 we're choosing to retrieve:
print("query:", query)
print("retrieving the top", k, "of", len(bundle), "documents (not loading all 14)")
print()

# Print the top `k` results: the document's index, a short preview, and its relevance score.
# THIS is the payoff — a handful of passages, not the whole bundle, fed to the model:
for rank, (s, i) in enumerate(scored[:k], start=1):
    preview = bundle[i][:50].replace("\n", " ")
    print(str(rank) + ".  doc #" + str(i), " score=" + str(round(s, 3)))
    print("     " + preview + "...")

# Now do the thing this course is built on: CHANGE THE BOXES and press Run. Ask a different
# question ("who owed a duty of care?", "what was the settlement offer?") and watch which
# documents rise to the top. Raise `k` to cast a wider net; drop it to 1 to demand the single
# best passage. The bundle never changes — only what you pull out of it does.

# One honesty note, because it is the whole point of the next lesson. This is simple word-
# overlap scoring, and legal documents share a LOT of boilerplate ("the Claimant", "in breach
# of", "save to the extent"). So a document that merely shares phrasing with your question can
# outrank the one that truly answers it — retrieval can hand you the WRONG document, confidently.
print()
print("note: word-overlap can rank shared legal boilerplate over the real answer — real systems")
print("use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).")

Show the code and expected output

# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# In Lesson 1 you watched the bundle overflow the window: not all 14 documents fit, and
# even the ones that did were read unevenly. RETRIEVAL is the answer to that. Instead of
# stuffing the whole bundle in and hoping, you ask a QUESTION and pull back only the few
# documents that actually answer it. That is what this cell does.

# `query` is your question, from the first box. `k` is how many passages to bring back.
# `bundle` is the same 14-document case file as Lesson 1; `score(query, doc)` (built behind
# the scenes) rates how well a single document answers the query — higher means more relevant.

# Score EVERY document in the bundle against your question, keeping its index alongside:
scored = [(score(query, doc), i) for i, doc in enumerate(bundle)]

# Sort best-first, so the most relevant document is at the top:
scored.sort(reverse=True)

# Show what we asked, and how many of the 14 we're choosing to retrieve:
print("query:", query)
print("retrieving the top", k, "of", len(bundle), "documents (not loading all 14)")
print()

# Print the top `k` results: the document's index, a short preview, and its relevance score.
# THIS is the payoff — a handful of passages, not the whole bundle, fed to the model:
for rank, (s, i) in enumerate(scored[:k], start=1):
    preview = bundle[i][:50].replace("\n", " ")
    print(str(rank) + ".  doc #" + str(i), " score=" + str(round(s, 3)))
    print("     " + preview + "...")

# Now do the thing this course is built on: CHANGE THE BOXES and press Run. Ask a different
# question ("who owed a duty of care?", "what was the settlement offer?") and watch which
# documents rise to the top. Raise `k` to cast a wider net; drop it to 1 to demand the single
# best passage. The bundle never changes — only what you pull out of it does.

# One honesty note, because it is the whole point of the next lesson. This is simple word-
# overlap scoring, and legal documents share a LOT of boilerplate ("the Claimant", "in breach
# of", "save to the extent"). So a document that merely shares phrasing with your question can
# outrank the one that truly answers it — retrieval can hand you the WRONG document, confidently.
print()
print("note: word-overlap can rank shared legal boilerplate over the real answer — real systems")
print("use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).")

query: what damages is the claimant claiming?
retrieving the top 3 of 14 documents (not loading all 14)

1.  doc #9  score=0.287
     By reason of the matters aforesaid the Claimant ha...
2.  doc #2  score=0.182
     (2) A term which has not been individually negotia...
3.  doc #3  score=0.178
     In my judgment the test laid down in Caparo is not...

note: word-overlap can rank shared legal boilerplate over the real answer — real systems
use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).

ThinkRetrieval makes a silent editorial choice on your behalf: you ask one question, and a scoring function — not you — decides which handful of the file the model is even allowed to see. If the passage that exonerates your client shares less vocabulary with your query than a boilerplate clause that merely sounds on-point, it is never retrieved and never read. In a matter where being wrong has consequences, what is your professional obligation to test what the retriever left out — and how would you discharge it, given that you cannot read what you were never shown?