Lesson 2 · Finding what matters
Finding what matters — retrieve, don't stuff
In Lesson 1 you watched a case bundle overflow the context window: not everything fits, and even what does fit is read unevenly, with the middle used least. The obvious response is to stop trying to cram the whole bundle in. Instead, you ask a question and pull back only the few passages that answer it. That is retrieval — the engine room of what gets called RAG (retrieval-augmented generation) — and the cell below is a working, if humble, version of it: the same fourteen-document file as Lesson 1, scored against your query, with only the top few handed back.
Now do the thing this course is built on: change the boxes and watch the ranking move. Ask who owed a duty of care? and the duty pleading rises cleanly to the top. Then ask what was the settlement offer? and look carefully — the actual without-prejudice settlement letter lands only third, behind two documents that merely share common words; ask is there a limitation on liability? and the real limitation-of-liability clause does not crack the top three at all. Push k up to cast a wider net, or down to 1 to demand the single best passage. The bundle is fixed; only what you extract from it changes — and, as those last two questions show, what you extract is sometimes the wrong thing. This is the chunk-and-retrieve skill you bank from this lesson: a way to feed the model the relevant slice of a file too big to load whole, together with a healthy distrust of it.
But notice what the closing note warns and the closing question presses. The anchoring work here is Reuter et al., “Towards Reliable Retrieval in RAG Systems for Large Legal Datasets” (arXiv:2510.06999, 2025): in a realistic legal corpus, standard RAG retrieved the wrong document at rates above 95%, and a summary-augmented chunking strategy roughly halved that — a Document-Level Retrieval Mismatch, the “retrieved the wrong document” failure. Our toy scorer is word-overlap, which legal boilerplate fools easily; real systems use embeddings and still hit the same wall. Treat this as emerging, single-study evidence, not settled law — but take the caution seriously, because it improved the retrieval step only and did not prove the final answer was more truthful. That is exactly the seam Lesson 3 works: once you have retrieved the right passages, you still have to summarise them without inventing authorities the documents never contained.
# You don't need to change the code — just edit the boxes above and press Run. # (It runs on its own the first time, so you can see what it does straight away.) # In Lesson 1 you watched the bundle overflow the window: not all 14 documents fit, and # even the ones that did were read unevenly. RETRIEVAL is the answer to that. Instead of # stuffing the whole bundle in and hoping, you ask a QUESTION and pull back only the few # documents that actually answer it. That is what this cell does. # `query` is your question, from the first box. `k` is how many passages to bring back. # `bundle` is the same 14-document case file as Lesson 1; `score(query, doc)` (built behind # the scenes) rates how well a single document answers the query — higher means more relevant. # Score EVERY document in the bundle against your question, keeping its index alongside: scored = [(score(query, doc), i) for i, doc in enumerate(bundle)] # Sort best-first, so the most relevant document is at the top: scored.sort(reverse=True) # Show what we asked, and how many of the 14 we're choosing to retrieve: print("query:", query) print("retrieving the top", k, "of", len(bundle), "documents (not loading all 14)") print() # Print the top `k` results: the document's index, a short preview, and its relevance score. # THIS is the payoff — a handful of passages, not the whole bundle, fed to the model: for rank, (s, i) in enumerate(scored[:k], start=1): preview = bundle[i][:50].replace("\n", " ") print(str(rank) + ". doc #" + str(i), " score=" + str(round(s, 3))) print(" " + preview + "...") # Now do the thing this course is built on: CHANGE THE BOXES and press Run. Ask a different # question ("who owed a duty of care?", "what was the settlement offer?") and watch which # documents rise to the top. Raise `k` to cast a wider net; drop it to 1 to demand the single # best passage. The bundle never changes — only what you pull out of it does. # One honesty note, because it is the whole point of the next lesson. This is simple word- # overlap scoring, and legal documents share a LOT of boilerplate ("the Claimant", "in breach # of", "save to the extent"). So a document that merely shares phrasing with your question can # outrank the one that truly answers it — retrieval can hand you the WRONG document, confidently. print() print("note: word-overlap can rank shared legal boilerplate over the real answer — real systems") print("use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).")
Show the code and expected output
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)
# In Lesson 1 you watched the bundle overflow the window: not all 14 documents fit, and
# even the ones that did were read unevenly. RETRIEVAL is the answer to that. Instead of
# stuffing the whole bundle in and hoping, you ask a QUESTION and pull back only the few
# documents that actually answer it. That is what this cell does.
# `query` is your question, from the first box. `k` is how many passages to bring back.
# `bundle` is the same 14-document case file as Lesson 1; `score(query, doc)` (built behind
# the scenes) rates how well a single document answers the query — higher means more relevant.
# Score EVERY document in the bundle against your question, keeping its index alongside:
scored = [(score(query, doc), i) for i, doc in enumerate(bundle)]
# Sort best-first, so the most relevant document is at the top:
scored.sort(reverse=True)
# Show what we asked, and how many of the 14 we're choosing to retrieve:
print("query:", query)
print("retrieving the top", k, "of", len(bundle), "documents (not loading all 14)")
print()
# Print the top `k` results: the document's index, a short preview, and its relevance score.
# THIS is the payoff — a handful of passages, not the whole bundle, fed to the model:
for rank, (s, i) in enumerate(scored[:k], start=1):
preview = bundle[i][:50].replace("\n", " ")
print(str(rank) + ". doc #" + str(i), " score=" + str(round(s, 3)))
print(" " + preview + "...")
# Now do the thing this course is built on: CHANGE THE BOXES and press Run. Ask a different
# question ("who owed a duty of care?", "what was the settlement offer?") and watch which
# documents rise to the top. Raise `k` to cast a wider net; drop it to 1 to demand the single
# best passage. The bundle never changes — only what you pull out of it does.
# One honesty note, because it is the whole point of the next lesson. This is simple word-
# overlap scoring, and legal documents share a LOT of boilerplate ("the Claimant", "in breach
# of", "save to the extent"). So a document that merely shares phrasing with your question can
# outrank the one that truly answers it — retrieval can hand you the WRONG document, confidently.
print()
print("note: word-overlap can rank shared legal boilerplate over the real answer — real systems")
print("use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).")
query: what damages is the claimant claiming?
retrieving the top 3 of 14 documents (not loading all 14)
1. doc #9 score=0.287
By reason of the matters aforesaid the Claimant ha...
2. doc #2 score=0.182
(2) A term which has not been individually negotia...
3. doc #3 score=0.178
In my judgment the test laid down in Caparo is not...
note: word-overlap can rank shared legal boilerplate over the real answer — real systems
use embeddings and STILL hit this 'retrieved the wrong document' failure (next lesson).
ThinkRetrieval makes a silent editorial choice on your behalf: you ask one question, and a scoring function — not you — decides which handful of the file the model is even allowed to see. If the passage that exonerates your client shares less vocabulary with your query than a boilerplate clause that merely sounds on-point, it is never retrieved and never read. In a matter where being wrong has consequences, what is your professional obligation to test what the retriever left out — and how would you discharge it, given that you cannot read what you were never shown?