Lesson 4 · Working at scale

Working at scale — fan out, and verify every result

A single page of the Domesday Book manuscript: one tall column of dense medieval Latin in dark ink on parchment, with several headings and place-names picked out in red, shown in black-and-white.
A page of Great Domesday, the survey William I ordered in 1086 (The National Archives, E 31/2). Royal commissioners worked in parallel across seven circuits, each putting the same sworn questions to local juries and compiling the answers into one final record — "Domesday" because, as the king's treasurer wrote a century later, like the Last Judgement there was no appeal from it. Working at scale is exactly this: run one procedure across the whole corpus at once, and reach an external source for the facts you do not already hold. Source: Wikimedia Commons · public domain.

The Domesday survey is what working at scale looks like nine centuries before the phrase existed. Faced with a whole conquered country to assess, William’s clerks did not work through it end to end; they split it into seven circuits and ran one procedure in parallel across all of them — the same sworn questions, put to local juries who held the facts the Treasury did not. The answers came back and were compiled into a single record so authoritative that, like the Last Judgement it was nicknamed for, there was no appeal from it. Two moves make that possible, and they are exactly the two this lesson is about: run one routine across the entire corpus at once, and reach out to an external source for the facts you do not already hold.

In our terms, the first move is fan-out — taking the grounding check you banked in Lesson 3 and applying it automatically to every document in a matter, because once the pile is hundreds deep you can no longer read each output the way you read one. The second is the connector: a tool the model calls to fetch something it cannot know on its own — here, whether a cited case actually exists. On this public page every connector is mocked by design (the constraint that keeps the course backend-free and anonymous): the lookup returns a pre-baked verdict rather than calling a live database, and on an internal build that single function is all that would change. But scale also introduces a failure that simply is not visible at small size: error accumulates along a chain of tool-calls. The anchor here is Fan et al., “Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol” (arXiv:2602.13320, 2026), which finds that cumulative distortion “exhibits linear growth and high-probability deviations bounded by O(√T)” — it rules out exponential failure modes, which is the reassuring half — and that “periodic re-grounding approximately every 9 steps suffices for error control.” Treat those numbers as emerging, not settled: it is a single 2026 preprint on deterministic tools and single-agent chains, and the eye-catching “multi-agent reliability multipliers” you may have seen quoted elsewhere are not verifiable in this course’s sources — they reach us only second-hand through a marketing blog. The discipline survives the caveats even if the constants move.

So the cell below does both moves on pre-baked data and lets you feel where each one breaks. On load it fans the grounding check across the whole ten-document batch in one pass and flags the three that smuggle in a confident citation to a case found in none of the sources — caught automatically, exactly the work you would never finish by hand — while the mocked connector confirms the one genuine authority (Caparo), so you can see it is a checker and not merely a rejector. Then it runs the verify chain across a sixty-document matter with re-grounding switched off, and the peak distortion sails over the safe bar: at that length the chain drifts off its sources before it finishes. Now do the thing this course is built on: change the boxes. Push documents higher and watch the distortion grow like √T — slowly, never explosively. Then set reground_every to 9 and watch it fall back under the bar and stay there however large the matter gets. That is the skill you bank from this lesson — Batch-and-verify: never fan a generation step across a corpus without fanning a verification step across it too, and re-ground against the source of truth often enough that the drift never compounds. It carries straight into Lesson 5, where the thing being run at scale is no longer a check but a judgment — and one verifier, however tireless, turns out not to be enough.

✏️ how many documents in the matter
✏️ re-ground every N tool-call steps (0 = never)
the code · starting Python…
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# Lesson 3 checked ONE claim against the sources, by hand. That does not scale: a real matter is
# hundreds of documents, and you cannot eyeball hundreds of model outputs. Working at scale is two
# new moves. (1) FAN OUT — run the same verify step AUTOMATICALLY across the whole batch at once.
# (2) Reach an external CONNECTOR for facts you do not hold yourself (here: is a cited case real?).
# Each move brings a failure that only shows up at scale. This cell shows both, on pre-baked data.

# `documents` (first box) is how big the matter is. `reground_every` (second box) is how often the
# agent stops and re-checks itself against the source of truth, counted in tool-call steps (0 = never).

# ---- Move 1: FAN OUT the Lesson-3 grounding check across the whole batch, in a single pass -------
flagged = []
for i, item in enumerate(batch):
    grounded = is_grounded(item["text"])                 # the Lesson-3 check, reused unchanged
    cite = item["cite"]
    status = lookup_citation(cite) if cite else "none"   # the MOCKED connector
    if (not grounded) or status == "not found":
        flagged.append((i, cite, status))

print("fanned out the grounding check across " + str(len(batch)) + " documents in ONE pass")
print("flagged " + str(len(flagged)) + " possible fabricated authorities — caught automatically, not by hand:")
for i, cite, status in flagged:
    shown = cite if cite else "(unsupported by any source; no citation given)"
    print("   doc #" + str(i) + ":  " + shown + "   [connector: " + status + "]")

# the connector is not just a rejector — show the genuine authority it positively confirmed:
confirmed = [it["cite"] for it in batch if it["cite"] and lookup_citation(it["cite"]) == "valid"]
print("for contrast, the connector CONFIRMED these as genuine: " + ", ".join(confirmed))
print()

# ---- Move 2: the catch nobody sees at small scale — distortion ACCUMULATES along the chain -------
steps_per_doc = 3                                # each doc ~ retrieve, then ground, then look up the cite
total_steps = documents * steps_per_doc
peak = peak_distortion(total_steps, reground_every)

print("now run that verify chain across the whole matter, not just these " + str(len(batch)) + " documents:")
print("   " + str(documents) + " documents x 3 tool-calls each = " + str(total_steps) + " tool-calls in the chain")
if reground_every:
    print("   re-grounding the agent every " + str(reground_every) + " steps")
else:
    print("   re-grounding the agent: NEVER")
print("   peak cumulative distortion = " + str(round(peak, 2)) + "   (safe bar = " + str(SAFE_BAR) + ")")
print()

if peak <= SAFE_BAR:
    print("UNDER the bar: the chain stayed grounded across the whole batch.")
else:
    print("OVER the bar: at this scale the chain drifts off its sources before it finishes.")
    print("   -> set `reground_every` to about 9 (the figure Fan et al. report) and run again.")

# Now do the thing this course is built on: CHANGE THE BOXES. Push `documents` up and watch the peak
# distortion climb like the square root of the chain length — slowly, NOT explosively (that bound is
# the reassuring half of the result). Then set `reground_every` to 9 and watch it drop back under the
# bar and STAY there, however large the matter grows: the cure for scale is not "trust it less", it is
# "re-check against the source of truth, often".

print()
print("note: the connector is MOCKED (no live database on this page), and the sqrt(T) growth and the")
print("9-step figure are from a single 2026 preprint on deterministic, single-agent tool chains —")
print("illustrative numbers, but the discipline is the lesson: verify every item, and re-ground often.")
Show the code and expected output
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# Lesson 3 checked ONE claim against the sources, by hand. That does not scale: a real matter is
# hundreds of documents, and you cannot eyeball hundreds of model outputs. Working at scale is two
# new moves. (1) FAN OUT — run the same verify step AUTOMATICALLY across the whole batch at once.
# (2) Reach an external CONNECTOR for facts you do not hold yourself (here: is a cited case real?).
# Each move brings a failure that only shows up at scale. This cell shows both, on pre-baked data.

# `documents` (first box) is how big the matter is. `reground_every` (second box) is how often the
# agent stops and re-checks itself against the source of truth, counted in tool-call steps (0 = never).

# ---- Move 1: FAN OUT the Lesson-3 grounding check across the whole batch, in a single pass -------
flagged = []
for i, item in enumerate(batch):
    grounded = is_grounded(item["text"])                 # the Lesson-3 check, reused unchanged
    cite = item["cite"]
    status = lookup_citation(cite) if cite else "none"   # the MOCKED connector
    if (not grounded) or status == "not found":
        flagged.append((i, cite, status))

print("fanned out the grounding check across " + str(len(batch)) + " documents in ONE pass")
print("flagged " + str(len(flagged)) + " possible fabricated authorities — caught automatically, not by hand:")
for i, cite, status in flagged:
    shown = cite if cite else "(unsupported by any source; no citation given)"
    print("   doc #" + str(i) + ":  " + shown + "   [connector: " + status + "]")

# the connector is not just a rejector — show the genuine authority it positively confirmed:
confirmed = [it["cite"] for it in batch if it["cite"] and lookup_citation(it["cite"]) == "valid"]
print("for contrast, the connector CONFIRMED these as genuine: " + ", ".join(confirmed))
print()

# ---- Move 2: the catch nobody sees at small scale — distortion ACCUMULATES along the chain -------
steps_per_doc = 3                                # each doc ~ retrieve, then ground, then look up the cite
total_steps = documents * steps_per_doc
peak = peak_distortion(total_steps, reground_every)

print("now run that verify chain across the whole matter, not just these " + str(len(batch)) + " documents:")
print("   " + str(documents) + " documents x 3 tool-calls each = " + str(total_steps) + " tool-calls in the chain")
if reground_every:
    print("   re-grounding the agent every " + str(reground_every) + " steps")
else:
    print("   re-grounding the agent: NEVER")
print("   peak cumulative distortion = " + str(round(peak, 2)) + "   (safe bar = " + str(SAFE_BAR) + ")")
print()

if peak <= SAFE_BAR:
    print("UNDER the bar: the chain stayed grounded across the whole batch.")
else:
    print("OVER the bar: at this scale the chain drifts off its sources before it finishes.")
    print("   -> set `reground_every` to about 9 (the figure Fan et al. report) and run again.")

# Now do the thing this course is built on: CHANGE THE BOXES. Push `documents` up and watch the peak
# distortion climb like the square root of the chain length — slowly, NOT explosively (that bound is
# the reassuring half of the result). Then set `reground_every` to 9 and watch it drop back under the
# bar and STAY there, however large the matter grows: the cure for scale is not "trust it less", it is
# "re-check against the source of truth, often".

print()
print("note: the connector is MOCKED (no live database on this page), and the sqrt(T) growth and the")
print("9-step figure are from a single 2026 preprint on deterministic, single-agent tool chains —")
print("illustrative numbers, but the discipline is the lesson: verify every item, and re-ground often.")
fanned out the grounding check across 10 documents in ONE pass
flagged 3 possible fabricated authorities — caught automatically, not by hand:
   doc #7:  Harrington v Bellwether [2019] EWHC 244   [connector: not found]
   doc #8:  Rylands v Fletcher (No 2) [2021] EWCA Civ 99   [connector: not found]
   doc #9:  Marsden v Cole [2018] QB 511   [connector: not found]
for contrast, the connector CONFIRMED these as genuine: Caparo Industries plc v Dickman [1990] UKHL 2

now run that verify chain across the whole matter, not just these 10 documents:
   60 documents x 3 tool-calls each = 180 tool-calls in the chain
   re-grounding the agent: NEVER
   peak cumulative distortion = 1.07   (safe bar = 0.35)

OVER the bar: at this scale the chain drifts off its sources before it finishes.
   -> set `reground_every` to about 9 (the figure Fan et al. report) and run again.

note: the connector is MOCKED (no live database on this page), and the sqrt(T) growth and the
9-step figure are from a single 2026 preprint on deterministic, single-agent tool chains —
illustrative numbers, but the discipline is the lesson: verify every item, and re-ground often.

ThinkFan-out changes the unit of trust: in Lesson 3 you vouched for one summary you had read; here a procedure you wrote vouches for hundreds you will never read line by line. That is a genuine gain — the machine checks every item, tirelessly, where you would have spot-checked ten — but it relocates the risk rather than removing it, because a flaw in the verify step is now applied at scale too, silently, to the whole matter. So before you run any check across a thousand documents and act on the tally: how do you assure yourself the checker itself is sound — what would you deliberately feed it to see it fail, and what is the smallest sample you would still read by hand to keep the automation honest?