Lesson 5 · Reasoning across

Reasoning across — convene a panel, and make it disagree

A c.1460 illuminated miniature of the English Court of King's Bench: a row of red-robed judges seated together on a high bench, with court officers below and the accused standing at the bar. — The Court of King's Bench, illuminated miniature, c. 1460 — the earliest known image of an English court. A panel of judges, not one, ranged across a single bench. This was the form English law trained into: at the Inns of Court, students argued *moots*, hypothetical cases pleaded before a bench of benchers who weighed both sides and ruled. The multi-agent moot court in this lesson rebuilds that panel for machine outputs — and meets the benchers' caution: judges who agree too quickly stop adding anything. Source: Wikimedia Commons · public domain.

The earliest image we have of an English court shows a panel — the judges of King’s Bench ranged together along one high bench, c. 1460, not a single arbiter. English law trained its advocates into the same shape: at the Inns of Court a student argued moots, hypothetical cases pleaded before a bench of benchers who heard both sides and ruled. The form encodes a hard-won distrust of the lone decision-maker — and, with it, the benchers’ own caution, which is the sharp edge of this lesson: a panel is only worth convening if its members genuinely differ. Judges who agree too quickly have stopped adding anything; they have become one judge counted several times.

That is exactly the problem when the “judge” is a language model. A single model-judge carries inherent parametric biases — such as favouring verbosity — and suffers cognitive overload when attempting to evaluate all dimensions in a single inference step: the wording is from You et al., “A Survey on Agent-as-a-Judge” (arXiv:2601.05111, 2026), which argues that a multi-agent panel “ensures the final judgment transcends the heuristics of any single model” and notes that “ChatEval pioneered this with a courtroom-inspired discussion mechanism” — the moot, rebuilt for machine outputs. The empirical shape comes from Li et al., “Leveraging LLMs as Meta-Judges” (arXiv:2504.17087, 2025): majority voting reached 77.3% against a single agent’s 68.9% — but, crucially, panel discussion could underperform, because “the opinions of different agents tend to converge over time” and extra roles bring information overload. So the design rule is counter-intuitive and is the thing to carry away: diversity beats premature consensus, and more judges is not better. Treat all of this as emerging — these are 2025–26 preprints on general tasks; legal multi-agent judges are barely studied (they surface only as one-line citations inside that survey, and the nearest legal system, JurisMMA, does judgment prediction, not evaluation). And on this public page every judge is mocked by design: its reliability is a number, not a live model call.

So the cell below makes both halves visible. First a fixed moot: the candidate answer is the very fabrication you have been hunting since Lesson 3 — a confident Harrington v Bellwether and a wrong three-year period — and a single judge that rewards fluent, assured prose passes it, while a diverse three-judge bench (one weighing style, one checking the citation, one checking the statute) outvotes that judge two-to-one and catches it. Then the dial: on load, a panel of nine judges at just 10% independence scores 69% — no better than one judge, because at that level they have all converged into a single view, and the nine bought you nothing. Now do the thing this course is built on: change the boxes. Push independence up and watch accuracy climb (77%, 82%, 86%) as the bench learns to disagree; then hold it fixed and push judges to ninety-nine and watch the number not move — past the diversity the panel can actually support, every extra judge is a copy. The skill you bank here is Diverse panel: when you must judge rather than check a model’s output, convene more than one judge, force them to reason differently, and take the majority — but never mistake a louder consensus for a more reliable one. That distrust of the too-easy agreement is what you carry into Lesson 6, where the loop finally closes: every correction the panel surfaces becomes a written rule, and every rule a reusable skill.

✏️ how many judges sit on the panel

✏️ how independent the judges are, 0–100 (low = they converge)

the code · starting Python…

# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# Lessons 3 and 4 CHECKED facts. This lesson JUDGES — a softer, more dangerous task, because one
# model-judge has biases (it favours fluent, confident prose) and overloads when asked to weigh every
# dimension at once. The classical answer is the one English law already used: don't trust a single
# judge, convene a PANEL, hear different views, take the majority. But a panel only helps if its
# members genuinely DIFFER — judges who agree too quickly add nothing. This cell shows both halves.

# `judges` = how many sit on the panel. `independence` (0-100) = how different their perspectives
# are; low independence means they converge on one view (groupthink) and the panel is worth ~1 judge.

# ---- A concrete moot: one biased judge vs a DIVERSE three-judge panel (fixed illustration) --------
print("MOOT: is this candidate answer sound?")
print('  candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."')
print("  (fluent, confident — and wrong: no such case, and the real period is six years)")
print("     judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)")
print("     judge B, who checks every citation -> UNSOUND (no such case exists)")
print("     judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)")
print("  majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through")
print()

# ---- Now the catch: a panel's value comes from DIVERSITY, not headcount ---------------------------
acc, effective = panel_accuracy(judges, independence)
single = round(p_single * 100)
print("panel of " + str(make_odd(judges)) + " judges at " + str(independence) + "% independence")
print("   -> worth about " + str(effective) + " genuinely independent vote(s)")
print("   -> accuracy on a held-out set of rulings: " + str(round(acc * 100)) + "%   (one judge alone: " + str(single) + "%)")
if effective <= 1:
    print("   so all those judges bought nothing: same blind spot, same answer as a single judge.")
    print("   -> raise `independence` to give them genuinely different perspectives, and run again.")
else:
    print("   the majority of differing judges is more reliable than the best single judge.")
print()

# Now do the thing this course is built on: CHANGE THE BOXES. Push `independence` up and watch accuracy
# climb as the panel's perspectives diverge. Then HOLD independence fixed and push `judges` far up —
# and watch accuracy STOP moving: once you have as many independent views as the diversity allows,
# extra judges are extra copies, not extra information. More judges is NOT more reliable; more
# DIVERSITY is. (It is why a moot pairs opposing counsel — not five judges who already agree.)

print("note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes")
print("Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is")
print("their finding that judges' opinions converge. On the public page every judge is MOCKED — no model")
print("is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.")

Show the code and expected output

# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# Lessons 3 and 4 CHECKED facts. This lesson JUDGES — a softer, more dangerous task, because one
# model-judge has biases (it favours fluent, confident prose) and overloads when asked to weigh every
# dimension at once. The classical answer is the one English law already used: don't trust a single
# judge, convene a PANEL, hear different views, take the majority. But a panel only helps if its
# members genuinely DIFFER — judges who agree too quickly add nothing. This cell shows both halves.

# `judges` = how many sit on the panel. `independence` (0-100) = how different their perspectives
# are; low independence means they converge on one view (groupthink) and the panel is worth ~1 judge.

# ---- A concrete moot: one biased judge vs a DIVERSE three-judge panel (fixed illustration) --------
print("MOOT: is this candidate answer sound?")
print('  candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."')
print("  (fluent, confident — and wrong: no such case, and the real period is six years)")
print("     judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)")
print("     judge B, who checks every citation -> UNSOUND (no such case exists)")
print("     judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)")
print("  majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through")
print()

# ---- Now the catch: a panel's value comes from DIVERSITY, not headcount ---------------------------
acc, effective = panel_accuracy(judges, independence)
single = round(p_single * 100)
print("panel of " + str(make_odd(judges)) + " judges at " + str(independence) + "% independence")
print("   -> worth about " + str(effective) + " genuinely independent vote(s)")
print("   -> accuracy on a held-out set of rulings: " + str(round(acc * 100)) + "%   (one judge alone: " + str(single) + "%)")
if effective <= 1:
    print("   so all those judges bought nothing: same blind spot, same answer as a single judge.")
    print("   -> raise `independence` to give them genuinely different perspectives, and run again.")
else:
    print("   the majority of differing judges is more reliable than the best single judge.")
print()

# Now do the thing this course is built on: CHANGE THE BOXES. Push `independence` up and watch accuracy
# climb as the panel's perspectives diverge. Then HOLD independence fixed and push `judges` far up —
# and watch accuracy STOP moving: once you have as many independent views as the diversity allows,
# extra judges are extra copies, not extra information. More judges is NOT more reliable; more
# DIVERSITY is. (It is why a moot pairs opposing counsel — not five judges who already agree.)

print("note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes")
print("Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is")
print("their finding that judges' opinions converge. On the public page every judge is MOCKED — no model")
print("is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.")

MOOT: is this candidate answer sound?
  candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."
  (fluent, confident — and wrong: no such case, and the real period is six years)
     judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)
     judge B, who checks every citation -> UNSOUND (no such case exists)
     judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)
  majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through

panel of 9 judges at 10% independence
   -> worth about 1 genuinely independent vote(s)
   -> accuracy on a held-out set of rulings: 69%   (one judge alone: 69%)
   so all those judges bought nothing: same blind spot, same answer as a single judge.
   -> raise `independence` to give them genuinely different perspectives, and run again.

note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes
Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is
their finding that judges' opinions converge. On the public page every judge is MOCKED — no model
is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.

ThinkA panel only outperforms a single judge to the extent its members actually see differently — and that is far easier to assert than to secure. Three AI judges built on the same model, given the same prompt, are not a bench; they are one judge wearing three robes, and they will agree confidently and be wrong together. So if you stand up a multi-judge check over your own outputs: what concretely makes your judges independent — different models, different instructions, an adversary briefed to dissent — and how would you test that their agreement is worth anything, rather than three echoes you have mistaken for a verdict?