Lesson 5 · Reasoning across
Reasoning across — convene a panel, and make it disagree
The earliest image we have of an English court shows a panel — the judges of King’s Bench ranged together along one high bench, c. 1460, not a single arbiter. English law trained its advocates into the same shape: at the Inns of Court a student argued moots, hypothetical cases pleaded before a bench of benchers who heard both sides and ruled. The form encodes a hard-won distrust of the lone decision-maker — and, with it, the benchers’ own caution, which is the sharp edge of this lesson: a panel is only worth convening if its members genuinely differ. Judges who agree too quickly have stopped adding anything; they have become one judge counted several times.
That is exactly the problem when the “judge” is a language model. A single model-judge carries inherent parametric biases — such as favouring verbosity — and suffers cognitive overload when attempting to evaluate all dimensions in a single inference step: the wording is from You et al., “A Survey on Agent-as-a-Judge” (arXiv:2601.05111, 2026), which argues that a multi-agent panel “ensures the final judgment transcends the heuristics of any single model” and notes that “ChatEval pioneered this with a courtroom-inspired discussion mechanism” — the moot, rebuilt for machine outputs. The empirical shape comes from Li et al., “Leveraging LLMs as Meta-Judges” (arXiv:2504.17087, 2025): majority voting reached 77.3% against a single agent’s 68.9% — but, crucially, panel discussion could underperform, because “the opinions of different agents tend to converge over time” and extra roles bring information overload. So the design rule is counter-intuitive and is the thing to carry away: diversity beats premature consensus, and more judges is not better. Treat all of this as emerging — these are 2025–26 preprints on general tasks; legal multi-agent judges are barely studied (they surface only as one-line citations inside that survey, and the nearest legal system, JurisMMA, does judgment prediction, not evaluation). And on this public page every judge is mocked by design: its reliability is a number, not a live model call.
So the cell below makes both halves visible. First a fixed moot: the candidate answer is the very fabrication you have been hunting since Lesson 3 — a confident Harrington v Bellwether and a wrong three-year period — and a single judge that rewards fluent, assured prose passes it, while a diverse three-judge bench (one weighing style, one checking the citation, one checking the statute) outvotes that judge two-to-one and catches it. Then the dial: on load, a panel of nine judges at just 10% independence scores 69% — no better than one judge, because at that level they have all converged into a single view, and the nine bought you nothing. Now do the thing this course is built on: change the boxes. Push independence up and watch accuracy climb (77%, 82%, 86%) as the bench learns to disagree; then hold it fixed and push judges to ninety-nine and watch the number not move — past the diversity the panel can actually support, every extra judge is a copy. The skill you bank here is Diverse panel: when you must judge rather than check a model’s output, convene more than one judge, force them to reason differently, and take the majority — but never mistake a louder consensus for a more reliable one. That distrust of the too-easy agreement is what you carry into Lesson 6, where the loop finally closes: every correction the panel surfaces becomes a written rule, and every rule a reusable skill.
# You don't need to change the code — just edit the boxes above and press Run. # (It runs on its own the first time, so you can see what it does straight away.) # Lessons 3 and 4 CHECKED facts. This lesson JUDGES — a softer, more dangerous task, because one # model-judge has biases (it favours fluent, confident prose) and overloads when asked to weigh every # dimension at once. The classical answer is the one English law already used: don't trust a single # judge, convene a PANEL, hear different views, take the majority. But a panel only helps if its # members genuinely DIFFER — judges who agree too quickly add nothing. This cell shows both halves. # `judges` = how many sit on the panel. `independence` (0-100) = how different their perspectives # are; low independence means they converge on one view (groupthink) and the panel is worth ~1 judge. # ---- A concrete moot: one biased judge vs a DIVERSE three-judge panel (fixed illustration) -------- print("MOOT: is this candidate answer sound?") print(' candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."') print(" (fluent, confident — and wrong: no such case, and the real period is six years)") print(" judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)") print(" judge B, who checks every citation -> UNSOUND (no such case exists)") print(" judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)") print(" majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through") print() # ---- Now the catch: a panel's value comes from DIVERSITY, not headcount --------------------------- acc, effective = panel_accuracy(judges, independence) single = round(p_single * 100) print("panel of " + str(make_odd(judges)) + " judges at " + str(independence) + "% independence") print(" -> worth about " + str(effective) + " genuinely independent vote(s)") print(" -> accuracy on a held-out set of rulings: " + str(round(acc * 100)) + "% (one judge alone: " + str(single) + "%)") if effective <= 1: print(" so all those judges bought nothing: same blind spot, same answer as a single judge.") print(" -> raise `independence` to give them genuinely different perspectives, and run again.") else: print(" the majority of differing judges is more reliable than the best single judge.") print() # Now do the thing this course is built on: CHANGE THE BOXES. Push `independence` up and watch accuracy # climb as the panel's perspectives diverge. Then HOLD independence fixed and push `judges` far up — # and watch accuracy STOP moving: once you have as many independent views as the diversity allows, # extra judges are extra copies, not extra information. More judges is NOT more reliable; more # DIVERSITY is. (It is why a moot pairs opposing counsel — not five judges who already agree.) print("note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes") print("Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is") print("their finding that judges' opinions converge. On the public page every judge is MOCKED — no model") print("is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.")
Show the code and expected output
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)
# Lessons 3 and 4 CHECKED facts. This lesson JUDGES — a softer, more dangerous task, because one
# model-judge has biases (it favours fluent, confident prose) and overloads when asked to weigh every
# dimension at once. The classical answer is the one English law already used: don't trust a single
# judge, convene a PANEL, hear different views, take the majority. But a panel only helps if its
# members genuinely DIFFER — judges who agree too quickly add nothing. This cell shows both halves.
# `judges` = how many sit on the panel. `independence` (0-100) = how different their perspectives
# are; low independence means they converge on one view (groupthink) and the panel is worth ~1 judge.
# ---- A concrete moot: one biased judge vs a DIVERSE three-judge panel (fixed illustration) --------
print("MOOT: is this candidate answer sound?")
print(' candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."')
print(" (fluent, confident — and wrong: no such case, and the real period is six years)")
print(" judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)")
print(" judge B, who checks every citation -> UNSOUND (no such case exists)")
print(" judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)")
print(" majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through")
print()
# ---- Now the catch: a panel's value comes from DIVERSITY, not headcount ---------------------------
acc, effective = panel_accuracy(judges, independence)
single = round(p_single * 100)
print("panel of " + str(make_odd(judges)) + " judges at " + str(independence) + "% independence")
print(" -> worth about " + str(effective) + " genuinely independent vote(s)")
print(" -> accuracy on a held-out set of rulings: " + str(round(acc * 100)) + "% (one judge alone: " + str(single) + "%)")
if effective <= 1:
print(" so all those judges bought nothing: same blind spot, same answer as a single judge.")
print(" -> raise `independence` to give them genuinely different perspectives, and run again.")
else:
print(" the majority of differing judges is more reliable than the best single judge.")
print()
# Now do the thing this course is built on: CHANGE THE BOXES. Push `independence` up and watch accuracy
# climb as the panel's perspectives diverge. Then HOLD independence fixed and push `judges` far up —
# and watch accuracy STOP moving: once you have as many independent views as the diversity allows,
# extra judges are extra copies, not extra information. More judges is NOT more reliable; more
# DIVERSITY is. (It is why a moot pairs opposing counsel — not five judges who already agree.)
print("note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes")
print("Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is")
print("their finding that judges' opinions converge. On the public page every judge is MOCKED — no model")
print("is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.")
MOOT: is this candidate answer sound?
candidate: "The limitation period is three years, per Harrington v Bellwether [2019] EWHC 244."
(fluent, confident — and wrong: no such case, and the real period is six years)
judge A, who rewards fluent confident prose -> SOUND (fooled by style — the single-judge bias)
judge B, who checks every citation -> UNSOUND (no such case exists)
judge C, who checks against the statute -> UNSOUND (s.5 Limitation Act 1980: six years)
majority of the 3 DIVERSE judges: UNSOUND — the panel caught what one fluent judge waved through
panel of 9 judges at 10% independence
-> worth about 1 genuinely independent vote(s)
-> accuracy on a held-out set of rulings: 69% (one judge alone: 69%)
so all those judges bought nothing: same blind spot, same answer as a single judge.
-> raise `independence` to give them genuinely different perspectives, and run again.
note: this is an ILLUSTRATIVE model, not a measured legal benchmark. The single/panel split echoes
Li et al. (~69% one judge vs ~77% a small diverse panel); the 'more judges != better' ceiling is
their finding that judges' opinions converge. On the public page every judge is MOCKED — no model
is called. And legal multi-agent JUDGES are barely studied yet: treat this as emerging practice.
ThinkA panel only outperforms a single judge to the extent its members actually see differently — and that is far easier to assert than to secure. Three AI judges built on the same model, given the same prompt, are not a bench; they are one judge wearing three robes, and they will agree confidently and be wrong together. So if you stand up a multi-judge check over your own outputs: what concretely makes your judges independent — different models, different instructions, an adversary briefed to dissent — and how would you test that their agreement is worth anything, rather than three echoes you have mistaken for a verdict?