Lesson 6 · Improving the loop

Improving the loop — reliability you compound from your own corrections

The red-and-black double-page index from John Locke's 1706 A New Method of Making Common-Place-Books, tabulating each letter of the alphabet against the following vowel as an addressing scheme for entries. — The index from John Locke's *A New Method of Making Common-Place-Books*, 1706 — first published in French in 1686, and printed here in red and black. For centuries, and nowhere more than in legal training, readers distilled what they read into a personal, indexed book of rules, maxims and cases, kept for reuse and grown over a lifetime. The `toolkit/` you assemble across this course is a commonplace book: each correction turned into a rule, each rule filed as a reusable, addressable skill — the humanist practice Erasmus codified five centuries ago. Source: Wikimedia Commons · public domain.

A commonplace book is the oldest version of the thing this whole course has been quietly building. For centuries — and nowhere more methodically than in legal training — a reader kept a personal, indexed book into which they distilled what they read: a rule here, a maxim there, a case worth keeping, each filed under a heading so it could be found again and reused. Locke published a formal addressing scheme for it in 1706; Erasmus had codified the practice two centuries earlier in De Copia. The point was never collection for its own sake. It was that a working lawyer’s reliability is accumulated — built rule by rule over a career, each hard-won correction written down so it never has to be learned twice. The toolkit/ you have assembled across these six lessons is exactly that book, in software: every lesson ended by turning a way the model fails into a durable, reusable skill.

This last lesson is about the loop that makes the book worth keeping: correction → rule → skill, repeated until reliability compounds. You honestly have to flag what this is and is not. Unlike the earlier lessons, it rests on the weakest evidence in the course — the loop is a working engineering practice (it is how Claude Code skills are built), not a peer-reviewed result, and you should hold it as such. The nearest scholarly support is the literature on reliability through iteration: the Six Sigma Agent (arXiv:2601.22290, 2026) reports consensus checking taking a 5% per-action error down to 0.11% as independent checks accumulate — the same compounding shape, in a different setting. And it carries the same warning you met in Lesson 5, now turned on your own toolkit: the multiplication only holds if the checks are genuinely independent. Real skills overlap, so true stacking is always weaker than the tidy product suggests, and a toolkit whose rules all reflect the same instinct gives you multiplied confidence without added coverage.

So the cell below makes the compounding visible, using your own six skills. On load, only the first three are switched on, and the model’s error rate sits at 12.3% — better than the raw 40%, but a long way from safe. Now do the thing this course is built on, one last time: change the box. Add 4 and watch the Grounding check — Lesson 3, the heart of the course — deliver the single largest drop on its own; add 5 and 6 and watch the residual fall to about 2%; then remove any one number and see precisely how much that skill was holding back. The lesson in the numbers is the whole course in miniature: that reliability did not arrive as one clever model or one perfect prompt, but as a stack of ordinary corrections, each catching what the last let through. That is the final skill — the one that makes every future skill — and it is yours to keep: when the model fails you, don’t just fix it, file it. Write the correction as a rule, bank the rule as a skill, and your commonplace book grows past these six on its own.

✏️ which of your six skills are switched on (their numbers, 1–6)

the code · starting Python…

# You don't need to change the code — just edit the box above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# This is the last lesson, and it is about the course itself. Across L0–L5 you did not just learn
# six facts — you BANKED six skills, one per lesson, each one a correction turned into a reusable rule:
# size the window, budget the context, retrieve don't stuff, ground every claim, verify at scale,
# convene a diverse panel. The loop of this lesson is what turns that pile into compounding reliability:
# stack the skills as independent CHECKS, and each one catches a share of what the last let through.

# `skills_on` lists which skills are currently switched on. Edit it to add or remove their numbers.

# work out which skills are on (any of 1..6, in any order):
on = set()
for token in skills_on.replace(",", " ").split():
    if token.isdigit() and 1 <= int(token) <= 6:
        on.add(int(token))

# stack the active checks: each catches a fraction of the errors that survived the earlier ones,
# so the residual error is the base rate multiplied down by every check you switch on.
print("your toolkit — six skills, banked one per lesson, stacked as independent checks:")
residual = BASE_ERROR
for num, name, lesson, eff in SKILLS:
    if num in on:
        residual *= (1 - eff)
        print("   [on ] " + str(num) + ". " + name + " (" + lesson + ")  -  catches ~" + str(round(eff * 100)) + "% of the errors that reach it")
    else:
        print("   [off] " + str(num) + ". " + name + " (" + lesson + ")  -  (switched off)")
print()

raw_pct = round(BASE_ERROR * 100, 1)
now_pct = round(residual * 100, 1)
print("raw model, no toolkit:    " + str(raw_pct) + "%  error on these hard tasks  (~" + str(round(BASE_ERROR * 1000)) + " of every 1,000 documents wrong)")
print("with " + str(len(on)) + " of 6 skills on:    " + str(now_pct) + "%  error  (~" + str(round(residual * 1000)) + " of every 1,000 still slip through)")
print()

# Now do the thing this course is built on, one last time: CHANGE THE BOX.
if len(on) < 6:
    print("-> switch on more skills (add their numbers) and watch the error fall; the Grounding check (4)")
    print("   is the single biggest drop. Or remove one and watch how much that one skill was holding back.")
else:
    print("-> all six on. From " + str(raw_pct) + "% to " + str(now_pct) + "% — not by one clever fix, but by")
    print("   stacking independent checks, each catching what the last let through. That compounding IS the loop.")

print()
print("note: this is an ILLUSTRATIVE model. Two honest caveats. (1) It assumes the checks are INDEPENDENT")
print("(errors multiply) — real skills overlap, so true stacking is weaker than the product implies; the")
print("same 'independence' caveat you met judging panels in Lesson 5. (2) The correction -> rule -> skill")
print("loop is a working PRACTICE, not a peer-reviewed result — the nearest evidence is reliability-through-")
print("iteration (the Six Sigma Agent reports 5% per-action error falling to 0.11% as checks accumulate).")

Show the code and expected output

# You don't need to change the code — just edit the box above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# This is the last lesson, and it is about the course itself. Across L0–L5 you did not just learn
# six facts — you BANKED six skills, one per lesson, each one a correction turned into a reusable rule:
# size the window, budget the context, retrieve don't stuff, ground every claim, verify at scale,
# convene a diverse panel. The loop of this lesson is what turns that pile into compounding reliability:
# stack the skills as independent CHECKS, and each one catches a share of what the last let through.

# `skills_on` lists which skills are currently switched on. Edit it to add or remove their numbers.

# work out which skills are on (any of 1..6, in any order):
on = set()
for token in skills_on.replace(",", " ").split():
    if token.isdigit() and 1 <= int(token) <= 6:
        on.add(int(token))

# stack the active checks: each catches a fraction of the errors that survived the earlier ones,
# so the residual error is the base rate multiplied down by every check you switch on.
print("your toolkit — six skills, banked one per lesson, stacked as independent checks:")
residual = BASE_ERROR
for num, name, lesson, eff in SKILLS:
    if num in on:
        residual *= (1 - eff)
        print("   [on ] " + str(num) + ". " + name + " (" + lesson + ")  -  catches ~" + str(round(eff * 100)) + "% of the errors that reach it")
    else:
        print("   [off] " + str(num) + ". " + name + " (" + lesson + ")  -  (switched off)")
print()

raw_pct = round(BASE_ERROR * 100, 1)
now_pct = round(residual * 100, 1)
print("raw model, no toolkit:    " + str(raw_pct) + "%  error on these hard tasks  (~" + str(round(BASE_ERROR * 1000)) + " of every 1,000 documents wrong)")
print("with " + str(len(on)) + " of 6 skills on:    " + str(now_pct) + "%  error  (~" + str(round(residual * 1000)) + " of every 1,000 still slip through)")
print()

# Now do the thing this course is built on, one last time: CHANGE THE BOX.
if len(on) < 6:
    print("-> switch on more skills (add their numbers) and watch the error fall; the Grounding check (4)")
    print("   is the single biggest drop. Or remove one and watch how much that one skill was holding back.")
else:
    print("-> all six on. From " + str(raw_pct) + "% to " + str(now_pct) + "% — not by one clever fix, but by")
    print("   stacking independent checks, each catching what the last let through. That compounding IS the loop.")

print()
print("note: this is an ILLUSTRATIVE model. Two honest caveats. (1) It assumes the checks are INDEPENDENT")
print("(errors multiply) — real skills overlap, so true stacking is weaker than the product implies; the")
print("same 'independence' caveat you met judging panels in Lesson 5. (2) The correction -> rule -> skill")
print("loop is a working PRACTICE, not a peer-reviewed result — the nearest evidence is reliability-through-")
print("iteration (the Six Sigma Agent reports 5% per-action error falling to 0.11% as checks accumulate).")

your toolkit — six skills, banked one per lesson, stacked as independent checks:
   [on ] 1. Token counter (L0)  -  catches ~20% of the errors that reach it
   [on ] 2. Context budgeter (L1)  -  catches ~30% of the errors that reach it
   [on ] 3. Chunk-and-retrieve (L2)  -  catches ~45% of the errors that reach it
   [off] 4. Grounding check (L3)  -  (switched off)
   [off] 5. Batch-and-verify (L4)  -  (switched off)
   [off] 6. Diverse panel (L5)  -  (switched off)

raw model, no toolkit:    40.0%  error on these hard tasks  (~400 of every 1,000 documents wrong)
with 3 of 6 skills on:    12.3%  error  (~123 of every 1,000 still slip through)

-> switch on more skills (add their numbers) and watch the error fall; the Grounding check (4)
   is the single biggest drop. Or remove one and watch how much that one skill was holding back.

note: this is an ILLUSTRATIVE model. Two honest caveats. (1) It assumes the checks are INDEPENDENT
(errors multiply) — real skills overlap, so true stacking is weaker than the product implies; the
same 'independence' caveat you met judging panels in Lesson 5. (2) The correction -> rule -> skill
loop is a working PRACTICE, not a peer-reviewed result — the nearest evidence is reliability-through-
iteration (the Six Sigma Agent reports 5% per-action error falling to 0.11% as checks accumulate).

ThinkThe loop's promise is that reliability compounds — but only if each correction becomes a durable, independent rule, and not a fix you make once and forget by the next matter. Two things quietly defeat it: corrections you never write down, and a toolkit whose rules all encode the same instinct, so that stacking them multiplies confidence without adding coverage (the independence trap you met in Lesson 5, now pointed at yourself). So, leaving this course with six skills and the means to grow more: what would actually make you stop and write the rule down mid-matter — and how would you tell whether your toolkit is six independent checks, or one blind spot you have copied six times?