Reference

Resources & further reading

Everything this course teaches is anchored to published work, and this page is that anchor map: the strongest source behind each lesson, with an honest mark of how far the evidence actually reaches. The field is young — much of what follows is preprint, single-study, or drawn from general rather than legal settings, and a few techniques the course demonstrates have no supporting literature yet at all. Those gaps are listed plainly below, because for a legal reader they are part of the evidence, not an embarrassment to it. Read the verdict on each item before you rely on it.

How to read the verdicts

Strongly evidenced — peer-reviewed and replicated or multiply corroborated; teach as established. Emerging — plausible and often rigorous, but single-study, preprint, or narrow-domain; a finding to watch. Contested — good sources disagree; learn the disagreement, not a winner. Vendor-claimed — asserted by a product source, not independent evidence.

L0 · Foundations — tokenisation

Ariai & Demartini, Natural Language Processing for the Legal Domain: A Survey Emerging

2024 · arXiv:2410.21306 · legal text tokenises inefficiently and uses specialised vocabulary, so a fixed window holds fewer legal documents than you would expect.
Vaswani et al., Attention Is All You Need Background

2017 · arXiv:1706.03762 · the transformer architecture itself. Foundational background, named for completeness — it sits outside this course's verified source base, so treat it as established context, not a course claim.

L1 · Getting documents in — long context

Liu et al., Lost in the Middle: How Language Models Use Long Contexts Strongly evidenced

Stanford / UC Berkeley / Samaya AI · widely cited as TACL 2024, arXiv:2307.03172 (venue/id not printed in the captured source) · performance is highest when relevant material is at the very start or end of the window, and degrades in the middle.
Veseli et al., Positional Biases Shift as Inputs Approach Context-Window Limits; and the GM-Extract study Contested shape

2025 · arXiv:2508.07479 · arXiv:2511.13900 · the raw U-shape weakens past roughly half window-fill and becomes a relative distance/recency bias — so don't teach a fixed U-curve as universal law.

L2 · Finding what matters — retrieval / RAG

Reuter et al., Towards Reliable Retrieval in RAG Systems for Large Legal Datasets Emerging

2025 · arXiv:2510.06999 · standard RAG over a realistic legal corpus retrieved the wrong document at rates above 95%; summary-augmented chunking roughly halved that "document-level retrieval mismatch."

Honest limit: it improved the retrieval step only — it did not prove the final answer was more truthful.

L3 · Summarising reliably — grounding & hallucination

Dahl, Magesh, Suzgun & Ho, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models Strongly evidenced

Journal of Legal Analysis, 16, 64–93, 2024 · DOI 10.1093/jla/laae003 · arXiv:2401.01301 · general-purpose models hallucinated fabricated case-law on 58–88% of specific legal queries.
Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools Strongly evidenced

J. Empirical Legal Studies, 2025 · arXiv:2405.20362 · even RAG-backed commercial tools still invented material — Lexis+ AI ~17%, Westlaw ~33%, GPT-4 ~43%.
Chalkidis et al., LEGAL-BERT: The Muppets straight out of Law School; and LexGLUE Strongly evidenced

Findings of ACL: EMNLP 2020, DOI 10.18653/v1/2020.findings-emnlp.261 / arXiv:2010.02559 · LexGLUE, ACL 2022, arXiv:2110.00976 · domain-adapted legal models earn their keep.

L4 · Working at scale — agents & connectors

Fan et al., Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol Emerging

2026 · arXiv:2602.13320 · cumulative tool-call distortion grows like √T (bounded, not exponential), and re-grounding about every 9 steps suffices for error control.

Honest limit: a single v1 preprint on deterministic tools and single-agent chains.

L5 · Reasoning across — multi-agent evaluation

You et al., A Survey on Agent-as-a-Judge Emerging

2026 · arXiv:2601.05111 · a single model-judge favours verbosity and overloads on many dimensions at once; a multi-agent panel "transcends the heuristics of any single model" (and credits a courtroom-inspired mechanism).
Li et al., Leveraging LLMs as Meta-Judges Emerging

2025 · arXiv:2504.17087 · majority voting reached 77.3% vs a single agent's 68.9% — but panel discussion can underperform because opinions converge: diversity beats premature consensus, and more judges is not better.
JurisMMA — a six-agent courtroom for legal judgment Emerging

2026 · arXiv:2601.12815 · the nearest legal multi-agent system — but it does judgment prediction, not evaluation.

L6 · Improving the loop — reliability through iteration

The Six Sigma Agent Vendor-affiliated

2026 · arXiv:2601.22290 · consensus checking took a 5% per-action error down to 0.11% as independent checks accumulated — the compounding shape behind the lesson's meter.

Honest limit: the correction → rule → skill loop is a working engineering practice, not a peer-reviewed result. Hold it as a pattern, anchored to the reliability-through-iteration literature.

Adjacent — speech & document capture

Not taught as public lessons (they need real models, kept for the internal build), but on the same spine and useful if you digitise audio or scans:

Whisper: Courtside Edition; Seeing Justice Clearly; OCR or Not Mixed

arXiv:2602.18966 · AAAI 2026 · arXiv:2603.02789 · transcription accuracy is strong on clean speech but collapses on spontaneous/forensic audio, and OCR-free models fabricate names, dates and locations on handwritten legal documents.

Broader landscape

Guha et al., LegalBench

NeurIPS 2023, Datasets & Benchmarks track · a collaboratively built benchmark of legal-reasoning tasks.
Large Language Models Meet Legal Artificial Intelligence: A Survey

2025 · arXiv:2509.09969 · the wider map of legal-AI methods and tasks.

The honest gaps

For a sceptical reader these disclosures are the credibility. Where the course shows a technique the literature does not yet fully support, it says so — here and in the lessons themselves.

Foundations sit outside this source base. No transformer/attention paper is part of the verified set; the architecture is presented as established background, not a course finding.
"Lost in the middle" is real but its shape is debated (2025–26), model- and length-dependent — and there is no legal-domain study of it; the legal relevance is argued by analogy.
Better retrieval is not the same as a truer answer. Summary-augmented chunking cut wrong-document retrieval, but was not shown to reduce final-answer hallucination.
The commercial-tool hallucination figures are point-in-time. They test 2023–24 product versions, vendors dispute them, and "hallucination-free" marketing claims are contradicted by the peer-reviewed numbers — present them as a moment, not a verdict on any tool today.
Citation-grounding hard evidence is non-common-law and vendor-affiliated (Ukrainian law); no US/UK citation-graph study is in the source base.
The viral multi-agent "reliability multipliers" (e.g. 17×) are not verifiable here — they reach us only relayed through a marketing blog citing papers not in the source base. Do not present them as established.
Legal multi-agent judges are barely studied. The moot-panel design leans on general-domain work; named legal systems appear only as one-line citations inside a survey.
The improvement loop is practice, not theory. "Correction → rule → skill" is how these tools are actually built, anchored to reliability-through-iteration — not a published result in its own right.
Courtroom/legal speech recognition is essentially untested in this source base; strong word-error numbers are general or forensic speech, and OCR-free models fabricate on handwritten legal material.

How this course runs

Every cell you ran executed in your own browser — Python compiled to WebAssembly (Pyodide), with the real tiktoken tokeniser and its vocabulary served from this site, never a third-party server. There is no backend, no login, and nothing about you is stored or sent anywhere: your progress lives only in this browser's local storage, which is why the toolkit has Save and Load buttons rather than an account. Where a lesson needed a live model or a connector, the output was pre-baked and the connector mocked — by design, so the page stays static, private, and safe to leave running in a law school. The agentic skills themselves are Claude Code practice; this course is the gentle, backend-free on-ramp to it.

Going further — take-home experiments that reproduce each finding in a live Claude or Gemini. · ← Back to the course