Reference
Resources & further reading
Everything this course teaches is anchored to published work, and this page is that anchor map: the strongest source behind each lesson, with an honest mark of how far the evidence actually reaches. The field is young — much of what follows is preprint, single-study, or drawn from general rather than legal settings, and a few techniques the course demonstrates have no supporting literature yet at all. Those gaps are listed plainly below, because for a legal reader they are part of the evidence, not an embarrassment to it. Read the verdict on each item before you rely on it.
Strongly evidenced — peer-reviewed and replicated or multiply corroborated; teach as established. Emerging — plausible and often rigorous, but single-study, preprint, or narrow-domain; a finding to watch. Contested — good sources disagree; learn the disagreement, not a winner. Vendor-claimed — asserted by a product source, not independent evidence.
L0 · Foundations — tokenisation
- Ariai & Demartini, Natural Language Processing for the Legal Domain: A Survey Emerging
- Vaswani et al., Attention Is All You Need Background
L1 · Getting documents in — long context
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts Strongly evidenced
- Veseli et al., Positional Biases Shift as Inputs Approach Context-Window Limits; and the GM-Extract study Contested shape
L2 · Finding what matters — retrieval / RAG
- Reuter et al., Towards Reliable Retrieval in RAG Systems for Large Legal Datasets EmergingHonest limit: it improved the retrieval step only — it did not prove the final answer was more truthful.
L3 · Summarising reliably — grounding & hallucination
- Dahl, Magesh, Suzgun & Ho, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models Strongly evidenced
- Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools Strongly evidenced
- Chalkidis et al., LEGAL-BERT: The Muppets straight out of Law School; and LexGLUE Strongly evidenced
L4 · Working at scale — agents & connectors
- Fan et al., Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol EmergingHonest limit: a single v1 preprint on deterministic tools and single-agent chains.
L5 · Reasoning across — multi-agent evaluation
- You et al., A Survey on Agent-as-a-Judge Emerging
- Li et al., Leveraging LLMs as Meta-Judges Emerging
- JurisMMA — a six-agent courtroom for legal judgment Emerging
L6 · Improving the loop — reliability through iteration
- The Six Sigma Agent Vendor-affiliatedHonest limit: the correction → rule → skill loop is a working engineering practice, not a peer-reviewed result. Hold it as a pattern, anchored to the reliability-through-iteration literature.
Adjacent — speech & document capture
Not taught as public lessons (they need real models, kept for the internal build), but on the same spine and useful if you digitise audio or scans:
- Whisper: Courtside Edition; Seeing Justice Clearly; OCR or Not Mixed
Broader landscape
- Guha et al., LegalBench
- Large Language Models Meet Legal Artificial Intelligence: A Survey
The honest gaps
For a sceptical reader these disclosures are the credibility. Where the course shows a technique the literature does not yet fully support, it says so — here and in the lessons themselves.
- Foundations sit outside this source base. No transformer/attention paper is part of the verified set; the architecture is presented as established background, not a course finding.
- "Lost in the middle" is real but its shape is debated (2025–26), model- and length-dependent — and there is no legal-domain study of it; the legal relevance is argued by analogy.
- Better retrieval is not the same as a truer answer. Summary-augmented chunking cut wrong-document retrieval, but was not shown to reduce final-answer hallucination.
- The commercial-tool hallucination figures are point-in-time. They test 2023–24 product versions, vendors dispute them, and "hallucination-free" marketing claims are contradicted by the peer-reviewed numbers — present them as a moment, not a verdict on any tool today.
- Citation-grounding hard evidence is non-common-law and vendor-affiliated (Ukrainian law); no US/UK citation-graph study is in the source base.
- The viral multi-agent "reliability multipliers" (e.g. 17×) are not verifiable here — they reach us only relayed through a marketing blog citing papers not in the source base. Do not present them as established.
- Legal multi-agent judges are barely studied. The moot-panel design leans on general-domain work; named legal systems appear only as one-line citations inside a survey.
- The improvement loop is practice, not theory. "Correction → rule → skill" is how these tools are actually built, anchored to reliability-through-iteration — not a published result in its own right.
- Courtroom/legal speech recognition is essentially untested in this source base; strong word-error numbers are general or forensic speech, and OCR-free models fabricate on handwritten legal material.
How this course runs
Every cell you ran executed in your own browser — Python compiled to WebAssembly (Pyodide), with the real tiktoken tokeniser and its vocabulary served from this site, never a third-party server. There is no backend, no login, and nothing about you is stored or sent anywhere: your progress lives only in this browser's local storage, which is why the toolkit has Save and Load buttons rather than an account. Where a lesson needed a live model or a connector, the output was pre-baked and the connector mocked — by design, so the page stays static, private, and safe to leave running in a law school. The agentic skills themselves are Claude Code practice; this course is the gentle, backend-free on-ramp to it.
Going further — take-home experiments that reproduce each finding in a live Claude or Gemini. · ← Back to the course