Lesson 0 · Foundations

Foundations — the model reads tokens, not your PDF

A page from a ninth-century manuscript listing rows of small Tironian shorthand signs beside their Latin meanings, headed by large decorated initials.
A page of the Commentarii notarum Tironianarum, a 9th-century Carolingian copy of the shorthand of Marcus Tullius Tiro — Cicero's freedman secretary. To catch courtroom Latin at the speed of argument (Plutarch has scribes recording Cato's Senate speech against Catiline in 63 BC), Tiro fixed spoken language to an inventory of several thousand compact signs. A tokeniser does the same to your case bundle two thousand years later: language broken into a finite set of sub-word symbols a machine records and replays. The model never reads your PDF — only the notes. Source: Wikimedia Commons · public domain.

A token is the unit a language model actually works in. Before the model sees a word of your text, a tokeniser chops it into pieces from a fixed vocabulary — whole common words, but also fragments like app + ellant, and single characters for anything unusual. Those pieces become numbers, and numbers are all the model ever reads. This is what “the model reads tokens, not your PDF” means literally: your formatting, your page breaks, your highlighting are gone; what remains is a stream of sub-word symbols, like the compact signs Tiro fixed for courtroom Latin two thousand years ago.

The cell below runs on its own the first time — it counts the tokens in the sentence and prints the first few pieces so you can see where the cuts fall. Now do the thing this course is built on: change the input and watch the output move. Replace the sentence in the top box with one of your own — a line from a judgment, a clause from a contract, a passage thick with Latin or citations — nudge the number box to see more pieces, and press Run. The count is the number the model would charge against its finite window for that text.

Watch what looks right and what surprises you. Plain English tends to land near one token per short word; dense legal vocabulary and citations often splinter into three or four pieces apiece — which is the whole point, and the thing the closing question asks you to carry forward.

✏️ your sentence
✏️ how many pieces to show
the code · starting Python…
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# `enc` is a real tokeniser — the same one GPT-4o-era models use to read text.
# `text` is your sentence. .encode() turns it into the token-numbers the model sees:
tokens = enc.encode(text)

# How many tokens your sentence became. A model's context window holds only so many,
# so this number is, quite literally, what your text "costs" the model:
print(len(tokens), "tokens")

# `n` is the number in the second box. Turn the first n token-numbers back into the
# text pieces they stand for, so you can see exactly where the tokeniser cut your words:
print("pieces:", [enc.decode([t]) for t in tokens[:n]])
Show the code and expected output
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)

# `enc` is a real tokeniser — the same one GPT-4o-era models use to read text.
# `text` is your sentence. .encode() turns it into the token-numbers the model sees:
tokens = enc.encode(text)

# How many tokens your sentence became. A model's context window holds only so many,
# so this number is, quite literally, what your text "costs" the model:
print(len(tokens), "tokens")

# `n` is the number in the second box. Turn the first n token-numbers back into the
# text pieces they stand for, so you can see exactly where the tokeniser cut your words:
print("pieces:", [enc.decode([t]) for t in tokens[:n]])
13 tokens
pieces: ['The', ' appellant', ' submits', ' that', ' the', ' indemn', 'ity', ' clause', ' is', ' unen', 'force', 'able']

ThinkA 40-document case bundle can swallow far more of the window than its page count suggests. Knowing that specialist legal terms — 'indemnity', 'estoppel', a neutral citation — often fracture into several tokens each, where in your own practice would token cost quietly outrun the number of pages you think you are handing over?