Lesson 0 · Foundations
Foundations — the model reads tokens, not your PDF
A token is the unit a language model actually works in. Before the model sees a word of your text, a tokeniser chops it into pieces from a fixed vocabulary — whole common words, but also fragments like app + ellant, and single characters for anything unusual. Those pieces become numbers, and numbers are all the model ever reads. This is what “the model reads tokens, not your PDF” means literally: your formatting, your page breaks, your highlighting are gone; what remains is a stream of sub-word symbols, like the compact signs Tiro fixed for courtroom Latin two thousand years ago.
The cell below runs on its own the first time — it counts the tokens in the sentence and prints the first few pieces so you can see where the cuts fall. Now do the thing this course is built on: change the input and watch the output move. Replace the sentence in the top box with one of your own — a line from a judgment, a clause from a contract, a passage thick with Latin or citations — nudge the number box to see more pieces, and press Run. The count is the number the model would charge against its finite window for that text.
Watch what looks right and what surprises you. Plain English tends to land near one token per short word; dense legal vocabulary and citations often splinter into three or four pieces apiece — which is the whole point, and the thing the closing question asks you to carry forward.
# You don't need to change the code — just edit the boxes above and press Run. # (It runs on its own the first time, so you can see what it does straight away.) # `enc` is a real tokeniser — the same one GPT-4o-era models use to read text. # `text` is your sentence. .encode() turns it into the token-numbers the model sees: tokens = enc.encode(text) # How many tokens your sentence became. A model's context window holds only so many, # so this number is, quite literally, what your text "costs" the model: print(len(tokens), "tokens") # `n` is the number in the second box. Turn the first n token-numbers back into the # text pieces they stand for, so you can see exactly where the tokeniser cut your words: print("pieces:", [enc.decode([t]) for t in tokens[:n]])
Show the code and expected output
# You don't need to change the code — just edit the boxes above and press Run.
# (It runs on its own the first time, so you can see what it does straight away.)
# `enc` is a real tokeniser — the same one GPT-4o-era models use to read text.
# `text` is your sentence. .encode() turns it into the token-numbers the model sees:
tokens = enc.encode(text)
# How many tokens your sentence became. A model's context window holds only so many,
# so this number is, quite literally, what your text "costs" the model:
print(len(tokens), "tokens")
# `n` is the number in the second box. Turn the first n token-numbers back into the
# text pieces they stand for, so you can see exactly where the tokeniser cut your words:
print("pieces:", [enc.decode([t]) for t in tokens[:n]])
13 tokens pieces: ['The', ' appellant', ' submits', ' that', ' the', ' indemn', 'ity', ' clause', ' is', ' unen', 'force', 'able']
ThinkA 40-document case bundle can swallow far more of the window than its page count suggests. Knowing that specialist legal terms — 'indemnity', 'estoppel', a neutral citation — often fracture into several tokens each, where in your own practice would token cost quietly outrun the number of pages you think you are handing over?