LLM City and Tokens
Exploring LLM City with Unique currency
Welcome to LLM City
On a bright sunny day, an excited kid walked into LLM City. Everything looked familiar — rides, lights, laughter — but something felt… different.
He ran to the counter and asked:
“How much for roller coaster ride?”
The clerk didn’t answer. He took the sentence, stamped it, and slid it back:
≈ 7 tokens
The kid tilted his head. “Seven… what?” The clerk pointed beyond the gate and said softly: “If you want to ride, you’ll have to go through.”
The kid looked down at his sentence. For the first time, it didn’t feel simple anymore. He stepped forward.
LLM Land
One sentence enters. Tokens come out.
The Chopper & The Price Board
The sentence was broken apart:
How | much | for | roller | coaster | ride | ?
Each piece fell on its own. Small. Separate. Countable. Then, each one was picked up and assigned a number from a giant board:
How → 2
much → 3
for → 4
roller → 5
coaster → 6
ride → 7
? → 8
The clerk spoke: “This is how we keep track. Because numbers are all the system understands.”
What a Token Really Is
- Every token is mapped to a number
- That mapping is called a vocabulary
- If we were building system, our system now understands 7 words
- The size of this vocabulary is fixed
So instead of: How much for roller coaster ride ?
The system now sees: 2 | 3 | 4 | 5 | 6 | 7 | 8
What is a token?
Think of a token as a “chunk” of text. It can be a whole word, a part of a word (like “ing”), or even just a single punctuation mark. It’s the smallest unit of “currency” the model can spend to understand your sentence.
Handling the Mess: Tyops & BPE
I mean typos …
Real input isn’t always perfect.
Consider this: How much for roler coaster ride?
This exact sequence roler does not exist in the vocabulary. But the system does not stop; it breaks it down further:
roler → rol | er
- the system does not fail
- it finds smaller known pieces
- it keeps going
In the real world
This isn’t just our story.
Try it yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
Try: How much for roller coaster ride
Then try: How much for roler coaster ride
Or even gibberish: r0l3rcoast3rzzz
You’ll notice:
- token count changes
- tokens split differently
- unknown words get broken into smaller parts
This has a name: BPE
This is called Byte Pair Encoding (BPE)
Why this works
The system does not store every word.
That would be impossible.
Instead:
- it stores common pieces
- it builds words from those pieces
- those pieces make a system’s vocabulary
Rocket Fuel: Vocabulary Size
Think of vocabulary size as a rocket.
Small vocabulary → small rocket
It knows fewer pieces. Runs out of fuel early.
Large vocabulary → bigger rocket
More pieces. More fuel. Goes farther.
So the real question: Which model has enough fuel to reach the moon? Tap a rocket to find out.
My LLM, My Rules
Same sentence. Different worlds. Different tickets.
A GPT token is a currency that has no value in Claude’s world. Each model has its own secret code.
From Numbers to Meaning
The kid stared at the board: 2 | 3 | 4 | 5 | 6 | 7 | 8.
“They’re just numbers,” he said.
The clerk nodded. “Exactly. And numbers alone don’t mean anything. If we stopped here, the computer would think 3 is ‘closer’ to 4 than it is to 7 just because of the math. But in the world of ideas, that’s not true.”
The kid frowned.
“But that’s not how words work.”
The clerk smiled. “Right. So we fix it.”
Why numbers are not enough
In a model, Token IDs are just locker numbers. They are labels, not definitions.
- The order is arbitrary.
- The meaning is missing.
- The number 2 isn’t “similar” to 3 in any way that matters.
Idea 1: Bag of words
The kid looked at the numbers: 2 | 3 | 4 | 5 | 6 | 7 | 8
“So,” he said, “I just give the machine this list of numbers?”
The clerk shook his head.
In the early days, we threw words into a bag and just counted them.
The kid frowned. “A bag?”
The clerk grabbed the sentence:
How much for roller coaster ride?
Then he dropped the words into a sack:
Bag of Words
Order falls away. Counts remain.
No first word.
No last word.
No sentence shape.
Just counts.
The problem
“The dog bit the man” and “The man bit the dog” look exactly the same to a bag. The meaning is lost.
The kid noded.
Both bags contain the same words. The meanings are completely different, but the system is blind to the distinction.
Idea 2: One-Hot Encoding
Next, we tried switchboards. Every word got its own light.
One light on. The rest stay dark.
The clerk leaned over the board. “Remember that rocket we saw earlier? BERT was among the smallest with a vocabulary of 30,000 words.”
The kid nodded.
“Now imagine this switchboard isn’t just 7 lights long. Imagine it has 30,000 lights. To say the word roller, you have to turn on one single light and leave 29,999 of them dark.”
Click 30K bulbs in earlier diagram
The kid’s eyes widened. “For every single word?”
“For every single word,” the clerk confirmed.
“And here is the problem: because every word has its own isolated light, the computer thinks roller is exactly as ‘close’ to ride as it is to How. There are no neighborhoods. No friends. No vibes.”
For the most powerful rocket with a 256,000 word vocabulary, the island is even more isolated.
One-Hot Encoding = a massive checklist. It knows who is present, but it doesn’t know who is related.
Idea 3: Embeddings (Number → Vector)
We started with How much for roller coaster ride ?
We turned words into numbers:
How → 2
much → 3
for → 4 …
Now we go one step further.
Each number becomes a fixed array of 5 numbers
2 → [0.2, -0.4, 0.9, 0.1, -0.7]
3 → [-0.1, 0.7, -0.3, 0.5, 0.2]
4 → [0.0, -0.2, 0.6, -0.8, 0.4]
This array has a name: It’s a vector
Think of a vector like a GPS coordinate for a word’s meaning.
- In 1D: You only know how far apart numbers are (3 - 2 = 1).
- In 5D (or 1,000D!): You can measure “closeness” across many different traits at once.
Now, the computer can finally ask: “How close are these two ideas?”
Instead of just comparing IDs, it looks at the space between them. Words that feel the same—like “Apple” and “Orange”—will have vectors that sit right next to each other in this invisible galaxy.
Apple - Is ‘Apple’ a fruit or a tech company? Context helps the vector find the right neighborhood.
Embedding size
For our vector, we used 5 numbers to describe a word. That count is called the embedding size (or dimension). It represents how much “room” the model has to store the nuance of a word.
But real models need a lot more than five coordinates to understand the complexity of human language.
| Model Scale | Analogy | Result |
|---|---|---|
| Small (128–512) | Pocket Map | Fast and light, but misses subtle “vibes” or sarcasm. |
| Standard (768–1024) | High-Res GPS | Finds the right “neighborhood” for complex words (like BERT). |
| Massive (4096+) | Topographical Map | Captures every tiny detail and nuance of human thought (like GPT-4). |
Vocabulary size is the dictionary. Embedding size is the detail inside each definition.
Short summary
Here is where we are so far
Sentence -> Ids -> Vector -> Embedding matrix
-
How much for roller coaster ride ?→2 | 3 | 4 | 5 | 6 | 7 | 8 -
Each ID maps to a vector:
2 → [0.2, -0.4, 0.9, 0.1, -0.7] 3 → [-0.1, 0.7, -0.3, 0.5, 0.2] 4 → [0.0, -0.2, 0.6, -0.8, 0.4] 5 → [0.5, 0.1, -0.2, 0.9, -0.1] 6 → [0.4, 0.2, -0.1, 0.8, -0.2] 7 → [-0.3, 0.6, 0.4, -0.1, 0.5] 8 → [0.1, -0.1, 0.8, 0.2, -0.9] -
Vector as Embedding Matrix of 7 tokens × 5 dimensions
Token d1 d2 d3 d4 d5 2 0.2 -0.4 0.9 0.1 -0.7 3 -0.1 0.7 -0.3 0.5 0.2 4 0.0 -0.2 0.6 -0.8 0.4 5 0.5 0.1 -0.2 0.9 -0.1 6 0.4 0.2 -0.1 0.8 -0.2 7 -0.3 0.6 0.4 -0.1 0.5 8 0.1 -0.1 0.8 0.2 -0.9
The model now sees the entire sentence as one matrix.
Special tokens
The Problem: The machine wants the same-sized boxes
The kid asked what if someone only says “How much”?
“That one only has 2 tokens. So its matrix is 2 × 5?”
The clerk nodded.
“Right. And that creates a problem.”
The LLM is like a conveyor belt.
It does not want random box sizes.
It wants every sentence lined up the same way.
Sentence A → 7 × 5
Sentence B → 2 × 5
That is messy.
So we choose a fixed length, say 10.
Padding (PAD)
Our sentence has 7 token IDs.
[2, 3, 4, 5, 6, 7, 8]
But the machine expects 10.
So we add empty tokens at the end.
[2, 3, 4, 5, 6, 7, 8, 0, 0, 0]
Those extra 0s are called padding.
They mean:
“Nothing here. Just keeping the shape.”
The model knows to ignore these positions when it computes meaning.
10 tokens × 5 vector = 10 × 5 embedding matrix
Special Numbers
Models reserve some IDs for special jobs.
0 → PAD → empty seat
1 → UNK → unknown word
101 → START → start marker (BERT-style)
102 → END → end / separator
These are not normal words.
They are instructions for the machine. Different models define different sets of special tokens.
Notice UNK.
If you type gibberish—something the model has never seen—it gets mapped to:
1 → UNK
Which simply means:
“I don’t recognize this.”
One Last Thing: Who Decides These Numbers?
The kid paused.
“Wait… who assigned these IDs and vectors?”
The clerk smiled.
“The model learned them.”
Before you ever typed a sentence, the LLM was trained on massive amounts of text.
During that training:
- Words (or pieces of words) were assigned token IDs
- Each token ID was mapped to a vector
- Those vectors were adjusted again and again until similar words ended up close together
"king" → [ ... ]
"queen" → [ ... ] ← ends up nearby
"apple" → [ ... ] ← far away (different meaning)
No human hand-picks these numbers.
One More Leap: From Vectors to Memory
The kid pointed at the vectors.
“So these numbers… can we store them somewhere?”
The clerk nodded.
“That’s exactly what we do.”
Every word, sentence, or document can be turned into a vector.
And once you have vectors, you can store them in something called a vector database.
"What is a roller coaster?" → [ ... ]
"Theme park ride cost" → [ ... ]
"Apple pie recipe" → [ ... ]
Instead of searching by exact words…
We search by meaning.
Similar meaning → vectors close together
Different meaning → vectors far apart
So when you ask:
"How much is a ride?"
The system doesn’t just look for exact matches.
It finds vectors that are nearby in this space.
Souvenirs from LLM City
Quiz
86% people love quiz after learning! Are you one of them?
Each correct answer gives you 10 points. Second try gives 5.
Score
0 / 60