April 20, 2026 12 min read

LLM City and Tokens

Exploring LLM City with Unique currency

Welcome to LLM City

On a bright sunny day, an excited kid walked into LLM City. Everything looked familiar — rides, lights, laughter — but something felt… different.

He ran to the counter and asked:

“How much for roller coaster ride?”

The clerk didn’t answer. He took the sentence, stamped it, and slid it back:

≈ 7 tokens

The kid tilted his head. “Seven… what?” The clerk pointed beyond the gate and said softly: “If you want to ride, you’ll have to go through.”

The kid looked down at his sentence. For the first time, it didn’t feel simple anymore. He stepped forward.

LLM Land

One sentence enters. Tokens come out.

The Chopper & The Price Board

The sentence was broken apart:

How | much | for | roller | coaster | ride | ?

Each piece fell on its own. Small. Separate. Countable. Then, each one was picked up and assigned a number from a giant board:

    How → 2
    much → 3
    for → 4
    roller → 5
    coaster → 6
    ride → 7
    ? → 8

The clerk spoke: “This is how we keep track. Because numbers are all the system understands.”

What a Token Really Is

Every token is mapped to a number
That mapping is called a vocabulary
- If we were building system, our system now understands 7 words
The size of this vocabulary is fixed

So instead of: How much for roller coaster ride ?

The system now sees: 2 | 3 | 4 | 5 | 6 | 7 | 8

What is a token?

Think of a token as a “chunk” of text. It can be a whole word, a part of a word (like “ing”), or even just a single punctuation mark. It’s the smallest unit of “currency” the model can spend to understand your sentence.

Handling the Mess: Tyops & BPE

I mean typos …

Real input isn’t always perfect.

Consider this: How much for roler coaster ride?

This exact sequence roler does not exist in the vocabulary. But the system does not stop; it breaks it down further:

roler → rol | er

the system does not fail
it finds smaller known pieces
it keeps going

In the real world

This isn’t just our story.

Try it yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playground

Try: How much for roller coaster ride

Then try: How much for roler coaster ride

Or even gibberish: r0l3rcoast3rzzz

You’ll notice:

token count changes
tokens split differently
unknown words get broken into smaller parts

This has a name: BPE

This is called Byte Pair Encoding (BPE)

Why this works

The system does not store every word.

That would be impossible.

Instead:

it stores common pieces
it builds words from those pieces
those pieces make a system’s vocabulary

Rocket Fuel: Vocabulary Size

Think of vocabulary size as a rocket.

Small vocabulary → small rocket
It knows fewer pieces. Runs out of fuel early.

Large vocabulary → bigger rocket
More pieces. More fuel. Goes farther.

So the real question: Which model has enough fuel to reach the moon? Tap a rocket to find out.

My LLM, My Rules

Same sentence. Different worlds. Different tickets.

A GPT token is a currency that has no value in Claude’s world. Each model has its own secret code.

From Numbers to Meaning

The kid stared at the board: 2 | 3 | 4 | 5 | 6 | 7 | 8. “They’re just numbers,” he said.

The clerk nodded. “Exactly. And numbers alone don’t mean anything. If we stopped here, the computer would think 3 is ‘closer’ to 4 than it is to 7 just because of the math. But in the world of ideas, that’s not true.”

The kid frowned.

“But that’s not how words work.”

The clerk smiled. “Right. So we fix it.”

Why numbers are not enough

In a model, Token IDs are just locker numbers. They are labels, not definitions.

The order is arbitrary.
The meaning is missing.
The number 2 isn’t “similar” to 3 in any way that matters.

Idea 1: Bag of words

The kid looked at the numbers: 2 | 3 | 4 | 5 | 6 | 7 | 8

“So,” he said, “I just give the machine this list of numbers?”

The clerk shook his head.

In the early days, we threw words into a bag and just counted them.

The kid frowned. “A bag?”

The clerk grabbed the sentence:

How much for roller coaster ride?

Then he dropped the words into a sack:

Bag of Words

Order falls away. Counts remain.

No first word.
No last word.
No sentence shape.

Just counts.

The problem

“The dog bit the man” and “The man bit the dog” look exactly the same to a bag. The meaning is lost.

The kid noded.

Both bags contain the same words. The meanings are completely different, but the system is blind to the distinction.

Idea 2: One-Hot Encoding

Next, we tried switchboards. Every word got its own light.

One light on. The rest stay dark.

The clerk leaned over the board. “Remember that rocket we saw earlier? BERT was among the smallest with a vocabulary of 30,000 words.”

The kid nodded.

“Now imagine this switchboard isn’t just 7 lights long. Imagine it has 30,000 lights. To say the word roller, you have to turn on one single light and leave 29,999 of them dark.”

Click 30K bulbs in earlier diagram

The kid’s eyes widened. “For every single word?”

“For every single word,” the clerk confirmed.

“And here is the problem: because every word has its own isolated light, the computer thinks roller is exactly as ‘close’ to ride as it is to How. There are no neighborhoods. No friends. No vibes.” For the most powerful rocket with a 256,000 word vocabulary, the island is even more isolated.

One-Hot Encoding = a massive checklist. It knows who is present, but it doesn’t know who is related.

Idea 3: Embeddings (Number → Vector)

We started with How much for roller coaster ride ?

We turned words into numbers:

How → 2
much → 3
for → 4 …

Now we go one step further.

Each number becomes a fixed array of 5 numbers

2 → [0.2, -0.4, 0.9, 0.1, -0.7]
3 → [-0.1, 0.7, -0.3, 0.5, 0.2]
4 → [0.0, -0.2, 0.6, -0.8, 0.4]

This array has a name: It’s a vector

Think of a vector like a GPS coordinate for a word’s meaning.

In 1D: You only know how far apart numbers are (3 - 2 = 1).
In 5D (or 1,000D!): You can measure “closeness” across many different traits at once.

Now, the computer can finally ask: “How close are these two ideas?”

Instead of just comparing IDs, it looks at the space between them. Words that feel the same—like “Apple” and “Orange”—will have vectors that sit right next to each other in this invisible galaxy.

Apple - Is ‘Apple’ a fruit or a tech company? Context helps the vector find the right neighborhood.

Embedding size

For our vector, we used 5 numbers to describe a word. That count is called the embedding size (or dimension). It represents how much “room” the model has to store the nuance of a word.

But real models need a lot more than five coordinates to understand the complexity of human language.

Model Scale	Analogy	Result
Small (128–512)	Pocket Map	Fast and light, but misses subtle “vibes” or sarcasm.
Standard (768–1024)	High-Res GPS	Finds the right “neighborhood” for complex words (like BERT).
Massive (4096+)	Topographical Map	Captures every tiny detail and nuance of human thought (like GPT-4).

Vocabulary size is the dictionary. Embedding size is the detail inside each definition.

Short summary

Here is where we are so far

Sentence -> Ids -> Vector -> Embedding matrix

How much for roller coaster ride ? → 2 | 3 | 4 | 5 | 6 | 7 | 8

Each ID maps to a vector:

2 → [0.2, -0.4, 0.9, 0.1, -0.7]
3 → [-0.1, 0.7, -0.3, 0.5, 0.2]
4 → [0.0, -0.2, 0.6, -0.8, 0.4]
5 → [0.5, 0.1, -0.2, 0.9, -0.1]
6 → [0.4, 0.2, -0.1, 0.8, -0.2]
7 → [-0.3, 0.6, 0.4, -0.1, 0.5]
8 → [0.1, -0.1, 0.8, 0.2, -0.9]

Vector as Embedding Matrix of 7 tokens × 5 dimensions

Token d1 d2 d3 d4 d5
2 0.2 -0.4 0.9 0.1 -0.7
3 -0.1 0.7 -0.3 0.5 0.2
4 0.0 -0.2 0.6 -0.8 0.4
5 0.5 0.1 -0.2 0.9 -0.1
6 0.4 0.2 -0.1 0.8 -0.2
7 -0.3 0.6 0.4 -0.1 0.5
8 0.1 -0.1 0.8 0.2 -0.9

Token	d1	d2	d3	d4	d5
2	0.2	-0.4	0.9	0.1	-0.7
3	-0.1	0.7	-0.3	0.5	0.2
4	0.0	-0.2	0.6	-0.8	0.4
5	0.5	0.1	-0.2	0.9	-0.1
6	0.4	0.2	-0.1	0.8	-0.2
7	-0.3	0.6	0.4	-0.1	0.5
8	0.1	-0.1	0.8	0.2	-0.9

The model now sees the entire sentence as one matrix.

Special tokens

The Problem: The machine wants the same-sized boxes

The kid asked what if someone only says “How much”?

“That one only has 2 tokens. So its matrix is 2 × 5?”

The clerk nodded.

“Right. And that creates a problem.”

The LLM is like a conveyor belt.

It does not want random box sizes.

It wants every sentence lined up the same way.

Sentence A → 7 × 5
Sentence B → 2 × 5

That is messy.

So we choose a fixed length, say 10.

Padding (PAD)

Our sentence has 7 token IDs.

[2, 3, 4, 5, 6, 7, 8]

But the machine expects 10.

So we add empty tokens at the end.

[2, 3, 4, 5, 6, 7, 8, 0, 0, 0]

Those extra 0s are called padding.

They mean:

“Nothing here. Just keeping the shape.”

The model knows to ignore these positions when it computes meaning.

10 tokens × 5 vector = 10 × 5 embedding matrix

Special Numbers

Models reserve some IDs for special jobs.

0   → PAD   → empty seat
1   → UNK   → unknown word
101 → START → start marker (BERT-style)
102 → END   → end / separator

These are not normal words.

They are instructions for the machine. Different models define different sets of special tokens.

Notice UNK.

If you type gibberish—something the model has never seen—it gets mapped to: 1 → UNK

Which simply means:

“I don’t recognize this.”

One Last Thing: Who Decides These Numbers?

The kid paused.

“Wait… who assigned these IDs and vectors?”

The clerk smiled.

“The model learned them.”

Before you ever typed a sentence, the LLM was trained on massive amounts of text.

During that training:

Words (or pieces of words) were assigned token IDs
Each token ID was mapped to a vector
Those vectors were adjusted again and again until similar words ended up close together

"king"   → [ ... ]
"queen"  → [ ... ]   ← ends up nearby
"apple"  → [ ... ]   ← far away (different meaning)

No human hand-picks these numbers.

One More Leap: From Vectors to Memory

The kid pointed at the vectors.

“So these numbers… can we store them somewhere?”

The clerk nodded.

“That’s exactly what we do.”

Every word, sentence, or document can be turned into a vector.

And once you have vectors, you can store them in something called a vector database.

"What is a roller coaster?" → [ ... ]
"Theme park ride cost"      → [ ... ]
"Apple pie recipe"          → [ ... ]

Instead of searching by exact words…

We search by meaning.

Similar meaning → vectors close together
Different meaning → vectors far apart

So when you ask:

"How much is a ride?"

The system doesn’t just look for exact matches.

It finds vectors that are nearby in this space.

Souvenirs from LLM City

Token A small piece of text the model can turn into a number.

Quiz

86% people love quiz after learning! Are you one of them?

Each correct answer gives you 10 points. Second try gives 5.

Score

0 / 60