Artwork for Day 2 of building a small language model

50 Days of Building a Small Language Model from Scratch

Day 2 - Tokenizers: The Unsung Heroes of Language Models

🚀 Day 2 of 50 Days of Building a Small Language Model from Scratch - Tokenizers: The Unsung Heroes of Language Models

When you interact with ChatGPT or any other large language model (LLM), you type in human-readable text like:
Hello, how are you?
But here's a secret: these models have no idea what words are.

Before your input ever reaches the model, it goes through a crucial transformation by a component called a tokenizer. It breaks down your sentence into tokens, the atomic units the model can understand.
Let's dive under the hood and understand how it all works.

What Is a Token?

At a high level, a token can be:

  • A word (e.g., hello)
  • A subword (e.g., un, believ, able)
  • A character (in some models)
  • Even a piece of punctuation or whitespace

Think of tokens like LEGO bricks. Individually, they're just plastic pieces, but put them together, and you can build houses, castles, or spaceships.
The tokenizer converts human language into a sequence of tokens (usually integers), which the model then uses for training or inference.

Tokenization Techniques: The Building Blocks

There are three main ways models tokenize text. Let's break them down:

1. Whitespace or Word-Level Tokenization (Basic and Rare Now)

Splits by spaces and punctuation. Simple, but inefficient for handling unknown or rare words.
Example:

"unbelievable" → ["unbelievable"]

NOTE: If the model hasn't seen unbelievable before, it's stuck.

2. Character-Level Tokenization (High Flexibility, Low Efficiency)

Breaks everything into individual characters. It works for any language, but the sequences become very long.
Example:

"unbelievable" → ["u", "n", "b", "e", "l", "i", "e", "v", "a", "b", "l", "e"]

Great for robust handling, but slow and hard to model long-term structure.

3. Subword Tokenization (The Gold Standard)

Used by almost all modern LLMs. It breaks words into frequent chunks based on the training corpus. This allows it to handle:

  • Common words as single tokens
  • Rare or made-up words using combinations

Popular algorithms include:

  • Byte-Pair Encoding (BPE) - Used in GPT-2, GPT-3, RoBERTa
  • WordPiece - Used in BERT
  • Unigram Language Model - Used in SentencePiece / T5
  • GPT-4 uses a variant called tiktoken, which is optimized for performance

Example with BPE:

"unbelievable" → ["un", "believ", "able"]
"unicornify" → ["un", "icorn", "ify"]

Under the Hood: How a Tokenizer Actually Works

Let's unpack what happens when you run:

tokenizer.encode("Hello, world!")

Here's what happens step-by-step:

  1. Normalization
    The text is cleaned and standardized, with lowercase letters, extra spaces removed, and Unicode normalization applied.
    "Hello, world!" → "hello, world!"
  2. Pre-tokenization
    Text is split based on predefined rules (e.g., whitespace, punctuation). This is language-dependent.
    "hello, world!" → ["hello", ",", "world", "!"]
  3. Subword Tokenization
    Now the magic begins. Each piece is matched to subword units from a learned vocabulary.
    "hello" → ["he", "llo"]
    "world" → ["wor", "ld"]
    Each subword is mapped to a token ID (an integer).

Real Output

Let's say your tokenizer has these mappings:

Token ID   he   42
          llo  91
          wor  57
          ld   82
          ,    11
          !    99

The final output becomes:

[42, 91, 11, 57, 82, 99]

That's what your model actually sees, not the words, but these numbers.

Vocabulary: Why It Matters

Each tokenizer has a fixed-size vocabulary, often around 32k to 50k tokens.
If your vocabulary is too small → more tokens are needed per sentence
If your vocabulary is too big → harder to train, and requires more memory

That's why models often optimize for the sweet spot, using tools like SentencePiece or tiktoken to find balance.

Why Tokenization Affects Model Performance

  • Context length: If tokenization is inefficient (splits too much), you run out of context window faster.
  • Training cost: More tokens = more compute.
  • Hallucinations: Poor tokenization can cause weird generation if the model can't map inputs cleanly.

For example, ChatGPT might be tokenized as ["Chat", "G", "PT"] by one tokenizer and ["ChatGPT"] by another - leading to different responses.


Try It Yourself (Hugging Face Example)

pip install transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Unbelievable scenes!")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
# ['un', '##believable', 'scenes', '!']
print(ids)
# [4895, 14474, 3793, 999]

Note: ## Indicates a subword continuation.

Final Thoughts

Tokenizers are the quiet enablers of LLMs. Without them, your model wouldn't even know where to begin.
Next time you use a language model, take a moment to appreciate the work done before the first layer of the transformer even kicks in. If tokenization goes wrong, it's like giving someone a story with the words all scrambled — even the smartest person won't understand it.

Next up: we'll build our very first tokenizer from scratch, step by step!

If you're looking for a one-stop solution for AI model training, evaluation, and deployment, with advanced RAG capabilities and seamless MCP (Model Context Protocol) integration, check out IdeaWeaver.

  • 🚀 Train, fine-tune, and deploy language models with enterprise-grade features.
  • 📚 Docs
  • 💻 GitHub

If you find IdeaWeaver helpful, a ⭐ on the repo would mean a lot!