Yesterday, I explained what a tokenizer is. Today, we're going to build our first tokenizer from scratch.
In its simplest form, the main job of a tokenizer is to break down your sentence into tokens, the atomic units that a model can understand.
Step 1: Creating Tokens
Our goal is to tokenize our data into individual words and special characters that we can then turn into embeddings for LLM training.
Now, the question is: how can we create these tokens? Let's not worry about LLMs at this stage. If I gave you this sentence, "IdeaWeaver, a comprehensive CLI tool for AI model training and evaluation"
and asked you to break it into smaller parts, how would you do it in Python?
The first Python method that comes to my mind is .split()
, which with no arguments, splits on any run of whitespace and discards it.
text = "IdeaWeaver, a comprehensive CLI tool for AI model training and evaluation"
parts = text.split()
print(parts)
# ['IdeaWeaver,', 'a', 'comprehensive', 'CLI', 'tool', 'for', 'AI', 'model', 'training', 'and', 'evaluation']
Or, one of the most popular Python modules that comes to mind is the re
module (regular expressions), where you can use the pattern r'(\s)'
to split on any single whitespace character and capture that character.
import re
text = "IdeaWeaver, a comprehensive CLI tool for AI model training and evaluation"
result = re.split(r'(\s)', text)
print(result)
# ['IdeaWeaver,', ' ', 'a', ' ', 'comprehensive', ' ', 'CLI', ' ', 'tool', ' ', 'for', ' ', 'AI', ' ', 'model', ' ', 'training', ' ', 'and', ' ', 'evaluation']
Let's extend the above code further and split on whitespace (\s
), commas (,
) or periods (\.
):
import re
text = "IdeaWeaver, a comprehensive CLI tool for AI model training and evaluation."
result = re.split(r'([\s,\.])', text)
print(result)
# ['IdeaWeaver', ',', '', ' ', 'a', ' ', 'comprehensive', ' ', 'CLI', ' ', 'tool', ' ', 'for', ' ', 'AI', ' ', 'model', ' ', 'training', ' ', 'and', ' ', 'evaluation', '.', '']
We can see that both the words and punctuation marks are now separate entries in the list just as we wanted.
However, there's still a small issue: the list includes whitespace characters. If those are not required, we can safely remove these redundant entries using the following approach:
# keep only tokens that aren't empty or all whitespace
result = [token for token in result if token.strip()]
print(result)
# ['IdeaWeaver', ',', 'a', 'comprehensive', 'CLI', 'tool', 'for', 'AI', 'model', 'training', 'and', 'evaluation', '.']
Should We Keep or Remove Whitespace?
When building a simple tokenizer, the decision to keep or remove whitespace characters depends on the specific needs of your application. Removing whitespace can reduce memory usage and computational overhead. However, preserving whitespace is important in cases where the structure of the text matters such as when processing Python code, where indentation and spacing are critical.
In this example, we choose to remove whitespace for simplicity and to keep the tokenized output concise. Later, we'll explore a tokenization approach that retains whitespace characters.
The tokenization scheme we developed above works well on the simple sample text. Now, let's refine it further to handle additional types of punctuation such as question marks, quotation marks, and double dashes.
text = "IdeaWeaver-- a comprehensive CLI tool for AI model training and evaluation?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)
# ['IdeaWeaver', '--', 'a', 'comprehensive', 'CLI', 'tool',
# 'for', 'AI', 'model', 'training', 'and', 'evaluation', '?']
Now, lets remove any tokens that are empty or consist solely of whitespace:
result = [token for token in result if token.strip()]
print(result)
# ['IdeaWeaver', '--', 'a', 'comprehensive', 'CLI', 'tool', 'for', 'AI', 'model', 'training', 'and', 'evaluation', '?']
So our final code will look like this:
import re
text = "IdeaWeaver-- a comprehensive CLI tool for AI model training and evaluation?"
tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
tokens = [tok.strip() for tok in tokens if tok.strip()]
Step 2: Creating Token IDs
In the previous section, we tokenized our text. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size:
all_tokens = sorted(set(tokens))
vocab_size = len(all_tokens)
print(vocab_size)
# This code will print vocabulary size
# 13
After determining the vocabulary size we can create the vocabulary and print it:
vocab = {token: idx for idx, token in enumerate(all_tokens)}
# e.g. {'--': 0, '?': 1, 'AI': 2, ...}
for token, idx in vocab.items():
print(f"{token}: {idx}")
# --: 0
# ?: 1
# AI: 2
# CLI: 3
# IdeaWeaver: 4
# a: 5
# and: 6
# comprehensive: 7
# evaluation: 8
# for: 9
# model: 10
# tool: 11
# training: 12
As shown in the output above, the dictionary maps each individual token to a unique integer ID.
Later in the blog, when we want to convert the numeric outputs of a language model back into readable text, we'll need a way to map token IDs back to their original tokens.
To achieve this, we can create an inverse vocabulary that reverses the mapping from token IDs to text tokens.
Step 3: Implementing a Complete Tokenizer Class
Let's now implement a complete Tokenizer
class in Python.
This class will include an encode
method that splits input text into tokens and maps each token to its corresponding integer ID using the vocabulary.
It will also provide a decode
method, which performs the reverse translating token IDs back into their original text form.
- Keep your vocabulary (the token→ID map) inside the tokenizer class so both encoding and decoding can use it
- Make a reverse map (ID→token) so you can turn numbers back into words
- To encode, split the text into tokens, clean them up, and look up each token's ID
- To decode, look up each ID's token and join them back into a string
- Finally, remove any extra spaces before punctuation so the output reads naturally
import re
from typing import List, Dict, Pattern
class BasicTokenizer:
"""
A minimal whitespace- and punctuation-based tokenizer
with bidirectional mapping between tokens and integer IDs.
"""
def __init__(
self,
token_index: Dict[str, int],
split_pattern: Pattern = re.compile(r'([,.:;?_!"\'()]|--|\s)'),
rejoin_pattern: Pattern = re.compile(r'\s+([,.:;?_!"\'()])')
):
# forward and reverse vocab
self.token_index = token_index
self.index_token = {idx: tok for tok, idx in token_index.items()}
# patterns for tokenization and for stitching text back together
self._split_pattern = split_pattern
self._rejoin_pattern = rejoin_pattern
# optional unknown token ID
self.unknown_id = token_index.get("", None)
def _tokenize(self, text: str) -> List[str]:
"""
Split on punctuation, double-dash, or whitespace,
strip out empty pieces.
"""
raw = self._split_pattern.split(text)
return [piece.strip() for piece in raw if piece.strip()]
def encode(self, text: str) -> List[int]:
"""
Convert a string into a list of token IDs.
Unknown tokens map to if present, otherwise are skipped.
"""
tokens = self._tokenize(text)
ids = []
for tok in tokens:
if tok in self.token_index:
ids.append(self.token_index[tok])
elif self.unknown_id is not None:
ids.append(self.unknown_id)
# else: drop it
return ids
def decode(self, ids: List[int]) -> str:
"""
Convert a list of IDs back into a human-readable string,
rejoining tokens with spaces, then fixing space-before-punctuation.
"""
# map IDs → tokens, skip missing IDs
tokens = [self.index_token[i] for i in ids if i in self.index_token]
text = " ".join(tokens)
# remove unwanted spaces before punctuation
return self._rejoin_pattern.sub(r"\1", text)
Let's instantiate a tokenizer class:
# Instantiate
tokenizer = BasicTokenizer(vocab)
# Tokenize & encode
text = "IdeaWeaver-- a comprehensive CLI tool for AI model training and evaluation?"
ids = tokenizer.encode(text)
# See the result
print(ids)
# ➞ [4, 0, 5, 7, 3, 11, 9, 2, 10, 12, 6, 8, 1]
After printing out the token IDs, we can decode the original text by calling:
decoded_text = tokenizer.decode(ids)
print(decoded_text)
# IdeaWeaver-- a comprehensive CLI tool for AI model training and evaluation?
The output above shows that the decode
method successfully translated the token IDs back into the original text.
Everything looks good so far, we've built a tokenizer that can tokenize and de-tokenize text based on a sample from the training set.
Now, let's test it on a new text sample that wasn't part of the training set.
text = "Hello, how are you?"
print(f"\nTesting with new text: '{text}'")
encoded_ids = tokenizer.encode(text)
print("Encoded IDs:", encoded_ids)
# Encoded IDs: [1]
What's Happening? 🤔
The tokenizer only recognizes tokens that were in the original vocabulary. From Hello, how are you?
:
Hello
,how
,are
,you
— Not in vocabulary → skipped,
— Not in vocabulary → skipped?
— In vocabulary (from original text) → encoded as ID 1
This demonstrates a fundamental limitation of this basic tokenizer: it can only work with text containing tokens from its training vocabulary.
all_tokens = sorted(list(set(tokens)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
for i, item in enumerate(list(vocab.items())[-5:]):
print(item)
Adding Special Context Tokens
In the previous section, we built a simple tokenizer and applied it to a sample text.
Now, we'll enhance that tokenizer to handle unknown words and mark the boundaries between separate texts.
Specifically, we'll update the BasicTokenizer
to support two special tokens: <|unk|>
and <|endoftext|>
.
The <|unk|>
token will be used to represent any word that is not found in the vocabulary, this helps the model handle unexpected or out-of-vocabulary inputs.
The <|endoftext|>
token will serve as a separator between unrelated text segments. For instance, when training GPT-style language models on multiple independent documents or books, it's common to insert a boundary token before each new text to indicate a transition.
Let's go ahead and modify the vocabulary to include these two special tokens by appending them to the list of unique words we created in the previous section.
Up to this point, we've discussed tokenization as a critical preprocessing step for feeding text into large language models (LLMs). Depending on the model architecture and training methodology, some researchers incorporate additional special tokens such as:
[BOS]
(Beginning of Sequence): Marks the start of a sequence, signaling the model where the input begins.[EOS]
(End of Sequence): Indicates the end of a sequence. This is especially useful when concatenating multiple unrelated texts, similar to GPT's<|endoftext|>
token. For example, when combining two separate Wikipedia articles or books,[EOS]
marks where one ends and the next begins.[PAD]
(Padding): When training with batches of varying-length texts, the shorter sequences are padded with[PAD]
tokens to match the length of the longest sequence in the batch.
It's important to note that GPT models do not use tokens like [BOS]
, [EOS]
, or [PAD]
. Instead, they rely solely on a single special token: <|endoftext|>
.
Additionally, GPT models do not include an <|unk|>
token for handling out-of-vocabulary words. Instead, they use a Byte Pair Encoding (BPE) tokenizer, which breaks words down into smaller subword units, allowing them to process virtually any input text without the need for an explicit unknown-word token. BPE is the topic we'll be discussing tomorrow.