Congratulations, we've reached Day 5 of Building a Small Language Model from Scratch.
So far, we've covered the fundamentals of what a small language model is. We built our first tokenizer from scratch and explored its limitations, particularly in handling unknown words. This led us to the Byte Pair Encoding (BPE) tokenizer, which addresses many of those issues.
I also introduced the tiktoken module from OpenAI, and today, we'll take a closer look at how we're using it in our code.
Using tiktoken: The GPT-2 BPE Tokenizer
To use the tiktoken library, we start by installing and importing it. Then we initialize the GPT-2 tokenizer like this:
import tiktoken
self.enc = tiktoken.get_encoding("gpt2")
If you don't have tiktoken installed yet? You can install it with:
pip install tiktoken
The GPT-2 tokenizer uses Byte Pair Encoding (BPE), which is:
- Fast
- Memory-efficient
- Pretrained on a wide range of English text
This makes it effective at breaking down words into meaningful subword units. For example, words like play, playing, and played are represented in a compact and generalized way that improves downstream model performance.
GPT-2's tokenizer remains one of the best options for building and experimenting with small- to mid-sized language models, especially in early prototyping stages.
Special Tokens for Storytelling
Further down in the code, we use special tokens to help the model understand story structure:
self.special_tokens = {
"start_story": "<|startofstory|>",
"end_story": "<|endofstory|>",
"title": "<|title|>",
}
These are vital for storytelling tasks. During both training and generation, these tokens tell the model:
- When to start and stop a story
- Where the title goes
- When a new narrative section begins
These special tokens help the model think like a human storyteller by providing context.
Text Preprocessing
Before tokenizing, we need to clean and normalize the text. The preprocess_text
method ensures consistency by converting text to lowercase, removing unnecessary whitespace, and replacing newlines with spaces.
def preprocess_text(self, text):
# Basic text cleaning
text = text.lower() # Convert to lowercase for consistency
text = text.replace('\n', ' ') # Replace newlines with spaces
text = ' '.join(text.split()) # Normalize whitespace
return text
This function prepares both the prompt and the story sections of each example. It helps reduce noise in the data and ensures that tokenization operates on clean, uniform input.
Tokenization with Error Handling
The process
method prepares each example by combining the prompt and story with special tokens. It then tokenizes the text using the GPT-2 tokenizer from tiktoken. To stay within the model's context limit, the tokenized output is truncated to a maximum of 1024 tokens. If any error occurs during tokenization, it catches the exception and returns an empty sequence.
def process(self, example):
# Preprocess both prompt and story
prompt = self.preprocess_text(example['prompt'])
story = self.preprocess_text(example['text'])
# Create structured text with special tokens
full_text = (
f"{self.special_tokens['prompt_start']} {prompt} {self.special_tokens['prompt_end']} "
f"{self.special_tokens['story_start']} {story} {self.special_tokens['story_end']}"
)
# Tokenize with error handling
try:
ids = self.enc.encode_ordinary(full_text)
# Truncate to GPT-2's context limit
if len(ids) > 1024:
ids = ids[:1024]
return {'ids': ids, 'len': len(ids)}
except Exception as e:
print(f"Error tokenizing text: {e}")
return {'ids': [], 'len': 0}
This function ensures that every example is robustly tokenized, avoiding crashes during processing due to malformed input.
Dataset Loading and Splitting
The prepare_dataset
method starts by downloading the Children Stories Collection dataset using the Hugging Face datasets library. It filters out examples that are too short or too long, and then splits the dataset into three subsets: train, validation, and fine-tune.
ds = load_dataset("ajibawa-2023/Children-Stories-Collection")
# Filter out too short/long examples
def filter_by_length(example):
return 50 <= example['text_token_length'] <= 1000
ds = ds.filter(filter_by_length)
# Split the dataset: 80% train, 10% validation, 10% fine-tune
train_val_test = ds["train"].train_test_split(test_size=0.2, seed=42)
val_finetune = train_val_test["test"].train_test_split(test_size=0.5, seed=42)
ds = {
"train": train_val_test["train"],
"validation": val_finetune["train"],
"finetune": val_finetune["test"]
}
This ensures a well-balanced dataset with examples of appropriate length for training a small LLM.
Parallel Tokenization and Binary Serialization
Once the dataset is split, each subset is tokenized using multiple processes (num_proc=8
) to speed things up. The resulting token IDs are then stored in .bin
files using memory-mapped NumPy arrays. This format allows for fast, efficient access during model training.
tokenized = split_data.map(
self.process,
remove_columns=['text', 'prompt', 'text_token_length'],
desc=f"tokenizing {split_name} split",
num_proc=8,
)
filename = os.path.join(self.data_dir, f"{split_name}.bin")
arr_len = np.sum(tokenized['len'], dtype=np.uint64)
dtype = np.uint16
arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
total_batches = 1024
idx = 0
for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
batch = tokenized.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
arr_batch = np.concatenate(batch['ids'])
arr[idx : idx + len(arr_batch)] = arr_batch
idx += len(arr_batch)
arr.flush()
By using memory mapping and sharding, this step optimizes the preprocessing pipeline for scalability, even with large datasets.
✅ Wrapping Up Week 1
This marks the end of our first week!
We've successfully built a robust and scalable preprocessing pipeline using tiktoken and Byte Pair Encoding (BPE), laying a solid foundation for training small language models efficiently.
Next week, we'll start working on the model itself.
Stay tuned, and have a great weekend ahead!