Artwork for Day 5 of building a small language model

50 Days of Building a Small Language Model from Scratch

Day 5 of 50 Days of Building a Small Language Model from Scratch — Byte Pair Encoding Explained: Using tiktoken In LLM Workflows

By Prashant Lakhera

Congratulations, we've reached Day 5 of Building a Small Language Model from Scratch.
So far, we've covered the fundamentals of what a small language model is. We built our first tokenizer from scratch and explored its limitations, particularly in handling unknown words. This led us to the Byte Pair Encoding (BPE) tokenizer, which addresses many of those issues.
I also introduced the tiktoken module from OpenAI, and today, we'll take a closer look at how we're using it in our code.

Using tiktoken: The GPT-2 BPE Tokenizer

To use the tiktoken library, we start by installing and importing it. Then we initialize the GPT-2 tokenizer like this:

import tiktoken
self.enc = tiktoken.get_encoding("gpt2")

If you don't have tiktoken installed yet? You can install it with:

pip install tiktoken

The GPT-2 tokenizer uses Byte Pair Encoding (BPE), which is:

  • Fast
  • Memory-efficient
  • Pretrained on a wide range of English text

This makes it effective at breaking down words into meaningful subword units. For example, words like play, playing, and played are represented in a compact and generalized way that improves downstream model performance.
GPT-2's tokenizer remains one of the best options for building and experimenting with small- to mid-sized language models, especially in early prototyping stages.

Special Tokens for Storytelling

Further down in the code, we use special tokens to help the model understand story structure:

self.special_tokens = {
    "start_story": "<|startofstory|>",
    "end_story": "<|endofstory|>",
    "title": "<|title|>",
}

These are vital for storytelling tasks. During both training and generation, these tokens tell the model:

  • When to start and stop a story
  • Where the title goes
  • When a new narrative section begins

These special tokens help the model think like a human storyteller by providing context.

Text Preprocessing

Before tokenizing, we need to clean and normalize the text. The preprocess_text method ensures consistency by converting text to lowercase, removing unnecessary whitespace, and replacing newlines with spaces.

def preprocess_text(self, text):
    # Basic text cleaning
    text = text.lower()  # Convert to lowercase for consistency
    text = text.replace('\n', ' ')  # Replace newlines with spaces
    text = ' '.join(text.split())  # Normalize whitespace
    return text

This function prepares both the prompt and the story sections of each example. It helps reduce noise in the data and ensures that tokenization operates on clean, uniform input.

Tokenization with Error Handling

The process method prepares each example by combining the prompt and story with special tokens. It then tokenizes the text using the GPT-2 tokenizer from tiktoken. To stay within the model's context limit, the tokenized output is truncated to a maximum of 1024 tokens. If any error occurs during tokenization, it catches the exception and returns an empty sequence.

def process(self, example):
    # Preprocess both prompt and story
    prompt = self.preprocess_text(example['prompt'])
    story = self.preprocess_text(example['text'])
    
    # Create structured text with special tokens
    full_text = (
        f"{self.special_tokens['prompt_start']} {prompt} {self.special_tokens['prompt_end']} "
        f"{self.special_tokens['story_start']} {story} {self.special_tokens['story_end']}"
    )
    
    # Tokenize with error handling
    try:
        ids = self.enc.encode_ordinary(full_text)
        # Truncate to GPT-2's context limit
        if len(ids) > 1024:
            ids = ids[:1024]
        return {'ids': ids, 'len': len(ids)}
    except Exception as e:
        print(f"Error tokenizing text: {e}")
        return {'ids': [], 'len': 0}

This function ensures that every example is robustly tokenized, avoiding crashes during processing due to malformed input.

Dataset Loading and Splitting

The prepare_dataset method starts by downloading the Children Stories Collection dataset using the Hugging Face datasets library. It filters out examples that are too short or too long, and then splits the dataset into three subsets: train, validation, and fine-tune.

ds = load_dataset("ajibawa-2023/Children-Stories-Collection")
# Filter out too short/long examples
def filter_by_length(example):
    return 50 <= example['text_token_length'] <= 1000
ds = ds.filter(filter_by_length)
# Split the dataset: 80% train, 10% validation, 10% fine-tune
train_val_test = ds["train"].train_test_split(test_size=0.2, seed=42)
val_finetune = train_val_test["test"].train_test_split(test_size=0.5, seed=42)
ds = {
    "train": train_val_test["train"],
    "validation": val_finetune["train"],
    "finetune": val_finetune["test"]
}

This ensures a well-balanced dataset with examples of appropriate length for training a small LLM.

Parallel Tokenization and Binary Serialization

Once the dataset is split, each subset is tokenized using multiple processes (num_proc=8) to speed things up. The resulting token IDs are then stored in .bin files using memory-mapped NumPy arrays. This format allows for fast, efficient access during model training.

tokenized = split_data.map(
    self.process,
    remove_columns=['text', 'prompt', 'text_token_length'],
    desc=f"tokenizing {split_name} split",
    num_proc=8,
)
filename = os.path.join(self.data_dir, f"{split_name}.bin")
arr_len = np.sum(tokenized['len'], dtype=np.uint64)
dtype = np.uint16
arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
total_batches = 1024
idx = 0
for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
    batch = tokenized.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
    arr_batch = np.concatenate(batch['ids'])
    arr[idx : idx + len(arr_batch)] = arr_batch
    idx += len(arr_batch)
arr.flush()

By using memory mapping and sharding, this step optimizes the preprocessing pipeline for scalability, even with large datasets.

✅ Wrapping Up Week 1

This marks the end of our first week!
We've successfully built a robust and scalable preprocessing pipeline using tiktoken and Byte Pair Encoding (BPE), laying a solid foundation for training small language models efficiently.
Next week, we'll start working on the model itself.
Stay tuned, and have a great weekend ahead!

🔗 Related Resources