Artwork for Day 1 of building a small language model

50 Days of Building a Small Language Model from Scratch

Day 1 - What Are Small Language Models?

Welcome to Day 1 of our journey in building a small language model (SLM) from scratch. I'm genuinely excited you're here. By the end of this series, you'll have hands-on experience spinning up a compact, efficient model that can run on your laptop.

What Makes a Model "Small"?

Did you know that there is no official definition of a small language model? I used to think that any model with fewer than X million parameters would be considered a small model, but I was wrong. There is no universally accepted definition for what qualifies as a small language model.

While there's no official threshold, many researchers loosely refer to a language model as "small" based on two main factors:

1️⃣ Parameter count:

Typically under 100 million parameters, though the cutoff varies. Think of models like:

  • ✔️ DeepSeek-V2 Tiny (15–20M): great balance of size and performance
  • ✔️ Phi-2 (2.7B but highly efficient): not technically "small," but optimized for low-resource scenarios
  • ✔️ DistilBERT (66M): a distilled version of BERT

2️⃣ Deployment footprint:

Whether the model can run efficiently on edge devices, mobile phones, or resource-constrained environments.

  • ✔️ Can it run on a mobile device?
  • ✔️ Can it serve real-time tasks on CPU or a single GPU?
  • ✔️ Does it use quantization or distillation effectively?

At the end of this series, we'll build two models:

  • GPT-based Children's Stories (30M parameters) 🔗
  • DeepSeek Children's Stories (15M parameters) 🔗

These models are still large enough to capture meaningful patterns, but small enough that you can train or fine-tune them on a single GPU, or even a CPU, in just a few hours.

Why Build an SLM?

Now, the big question: why bother building a small language model when there are already so many powerful models available?

Cost & Speed

Training a 30 M-param model costs a tiny fraction of what it takes to spin up a 70 B-param large model. You'll iterate faster, experiment more, and burn far less electricity.

Edge-Friendly

Want to run inference on your phone, Raspberry Pi, or edge GPU? SLMs make on-device AI practical by keeping data local, cutting latency, and boosting privacy.

Domain Expertise

By focusing on a narrow domain, say, legal Q&A or children's story generation, you can match (or even beat) larger models within your niche, all while staying lean.

Core Architecture Patterns

Under the hood, small language models still rely on Transformer blocks, but they become clever in how they scale them down. One example of deep-thin architecture comes from the world of children's stories. IdeaWeaver's Tiny Children's Stories model packs storytelling power into just 30 million parameters by utilizing 24 slimmed-down Transformer layers with 256-dimensional embeddings, a significantly smaller setup compared to the 768-dimensional setup in standard GPT models. And yet, it still manages to generate charming, multi-paragraph tales. Going even smaller, the DeepSeek Children's Stories model uses just 15 million parameters. It applies the same principle, many narrow layers, to maintain narrative structure while keeping the model size tiny.

Limitations to Keep in Mind

So, if small models can generate such good results, do we even need the big ones anymore? Well… not so fast. When I trained the GPT-based Children's Stories model (30M parameters) and the DeepSeek Children's Stories model (15M parameters), I was honestly impressed. Both models were able to craft multi-paragraph stories with a strong narrative arc, all while running on modest hardware.

However, when I applied the same principles to a small model trained on a DevOps-related dataset, the results weren't as consistent. Some things worked beautifully. Others? Not so much.

It turns out, small language models come with trade-offs. Over the next few days, I'll share some of the most interesting lessons and limitations I encountered - but here's a sneak peek:

Shallower Understanding

Fewer parameters mean less capacity to grasp subtle idioms, multi-clause sentences, or references that span across paragraphs. Think of it like having a bookshelf of ten favorite books instead of a whole library - it works great for specific topics, but struggles outside that comfort zone.

Hallucinations

Without broad world knowledge, small models are more likely to "make stuff up" (this is true even for LLM). They can sound confident even when they are entirely wrong, especially when you push them outside their fine-tuned domain.

Limited Context Windows

Most small models max out at 512 to 2048 tokens. Hand them a 10-page document, and they'll forget the beginning by the time they reach the end.

Weak Emergent Abilities

Don't expect complex reasoning or chain-of-thought problem-solving here. Those magical behaviors, such as solving puzzles step by step, tend to appear only in larger models.

Overfitting Risk

With limited capacity, it's tempting to train longer on small datasets. But if you're not careful, you'll end up with a model that memorizes the data instead of learning to generalize.

Looking Ahead

Today, I gave you a brief introduction to small language models. Tomorrow, we'll dive into tokenization strategies and explore how vocabulary size can directly impact model performance. And soon after, I'll walk you through a powerful technique called distillation, one that's helped me a lot in building efficient models.

Stay curious, experiment boldly, and remember: small doesn't mean simple, it means focused.

Let's make our SLMs mighty!

If you're looking for a one-stop solution for AI model training, evaluation, and deployment, with advanced RAG capabilities and seamless MCP (Model Context Protocol) integration, check out IdeaWeaver.

  • 🚀 Train, fine-tune, and deploy language models with enterprise-grade features.
  • 📚 Docs
  • 💻 GitHub

If you find IdeaWeaver helpful, a ⭐ on the repo would mean a lot!