Data Curation and Dataset Quality

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/dataset.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/dataset.py) > - [picollm/accelerated/pretrain/train_tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train_tokenizer.py) > - [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) ## What This Concept Is Imagine training two models with the same architecture and similar budgets, but one is trained on cleaner, better-shaped data. In practice, that data difference can matter more than many small architectural tweaks. This note is about why. A model learns from the distribution you feed it. If the data is messy, repetitive, badly formatted, or mismatched to the task, those weaknesses show up in the model. ## Foundation Terms You Need First A **training distribution** is the actual mix of patterns the model sees while learning. **Filtering** removes low-quality or unwanted records. **[[Glossary#Deduplication|Deduplication]]** reduces repeated examples so the run does not waste budget relearning the same text. **Formatting** makes examples consistent enough for the model to learn the intended structure. So when you read this note, think less in terms of "a dataset" and more in terms of a pipeline that shapes what the model will treat as normal. ## Data quality is part of the model People sometimes speak as if the dataset and the model are separate objects: first we choose a model, then we choose some data to feed into it. In practice, that separation is weaker than it sounds. A specific trained [[Glossary#Checkpoint|checkpoint]] is the result of a specific architecture trained on a specific data distribution under a specific objective. Change the data and you change the model in a very real sense. This is why data curation decisions should be treated as scientific decisions. Which domains are included? How much duplication is allowed? Are code examples mixed with prose? Are low-quality forum fragments filtered out? Are chat messages normalized into a consistent format? These questions are not clerical details. They directly shape what the model learns to continue, imitate, and prioritize.[^2] ## The main dataset failure modes There are several failure modes that appear again and again in LLM training. The first is duplication. If the same or near-identical examples appear repeatedly, the model can overfit to those patterns and waste part of its training budget relearning the same signal. Duplication also distorts the effective data distribution. The second is formatting inconsistency. In chat post-training especially, inconsistent role markers, uneven separators, or mixed conventions for system and user turns can teach the model unstable turn-taking behavior. This often shows up later as "the chatbot feels weird," but the root cause often sits in formatting, not in the optimizer.[^3] The third is low-information or low-quality text. Boilerplate, spam, malformed web content, and synthetic junk can all dilute the useful signal in the corpus. The model does not know that some text is "beneath it." It simply learns from the distribution it is given. The fourth is mismatch between training data and target use case. A model trained mostly on broad web prose may be competent at next-token completion yet still feel weak as a conversational assistant unless it later receives post-training on clean chat-style interactions. ## Deduplication and filtering Real large-scale training pipelines often include deduplication and filtering because the raw web is not a clean language-learning environment. Datasets such as FineWeb and related curated corpora exist partly because raw internet text benefits from quality control.[^4] The point of deduplication is not aesthetic purity. It is to protect training efficiency and generalization. If you spend a fixed token budget, they want as much unique, useful learning signal as possible. Likewise, filtering aims to remove malformed, empty, obviously noisy, or otherwise low-value examples so the optimizer spends less time on garbage text. For a course project, the exact filtering system can be simpler than industrial data pipelines. But the conceptual habit should still be taught clearly: do not treat every line of text as equally valuable. ## Why tokenizer training is also a data decision Tokenizer training is often presented as if it were a separate technical stage. But tokenizer training is also a data-quality decision. The tokenizer [[Glossary#Vocabulary|vocabulary]] is shaped by the text it is trained on. If that text is dominated by one domain or one style, the resulting tokenization may be efficient for that distribution and less efficient for others. This affects sequence lengths, token budgets, and ultimately compute cost.[^5] That is why tokenizer training belongs in the same conceptual conversation as dataset curation. A tokenizer is not a neutral front-end. It is an adaptation to a distribution. ## Why chat data must be cleaner than you might expect Chat post-training is where many beginner pipelines become unstable. The reason is that chat data carries extra structure. It is not just text. It contains roles, turn boundaries, response style assumptions, and sometimes hidden conventions about how prompts and answers are wrapped. If that structure is inconsistent, the model may learn messy role boundaries or awkward response habits. This is why [[Chat Format and SFT]] is not merely an inference concern. The formatting quality of the dataset affects whether the final chatbot feels coherent and properly conversational. Even a good [[Glossary#Base model|base model]] can become a messy assistant if the [[Glossary#SFT|SFT]] data teaches inconsistent turn-taking. ## Dataset quality versus dataset size There is always a temptation to think "more data is automatically better." But more low-quality or poorly matched data can be worse than a smaller, cleaner dataset. This is especially true for post-training. A compact, carefully curated conversational dataset can sometimes teach a more stable chat behavior than a huge but inconsistent mixture of assistant-like text, roleplay fragments, and accidental prompt wrappers. This does not mean scale is irrelevant. Large-scale pretraining still matters enormously. It means that scale without curation is not enough. ## What you should learn to ask about every dataset Before training, you should be able to ask a few disciplined questions: - What distribution does this dataset represent? - Is it aligned with the use case of the model? - How noisy is it? - How consistent is the formatting? - How much duplication is likely? - What filtering or normalization happened before training? These questions make the dataset visible as an engineered artifact instead of an opaque blob downloaded from the internet. ## Why this matters for `picollm` In `picollm`, you see a clean split between base pretraining and chat post-training. That split is useful because it mirrors the data story. Broad text corpora teach language and general continuation. Clean chat data teaches conversational structure. The code paths are easier to understand when you realize that the data itself is also split by purpose. The final practical lesson is not just "train on FineWeb, then SFT on chat data." The deeper lesson is: each stage has a different data objective, and the quality of that data governs whether the resulting model feels coherent. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Research Workflow and Ablations" href="Research%20Workflow%20and%20Ablations">Research Workflow and Ablations</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Chat Format and SFT" href="Chat%20Format%20and%20SFT">Chat Format and SFT</a></div> </div> </div> ## References [^1]: Sebastian Raschka, [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) [^2]: Jordan Hoffmann et al., DeepMind, [Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556) [^3]: Hugging Face TRL, [SFTTrainer documentation](https://huggingface.co/docs/trl/sft_trainer) [^4]: Hugging Face, [FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [^5]: Taku Kudo and John Richardson, Google, [SentencePiece](https://arxiv.org/abs/1808.06226)