Advanced Data Engineering for LLMs

> [!info] Course code > Use these repo paths together with this note: > - [picollm/accelerated/dataset.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/dataset.py) > - [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) > - [picollm/accelerated/pretrain/train_tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train_tokenizer.py) ## What This Concept Is In the basic data-quality note, the main question is whether the data is clean enough to train on. Here the question gets more deliberate: if you had several data sources, several filters, and several target behaviors in mind, how would you shape the corpus on purpose? That is why this note feels less like data cleaning and more like distribution design. ## Foundation Terms You Need First A serious corpus is not one pile of text. It is a **data pipeline** with acquisition, filtering, [[Glossary#Deduplication|deduplication]], mixture design, and evaluation-hygiene checks. **[[Glossary#Contamination|Contamination]]** is what happens when benchmark or evaluation content leaks into training. **[[Glossary#Deduplication|Deduplication]]** reduces repeated examples. **Mixture weighting** decides how much influence each source or domain gets. So the mindset shift here is important: the data pipeline is part of the model design, not just preparation work that happens before the "real" modeling begins. ## Deduplication Deduplication exists because web-scale data contains a large amount of repeated or near-repeated text. If duplicates are left untouched, training compute is wasted on redundant signal and [[Glossary#Benchmark|benchmark]] contamination risk can increase. Deduplication is therefore both an efficiency technique and an evaluation-hygiene technique.[^1] You should distinguish: - exact duplicate removal - near-duplicate clustering - document-level versus span-level deduplication Those choices change the resulting corpus. A dedup pipeline is not just janitorial work. It changes which patterns the model rehearses most often. ## Contamination Contamination matters because a model can appear to generalize when it is partly recalling evaluation content that was already present in the training data. At modern scale, contamination can happen through public repositories, mirrored benchmark files, prompt collections, or repeated web copies. Good data engineering therefore includes some awareness of evaluation hygiene, not just cleaning malformed HTML.[^1] The methodological standard to teach is: - define the evaluation assets you care about - screen training data against them where feasible - report residual contamination risk honestly ## Data mixtures and domain balancing Large training corpora are often mixtures of domains: web text, code, books, academic text, conversations, reference material, and more. The mixture matters because it determines what the model sees often and what it sees rarely. Domain balance is therefore a design choice, not an incidental artifact. At research level, mixture design is one of the clearest places where data engineering becomes model design. Changing the mixture changes: - [[Glossary#Vocabulary|vocabulary]] statistics - style priors - factual density - code versus prose competence - benchmark transfer patterns That is why advanced teams think in terms of mixture weights rather than one giant pooled dataset. ## Tokenizer-corpus co-design Tokenizer training is usually taught earlier as a separate concept. But for research work, tokenizer design and corpus design interact. [[Glossary#Vocabulary size|Vocabulary size]], normalization rules, and subword frequency are all downstream of the corpus distribution. This means: - a different corpus can justify a different tokenizer - a multilingual or code-heavy corpus changes merge behavior - domain-specific corpora can make token efficiency much better or much worse You should learn that tokenization is part of data engineering, not a detached preprocessing convenience. ## Synthetic data trade-offs Synthetic data can improve coverage, style consistency, or instruction-following behavior, but it can also narrow the model toward the biases and stylistic habits of the generating system. Synthetic data is powerful, but it is not automatically high quality just because it is cleanly formatted. The right research question is not "did we add synthetic data?" It is: - what gap did synthetic data fill? - what bias did it import? - how was quality audited? ## Curriculum design Curriculum design asks whether certain data should appear earlier, later, or in different proportions during training. Even if a course project does not implement a sophisticated curriculum, you should understand the concept: data ordering and stage structure can matter. Examples include: - general corpus first, chat corpus later - code mixture increased during a specialized phase - high-quality subsets emphasized early for stability That is a data-system decision, not only an optimizer decision. ## Provenance and auditability At PhD level, data engineering should be auditable. You should be able to answer: - where did this data come from? - what transformations were applied? - what filters were used? - what evaluation assets were protected? Without provenance, later debugging becomes guesswork because every failure can be blamed on an invisible data difference. <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Previous</div> <div><a class="internal-link" data-href="Safety and Alignment Evaluation" href="Safety%20and%20Alignment%20Evaluation">Safety and Alignment Evaluation</a></div> </div> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Interpretability and Mechanistic Analysis" href="Interpretability%20and%20Mechanistic%20Analysis">Interpretability and Mechanistic Analysis</a></div> </div> </div> ## References [^1]: Luca Soldaini et al., Allen Institute for AI, [Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research](https://arxiv.org/abs/2402.00159) [^2]: Hugging Face, [FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [^3]: Taku Kudo and John Richardson, Google, [SentencePiece](https://arxiv.org/abs/1808.06226) [^4]: Hugging Face TB, [The Smol Training Guide](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)