How to Create a LLM from scratch and deploy it

<video src="https://assets.montek.dev/lectures/course_intro.mov" controls></video> Hey everyone, and welcome. I’m really excited about sharing a course I created! this course is about building an LLM from scratch and actually deploy it into real applications. This course came from a lot of curiosity, a lot of reading, and a lot of hands-on experimentation. I spent time learning from Sebastian Raschka’s LLMs from Scratch, exploring Andrej Karpathy’s nanochat, and going through other papers, tools, and references that helped me understand how this whole space fits together. And while learning, I kept feeling the same thing: this knowledge is incredibly powerful, but it’s often scattered everywhere. So I wanted to create the course I wish I had in the beginning. A course that brings everything together in one clear, practical, organized journey. We’re not just going to talk about LLMs at a high level. We’re going to understand how they work, how they’re trained, how they’re evaluated, and how they’re deployed. And then we’ll take it one step further by looking at how real product workflows come together, including terminal experiences, web apps, and modern tooling like OpenTUI and Vercel. By the end of this course, you’ll be able built a LLM, understand it, and know how to ship it! I’m genuinely excited to share my knowledge and learned, and I hope you enjoy the course. ## What You Will Build You will build up to this pipeline: ```mermaid flowchart TD A["Raw text and chat data"] --> B["Tokenizer"] B --> C["Transformer model"] C --> D["Training and evaluation"] D --> E["Chat specialization"] E --> F["CLI or web chatbot"] ``` You are not only learning isolated ideas like attention or loss curves. You are learning how those pieces connect when you train, evaluate, serve, and release a model. ## How To Use This Course Move through the course in this order: 1. read the concept note on `lectures.montek.dev` 2. open the linked notebook or code surface 3. connect that idea to the serious `picollm/accelerated/` path 4. return to the product notes once the model path is clear The repo is split into three layers: - `lectures.montek.dev` explains the ideas - the notebooks and `course_tools` show the smallest runnable versions - `picollm/accelerated/` is the serious end-to-end capstone path > [!info] Course code > - repo overview: [README.md](https://github.com/Montekkundan/llm/blob/main/README.md) > - final model workflow: [picollm/](https://github.com/Montekkundan/llm/tree/main/picollm) > - accelerated stack: [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) > - accelerated speedrun: [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) > - cost and runtime note: [[Compute, Time, and Cost of LLMs]] > - repo reading guide: [[picollm Code Map]] > - evaluation note: [[Evaluation and Model Quality]] > - telemetry note: [[Experiment Tracking and Run Analysis]] > - deployment workflow note: [[Real Chatbot Workflow]] ## Why This Course Exists The goal is to make the full stack feel connected. You should be able to move from clean Transformer theory to a real training run, then from a checkpoint to a chatbot you can actually use. ## The Main Course Path The main path has two layers: ```mermaid flowchart TD A["Core concepts"] --> B["Training and evaluation"] B --> C["Runtime and deployment"] C --> D["Final chatbot workflow"] ``` The final practical block is the `picollm/accelerated/speedrun.sh` pipeline. Its stages are: 1. preflight and hardware validation 2. dataset bootstrap 3. tokenizer train and tokenizer eval 4. base pretraining 5. base evaluation 6. identity-data verification for [[Glossary#SFT|SFT]] 7. chat SFT 8. chat evaluation 9. report generation and run-manifest writeout 10. optional Hugging Face upload 11. launch into CLI or web chat This matters because almost every lecture maps to one stage of that run: - [[LLM/concepts/Tokenization]] explains the tokenizer stage - [[Training Loop]] explains base pretraining - [[Evaluation and Model Quality]] explains validation and chat-eval outputs - [[Chat Format and SFT]] explains why post-training changes behavior - [[Real Chatbot Workflow]] explains how the checkpoints become a real chatbot ## Day-By-Day Plan | Day | Notes | Goal | | --- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- | | 1 | [[Roadmap]], [[Glossary]], [[LLM/concepts/Tokenization]] | Learn why text must become tokens before a model can use it. | | 2 | [[Embedding Layer]], [[Positional Encoding]] | Learn how token IDs become vectors and how order enters the model. | | 3 | [[Scaled Dot-Product Attention]], [[Multi-head Attention]] | Learn how tokens interact across a sequence. | | 4 | [[Feed-Forward Network]], [[Layer Normalization]] | Learn the nonlinear and stability parts of a Transformer block. | | 5 | [[Encoder Block]], [[Decoder Block]], [[Causal Language Modeling]] | Learn why decoder-only Transformers become generators. | | 6 | [[Training Loop]], [[Training Configuration and Hyperparameters]] | Learn how optimization actually runs and which knobs matter. | | 7 | [[Inference and Sampling]], [[Evaluation and Model Quality]] | Learn how generation works and how to judge outputs honestly. | | 8 | [[Compute, Time, and Cost of LLMs]], [[Distributed Training and Multi-GPU]] | Learn why runtime, cost, and hardware behave the way they do. | | 9 | [[Experiment Tracking and Run Analysis]], [[Research Workflow and Ablations]] | Learn how to read runs and compare experiments carefully. | | 10 | [[Data Curation and Dataset Quality]], [[Chat Format and SFT]] | Learn how data shape changes model behavior. | | 11 | [[Failure Modes and Debugging]], [[Inference Runtime and KV Cache]] | Learn where chatbot systems break and how the runtime behaves. | | 12 | [[Quantization]], [[Serving, Latency, and Observability]] | Learn inference tradeoffs and serving measurement. | | 13 | [[FastAPI Chat App]], [[Serving, Latency, and Observability]] | Learn how a model becomes a service. | | 14 | [[Deployment]], [[OpenTUI Terminal Chat App]] | Learn how the backend becomes a usable product surface. | | 15 | [[Real Chatbot Workflow]], [[picollm Code Map]] | Connect theory to the real codebase and run order. | | 16 | [[Vercel AI SDK Chat App]] | Learn how the same backend powers a browser-native product. | | 17 | [[Scaling Laws and Compute-Optimal Training]], [[Optimizer Theory for Transformer Training]] | Learn how serious training budgets are planned. | | 18 | [[Advanced Distributed Training Systems]], [[Advanced Inference Systems]] | Learn the deeper systems layer behind large runs. | | 19 | [[Formal Evaluation and Benchmarking]], [[Reproducibility and Research Method]] | Learn how to make and defend experimental claims. | | 20 | [[Post-Training Beyond SFT]], [[Safety and Alignment Evaluation]] | Learn what comes after SFT before release. | | 21 | [[Advanced Data Engineering for LLMs]], [[Interpretability and Mechanistic Analysis]] | Learn the deeper data and analysis questions behind serious LLM work. | ## Accounts And References You May Need Later You do not need every account on day 1. These matter later for the practical path: - [Hugging Face](https://huggingface.co/): for datasets, model artifacts, and optional sharing - [Vast.ai](https://vast.ai/): for rentable cloud GPUs during heavier runs - [GitHub](https://github.com/): for the repo, code reading, and release workflow - [Vercel](https://vercel.com/): optional, for the browser deployment lecture These references are useful comparison points: - [TensorTonic](https://www.tensortonic.com/): visual and interactive Transformer intuition - [rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch): concept-first comparison - [nanochat](https://github.com/karpathy/nanochat): systems-first comparison ## Optional Advanced Track If you want the research-facing layer after the main path, follow [[Roadmap]]. ## What You Should Know By The End By the end of the course, you should be able to: - explain how raw text becomes tokens, embeddings, attention patterns, and predictions - read a decoder-only Transformer and know what each major part is doing - train a small language model end to end and understand what the loss is measuring - change generation behavior by adjusting decoding rather than guessing blindly - explain the difference between a [[Glossary#Base model|base model]] and a chat-SFT model - serve a model behind a stable API and connect that backend to different clients - move comfortably between theory, code, experiments, and deployment ## Reading Order Follow [[Roadmap]] for the navigation view of the course. For the final practical block, keep these docs close: - [picollm/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/README.md) - [picollm/accelerated/README.md](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/README.md) - [picollm/accelerated/speedrun.sh](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/speedrun.sh) - [prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md](https://github.com/Montekkundan/llm/blob/main/prompts/real_chatbot_workflow/base_vs_chat_sft_prompts.md) ## References Behind This Course - Sebastian Raschka, [Build a Large Language Model (From Scratch)](https://www.manning.com/books/build-a-large-language-model-from-scratch) and [rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch) - Andrej Karpathy, [nanochat](https://github.com/karpathy/nanochat) - Hugging Face, [The Smol Training Guide: The Secrets to Building World-Class LLMs](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#attention)