Tokenization - lectures

> [!info] Course code > Use the companion repository for runnable notebooks, figures, and implementation references for this lecture. > Some notebook viewers ignore deep links to individual code cells, so use the markdown heading immediately above the code cell as the stable locator. > - [notebooks/tokenization/lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb): `Word-level tokenizer and the OOV failure`, `Regex tokenizer and fallback tokens`, `BPE merges on a micro-corpus`, `Bytes versus tokens on the same string`, `WordPiece longest-match decoding rule`, `SentencePiece intuition and whitespace markers`, `Side-by-side comparison` > - [course_tools/runtime.py](https://github.com/Montekkundan/llm/blob/main/course_tools/runtime.py) > - [picollm/accelerated/tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/tokenizer.py) > - [picollm/accelerated/pretrain/train_tokenizer.py](https://github.com/Montekkundan/llm/blob/main/picollm/accelerated/pretrain/train_tokenizer.py) ## What This Concept Is Take the sentence `hello im montek`. You see one short sentence. The model does not. Before a Transformer can do anything useful, that sentence has to be broken into smaller pieces and turned into numbers. That whole conversion step is tokenization. If you keep one picture in your head for this lecture, use this one: text becomes pieces, pieces become IDs, and only then does the rest of the model begin. ## Foundation Terms You Need First Stay with the same sentence for a second. A **[[Glossary#Token|token]]** is one piece the tokenizer decides to work with. Depending on the tokenizer, `hello im montek` might become `hello`, `im`, and `montek`, or it might split `montek` into smaller pieces. The full list of pieces the tokenizer knows how to emit is its **[[Glossary#Vocabulary|vocabulary]]**. Each vocabulary entry gets a number called a **[[Glossary#Token ID|token ID]]**. Those IDs are still just labels. The model cannot do geometry or attention with the number `57` by itself. The next step is the **[[Glossary#Embedding|embedding]]** layer, which turns each token ID into a **[[Glossary#Vector|vector]]**, meaning an ordered list of numbers the Transformer can actually compute on. That is why this note starts with tokens and ends by pointing forward to embeddings. Read the diagram below from top to bottom. It is the simplest overview of the whole note: ```mermaid flowchart TD A["Raw text"] --> B["Normalization rules"] B --> C["Split into bytes, characters, or subwords"] C --> D["Match pieces to a fixed vocabulary"] D --> E["Emit token IDs"] E --> F["Feed IDs into embeddings and the model"] ``` <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/01_text_to_ids_pipeline.mp4" controls></video> ## Which Code Surfaces Matter - `lecture_walkthrough.ipynb` is the main runnable surface. It contains word-level tokenization, [[Glossary#BPE|BPE]], bytes-versus-tokens comparisons, [[Glossary#WordPiece|WordPiece]], [[Glossary#SentencePiece|SentencePiece]]-style unigram intuition, and side-by-side comparisons. - `course_tools/runtime.py` is the smallest end-to-end runtime. Use it when you want the cleanest mapping from text to IDs. - `picollm/accelerated/tokenizer.py` is the serious [[Glossary#Tokenizer|tokenizer]] surface. It defines special chat tokens, the split regex, byte-level BPE training logic, and conversation rendering. - `picollm/accelerated/pretrain/train_tokenizer.py` is the real tokenizer training and evaluation entrypoint used before long runs. Use those four surfaces in that order: notebook first, tiny runtime second, serious tokenizer third, real training entrypoint last. ## How Tokenization Works A tokenizer is a pair of functions: - **encode**: map text into a finite sequence of token IDs - **decode**: map token IDs back into text The full conversion path is: 1. clean or normalize the text if needed 2. split the text into candidate pieces 3. match those pieces against the vocabulary 4. emit token IDs 5. hand those IDs to the embedding layer > [!question] What are normalization rules? >> [!answer] Normalization rules clean or standardize text before segmentation. Common examples are lowercasing, Unicode normalization such as [[Glossary#NFC|NFC]] or [[Glossary#NFKC|NFKC]], [[Glossary#Collapsing runs|collapsing repeated whitespace runs]], and [[Glossary#Stripping|stripping]] unwanted leading or trailing whitespace.[^2] Normalization can make a tokenizer more consistent, but it can also remove distinctions that are hard or impossible to reconstruct later. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/08_normalization_irreversibility.mp4" controls></video> > [!question] What is [[Glossary#Vocabulary size|vocabulary size]]? >> [!answer] Vocabulary size is the number of distinct token entries a tokenizer can emit. If the vocabulary size is `50,257`, the model can read and predict `50,257` token IDs.[^3] > [!question] What is [[Glossary#Out-of-vocabulary (OOV)|out-of-vocabulary (OOV)]]? >> [!answer] OOV means a word or symbol cannot be represented directly by a word-level vocabulary. Subword methods were adopted largely to avoid this failure mode by breaking unseen text into smaller reusable pieces.[^4] > [!question] Why are token IDs not enough by themselves? >> [!answer] Token IDs are discrete labels, not learned numerical representations. The next step is the [[Embedding Layer]], where each token ID is mapped to a learned [[Glossary#Vector|vector]] so attention and feed-forward layers can compute on it. Two properties matter immediately: - **determinism**: the same text must produce the same token IDs under the same tokenizer version - **coverage**: the tokenizer must still handle unfamiliar names, code, symbols, and multilingual text That is why modern LLM tokenizers are usually byte-level or subword-based rather than pure word lists.[^4][^5] ## Why tokenization exists The modern story starts in compression. Byte Pair Encoding was introduced as a compression algorithm: repeatedly merge the most frequent adjacent symbol pair to reduce sequence length.[^6] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/07_bpe_compression_origin.mp4" controls></video> Neural machine translation then turned this into a language-modeling tool. The key insight was simple: language is open-vocabulary, so a fixed word list breaks too easily on rare names, domain terms, or spelling variation. Subword tokenization fixes that by composing rare words from smaller units.[^4] The original Transformer paper used a **shared source-target vocabulary**, meaning the same learned subword vocabulary was used on both the encoder input side and the decoder output side rather than training separate vocabularies for each language stream.[^3] From there, different toolkits became standard: - **BPE** stayed popular because it is simple and practical - **WordPiece** became widely known through BERT-style models, where BERT is the canonical encoder model trained with masked-token prediction[^7][^8] - **SentencePiece** became important because it treats tokenization as a trainable, language-agnostic toolkit that can work directly from raw text[^2][^9] ## The major tokenization families Use this table as the simplest compare-and-contrast view: | Family | Main idea | Main strength | Main limitation | Notebook section | |---|---|---|---|---| | Word-level | split on whitespace or word boundaries | intuitive | breaks on OOV text | `Word-level tokenizer and the OOV failure` | | Character-level | one symbol per character | no OOV for plain text | long sequences | `Word-level tokenizer and the OOV failure` | | BPE | merge frequent adjacent pieces | efficient practical default | merge rules are fixed after training | `BPE merges on a micro-corpus` | | WordPiece | greedy longest-match segmentation over a trained vocab | stable, widely used in BERT | depends heavily on the trained vocab and decoding rule | `WordPiece longest-match decoding rule` | | SentencePiece / unigram | train pieces directly from raw text with whitespace markers | language-agnostic and reproducible | less intuitive until you see the whitespace markers and scores | `SentencePiece intuition and whitespace markers` | ### Word-level tokenization Word-level tokenization is useful because it makes the OOV problem obvious immediately. The moment a new word is missing from the vocabulary, the tokenizer either fails or falls back to an unknown token.[^4] > [!example] Notebook follow-up > - [`Word-level tokenizer and the OOV failure`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#word-level-tokenizer-and-the-oov-failure) > Use this section right after the paragraph above to see the OOV failure mode in a runnable toy example. > [!tip] TensorTonic follow-up > - [TensorTonic: Transformers Tokenization](https://www.tensortonic.com/research/transformer/transformers-tokenization) > Use it here to practice the same word-level versus subword trade-off interactively. <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/02_word_level_oov_failure.mp4" controls></video> ### Character-level tokenization Character-level tokenization avoids OOV for alphabetic text, but it makes sequences much longer. That matters because Transformer [[Glossary#Self-attention|self-attention]] becomes more expensive as sequence length grows; the exact comparison is discussed in Section 4 and Table 1 of the Transformer paper.[^10] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/03_character_vs_subword_tradeoff.mp4" controls></video> ### BPE BPE starts with a base alphabet, counts frequent adjacent pairs, merges the most common pair, and repeats until it reaches a target vocabulary size.[^4][^6] > [!example] Notebook follow-up > - [`BPE merges on a micro-corpus`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#bpe-merges-on-a-micro-corpus) > Run this notebook section here to watch the merge process form reusable subword units step by step. > [!tip] TensorTonic follow-up > - [TensorTonic: GPT-2 BPE Training](https://www.tensortonic.com/research/gpt2/gpt2-bpe-training) > - [TensorTonic: GPT-2 BPE Encode/Decode](https://www.tensortonic.com/research/gpt2/gpt2-bpe-encode-decode) > Use these exercises after this paragraph to practice both learning BPE merges and applying them at encode/decode time. Modern byte-level BPE uses bytes as the base alphabet. That gives full coverage for arbitrary UTF-8 text and is why GPT-2-style tokenizers can handle emojis, code, punctuation, and multilingual strings without needing a conventional unknown token.[^5][^11] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/04_bpe_merge_process.mp4" controls></video> ### WordPiece WordPiece is not just “BPE with a different name.” The key detail is its inference-time rule: it usually applies a greedy longest-match-first segmentation over a trained vocabulary.[^12] That is why the `##` continuation convention matters. It marks that a piece is continuing a word rather than starting a fresh word span. The notebook already demonstrates this in the `WordPiece longest-match decoding rule` section. > [!example] Notebook follow-up > - [`WordPiece longest-match decoding rule`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#wordpiece-longest-match-decoding-rule) > Use this notebook section here to inspect how longest-match segmentation behaves on real token pieces. > [!tip] TensorTonic follow-up > - [TensorTonic: BERT WordPiece](https://www.tensortonic.com/research/bert/bert-wordpiece) > Work through it here to practice the same WordPiece continuation-rule behavior directly. > [!question] What is BERT? >> [!answer] BERT is a well-known encoder model trained with masked-token prediction rather than next-token generation. It made WordPiece widely familiar in NLP practice.[^7][^8] ### SentencePiece and unigram tokenization SentencePiece is best understood as a toolkit, not just one algorithm. It packages normalization, training, segmentation, and decoding into one reproducible model file.[^2][^9] It also treats whitespace as an explicit symbol rather than assuming a language-specific word splitter. That is why the visible whitespace marker matters so much in SentencePiece examples.[^2][^9] > [!question] What does unigram mean here? >> [!answer] [[Glossary#Unigram tokenization|Unigram tokenization]] treats tokenization as a scoring problem over candidate pieces. Instead of repeatedly merging pairs like BPE, it keeps a candidate vocabulary and searches for a high-probability segmentation of the text.[^2] SentencePiece supports both BPE and unigram language-model tokenization.[^2][^9] The practical production lesson is that large multilingual vocabularies are often implemented through SentencePiece-style tooling. In Gemma and Gemini-adjacent tooling, the large vocabulary is there to reduce fragmentation across many languages, digits, and whitespace patterns, not because “bigger is always better.”[^13][^14] ## Engineering trade-offs that actually matter ### Vocabulary size versus sequence length There is a real trade-off: - larger vocabularies usually shorten sequences - shorter sequences reduce attention cost and KV-cache growth - larger vocabularies also make embeddings and output layers larger - rare tokens may then receive too little training signal What that means in practice is: - if the tokenizer has bigger pieces available, a word or phrase may be represented by fewer tokens - fewer tokens means the model has fewer positions to attend over, so generation is usually cheaper and faster - but every extra vocabulary entry needs parameters in the embedding table, and usually in the output layer too - if you add many specialized tokens that appear only rarely, the model may never see them often enough to learn strong representations Here is the beginner version of the same trade-off: - **larger vocabulary**: more compact tokenization, but a bigger model table to learn - **smaller vocabulary**: smaller embedding/output tables, but longer token sequences and more runtime work For example, imagine the phrase `machine learning engineer`: - with a more expressive vocabulary, the tokenizer might encode it as a few large subword pieces - with a smaller vocabulary, it may need many smaller chunks If the sequence gets longer, then: - attention has to compare more token positions - generation has to carry a larger KV cache over time - latency and memory use increase If the vocabulary gets larger, then: - the embedding layer gets bigger - the output logits layer gets bigger unless weights are tied - some rare tokens may be poorly learned because they do not appear often enough during training > [!question] What is [[Glossary#KV cache|KV cache]]? >> [!answer] During generation, the model stores previously computed keys and values so it does not recompute the whole prefix at every step. You will see the full runtime story in [[Inference Runtime and KV Cache]]. > [!question] What does “data-starved” mean? >> [!answer] A token is data-starved when it exists in the vocabulary but appears too rarely in training for its embedding to be learned well. For example, a rare chemical name may get its own token, but if it appears only a handful of times, that dedicated embedding may stay weak. The trade-off is easiest to see with parameter accounting: ```mermaid flowchart TD A["Vocabulary size V"] B["Input embedding matrix: V x d_model"] C["Output projection: d_model x V"] D["Weight tying"] E["Reuse the same matrix instead of storing both separately"] A --> B B --> C C --> D D --> E ``` The Transformer paper explicitly shares the embedding matrix with the pre-[[Glossary#Softmax|softmax]] output projection, and Press and Wolf explain why that tying can be a strong design choice.[^3][^15] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/05_vocab_size_vs_sequence_length.mp4" controls></video> ### Preview: why vocabulary size changes model size Here, **model size** just means how many learned numbers the model has to store. Those learned numbers are called **[[Glossary#Parameter|parameters]]**. Suppose: - [[Glossary#Vocabulary size|vocabulary size]] $V = 50{,}000$ - token vector size $d_{\text{model}} = 1024$ Here: - $V$ means the tokenizer knows `50,000` distinct token entries - for now, read $d_{\text{model}}$ as: how many numbers the model stores for each token after lookup - you do not need the full embedding story yet; that comes in [[Embedding Layer]] - `1024` means each token is represented inside the model as a [[Glossary#Vector|vector]] with `1024` numbers So if token ID `123` is looked up in the embedding table, the model does not get back a single number. It gets back a length-`1024` vector. Then: - input embeddings require $50{,}000 \times 1024 = 51{,}200{,}000$ [[Glossary#Parameter|parameters]] - an untied output projection adds another $1024 \times 50{,}000 = 51{,}200{,}000$ [[Glossary#Parameter|parameters]] - tying the weights removes that second full matrix If you are new, read that multiplication very literally: - `50,000` is the number of token types in the vocabulary - `1024` is the number of learned values stored for each token - so for every token, the model stores a [[Glossary#Vector|vector]] of `1024` [[Glossary#Parameter|parameters]] - doing that for all `50,000` tokens gives `50,000 x 1024 = 51,200,000` learned numbers You can picture the embedding table like a spreadsheet or [[Glossary#Matrix|matrix]]: - each **row** corresponds to one token in the vocabulary - each **column** corresponds to one dimension of that token's representation - the value in each cell is one learned [[Glossary#Parameter|parameter]] So the total number of learned [[Glossary#Parameter|parameters]] is just: - number of rows x number of columns - $50{,}000 \times 1024 = 51{,}200{,}000$ For a tiny toy example, imagine: - vocabulary size `V = 3` - token vector size $d_{\text{model}} = 4$ Then the embedding table would have: - `3` rows, one for each token - `4` columns, one for each feature slot in the token vector So the table would contain: - $3 \times 4 = 12$ [[Glossary#Parameter|parameters]] total The same counting rule scales up to the real example. Nothing more complicated is happening there. The first multiplication comes from: - one row per vocabulary entry - one column per vector slot So the embedding [[Glossary#Matrix|matrix]] shape is: - $V \times d_{\text{model}} = 50{,}000 \times 1024$ That means the model stores a learned `1024`-number vector for each of the `50,000` possible tokens. If you want a preview of what comes later, there is usually also an output projection [[Glossary#Matrix|matrix]]. That output projection goes in the opposite direction: - it takes a `1024`-number model state - and maps it to `50,000` output [[Glossary#Logits|logits]], one score for each possible next token That is why its shape is: - $d_{\text{model}} \times V = 1024 \times 50{,}000$ You can count that matrix the same way: - `1024` input features go in - `50,000` token scores come out - so the matrix again contains `1024 x 50,000 = 51,200,000` [[Glossary#Parameter|parameters]] The only difference is the direction of use: - the embedding matrix turns a token ID into a vector - the output matrix turns a vector into token scores If that second matrix feels too early, the main takeaway is still simple: - bigger vocabulary means a bigger embedding table That alone is enough to show that tokenization choices affect model size. > [!question] Why this exercise matters >> [!answer] It makes vocabulary design concrete. Tokenization is not cosmetic. Vocabulary size changes how many learned numbers the model has to store, how much memory it needs, and sometimes how well optimization works. ### Token counts affect cost and throughput In production APIs, tokens are not only representation units. They are also billing and quota units. OpenAI pricing, OpenAI token guidance, and Gemini token-count APIs all make this explicit.[^1][^16][^17] That is why tokenization is part of the economics of LLM usage. For a concrete demo, open [lecture_walkthrough.ipynb](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb) and go to the heading `Bytes versus tokens on the same string` right above the code cell that prints UTF-8 byte counts beside token pieces. > [!example] Notebook follow-up > - [`Bytes versus tokens on the same string`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#bytes-versus-tokens-on-the-same-string) > Use this notebook section here to compare byte counts and token counts on the exact same input string. ### Tokenization bias and token-free research Tokenization is useful, but it is not neutral. It can fragment some languages more than others, inflate some inputs more than others, and introduce provider-specific counting differences. That is one reason token-free work such as SpaceByte, ByT5, and CANINE exists.[^18][^19][^20] ## Do Gemini, Claude, Grok, and OpenAI use the same tokenizers? No. There are three layers of difference: 1. they may use different tokenizer families 2. even inside the same family, they may use different vocabularies and normalization rules 3. API wrappers may insert extra hidden tokens that change count and cost OpenAI’s public tooling is BPE-based through `tiktoken`.[^7] Gemma and Gemini-adjacent tooling use large SentencePiece-based vocabularies.[^13][^14] Grok-1 publicly reports a SentencePiece tokenizer with 131,072 tokens.[^21] Anthropic explicitly says token counting can be an estimate and may include system-added tokens not billed.[^22] xAI explicitly warns that displayed counts can differ from actual endpoint consumption because inference endpoints may add predefined tokens.[^23] <video src="https://assets.montek.dev/lectures/media/llm/concepts/Tokenization/06_provider_tokenizer_comparison.mp4" controls></video> > [!question] Quick check > Which explanation is best when two models both say they use SentencePiece but still produce different token counts for the same chat request? A. the toolkit name guarantees identical tokenization B. vocabulary, normalization, special tokens, and message wrappers can still differ C. only pricing rules differ >> [!answer] B. The toolkit name does not force the same vocabulary, normalization rules, special tokens, or chat wrappers. ## Common questions you may have > [!question] Why not just use words? >> [!answer] Because language is open-vocabulary. Rare words, names, code, and multilingual strings break pure word-level vocabularies too easily, which is why subword tokenization became standard.[^4] > [!question] Why not just use characters or bytes and remove tokenization completely? >> [!answer] You can, but sequences usually become much longer. That increases attention cost and runtime pressure. Token-free research exists, but it has to close a real efficiency gap.[^10][^18][^19][^20] > [!question] Why can token counts differ between a tokenizer UI and API billing? >> [!answer] Because some providers add hidden wrapper tokens, system prompts, policy text, or protocol-specific message tokens. The visible text is not always the full sequence the endpoint actually sees.[^22][^23][^24] > [!example] Notebook walkthroughs in this lecture > > If you want to run the companion code while reading this note, use these notebook sections. If the viewer ignores the fragment, search for the exact heading text in the notebook: > > - [`Word-level tokenizer and the OOV failure`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#word-level-tokenizer-and-the-oov-failure) > - [`BPE merges on a micro-corpus`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#bpe-merges-on-a-micro-corpus) > - [`Bytes versus tokens on the same string`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#bytes-versus-tokens-on-the-same-string) > - [`WordPiece longest-match decoding rule`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#wordpiece-longest-match-decoding-rule) > - [`SentencePiece intuition and whitespace markers`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#sentencepiece-intuition-and-whitespace-markers) > - [`Side-by-side comparison`](https://github.com/Montekkundan/llm/blob/main/notebooks/tokenization/lecture_walkthrough.ipynb#side-by-side-comparison) > > Those sections already cover the main comparison path for this lecture. They are the best place to demonstrate how the families differ in practice. > [!tip] TensorTonic practice for this lecture > > If you want to practice this lecture in a more implementation-focused format, work through these TensorTonic exercises: > > - [TensorTonic: Transformers Tokenization](https://www.tensortonic.com/research/transformer/transformers-tokenization) > - [TensorTonic: BERT WordPiece](https://www.tensortonic.com/research/bert/bert-wordpiece) > - [TensorTonic: GPT-2 BPE Training](https://www.tensortonic.com/research/gpt2/gpt2-bpe-training) > - [TensorTonic: GPT-2 BPE Encode/Decode](https://www.tensortonic.com/research/gpt2/gpt2-bpe-encode-decode) > > They are good follow-ups because they let you compare the main tokenizer families directly: > > - subword tokenization in the original Transformer era > - WordPiece longest-match behavior in BERT > - byte-pair merge learning in GPT-2 > - the difference between learning merges and using them at encode/decode time <div style="display:flex; gap:1rem; margin:1.5rem 0; flex-wrap:wrap;"> <div style="flex:1; min-width:220px; border:1px solid var(--background-modifier-border); border-radius:12px; padding:1rem; background:var(--background-secondary);"> <div style="font-size:0.85em; color:var(--text-muted); margin-bottom:0.35rem;">Next</div> <div><a class="internal-link" data-href="Embedding Layer" href="Embedding%20Layer">Embedding Layer</a></div> </div> </div> ## References [^1]: OpenAI Help, "What are tokens and how to count them?," 2025. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them [^2]: Taku Kudo and John Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing," 2018. https://arxiv.org/abs/1808.06226 [^3]: Ashish Vaswani et al., "Attention Is All You Need," 2017. https://arxiv.org/abs/1706.03762 [^4]: Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units," 2015. https://arxiv.org/abs/1508.07909 [^5]: Hugging Face, "Tokenizer summary," 2025. https://huggingface.co/docs/transformers/en/tokenizer_summary [^6]: Philip Gage, "A New Algorithm for Data Compression," 1994. https://www.semanticscholar.org/paper/A-new-algorithm-for-data-compression-Gage/1aa9c0045f1fe8c79cce03c7c14ef4b4643a21f8 [^7]: OpenAI, "tiktoken," 2025. https://github.com/openai/tiktoken [^8]: Hugging Face, "BERT," 2025. https://huggingface.co/docs/transformers/en/model_doc/bert [^9]: Google, "SentencePiece repository," 2025. https://github.com/google/sentencepiece [^10]: Ashish Vaswani et al., "Attention Is All You Need," Section 4 and Table 1, 2017. https://arxiv.org/html/1706.03762v7#S4 [^11]: Hugging Face, "GPT-2," 2025. https://huggingface.co/docs/transformers/en/model_doc/gpt2 [^12]: Xinying Song, Alex Salcianu, and Yang Song, "Fast WordPiece Tokenization," 2020. https://arxiv.org/abs/2012.15524 [^13]: Gemma Team et al., "Gemma 2: Improving Open Language Models at a Practical Size," 2024. https://arxiv.org/abs/2408.00118 [^14]: Google Developers, "Gemma explained: Overview of Gemma model family architectures," 2024. https://developers.googleblog.com/en/gemma-explained-overview-gemma-model-family-architectures/ [^15]: Ofir Press and Lior Wolf, "Using the Output Embedding to Improve Language Models," 2016. https://arxiv.org/abs/1608.05859 [^16]: OpenAI, "API pricing," 2025. https://openai.com/api/pricing/ [^17]: Google, "Tokens," 2025. https://ai.google.dev/gemini-api/docs/tokens [^18]: Kevin Slagle, "SpaceByte: Towards Deleting Tokenization from Large Language Modeling," 2024. https://arxiv.org/abs/2404.14408 [^19]: Linting Xue, Aditya Barua, and Noah Constant, "ByT5: Towards a token-free future with pre-trained byte-to-byte models," 2021. https://arxiv.org/abs/2105.13626 [^20]: Jonathan H. Clark, Dan Garrette, and Iulia Turc, "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation," 2021. https://arxiv.org/abs/2103.06874 [^21]: xAI, "Grok-1 repository," 2024. https://github.com/xai-org/grok-1 [^22]: Anthropic, "Token counting," 2025. https://platform.claude.com/docs/en/build-with-claude/token-counting [^23]: xAI, "Rate limits," 2025. https://docs.x.ai/developers/rate-limits [^24]: Anthropic, "Constitutional AI," 2025. https://www.anthropic.com/constitution