Writing an LLM from scratch, part 20 – starting training, and cross entropy loss

(gilesthomas.com)

39 points | by gpjt 2 days ago ago

3 comments

leopoldj 2 days ago ago
There's more than one way to do self supervised training.
This is the approach the author has taken.
```
    Training corpus: "The fat cat sat on the mat"

    Input -> Label
    --------------
    "The" -> " fat"
    "The fat" -> " cat"
    "The fat cat" -> " sat"
```
Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).
```
    Training corpus: "The fat cat sat on the mat"
    Input (7 tokens): "The fat cat sat on the mat"
    Output logit (7 tokens): "mat fat sat on fat mat and"
    Shifted label (7 tokens): "fat cat sat on the mat <ignore>"
```
Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.
[-]
- blackbear_ 2 days ago ago
  The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.
asimovDev 2 days ago ago
https://www.gilesthomas.com/2024/12/llm-from-scratch-1
part 1