There's more than one way to do self supervised training.
This is the approach the author has taken.
Training corpus: "The fat cat sat on the mat"
Input -> Label
--------------
"The" -> " fat"
"The fat" -> " cat"
"The fat cat" -> " sat"
Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).
Training corpus: "The fat cat sat on the mat"
Input (7 tokens): "The fat cat sat on the mat"
Output logit (7 tokens): "mat fat sat on fat mat and"
Shifted label (7 tokens): "fat cat sat on the mat <ignore>"
Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.
The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.
There's more than one way to do self supervised training.
This is the approach the author has taken.
Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1). Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.
https://www.gilesthomas.com/2024/12/llm-from-scratch-1
part 1