- What matters when training LLMs:
- Architecture
- Training Algorithm and Loss
- Data
- Evaluation
- Systems
Pre-Training.
Post-Training.
Language Modeling.
P(the, mouse, ate, the, cheese) = 0.02 – syntactic knowledge
P(the, the, mouse, ate,cheese) = 0.0001 – semantic knowledge
P(…)
Auto-Reggressive (AR) language model: Predict next word.
Steps: she likely prefers: tokenize -> 1 -she, 2-likely, 3-prefers => pass to blackbox model => get probability distribution over next word prediction – sample & detokenize
Loss.
Tokenizer. large corpus of text -> each character to token -> pairs of token merge
Evalulation. using perplexity (validation loss) -? average per token loss and expnentiate
Data. All of internet; text extraction;
Training a SOTA model:
- LLaMa 3 400B
- 15.6T tokens
- FLOPs: 6NP
- 16K H100 with average throughput of 400 TFLOPS
- Parameters: 40B
- Time = 70 days
- Cost: rented compute + salary: $75M
Research problems:
- Synthetic data?
- MultiModel Data?
Systems.
References:
- Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) https://www.youtube.com/watch?v=9vM4p9NN0Ts