Building Large Language Models (LLMs)

What matters when training LLMs:
- Architecture
- Training Algorithm and Loss
- Data
- Evaluation
- Systems

Pre-Training.

Post-Training.

Language Modeling.
P(the, mouse, ate, the, cheese) = 0.02 – syntactic knowledge
P(the, the, mouse, ate,cheese) = 0.0001 – semantic knowledge
P(…)

Auto-Reggressive (AR) language model: Predict next word.

Steps: she likely prefers: tokenize -> 1 -she, 2-likely, 3-prefers => pass to blackbox model => get probability distribution over next word prediction – sample & detokenize

Loss.

Tokenizer. large corpus of text -> each character to token -> pairs of token merge

Evalulation. using perplexity (validation loss) -? average per token loss and expnentiate

Data. All of internet; text extraction;

Training a SOTA model:

LLaMa 3 400B
- 15.6T tokens
- FLOPs: 6NP
- 16K H100 with average throughput of 400 TFLOPS
- Parameters: 40B
- Time = 70 days
- Cost: rented compute + salary: $75M

Research problems:

Synthetic data?
MultiModel Data?

Systems.

References:

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) https://www.youtube.com/watch?v=9vM4p9NN0Ts