Building Large Language Models (LLMs)

  • What matters when training LLMs:
    • Architecture
    • Training Algorithm and Loss
    • Data
    • Evaluation
    • Systems

Pre-Training.

Post-Training.

Language Modeling.
P(the, mouse, ate, the, cheese) = 0.02 – syntactic knowledge
P(the, the, mouse, ate,cheese) = 0.0001 – semantic knowledge
P(…)

Auto-Reggressive (AR) language model: Predict next word.

Steps: she likely prefers: tokenize -> 1 -she, 2-likely, 3-prefers => pass to blackbox model => get probability distribution over next word prediction – sample & detokenize

Loss.

Tokenizer. large corpus of text -> each character to token -> pairs of token merge

Evalulation. using perplexity (validation loss) -? average per token loss and expnentiate

Data. All of internet; text extraction;

Training a SOTA model:

  • LLaMa 3 400B
    • 15.6T tokens
    • FLOPs: 6NP
    • 16K H100 with average throughput of 400 TFLOPS
    • Parameters: 40B
    • Time = 70 days
    • Cost: rented compute + salary: $75M

Research problems:

  1. Synthetic data?
  2. MultiModel Data?

Systems.

References:

  1. Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) https://www.youtube.com/watch?v=9vM4p9NN0Ts

Leave a Reply

Your email address will not be published. Required fields are marked *

error: