Introduction to Large Language Models
https://developers.google.com/machine-learning/crash-course/llm
What is Language Model?

At its simplest, a language model is a statistical tool that predicts the next piece of text in a sequence.
- Tokens: Models break text down into chunks called “tokens,” which can be words, parts of words (subwords), or individual characters.
- Probability: The model looks at a sequence (e.g., “When I hear rain on my roof, I…”) and calculates the probability of various tokens filling in the blank. It might assign a 9.4% chance to “cook soup” and a 2.5% chance to “nap.”
- Usage: These predictions are used for translating languages, summarizing documents, and generating new text.

The N-gram Approach
Early language models used “N-grams,” which are simply ordered sequences of words where N represents the number of words.
- Bigram (2-gram): Looks at pairs of words (e.g., “you are”).
- Trigram (3-gram): Looks at groups of three words (e.g., “you are very”).
- How it works: To predict the next word, the model looks at the previous words. If the input is “orange is,” the model checks its training data to see what usually comes next (e.g., “ripe” or “cheerful”).

Context
Context refers to the information surrounding a word that helps determine its specific meaning.
- The Limitation: N-grams struggle with context. If a model only looks at a 3-gram (“orange is…”), it doesn’t have enough prior information to know if “orange” refers to a fruit (orange is ripe) or a color (orange is cheerful).
- The Trade-off: You cannot simply make the N-gram massive (e.g., a 20-gram) to get more context. As N grows larger, the specific sequence becomes so rare in the training data that the model cannot make reliable predictions.

Recurrent Nueral Networks (RNNs)
RNNs is type of neural network that in trained on sequence of tokens and provide more context than N-grams. Unlike N-grams, RNNs process data conceptually like a person listening to a sentence, they evaluate information “token by token.”
- Capabilities: They can “learn” to keep or ignore context gradually, allowing them to handle longer passages (several sentences) compared to N-grams.
- Limitations: While better than N-grams, RNNs are still limited in how much context they can intuit due to the “vanishing gradient problem.”
Modern Large Language Models differ from RNNs because, LLMs can evaluate the whole context at once, rather than token by token.
