Key Concepts and Terminology in Reinforcement Learning

The Agent-Environment Interaction

In Reinforcement Learning, the two core components are the Agent and the Environment. The environment serves as the simulated or physical world where the agent operates.

🤖
Agent
Action (a_t)
Observation & Reward
🌍
Environment

In a continuous loop, the agent receives an observation detailing the current condition of this world. Based on that information, it executes a specific action. The environment then updates and shifts to a new state, driven both by the agent’s action and its own internal rules of physics or logic.

Key Concepts of Reinforcement Learning

Reinforcement Learning (RL) is about an Agent learning to master an Environment through trial, error, and rewards. Here is the absolute shortest breakdown of the core terminology:

State & Observation s, o

A complete (state) or partial (observation) description of what the environment looks like right now.

Action a

What the agent decides to do (e.g., move left, jump, apply torque to a motor).

Reward & Return r, R

The immediate points scored (r) vs. the total cumulative score (R). Future rewards are often discounted by γ.

Policy π

The agent’s “brain” or rulebook. It is a mathematical function that maps a state to a chosen action.

Trajectory τ

The recorded timeline of a single episode: State → Action → Reward → Next State.

Value Functions V, Q

The agent’s prediction of its expected future Return starting from its current state or action.

Interactive RL Loop

Watch how the math flows to build a Trajectory (τ)

🌍
Environment
P(s’|s,a)
State (st)
Action (at)
Reward (rt)
🤖
Agent Policy
at ~ π(a|s)
Timestep (t)
0
Immediate Reward (r)
0.00
Discounted Return R(τ)
0.00
τ =
[ Waiting to start ]

States and Observations

A state s provides a comprehensive blueprint of the environment at a specific moment, with absolutely no hidden variables. Conversely, an observation o offers merely a restricted or incomplete view of that state, where certain details remain obscured from the agent.

Full Visibility (State)
The agent possesses a bird’s-eye view of the entire environment, including all obstacles and objectives.
🎯
🤖
Partial Visibility (Observation)
The agent’s sensors are limited, revealing only nearby areas while the rest is hidden in a “fog of war.”
🤖

In the realm of Deep Reinforcement Learning, both states and observations are typically encoded mathematically as real-valued vectors, matrices, or higher-order tensors. To illustrate, a video game agent might process an observation as a dense grid of RGB pixel values, whereas a robotic arm’s state might be defined by a precise array of joint angles and movement speeds.

Environments that grant the agent total access to all underlying variables are known as fully observed. On the other hand, when the agent must make decisions based on restricted sensory input, the system is classified as partially observed.

Action Spaces

Every environment defines an Action Space the complete menu of valid moves an agent is allowed to make. These spaces generally fall into two distinct categories:

Discrete Action Space
A fixed, finite number of specific choices (e.g., playing a board game or pressing D-pad buttons on a controller).
Continuous Action Space
An infinite range of fluid, real-number values (e.g., the exact steering angle or acceleration of a physical robot).
-1.0 0.42… 1.0

Understanding this distinction is critical in deep reinforcement learning. Many algorithms are mathematically designed to solve only one type of action space, and applying them to the other requires a complete structural rework of the AI’s neural network.

Policies: The Agent’s Brain

A policy is the fundamental rulebook an agent uses to decide which action to take in any given state. Because the policy dictates behavior, researchers often substitute the word “policy” for the “agent” itself.

In Deep Reinforcement Learning, we use parameterized policies. This means the policy is powered by a neural network. The network’s weights and biases are the parameters (denoted as θ), which an optimization algorithm adjusts over time to improve performance.

1. Deterministic Policies

A deterministic policy maps a state directly to one exact, specific action. There is no randomness. If the agent sees the exact same state twice, it will take the exact same action twice. It is usually denoted by the symbol μ:

at = μθ(st)

2. Stochastic Policies

A stochastic policy outputs a probability distribution over a range of possible actions. The agent then rolls the dice and samples an action from that distribution. This randomness is crucial for letting the agent explore its environment. It is denoted by the symbol π:

at ~ πθ(· | st)

To train a stochastic policy, algorithms need to do two things: mathematically sample an action, and calculate the Log-Likelihood log πθ(a|s) (how probable that specific action was). There are two main types of stochastic policies depending on the action space:

  • Categorical Policies (For Discrete Actions): Works like an image classifier. The neural network outputs probabilities for a finite set of buttons (e.g., 10% Jump, 70% Duck, 20% Punch). The action is sampled based on those percentages.
  • Diagonal Gaussian Policies (For Continuous Actions): Works for fluid movements (like steering). The neural network outputs a Mean (μ) and a Standard Deviation (σ). The action is sampled by taking the mean and adding random spherical noise.

Action Sampling Simulation

Click the button to see how different policies react to the exact same environment state.

Deterministic
0.50

Always outputs exact center.

Stochastic (Categorical)
Right

Samples discrete choices based on weight.

Stochastic (Gaussian)
0.50

Samples continuous values around a mean.

Trajectories (Episodes)

A trajectory τ represents a chronological record of an agent’s experience in the world. It is a sequence consisting of every state encountered and every action taken during a specific period of time:

τ = (s0, a0, s1, a1, …)
Visualizing the Sequence (τ)
State
s0
Action
a0
State
s1
Action
a1
State
s2

The journey begins at an initial state s0, which is randomly chosen from a starting distribution ρ0. From there, the sequence evolves through State Transitions.

These transitions the movement from one state to the next are determined by the environment’s internal logic. This transition depends strictly on the most recent state and action. It can be deterministic (the outcome is certain) or stochastic (the outcome involves probability).

Deterministic Transition: st+1 = f(st, at)
Stochastic Transition: st+1 ~ P(·|st, at)

While the environment controls the transitions, the actions themselves are provided by the agent according to its specific policy. In Reinforcement Learning literature, these recorded sequences are also commonly referred to as episodes or rollouts.

Reward and Return

The reward function is the engine that drives reinforcement learning. It acts as the immediate feedback mechanism, outputting a number that tells the agent exactly how “good” or “bad” its current situation is.

Mathematically, the reward rt often depends on the current state, the action taken, and the resulting next state. However, it is frequently simplified to just depend on the current state-action pair:

rt = R(st, at, st+1)

The ultimate goal of the agent is not just to get a high reward right now, but to maximize its cumulative reward over time. We call this total accumulated score the Return, denoted by R(τ). There are two standard ways to formulate this Return.

1. Finite-Horizon Undiscounted Return

If an episode has a strict time limit (a fixed window of steps), we can simply add up all the rewards the agent collected before the timer ran out:

R(τ) = ∑t=0T rt

2. Infinite-Horizon Discounted Return

If an environment runs forever, simply adding up rewards would result in an infinite sum, breaking the underlying math. To solve this, we introduce a discount factor γ (gamma), a number between 0 and 1.

R(τ) = ∑t=0 γt rt

Why use a discount factor? Intuitively, it mimics real life: a dollar today is worth more to you than a dollar promised ten years from now. The agent prefers immediate rewards over distant, uncertain ones. Mathematically, multiplying future rewards by a fraction (γt) ensures that the infinite sum safely converges to a finite, computable number.

The Discount Factor Simulation

Assume the agent receives a constant reward of +100 at every timestep. Adjust Gamma (γ) to see how it shrinks the value of future rewards.

Total Return (10 steps) = 651.32

Notice how a Gamma of 1.0 treats all future rewards equally, while a Gamma of 0.0 makes the agent incredibly short-sighted, completely ignoring everything except the very first immediate reward.

Value Functions & Bellman Equations

While a reward tells the agent how it is doing now, a Value Function predicts how the agent will do over the long haul. It estimates the total future return the agent can expect to receive starting from a specific point.

1. The Four Main Functions

In reinforcement learning, we primarily track four types of “expectations”:

  • State-Value Vπ(s): Expected return starting in state s following policy π.
  • Action-Value Qπ(s,a): Expected return starting in state s, taking action a, then following policy π.
  • Optimal Value V*(s): The highest possible return achievable from state s.
  • Optimal Action-Value Q*(s,a): The highest possible return starting with action a.

2. Bellman Equations

The “Bellman Equation” is a self-consistency rule. It states that the value of where you are now must equal the reward you just got, plus the discounted value of where you land next.

Vπ(s) = E [ r + γ Vπ(s’) ]
The Bellman Backup
Values flow backward from future estimates to update the present.
PRESENT V(s)
FUTURE γV(s’)

Value(Now) = Reward + Discounted Value(Next)

3. The Advantage Function

Sometimes we don’t care about the absolute value, but rather how much better one action is compared to the average. This is the Advantage Function Aπ(s,a). It is the difference between the Action-Value and the State-Value:

Aπ(s,a) = Qπ(s,a) – Vπ(s)

4. The MDP

The math above is formalized as a Markov Decision Process (MDP). An MDP is defined by a 5-tuple ⟨ S, A, R, P, ρ0. It operates on the Markov Property: the future depends only on the current state and action, not the path taken to get there.

Take a Quiz.

1. What is the difference between a “State” and an “Observation”?
A State is a complete description of the world; an Observation is a partial view.
A State is only used in discrete spaces; an Observation is for continuous spaces.
An Observation is the reward the agent receives; a State is the action taken.
2. Which policy type outputs a probability distribution over actions?
Deterministic Policy (μ)
Stochastic Policy (π)
Optimal Value Function (V*)
3. What does the Bellman Equation primarily describe?
The probability of a trajectory occurring.
How to convert continuous actions into discrete buttons.
The relationship between the value of a current state and the value of the next state.
4. Why is the discount factor (γ) used in the Return calculation?
To increase the value of rewards obtained in the far future.
To ensure infinite-horizon sums converge and prioritize immediate rewards.
To record the chronological sequence of states and actions.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

error: