Deep Reinforcement Learning

Resources:

“Welcome to the 🤗 Deep Reinforcement Learning Course – Hugging Face Deep RL Course,” Huggingface.co, 2018. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction (accessed Feb. 21, 2026).
“Welcome to Spinning Up in Deep RL! — Spinning Up documentation,” Openai.com, 2018. https://spinningup.openai.com/en/latest/ (accessed Feb. 21, 2026).’
“CS 185/285,” Berkeley.edu, 2023. https://rail.eecs.berkeley.edu/deeprlcourse/ (accessed Feb. 21, 2026).
A. Plaat, “Deep Reinforcement Learning”.

🛒 The Supermarket Analogy [Plaat et al. Page 25]

Imagine you have just moved to a new city, you are hungry, and you want to buy some groceries. There is an unrealistic catch: you have no map and no smartphone. After some random exploration, you find a supermarket. You carefully note the route in your notebook and return home.

What will you do next time? You could exploit your current knowledge and follow the same path – it’s guaranteed to work. Or, you could be adventurous and explore, trying to find a quicker route. This is the classic Exploration-Exploitation trade-off.

Agent: You

Environment: The City

State: Your Location

Action: Moving a block

Reward: Path length/time

Policy: Your decision logic

What is Deep Reinforcement Learning?

At its core, Deep Reinforcement Learning (DRL) is a subfield of machine intelligence that combines two heavy hitters: Reinforcement Learning (RL) and Deep Learning (DL).

1. The Core Components

To understand DRL, we first have to look at the standard Reinforcement Learning loop. It’s essentially a “trial-and-error” framework where an Agent learns to make decisions.

Agent: The AI “player” or decision-maker.
Environment: The world the agent lives in (e.g., a video game, a stock market, or a robotic arm).
State (s): The current situation or “snapshot” of the environment.
Action (a): What the agent chooses to do.
Reward (r): The feedback (positive or negative) given to the agent based on its action.

The Reinforcement Learning Loop

Agent
(The AI Player)

➔
Action

Environment
(The World)

State & Reward
⇠

2. Why “Deep”?

In traditional RL, we use simple tables (like a spreadsheet) to map states to the best actions. This works for Tic-Tac-Toe, but it fails in the real world. Imagine a self-driving car; the number of possible “states” (camera pixels, sensor data, speed) is infinite. We can’t fit that in a table.

This is where Deep Learning comes in. We use Neural Networks as “function approximators.” Instead of looking up a value in a table, the agent passes the state through a deep neural network to predict which action will yield the highest long-term reward.

Traditional RL

Uses Q-Tables
Hand-crafted features
Limited complexity

Deep RL

Uses Neural Networks
Raw data (Pixels/Sensors)
End-to-level learning

3. How It Learns: The Goal

The agent’s goal isn’t just to get an immediate reward, but to maximize the cumulative reward over time, often called the Return.

Because future rewards are less certain than immediate ones, we use a discount factor ($\gamma$, typically between 0 and 1) to weight them. This is often expressed via the Bellman Equation:

Q(s, a) = r + \gamma \max_{a’} Q(s’, a’)

In DRL, the neural network learns to estimate this $Q$ value (the “quality” of an action) for every possible scenario.

THE BELLMAN EQUATION

Q(s, a) = r + γ max Q(s’, a’)

Immediate Reward + Discounted Future Value

4. Famous Examples

AlphaGo: DeepMind’s AI that defeated the world champion at Go. It used DRL to evaluate board positions and choose moves.
https://deepmind.google/research/alphago/
Atari Games: DRL agents can learn to play games like Breakout or Pong just by looking at the pixels on the screen, with no prior knowledge of the rules.
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
https://en.wikipedia.org/wiki/Pong
Robotics: Teaching a robot to walk or pick up fragile objects by rewarding “success” and penalizing “falls” or “breaks.”

Feature	Traditional RL	Deep RL
State Space	Small/Discrete (Tables)	High-dimensional (Pixels)
Brain	Q-Tables	Neural Networks
Scalability	Simple Games	Complex/Real-world