12: Reinforcement Learning

Reinforcement Learning (RL) focuses on decision-making in dynamic and uncertain environments. Agents aim to maximize cumulative rewards over time by exploring and exploiting their environments.

12.1 The Reinforcement Learning Framework

The RL framework consists of:

Agent: The learner or decision-maker.
Environment: The external system with which the agent interacts.
State (sss): A representation of the current situation.
Action (aaa): A choice made by the agent.
Reward (rrr): Feedback from the environment indicating the value of an action.
Policy (π(s)\pi(s)π(s)): A mapping from states to actions that defines the agent’s behavior.

12.1.1 The RL Cycle

Observe: The agent perceives the current state (sss).
Act: It selects an action (aaa) based on its policy.
Receive Feedback: The environment provides a reward (rrr) and a new state (s′s's′).
Update: The agent adjusts its policy to improve future rewards.

12.2 Types of Reinforcement Learning

12.2.1 Model-Based RL

The agent builds a model of the environment to predict the consequences of actions.
Example: Chess-playing algorithms that simulate potential moves.

12.2.2 Model-Free RL

The agent learns directly from experience without building an explicit model of the environment.

Subtypes:

Value-Based Methods: Learn a value function that estimates the expected rewards for actions.
- Example: Q-Learning.
Policy-Based Methods: Learn the policy directly, mapping states to actions.
- Example: REINFORCE algorithm.
Actor-Critic Methods: Combine value-based and policy-based approaches.
- Example: Advantage Actor-Critic (A2C).

12.3 Key Concepts in RL

12.3.1 Rewards

Immediate feedback received after an action.
Example: A self-driving car gets a reward for reaching its destination safely.

12.3.2 Cumulative Reward

The total reward an agent aims to maximize over time.

Formula: Gt=rt+1+γrt+2+γ2rt+3+…G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dotsGt=rt+1+γrt+2+γ2rt+3+… Where GtG_tGt is the cumulative reward at time ttt, and γ\gammaγ is the discount factor (0 ≤ γ\gammaγ ≤ 1), controlling the weight of future rewards.

12.3.3 Exploration vs. Exploitation

Exploration: Trying new actions to discover their effects.
Exploitation: Choosing actions that maximize rewards based on existing knowledge.

Balancing exploration and exploitation is critical in RL.

12.4 Value-Based Methods

12.4.1 Q-Learning

Q-Learning is a popular model-free algorithm that learns the optimal action-value function Q(s,a)Q(s, a)Q(s,a): Q(s,a)←Q(s,a)+α[r+γmax⁡aQ(s′,a)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_a Q(s', a) - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γmaxaQ(s′,a)−Q(s,a)]

Q(s,a)Q(s, a)Q(s,a): The value of taking action aaa in state sss.
α\alphaα: Learning rate.
γ\gammaγ: Discount factor.

Example: A robot learns the best path in a maze by updating Q(s,a)Q(s, a)Q(s,a) values based on trial and error.

12.4.2 SARSA

SARSA is another value-based method that updates Q(s,a)Q(s, a)Q(s,a) based on the current action taken: Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]

Difference from Q-Learning: SARSA is "on-policy," meaning it updates based on the policy the agent follows, while Q-Learning is "off-policy," updating based on the optimal action.

12.5 Policy-Based Methods

Policy-based methods directly optimize the agent's policy without relying on value functions.

12.5.1 Gradient Ascent for Policy Optimization

The agent adjusts its policy parameters to maximize expected rewards: θ←θ+α∇J(θ)\theta \leftarrow \theta + \alpha \nabla J(\theta)θ←θ+α∇J(θ) Where J(θ)J(\theta)J(θ) is the performance measure, and ∇J(θ)\nabla J(\theta)∇J(θ) is the gradient.

Example: Training a drone to balance in the air by optimizing its control policy.

12.6 Actor-Critic Methods

Actor-critic methods combine the strengths of value-based and policy-based approaches:

Actor: Updates the policy.
Critic: Evaluates the action taken by estimating its value.

Example: Advantage Actor-Critic (A2C) uses the advantage function A(s,a)=Q(s,a)−V(s)A(s, a) = Q(s, a) - V(s)A(s,a)=Q(s,a)−V(s) to improve learning stability.

12.7 Applications of Reinforcement Learning

12.7.1 Robotics

Robots learn to perform tasks like picking objects or navigating environments. Example: A robot arm learns to stack blocks efficiently.

12.7.2 Game AI

Reinforcement learning powers agents in games like Chess, Go, and StarCraft. Example: AlphaGo used RL to defeat human champions in Go.

12.7.3 Autonomous Vehicles

Self-driving cars use RL to optimize decisions for navigation, lane-changing, and collision avoidance.

12.7.4 Finance

RL is applied in trading systems to optimize portfolio strategies. Example: Algorithms adapt to market trends to maximize returns.

12.8 Summary

In this chapter, we explored:

The framework and types of reinforcement learning (model-based and model-free).
Key concepts like rewards, exploration vs. exploitation, and cumulative rewards.
Value-based, policy-based, and actor-critic methods.
Applications in robotics, games, autonomous vehicles, and finance.

Previous11: Learning from Examples Next13: Natural Language Processing (NLP)

Last updated 9 months ago