12: Reinforcement Learning

Reinforcement Learning (RL) focuses on decision-making in dynamic and uncertain environments. Agents aim to maximize cumulative rewards over time by exploring and exploiting their environments.


12.1 The Reinforcement Learning Framework

The RL framework consists of:

  1. Agent: The learner or decision-maker.

  2. Environment: The external system with which the agent interacts.

  3. State (sss): A representation of the current situation.

  4. Action (aaa): A choice made by the agent.

  5. Reward (rrr): Feedback from the environment indicating the value of an action.

  6. Policy (π(s)\pi(s)π(s)): A mapping from states to actions that defines the agent’s behavior.


12.1.1 The RL Cycle

  1. Observe: The agent perceives the current state (sss).

  2. Act: It selects an action (aaa) based on its policy.

  3. Receive Feedback: The environment provides a reward (rrr) and a new state (s′s's′).

  4. Update: The agent adjusts its policy to improve future rewards.


12.2 Types of Reinforcement Learning

12.2.1 Model-Based RL

  • The agent builds a model of the environment to predict the consequences of actions.

  • Example: Chess-playing algorithms that simulate potential moves.


12.2.2 Model-Free RL

  • The agent learns directly from experience without building an explicit model of the environment.

Subtypes:

  1. Value-Based Methods: Learn a value function that estimates the expected rewards for actions.

    • Example: Q-Learning.

  2. Policy-Based Methods: Learn the policy directly, mapping states to actions.

    • Example: REINFORCE algorithm.

  3. Actor-Critic Methods: Combine value-based and policy-based approaches.

    • Example: Advantage Actor-Critic (A2C).


12.3 Key Concepts in RL

12.3.1 Rewards

  • Immediate feedback received after an action.

  • Example: A self-driving car gets a reward for reaching its destination safely.


12.3.2 Cumulative Reward

  • The total reward an agent aims to maximize over time.

Formula: Gt=rt+1+γrt+2+γ2rt+3+…G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dotsGt​=rt+1​+γrt+2​+γ2rt+3​+… Where GtG_tGt​ is the cumulative reward at time ttt, and γ\gammaγ is the discount factor (0 ≤ γ\gammaγ ≤ 1), controlling the weight of future rewards.


12.3.3 Exploration vs. Exploitation

  • Exploration: Trying new actions to discover their effects.

  • Exploitation: Choosing actions that maximize rewards based on existing knowledge.

Balancing exploration and exploitation is critical in RL.


12.4 Value-Based Methods

12.4.1 Q-Learning

Q-Learning is a popular model-free algorithm that learns the optimal action-value function Q(s,a)Q(s, a)Q(s,a): Q(s,a)←Q(s,a)+α[r+γmax⁡aQ(s′,a)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_a Q(s', a) - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γmaxa​Q(s′,a)−Q(s,a)]

  • Q(s,a)Q(s, a)Q(s,a): The value of taking action aaa in state sss.

  • α\alphaα: Learning rate.

  • γ\gammaγ: Discount factor.

Example: A robot learns the best path in a maze by updating Q(s,a)Q(s, a)Q(s,a) values based on trial and error.


12.4.2 SARSA

SARSA is another value-based method that updates Q(s,a)Q(s, a)Q(s,a) based on the current action taken: Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]

Difference from Q-Learning: SARSA is "on-policy," meaning it updates based on the policy the agent follows, while Q-Learning is "off-policy," updating based on the optimal action.


12.5 Policy-Based Methods

Policy-based methods directly optimize the agent's policy without relying on value functions.

12.5.1 Gradient Ascent for Policy Optimization

The agent adjusts its policy parameters to maximize expected rewards: θ←θ+α∇J(θ)\theta \leftarrow \theta + \alpha \nabla J(\theta)θ←θ+α∇J(θ) Where J(θ)J(\theta)J(θ) is the performance measure, and ∇J(θ)\nabla J(\theta)∇J(θ) is the gradient.

Example: Training a drone to balance in the air by optimizing its control policy.


12.6 Actor-Critic Methods

Actor-critic methods combine the strengths of value-based and policy-based approaches:

  • Actor: Updates the policy.

  • Critic: Evaluates the action taken by estimating its value.

Example: Advantage Actor-Critic (A2C) uses the advantage function A(s,a)=Q(s,a)−V(s)A(s, a) = Q(s, a) - V(s)A(s,a)=Q(s,a)−V(s) to improve learning stability.


12.7 Applications of Reinforcement Learning

12.7.1 Robotics

Robots learn to perform tasks like picking objects or navigating environments. Example: A robot arm learns to stack blocks efficiently.


12.7.2 Game AI

Reinforcement learning powers agents in games like Chess, Go, and StarCraft. Example: AlphaGo used RL to defeat human champions in Go.


12.7.3 Autonomous Vehicles

Self-driving cars use RL to optimize decisions for navigation, lane-changing, and collision avoidance.


12.7.4 Finance

RL is applied in trading systems to optimize portfolio strategies. Example: Algorithms adapt to market trends to maximize returns.


12.8 Summary

In this chapter, we explored:

  1. The framework and types of reinforcement learning (model-based and model-free).

  2. Key concepts like rewards, exploration vs. exploitation, and cumulative rewards.

  3. Value-based, policy-based, and actor-critic methods.

  4. Applications in robotics, games, autonomous vehicles, and finance.

Last updated