12: Reinforcement Learning

Reinforcement Learning (RL) focuses on decision-making in dynamic and uncertain environments. Agents aim to maximize cumulative rewards over time by exploring and exploiting their environments.
12.1 The Reinforcement Learning Framework
The RL framework consists of:
Agent: The learner or decision-maker.
Environment: The external system with which the agent interacts.
State (sss): A representation of the current situation.
Action (aaa): A choice made by the agent.
Reward (rrr): Feedback from the environment indicating the value of an action.
Policy (π(s)\pi(s)π(s)): A mapping from states to actions that defines the agent’s behavior.
12.1.1 The RL Cycle
Observe: The agent perceives the current state (sss).
Act: It selects an action (aaa) based on its policy.
Receive Feedback: The environment provides a reward (rrr) and a new state (s′s's′).
Update: The agent adjusts its policy to improve future rewards.
12.2 Types of Reinforcement Learning
12.2.1 Model-Based RL
The agent builds a model of the environment to predict the consequences of actions.
Example: Chess-playing algorithms that simulate potential moves.
12.2.2 Model-Free RL
The agent learns directly from experience without building an explicit model of the environment.
Subtypes:
Value-Based Methods: Learn a value function that estimates the expected rewards for actions.
Example: Q-Learning.
Policy-Based Methods: Learn the policy directly, mapping states to actions.
Example: REINFORCE algorithm.
Actor-Critic Methods: Combine value-based and policy-based approaches.
Example: Advantage Actor-Critic (A2C).
12.3 Key Concepts in RL
12.3.1 Rewards
Immediate feedback received after an action.
Example: A self-driving car gets a reward for reaching its destination safely.
12.3.2 Cumulative Reward
The total reward an agent aims to maximize over time.
Formula: Gt=rt+1+γrt+2+γ2rt+3+…G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dotsGt=rt+1+γrt+2+γ2rt+3+… Where GtG_tGt is the cumulative reward at time ttt, and γ\gammaγ is the discount factor (0 ≤ γ\gammaγ ≤ 1), controlling the weight of future rewards.
12.3.3 Exploration vs. Exploitation
Exploration: Trying new actions to discover their effects.
Exploitation: Choosing actions that maximize rewards based on existing knowledge.
Balancing exploration and exploitation is critical in RL.
12.4 Value-Based Methods
12.4.1 Q-Learning
Q-Learning is a popular model-free algorithm that learns the optimal action-value function Q(s,a)Q(s, a)Q(s,a): Q(s,a)←Q(s,a)+α[r+γmaxaQ(s′,a)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_a Q(s', a) - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γmaxaQ(s′,a)−Q(s,a)]
Q(s,a)Q(s, a)Q(s,a): The value of taking action aaa in state sss.
α\alphaα: Learning rate.
γ\gammaγ: Discount factor.
Example: A robot learns the best path in a maze by updating Q(s,a)Q(s, a)Q(s,a) values based on trial and error.
12.4.2 SARSA
SARSA is another value-based method that updates Q(s,a)Q(s, a)Q(s,a) based on the current action taken: Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Difference from Q-Learning: SARSA is "on-policy," meaning it updates based on the policy the agent follows, while Q-Learning is "off-policy," updating based on the optimal action.
12.5 Policy-Based Methods
Policy-based methods directly optimize the agent's policy without relying on value functions.
12.5.1 Gradient Ascent for Policy Optimization
The agent adjusts its policy parameters to maximize expected rewards: θ←θ+α∇J(θ)\theta \leftarrow \theta + \alpha \nabla J(\theta)θ←θ+α∇J(θ) Where J(θ)J(\theta)J(θ) is the performance measure, and ∇J(θ)\nabla J(\theta)∇J(θ) is the gradient.
Example: Training a drone to balance in the air by optimizing its control policy.
12.6 Actor-Critic Methods
Actor-critic methods combine the strengths of value-based and policy-based approaches:
Actor: Updates the policy.
Critic: Evaluates the action taken by estimating its value.
Example: Advantage Actor-Critic (A2C) uses the advantage function A(s,a)=Q(s,a)−V(s)A(s, a) = Q(s, a) - V(s)A(s,a)=Q(s,a)−V(s) to improve learning stability.
12.7 Applications of Reinforcement Learning
12.7.1 Robotics
Robots learn to perform tasks like picking objects or navigating environments. Example: A robot arm learns to stack blocks efficiently.
12.7.2 Game AI
Reinforcement learning powers agents in games like Chess, Go, and StarCraft. Example: AlphaGo used RL to defeat human champions in Go.
12.7.3 Autonomous Vehicles
Self-driving cars use RL to optimize decisions for navigation, lane-changing, and collision avoidance.
12.7.4 Finance
RL is applied in trading systems to optimize portfolio strategies. Example: Algorithms adapt to market trends to maximize returns.
12.8 Summary
In this chapter, we explored:
The framework and types of reinforcement learning (model-based and model-free).
Key concepts like rewards, exploration vs. exploitation, and cumulative rewards.
Value-based, policy-based, and actor-critic methods.
Applications in robotics, games, autonomous vehicles, and finance.
Last updated