Go back to Blogs
Fundamentals of Reinforcement Learning algorithms
ℹ️
- We sometimes use affiliate links in our content. This means that if you click on a link and make a purchase, we may receive a small commission at no extra cost to you. This helps us keep creating valuable content for you!
Prerequisites

What is Reinforcement learning?
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. The agent interacts with the environment in discrete time steps, receiving feedback in the form of rewards or penalties based on its actions. The goal is to learn a policy that maps states to actions to maximize the long-term reward.These problems often involve dynamic and uncertain environments where the outcomes of actions are not immediately apparent and may depend on a series of prior decisions.
To address this complexity, RL relies on the Markov Decision Process (MDP) as the foundational framework. The MDP provides a mathematical model to describe the interaction between the agent and the environment, capturing the essential components: states, actions, transitions, and rewards. By doing so, it formalizes the problem into a structure that can be analyzed and optimized, enabling the development of algorithms to learn strategies or policies that guide the agent’s actions.
Components of Reinforcement learning
- Agent
The agent is the decision-maker in the RL environment. It learns a policy to maximize the cumulative reward over time. The agent interacts with the environment by performing actions based on its policy and updates its knowledge using the feedback received (rewards or penalties).
- Key Features:
- Exploration: Discovering new actions to gather information about the environment.
- Exploitation: Leveraging known information to maximize rewards
- Environment
The environment is the external system with which the agent interacts. It defines the state space, action space, and the dynamics (transition probabilities) that dictate how actions influence the environment.
- State Space: A set of all possible situations or configurations the environment can be in.
- Action Space: The set of all possible actions the agent can take.
- Dynamics: Rules that describe how actions affect state transitions and rewards.
- State (S)
The state represents the current situation or context of the environment, as perceived by the agent. It provides the necessary information for the agent to decide the next action.
- State Representation: States can be discrete (e.g., chess-board positions) or continuous (e.g., robot position in a room).
- Partial Observability: In some cases, the agent may not have full knowledge of the current state (e.g., in partially observable Markov decision processes).
- Action (A)
Actions are the possible moves or decisions the agent can make at a given state. The agent chooses an action based on its policy.
- Deterministic vs. Stochastic Actions: Actions can produce deterministic outcomes or have probabilistic effects.
- Discrete vs. Continuous Actions: Action spaces can be discrete (e.g., move left/right) or continuous (e.g., steering angle of a car).
- Reward (R)
The reward is a scalar value received by the agent after taking an action in a particular state. It serves as feedback to guide the agent’s learning process.
- Policy (π)
The policy defines the agent’s behavior by mapping states to actions. It is the strategy that determines which action to take in a given state.
- Deterministic Policy: Maps each state to a specific action.
- Stochastic Policy: Maps each state to a probability distribution over actions.
- Optimal Policy: The policy that maximizes expected cumulative rewards.
- Value Function (V)
The value function estimates the expected cumulative reward starting from a specific state (or state-action pair).
Comparison of RL with other ML approaches
Aspect | Reinforcement Learning (RL) | Supervised Learning | Unsupervised Learning |
Goal | Maximize cumulative rewards | Learn a mapping from inputs to outputs | Find hidden patterns in data |
Data | Interaction with environment | Labeled data | Unlabeled data |
Feedback | Rewards and penalties | Correct output labels | No explicit feedback |
Learning Process | Trial and error, exploration and exploitation | Minimize prediction error | Discover structure in data |
Key Components | Agent, Environment, State, Action, Reward, Policy | Input Features, Output Labels, Training Data | Input Features |
Common Algorithms | Q-Learning, SARSA, Deep Q-Networks (DQN) | Linear Regression, SVM, Neural Networks | K-Means, PCA, Autoencoders |
Application of RL
Reinforcement Learning (RL) has a wide range of applications across various domains. Here are some notable examples:
- Game Playing – RL has been successfully applied to train agents to play complex games. Notable examples include:
- AlphaGo: Developed by DeepMind, AlphaGo uses RL to play the board game Go at a superhuman level.
- Atari Games: Deep Q-Networks (DQN) have been used to achieve human-level performance in playing various Atari games.
- Robotics – In robotics, RL is used to teach robots to perform tasks such as:
- Navigation: Robots learn to navigate through complex environments.
- Manipulation: Robots learn to grasp and manipulate objects.
- Locomotion: Robots learn to walk, run, or perform other forms of movement.
- Finance – RL is applied in finance for tasks such as:
- Portfolio Management: RL algorithms optimize the allocation of assets in a portfolio to maximize returns.
- Trading Strategies: RL is used to develop trading strategies that adapt to market conditions.
- Autonomous Vehicles – RL is crucial in the development of autonomous vehicles for:
- Path Planning: Vehicles learn to plan optimal paths in dynamic environments.
- Decision Making: Vehicles learn to make real-time decisions in complex traffic scenarios.
- Natural Language Processing (NLP) – In NLP, RL is used for:
- Dialogue Systems: RL helps in training chatbots and virtual assistants to have more natural and effective conversations.
- Machine Translation: RL improves the quality of machine translation systems by optimizing translation accuracy.
These applications demonstrate the versatility and power of RL in solving complex, real-world problems across various industries.
Description of RL algorithms
Q-Learning
Q-Learning is a model-free, off-policy reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It aims to learn the quality of actions, denoted as Q-values, which represent the expected utility of taking a given action in a given state and following the optimal policy thereafter. The Q-values are updated using the Bellman equation:
- Q(s, a) = Q(s, a) + α [r + γ max(Q(s’, a’)) – Q(s, a)]
- where ( α ) is the learning rate, ( γ ) is the discount factor, ( r ) is the reward, ( s’ ) is the next state, and ( a’ ) is the next action.
SARSA (State-Action-Reward-State-Action)
SARSA is a model-free, on-policy reinforcement learning algorithm that updates the Q-values based on the action actually taken by the agent. It learns the value of the policy being followed by the agent. The Q-values are updated using the following rule:
- Q(s, a) = Q(s, a) + α [r + γ Q(s’, a’) – Q(s, a)]
- where ( α ) is the learning rate, ( γ ) is the discount factor, ( r ) is the reward, ( s’ ) is the next state, and ( a’ ) is the next action taken by the agent.
Deep Q-Networks (DQN)
Deep Q-Networks (DQN) extend Q-Learning by using deep neural networks to approximate the Q-value function, making it suitable for environments with large or continuous state spaces. DQN uses experience replay to store and sample past experiences to break correlation and improve learning stability. It also utilizes a target network to stabilize training by reducing oscillations. The Q-value approximation is given by:
- Q(s, a; θ) ≈ Q*(s, a)
- where ( θ ) are the parameters of the neural network.
Implementation RL using MDP
MDP Class – The implementation of Markov Decision Process (MDP), a foundational framework that provides the structure for defining RL problems is provided below.
class MDP:
def __init__(self, states, actions, transitions, rewards, gamma=0.9):
"""
Initialize MDP components.
:param states: List of states
:param actions: List of actions
:param transitions: Nested dict, P[state][action] = [(prob, next_state)]
:param rewards: Dict, R[state][action][next_state] = reward
:param gamma: Discount factor (0 <= gamma <= 1)
"""
self.states = states
self.actions = actions
self.transitions = transitions
self.rewards = rewards
self.gamma = gamma
def get_transition(self, state, action):
"""Return possible next states and probabilities for a state-action pair."""
try:
return self.transitions[state][action]
except KeyError:
raise ValueError(f"Invalid state-action pair: ({state}, {action})")
def get_reward(self, state, action, next_state):
"""Return reward for a state-action-next_state triplet."""
try:
return self.rewards[state][action][next_state]
except KeyError:
raise ValueError(f"Invalid state-action-next_state triplet: ({state}, {action}, {next_state})")
Q-learning -The implementation of Q-learning, which is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP) is provided below:
import random
from markov_decision_process import MDP
class QLearning:
def __init__(self, mdp, alpha=0.1, gamma=0.9, epsilon=0.1):
"""
Initialize Q-learning components.
:param mdp: An instance of the MDP class
:param alpha: Learning rate (0 < alpha <= 1)
:param gamma: Discount factor (0 <= gamma <= 1)
:param epsilon: Exploration rate (0 <= epsilon <= 1)
"""
self.mdp = mdp
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.q_table = {state: {action: 0 for action in mdp.actions} for state in mdp.states}
def choose_action(self, state):
"""Choose an action based on epsilon-greedy policy."""
if random.uniform(0, 1) < self.epsilon:
return random.choice(self.mdp.actions) # Explore
else:
return max(self.q_table[state], key=self.q_table[state].get) # Exploit
def update_q_value(self, state, action, reward, next_state):
"""Update Q-value based on the observed transition."""
best_next_action = max(self.q_table[next_state], key=self.q_table[next_state].get)
td_target = reward + self.gamma * self.q_table[next_state][best_next_action]
td_error = td_target - self.q_table[state][action]
self.q_table[state][action] += self.alpha * td_error
def simulate(self, episodes=1000):
"""Run the Q-learning simulation."""
for episode in range(episodes):
state = random.choice(self.mdp.states)
while True:
action = self.choose_action(state)
try:
next_state = random.choices(
[s for (p, s) in self.mdp.get_transition(state, action)],
[p for (p, s) in self.mdp.get_transition(state, action)]
)[0]
reward = self.mdp.get_reward(state, action, next_state)
except ValueError as e:
print(e)
return
self.update_q_value(state, action, reward, next_state)
state = next_state
if state == 'C': # Assuming 'C' is a terminal state
break
def get_policy(self):
"""Return the optimal policy based on Q-values."""
policy = {state: max(actions, key=actions.get) for state, actions in self.q_table.items()}
return policy
def display_q_table(self):
"""Display the Q-table."""
print("Q-Table:")
for state, actions in self.q_table.items():
print(f"State {state}: {actions}")
def display_policy(self):
"""Display the policy."""
policy = self.get_policy()
print("Policy:")
for state, action in policy.items():
print(f"State {state}: {action}")
def q_learning(mdp: MDP):
try:
ql = QLearning(mdp)
ql.simulate(episodes=1000)
print("Optimal Policy (Q-Learning):")
ql.display_q_table()
ql.display_policy()
except Exception as e:
print(f"Error during Q-Learning: {e}")
SARSA (State-Action-Reward-State-Action) – The implementation of SARSA, which is a model-free, on-policy reinforcement learning algorithm that updates the Q-values based on the action actually taken by the agent is provided below:
import random
from markov_decision_process import MDP
class SARSA:
def __init__(self, mdp, alpha=0.1, gamma=0.9, epsilon=0.1):
"""
Initialize SARSA components.
:param mdp: An instance of the MDP class
:param alpha: Learning rate (0 < alpha <= 1)
:param gamma: Discount factor (0 <= gamma <= 1)
:param epsilon: Exploration rate (0 <= epsilon <= 1)
"""
self.mdp = mdp
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.q_table = {state: {action: 0 for action in mdp.actions} for state in mdp.states}
def choose_action(self, state):
"""Choose an action based on epsilon-greedy policy."""
if random.uniform(0, 1) < self.epsilon:
return random.choice(self.mdp.actions) # Explore
else:
return max(self.q_table[state], key=self.q_table[state].get) # Exploit
def update_q_value(self, state, action, reward, next_state, next_action):
"""Update Q-value based on the observed transition."""
td_target = reward + self.gamma * self.q_table[next_state][next_action]
td_error = td_target - self.q_table[state][action]
self.q_table[state][action] += self.alpha * td_error
def simulate(self, episodes=1000):
"""Run the SARSA simulation."""
for episode in range(episodes):
state = random.choice(self.mdp.states)
action = self.choose_action(state)
while True:
try:
next_state = random.choices(
[s for (p, s) in self.mdp.get_transition(state, action)],
[p for (p, s) in self.mdp.get_transition(state, action)]
)[0]
reward = self.mdp.get_reward(state, action, next_state)
next_action = self.choose_action(next_state)
except ValueError as e:
print(e)
return
self.update_q_value(state, action, reward, next_state, next_action)
state, action = next_state, next_action
if state == 'S2': # Assuming 'S2' is a terminal state
break
def get_policy(self):
"""Return the optimal policy based on Q-values."""
policy = {state: max(actions, key=actions.get) for state, actions in self.q_table.items()}
return policy
def display_q_table(self):
"""Display the Q-table."""
print("Q-Table:")
for state, actions in self.q_table.items():
print(f"State {state}: {actions}")
def display_policy(self):
"""Display the policy."""
policy = self.get_policy()
print("Policy:")
for state, action in policy.items():
print(f"State {state}: {action}")
def sarsa(mdp: MDP):
# Solve using SARSA
try:
sarsa = SARSA(mdp)
sarsa.simulate(episodes=1000)
print("\nOptimal Policy (SARSA):")
sarsa.display_q_table()
sarsa.display_policy()
except Exception as e:
print(f"Error during SARSA: {e}")
Test and results
Test setup
The execute method in the provided code demonstrates how to test the RL using both Q-learning and SARSA algorithms described above. It defines three states (‘S0’, ‘S1’, ‘S2’), two actions (‘a0’, ‘a1’) as described in the diagram below, and specifies the transitions and rewards for each state-action pair. An MDP instance is created with these parameters and a discount factor (gamma) of 0.9. The code then initializes and runs the Q-learning and SARSA algorithms on this MDP, displaying the optimal policy and q-tables for each algorithm.

from markov_decision_process import MDP
from q_learning import q_learning
from sarsa import sarsa
def execute():
states = ['S0', 'S1', 'S2']
actions = ['a0', 'a1']
transitions = {
'S0': {
'a0': [(1.0, 'S1')],
'a1': [(1.0, 'S2')],
},
'S1': {
'a0': [(1.0, 'S0')],
'a1': [(1.0, 'S2')],
},
'S2': {
'a0': [(1.0, 'S0')],
'a1': [(1.0, 'S1')],
}
}
rewards = {
'S0': {
'a0': {'S1': 1},
'a1': {'S2': 0},
},
'S1': {
'a0': {'S0': 0},
'a1': {'S2': 2},
},
'S2': {
'a0': {'S0': 1},
'a1': {'S1': 2},
}
}
mdp = MDP(states, actions, transitions, rewards, gamma=0.9)
# Solve using Q-Learning
q_learning(mdp)
# Solve with SARSA
sarsa(mdp)
The main block in the provided code serves as the application entrypoint to display results
# Application entrypoint
if __name__ == "__main__":
execute()
Results
The results of running the above code will display the optimal policies and Q-tables for each of the two algorithms: Q-learning and SARSA. Each algorithm will output the state-action values and the optimal policy derived from the learned values.

Conclusion
Reinforcement Learning (RL) combined with Markov Decision Processes (MDPs) provides a robust framework for solving decision-making problems influenced by both uncertainty and the actions of a decision-maker. By modeling the environment through states, actions, transitions, rewards, and a discount factor, MDPs enable the analysis and optimization of policies to maximize cumulative, long-term rewards.
Algorithms such as Q-Learning and SARSA are essential tools that iteratively refine strategies through interaction and learning. These methods empower agents to make optimal decisions even in complex, uncertain environments, making RL a cornerstone of intelligent systems development across diverse domains.
By leveraging RL with MDPs and their associated algorithms, decision-makers can develop robust strategies that account for uncertainty and optimize long-term outcomes in various applications, including robotics, finance, healthcare, and more.
Key Takeaways
- Structured Framework: MDPs provide a systematic approach to modeling decision-making problems under uncertainty.
- Policy Optimization: RL analyzes and optimizes policies to maximize long-term rewards.
- Effective Algorithms: Q-Learning and SARSA enable agents to iteratively learn and improve decision-making.
- Wide Applications: RL with MDPs enhances decision-making in fields like robotics, finance, and healthcare.
- Sustainability: The focus on maximizing long-term outcomes ensures robust and adaptable strategies.
References
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
- Deep Reinforcement Learning Hands-On by Maxim Lapan