Talking about self-learning artificial Intelligence Agent: Markov decision-making process (part I) 07/04 Update SLTechnology News&Howtos

Talking about self-learning artificial Intelligence Agent: Markov decision-making process (part I)

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Written by Artem Oppermann

This is the first article in a multi-part series on self-learning AI agents, or more accurately deep reinforcement learning. The purpose of this series is not just to give you an idea of these topics. Instead, I want to give you a deeper understanding of the theory, mathematics, and implementation behind the most popular and effective approaches to deep reinforcement learning.

Self-Learning Artificial Intelligence Agents Series-Contents

Part I: Markov Decision Processes (this article)

Part 2: Deep Q-Learning

Part 3: Deep (Dual) Q-Learning

Part IV: Policy gradients for space for sustained action

Part 5: The Dueling Network

Part 6: Asynchronous Role Review Agents

·...

Figure 1. Artificial intelligence learns how to run and overcome obstacles in Markov decision processes

directory

0. profile

1. Nutshell reinforcement learning

2. Markov decision process

2.1 Markov process

2.2 Markov reward program

2.3 Value function·

3. Bellman equation (Bellman Equation)

3.1 Bellman Equation of Markov Reward Process

3.2 Markov Decision Processes-Definitions

3.3 policy

3.4 action-value function

3.5 optimal policy

3.6 Bellman equation optimality equation

0. profile

Deep reinforcement learning is on the rise. In recent years, researchers and industry media around the world have not paid more attention to other subfields of deep learning. The greatest achievement in deep learning is due to deep reinforcement learning. Alpha Go from Google beat the world Go champion in Go (an achievement impossible a few years ago), and DeepMind's AI agents taught themselves to walk, run, and overcome obstacles (Figure 1-3).

Figure 2. AI agents learn how to operate and overcome obstacles

Figure 3. AI agents learn how to operate and overcome obstacles

Other AI agents have outperformed humans in playing Atari games since 2014 (Figure 4). In my opinion, the most amazing fact about all of this is that none of these AI agents were explicitly programmed or taught by humans how to solve these tasks. They teach themselves through the power of deep learning and reinforcement learning. The goal of this first article in a multi-part series is to provide the mathematical foundation necessary to tackle the most promising areas of this sub-field in upcoming articles.

Figure 4. AI agents learning how to play Atari games

1. deep reinforcement learning

Deep reinforcement learning can be summarized as building an algorithm (or AI agent) that learns directly from interactions with its environment (Figure 5). The environment may be the real world, a computer game, a simulation, or even a board game such as Go or chess. Like humans, AI agents learn from the consequences of their actions rather than from explicit teachings.

Figure 5 Schematic diagram of deep reinforcement learning

In deep reinforcement learning, agents are represented by neural networks. Neural networks interact directly with the environment. It observes the current state of the environment and decides what action to take (e.g. left, right, etc.) based on the current state and past experience. Based on the actions taken, AI agents receive rewards. The amount of reward determines the quality of the action taken to solve a given problem (e.g. learning how to walk). The agent's goal is to learn to act in any given situation to maximize the cumulative reward.

2. Markov decision process

Markov decision processes (MDP) are discrete-time stochastic control processes. Markov Decision Process (MDP) is our best approach to modeling complex environments for AI agents to date. Each problem that the agent aims to solve can be thought of as a sequence of states S1, S2, S3,... Sn (the state may be, for example, a Go/chess board configuration). Agents perform actions and move from one state to another. In the following, you learn the math that determines what actions agents must take in any given situation.

Equation 1 Markov property

2.1 Markov process

A Markov process is a stochastic model describing a series of possible states, where the current state depends only on previous states. This is also known as Markov property (equation 1). For reinforcement learning, this means that the next state of the AI agent depends only on the last state, not all previous states.

Markov process is a stochastic process. This means that the transition from the current state s to the next state s'can only occur with some probability Pss'(equation 2). In Markov processes, agents who are told to leave will leave only with a certain probability (e.g. 0.998). The final outcome of the agent is determined by unlikely circumstances.

Equation 2 Transition probability from state s to state s'

Pss'can be thought of as an entry in the state transition matrix P that defines transition probabilities from all states s to all successor states s'(Equation 3).

Equation 3 Transition probability matrix

Remember: a Markov process (or Markov chain) is a tuple. S is a (finite) set of states. P is the state transition probability matrix.

2.2 Markov reward program

Markov reward processes are tuples. where R is the reward the agent expects to receive in state s (Equation 4). The motivation for this process is that for AI agents aiming to achieve a certain goal, such as winning a chess game, certain states (game configurations) are more promising than others in terms of strategy and potential to win the game.

Expected reward for equation 4 state

The main topic of interest is the total reward Gt (Equation 5), which is the expected cumulative reward that the agent will receive in a sequence of all states. Each reward is weighted by a so-called discount factor γ∈[0,1]. Discounted rewards are mathematically convenient because they avoid the infinite payoffs of cyclic Markov processes. Besides the discount factor, it means that the more we have in the future, the less important rewards become, because the future is often uncertain. If the reward is a financial reward, immediate rewards may yield more benefits than delayed rewards. Except animal/human behavior shows a preference for immediate rewards.

Equation 5 Total reward for all states

2.3 value function

Another important concept is one of the value functions v (s). The value function maps values to each state s. The value of state s is defined as the expected total reward that the AI agent will receive when it begins its progression in state s (Equation 6).

Equation 6: Expected payoffs of the value function from state s

The value function can be broken down into two parts:

The immediate reward R (t + 1) received by the agent is in state s.

Discount value v (s (t + 1)) of the next state after state s.

Equation 7 Decomposition of the cost function

3. Bellman equation

3.1 Bellman Equation of Markov Reward Process

The decomposed value function (Equation 8) is also known as the Bellman equation for Markov reward processes. This functionality can be displayed in a node diagram (Figure 6). Starting from state s leads to the value v (s). In state s we have a certain probability that Pss'ends up in the next state s'. In this special case, we have two possible next states. To obtain the value v (s), we must sum up the values v (s') of the possible next states weighted by the probabilities Pss'and add an immediate reward from state s. This yields equation 9, which is only not equation 8 if we implement the expectation operator E in equation 9.

Formula 8 decomposes the cost function

6. Random transitions from s to s'.

Equation 9 Bellman equation after executing expectation operator E

3.2 Markov Decision Processes-Definitions

Markov decision processes are decisions of Markov reward processes. A Markov decision process is described by a set of tuples, where A is a finite set of possible actions that an agent can take in state s. Therefore, the direct reward that is now in state s also depends on the actions taken by the agent in that state (Equation 10).

Equation 10: Expected reward depends on state action

3.3 policy

At this point, we will discuss how agents decide what actions must be taken in a particular state. This is determined by the so-called policy π (equation 11). Mathematically, policy is the distribution of all actions to a given state. Policy determines the mapping from state s to action a that the agent must take.

Equation 11 as a mapping strategy from s to a

Keep in mind here that intuitively speaking, policy π can be described as a policy where an agent chooses certain actions based on the current state.

This strategy leads to a new definition of the state-value function v (s)(Equation 12), which we now define as the expected return from state s and then follow the strategy π.

Equation 12 State Value Function

3.4 action-value function

Another important function other than the state value function is the so-called action value function q (s, a)(Equation 13). The action-value function is the expected payoff we get by starting at state s, taking action a, and following strategy π. Note that for state s, q (s, a) can take on multiple values, because an agent can perform multiple operations in state s. The computation of Q (s, a) is implemented by neural networks. Given a state as input, the network calculates the mass of each possible action in that state as a scalar (Figure 7). Higher quality means better action on a given goal.

Figure 7. Graphical representation of the action value function

Remember: the action value function tells us how good it is to take a particular action in a particular state.

Equation 13: Action Value Function

Previously, the state value function v (s) could be decomposed into the following form:

The state cost function decomposed by Equation 14

The same decomposition can be applied to the action cost function:

The state cost function of equation 15 decomposition

At this point we discuss how v (s) and q (s, a) relate to each other. The relationship between these functions can be visualized again in the diagram:

Figure 8. Visualization of the relationship between v (s) and q (s, a)

Being in state s in this example allows us to take two possible actions a. By definition, taking a particular action in a particular state gives us an action value q (s, a). The action cost function v (s) is the sum of probability-weighted possibilities q (s, a) for taking action a in state s (Equation 16)(except that it is not a policy π).

Equation 16: Weighted sum of state value function as action value

Let us now consider the opposite case in Figure 9. The root of the binary tree is now a state in which we choose to take a particular action. Remember that Markov processes are random. Taking action doesn't mean you'll end up with 100% certainty about your desired goal. Strictly speaking, you have to consider the probability that you will end up in some other state after taking the action. In this particular case after taking action, you can end up in two different next-state s':

Figure 9. Visualization of the relationship between v (s) and q (s, a)

To get action value, you must take discounted state values weighted by probability Pss'to end up in all possible states (only 2 in this case) and add instant rewards:

Equation 17 Relationship between q (s, a) and v (s)

Now that we know the relationship between these functions, we can insert v (s) from Eq. Inserting q (s, a) from equation 16 into equation 17, we obtain that in equation 18, it can be noted that there is a recursive relationship between the current q (s, a) and the next action value q (s', a').

Equation 18 Recursive property of action value function

This recursive relationship can be visualized again in a binary tree (Figure 10). We start with q (s, a) and end with some probability Pss'In the next state s', we can take action a' with probability π, we end with action value q (s', one'). To obtain q (s, a), we must ascend and integrate all probabilities in the binary tree, as shown in Equation 18.

Figure 10. Visualization of recursive behavior of q (s, a)

3.5 optimal policy

The most important topic in deep reinforcement learning is finding the optimal action value function q *. Finding q * indicates that the agent knows exactly the quality of the action in any given state. In addition, the agent can decide which action must be taken on quality. Let us define what q * means. The optimal action value function is the function that follows the strategy that maximizes action value:

Equation 19 Definition of the Best Action Value Function

To find the best strategy, we must maximize q (s, a). Maximizing means that we choose only action a out of all possible actions q (s, a) has the highest value. This yields the following definition for the optimal policy π:

Equation 19 Definition of the Best Action Value Function

3.6 Bellman optimality equation

The conditions of the optimal policy can be inserted into the equation. Equation 18 thus gives us the Bellman optimality equation:

Equation 21 Bellman Optimality Equation

If an AI agent can solve this equation, then it basically means solving the problem in a given environment. The agent knows the quality of any possible action about the target at any given state or situation and can behave accordingly.

Solving the Bellman optimality equation will be the subject of an upcoming article. In the following article, I will introduce you to the first technique for solving deep Q-learning equations.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.