What is the principle of Python DQN algorithm? 07/06 Update SLTechnology News&Howtos

What is the principle of Python DQN algorithm?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the principle of Python DQN algorithm". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the principle of Python DQN algorithm"?

1. Brief introduction of DQN algorithm

The Q-learning algorithm uses a Q-tabel to record the action value in each state. When the state space or action space is larger, the storage space required will be larger. If the state space or action space is continuous, the algorithm cannot be used. Therefore, Q-learning algorithm can only be used to solve discrete low-dimensional state space and action space problems. The core of DQN algorithm is to use an artificial neural network.

Instead of Q-tabel, the action value function. The input of the network is the state information and the output is the value of each action, so the DQN algorithm can be used to solve the continuous state space and discrete action space problems, but can not solve the continuous action space problems. For the problem of continuous action space, blog will introduce it later.

2 principle of DQN algorithm

DQN algorithm is an off-policy algorithm. When there are different strategies, self-interest and function approximation at the same time, the convergence can not be guaranteed, and it is easy to have some problems such as unstable training or difficult training. In order to solve these problems, the researchers mainly made improvements from the following two aspects.

(1) experience playback: the experience (current state st, action at, instant reward rt+1, next state st+1, turn status done) is stored in the experience pool and sampled according to certain rules.

(2) Target network: modify the updating mode of the network, such as not using the newly learned network weight for the subsequent self-benefit process immediately.

2.1 experience playback

Experience playback is a technique that makes the empirical probability distribution stable, which can improve the stability of training. Experience playback mainly has two key steps: "storage" and "playback".

Storage: store experience in the experience pool in the form of (st,at,rt+1,st+1,done).

Playback: one or more pieces of empirical data are sampled from the experience pool according to certain rules.

From a storage perspective, experience playback can be divided into centralized playback and distributed playback:

Centralized playback: the agent runs in an environment and stores experience uniformly in the experience pool.

Distributed playback: multiple agents run in multiple environments at the same time, and the experience is stored uniformly in the experience pool. Because multiple agents generate experiences at the same time, they can collect experiences more quickly while using more resources.

From the perspective of sampling, experience playback can be divided into uniform playback and priority playback:

Uniform playback: equal probability sampling experience from the experience pool.

Priority playback: assign a priority to each experience in the experience pool and prefer to choose the experience with higher priority when sampling the experience. As a general practice, if the priority of an experience (for example, experience) is, the probability of selecting that experience is:

Priority playback can refer to this paper: priority experience playback.

Advantages of experience playback:

1. When training the Q network, we can break the correlation between the data and make the data meet the independent and identical distribution, so as to reduce the variance of parameter update and improve the convergence speed.

two。 The ability to reuse experience and high data utilization is especially useful in situations where data acquisition is difficult.

Disadvantages of experience playback:

Cannot be applied to round update and multi-step learning algorithms. However, the application of experience playback to Q learning avoids this disadvantage.

Centralized uniform playback is used in the code, as follows:

Import numpy as np class ReplayBuffer: def _ init__ (self, state_dim, action_dim, max_size, batch_size): self.mem_size = max_size self.batch_size = batch_size self.mem_cnt = 0 self.state_memory = np.zeros ((self.mem_size, state_dim)) self.action_memory = np.zeros ((self.mem_size) ) self.reward_memory = np.zeros ((self.mem_size,)) self.next_state_memory = np.zeros ((self.mem_size, state_dim)) self.terminal_memory = np.zeros ((self.mem_size,), dtype=np.bool) def store_transition (self, state, action, reward, state_ Done): mem_idx = self.mem_cnt% self.mem_size self.state_ limits [mem _ idx] = state self.action_ levels [mem _ idx] = action self.reward_ levels [mem _ idx] = reward self.next_state_ levels [mem _ idx] = state_ self.terminal_ levels [mem _ idx] = done self.mem_cnt + = 1 def sample_buffer ( Self): mem_len = min (self.mem_size Self.mem_cnt) batch = np.random.choice (mem_len, self.batch_size, replace=True) states = self.state_ batch actions = self.action_ batch rewards = self.reward_ batch states_ = self.next_state_ batch terminals = self.terminal_ batch return states, actions, rewards, states_ Terminals def ready (self): return self.mem_cnt > self.batch_size2.2 destination network

For self-interest-based Q learning, action value estimation is related to weight. When the weight changes, the estimation of the action value will also change. In the process of learning, action value tries to chase the return of a change, which is prone to instability.

The target network is to build a network with the same structure in addition to the original neural network. The original network is called the evaluation network, and the newly built network is called the target network. In the process of learning, the target network is used to evaluate the self-benefit return as the learning goal. In the process of updating, only the weight of the evaluation network is updated, but not the weight of the target network. In this way, the goal for which the weight is updated does not change in each iteration, but is a fixed goal. After updating a certain number of times, the weight of the evaluation network is copied to the target network, and then the next batch of updates are carried out, so that the target network can also be updated. Because the estimation of return is relatively fixed in a period of time when the target network does not change, the introduction of the target network increases the stability of learning.

How to update the target network:

The target network is fixed over a period of time, and after a certain number of times, the evaluation network weight is copied to the target network by hard update (hard update), that is,

It represents the target network weight and indicates the evaluation of the network weight.

Another commonly used update method is soft update (soft update), which introduces a learning rate and assigns the weighted average values of the old target network parameters and the new evaluation network parameters to the target network.

Learning rate

3 pseudo code of DQN algorithm

The implementation code of DQN algorithm is as follows:

Import torch as Timport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Fimport numpy as npfrom buffer import ReplayBuffer device = T.device ("cuda:0" if T.cuda.is_available () else "cpu") class DeepQNetwork (nn.Module): def _ _ init__ (self, alpha, state_dim, action_dim, fc1_dim, fc2_dim): super (DeepQNetwork, self). _ _ init__ () self.fc1 = nn.Linear (state_dim Fc1_dim) self.fc2 = nn.Linear (fc1_dim, fc2_dim) self.q = nn.Linear (fc2_dim, action_dim) self.optimizer = optim.Adam (self.parameters (), lr=alpha) self.to (device) def forward (self State): X = T.relu (self.fc1 (state)) x = T.relu (self.fc2 (x)) Q = self.q (x) return q def save_checkpoint (self, checkpoint_file): T.save (self.state_dict (), checkpoint_file, _ use_new_zipfile_serialization=False) def load_checkpoint (self Checkpoint_file): self.load_state_dict (T.load (checkpoint_file)) class DQN: def _ _ init__ (self, alpha, state_dim, action_dim, fc1_dim, fc2_dim, ckpt_dir, gamma=0.99, tau=0.005, epsilon=1.0, eps_end=0.01, eps_dec=5e-4, max_size=1000000 Batch_size=256): self.tau = tau self.gamma = gamma self.epsilon = epsilon self.eps_min = eps_end self.eps_dec = eps_dec self.batch_size = batch_size self.action_space = [i for i in range (action_dim)] self.checkpoint_dir = ckpt_dir self.q_eval = DeepQNetwork (alpha=alpha, state_dim=state_dim, action_dim=action_dim Fc1_dim=fc1_dim, fc2_dim=fc2_dim) self.q_target = DeepQNetwork (alpha=alpha, state_dim=state_dim, action_dim=action_dim, fc1_dim=fc1_dim, fc2_dim=fc2_dim) self.memory = ReplayBuffer (state_dim=state_dim, action_dim=action_dim, max_size=max_size) Batch_size=batch_size) self.update_network_parameters (tau=1.0) def update_network_parameters (self, tau=None): if tau is None: tau= self.tau for q_target_params, q_eval_params in zip (self.q_target.parameters () Self.q_eval.parameters (): qtargets targets params.data.copya.copy_ (tau * q_eval_params + (1-tau) * q_target_params) def remember (self, state, action, reward, state_, done): self.memory.store_transition (state, action, reward, state_, done) def choose_action (self, observation, isTrain=True): state = T.tensor ([observation]) Dtype=T.float) .to (device) actions = self.q_eval.forward (state) action = T.argmax (actions) .item () if (np.random.random ()

< self.epsilon) and isTrain: action = np.random.choice(self.action_space) return action def learn(self): if not self.memory.ready(): return states, actions, rewards, next_states, terminals = self.memory.sample_buffer() batch_idx = np.arange(self.batch_size) states_tensor = T.tensor(states, dtype=T.float).to(device) rewards_tensor = T.tensor(rewards, dtype=T.float).to(device) next_states_tensor = T.tensor(next_states, dtype=T.float).to(device) terminals_tensor = T.tensor(terminals).to(device) with T.no_grad(): q_ = self.q_target.forward(next_states_tensor) q_[terminals_tensor] = 0.0 target = rewards_tensor + self.gamma * T.max(q_, dim=-1)[0] q = self.q_eval.forward(states_tensor)[batch_idx, actions] loss = F.mse_loss(q, target.detach()) self.q_eval.optimizer.zero_grad() loss.backward() self.q_eval.optimizer.step() self.update_network_parameters() self.epsilon = self.epsilon - self.eps_dec if self.epsilon >

Self.eps_min else self.eps_min def save_models (self, episode): self.q_eval.save_checkpoint (self.checkpoint_dir + 'Qasymevalsta DQN, QC, QC, QN, QN, QR, etc.). Format (episode)) print (' Saving Q_eval network Saving Q_eval network fullycy') Self.q_target.save_checkpoint (self.checkpoint_dir + 'Qroomtargetplink DQN qualified targets _ {} .pth' .format (episode)) print ('Saving Q_target network qualified targets') Def load_models (self, episode): self.q_eval.load_checkpoint (self.checkpoint_dir + 'Qasymevalsta DQN, QC, QC, QN, QN, QR, etc.). Format (episode)) print (' Loading Q_eval network Loading Q_eval network fullycy') Self.q_target.load_checkpoint (self.checkpoint_dir + 'Qroomtargetplink DQN qualified targets _ {} .pth' .format (episode)) print ('Loading Q_target network qualified targets')

The algorithm simulation environment is the LunarLander-v2 environment in the gym library, so we need to configure the gym library first. Enter the corresponding Python environment in Aanconda and execute the following instructions

Pip install gym

However, the gym library installed in this way includes only a small number of built-in environments, such as algorithmic environments, simple word games, and classic control environments, and cannot use LunarLander-v2.

The training script is as follows:

Import gymimport numpy as npimport argparsefrom DQN import DQNfrom utils import plot_learning_curve, create_directory parser = argparse.ArgumentParser () parser.add_argument ('- max_episodes', type=int, default=500) parser.add_argument ('--ckpt_dir', type=str, default='./checkpoints/DQN/') parser.add_argument ('--reward_path', type=str, default='./output_images/avg_reward.png') parser.add_argument ('--epsilon_path', type=str Default='./output_images/epsilon.png') args = parser.parse_args () def main (): env = gym.make ('LunarLander-v2') agent = DQN (alpha=0.0003, state_dim=env.observation_space.shape [0], action_dim=env.action_space.n, fc1_dim=256, fc2_dim=256, ckpt_dir=args.ckpt_dir, gamma=0.99, tau=0.005, epsilon=1.0, eps_end=0.05, eps_dec=5e-4 Max_size=1000000, batch_size=256) create_directory (args.ckpt_dir, sub_dirs= ['Qothers targets']) total_rewards, avg_rewards, eps_history = [], [] for episode in range (args.max_episodes): total_reward = 0 done = False observation = env.reset () while not done: action = agent.choose_action (observation IsTrain=True) observation_, reward, done, info = env.step (action) agent.remember (observation, action, reward, observation_ Done) agent.learn () total_reward + = reward observation = observation_ total_rewards.append (total_reward) avg_reward = np.mean (total_rewards [- 100:]) avg_rewards.append (avg_reward) eps_history.append (agent.epsilon) print ('EP: {} reward: {} avg_reward: {} epsilon: {}'. Format (episode + 1, total_reward, avg_reward, agent.epsilon) if (episode + 1)% 50 = = 0: agent.save_models (episode + 1) episodes = [i for i in range (args.max_episodes)] plot_learning_curve (episodes, avg_rewards, 'Reward',' reward', args.reward_path) plot_learning_curve (episodes, eps_history, 'Epsilon',' epsilon' Args.epsilon_path) if _ _ name__ = ='_ main__': main ()

Drawing functions and folder creation functions are also used during training. I put them in a separate utils.py script with the following code:

Import osimport matplotlib.pyplot as plt def plot_learning_curve (episodes, records, title, ylabel, figure_file): plt.figure () plt.plot (episodes, records, linestyle='-', color='r') plt.title (title) plt.xlabel ('episode') plt.ylabel (ylabel) plt.show () plt.savefig (figure_file) def create_directory (path: str Sub_dirs: list): for sub_dir in sub_dirs: if os.path.exists (path + sub_dir): print (path + sub_dir +'is already examples') Else: os.makedirs (path + sub_dir, exist_ok=True) print (path + sub_dir + 'create abundant')

The simulation results are shown in the following figure:

From the average reward curve, we can see that the algorithm tends to converge when it iterates to about 400 steps.

At this point, I believe you have a deeper understanding of "what is the principle of the Python DQN algorithm?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.