Dota 2 has been captured! Defeat the human race in the OpenAI Five 5V5 regiment battle 04/26 Update SLTechnology News&Howtos

Dota 2 has been captured! Defeat the human race in the OpenAI Five 5V5 regiment battle

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Recently, OpenAI has developed a set of "OpenAI Five" algorithms, and their five neural network algorithms can already work together to beat amateur teams in Dota 2 games.

OpenAI said that although there are still restrictions in today's games, their goal is to beat a group of top professionals in international competitions in August, of course limited to a limited number of "hero" characters.

However, Dota 2 is one of the most popular and complex e-sports games in the world.

OpenAI Five plays games every day for 180 years, learning through self-confrontation. It is trained with extended Proximal policy optimization running on 256 GPU and 128000 CPU cores-a larger version of the system set up for the simpler single-player version of the game they released last year. In the game, recognizable strategies are learned using a separate LSTM for each hero and without using human data.

The game between OpenAI Five and OpenAI staff team was understood by professional commentator Blitz and OpenAI Dota team member Christy Dennison, and many community members also watched one after another.

problem

A milestone in artificial intelligence is surpassing human capabilities in complex video games like StarCraft or Dota. In contrast to previous AI milestones, such as chess or go, complex video games begin to reflect the chaotic and continuous nature of the real world. The system that can solve complex video games has high versatility, and its application is not in the game.

A milestone in the field of artificial intelligence is surpassing human capabilities in complex video games such as StarCraft or Dota. Compared with previous artificial intelligence milestones, such as chess or go, people may pay more attention to the fact that systems for solving complex video games will be highly versatile, not just applied in the game field.

Dota 2 is a real-time strategy game made up of two players, each controlling a character called "hero". AI who plays Dota must master the following points:

Have long-term planning. The Dota game runs at 30 frames per second, with an average time of 45 minutes and 80000 times per game. Most behaviors (such as ordering a hero to move to a location) have only a small impact alone, but some individual behaviors may strategically affect the game; some strategies can even have a significant impact on the game as a whole. OpenAI Five watches every four frames and produces 20000 movements. Chess, on the other hand, usually ends before 40 steps and moves before 150 steps, and almost every move is strategic. It has the power of deduction. In the game, the rest of the map is hidden in the fog, so for hidden enemies and their strategies, AI needs to infer from incomplete data and needs to simulate what the opponent may be in progress. However, chess and go are all-information games, which are relatively easy. A highly continuous action space. In Dota, each hero can take dozens of actions, many of which are directed at another unit or a location on the ground. We divide each hero's space into 170000 possible actions (not all of which are valid, such as spells that use cooldown). The average number of moves in chess is 35 and the average number of actions in go is 250. High-dimensional, continuous observation space. Dota will be performed on a large map of ten heroes, dozens of buildings, dozens of NPC units, and game features such as runes, trees and wards, so the process of Dota play is complex. The model observes the state of the Dota game through Valve's Bot API, where 20000 (mostly floating point) numbers represent all the information that humans are allowed to access. The chess board naturally has about 70 enumerated values; a go board has about 400 enumerated values.

Dota rules are also very complex. The game has been actively developed for more than a decade, and the game logic is implemented in hundreds of thousands of lines of code. This logic takes a few milliseconds to execute, while it only takes a few nanoseconds for chess or go. The game is updated every two weeks, constantly changing the semantics of the environment.

Method

OpenAI Five's system uses a large-scale version of Proximal Policy Optimization for learning. Both OpenAI Five and early 1v1 robots rely entirely on self-learning. They start with random parameters and do not use data from humans.

Researchers of reinforcement learning (including ourselves) generally believe that long-term vision requires fundamentally new progress, such as hierarchical reinforcement learning. The results show that we do not give enough credit to today's algorithms, at least when they run on a sufficient scale and in a reasonable way of exploration.

Agents are trained to maximize the exponential decay sum of future rewards, which is weighted by an exponential attenuation factor called γ. In the latest OpenAI Five training, gamma is degraded from 0.998 (the half-life for evaluating future awards is 46 seconds) to 0.9997 (the half-life for evaluating future rewards is five minutes). For comparison, the longest half-life of the proximal strategy optimization (PPO) paper is 0.5 seconds, the longest half-life of the Rainbow paper is 4.4 seconds, and the half-life of observation and further observation paper is 46 seconds.

Although the current version of OpenAI Five does not perform well on the final hit (watching our test game, professional Dota commentator Blitz estimates that it is about the median of Dota players), its objective priority matches a common professional strategy. Achieving long-term returns such as strategic map control often requires sacrificing short-term returns because it takes time to build attack towers. This observation reinforces OpenAI's belief that the system is really optimized in the long run.

Model architecture

Each OpenAI Five's network contains a single-layer 1024-unit LSTM that can view the current game status (extracted from Valve's Bot API) and issue corresponding actions through several possible actions. Each action has semantic meaning, such as the number of ticks that delay the action, the number of actions to be selected, the X or Y coordinates of the action in the grid around the unit, and so on.

Interactively demonstrate the observation space and action space used by OpenAI Five. OpenAI Five sees the world as a list of 20000 numbers and acts by publishing a list of eight enumerated values. Choose different actions and targets to see how OpenAI Five encodes each action and how it observes it. The following image shows what we see.

OpenAI Five can respond to missing related state fragments. Until recently, for example, OpenAI Five observations did not include shrapnel areas (areas where projectiles landed on the enemy), only areas that humans saw on the screen. However, we observe that OpenAI Five learns to move out of (although cannot avoid entering) the active shrapnel area, because it can see its "health condition" declining.

Exploration

In view of the learning algorithms that can handle long-term vision, we still need to explore the environment. Even with our limitations, there are hundreds of items, dozens of buildings, spells, and unit types, and a large number of game mechanisms that need to be understood, many of which produce powerful combinations. It is not easy to explore the vast space of this combination effectively.

OpenAI Five starts with self-confrontation (starting with random weights), which provides a natural lesson for exploring the environment. To avoid a "strategic collapse", agents train 80% of the games against themselves and another 20% against themselves in the past. In the first game, the heroes wandered aimlessly on the map. After several hours of training, concepts such as agriculture or mid-term combat emerged. A few days later, they have been adopting a basic human strategy: trying to seize the runes of the bounty from their opponents, walk to their first-level tower to farm, and rotate the heroes on the map to gain passage advantage. And through further training, they will be able to master some advanced strategies.

In March 2017, OpenAI's first product beat robots, but there was nothing it could do for humans. In order to force exploration in the strategic space, during the training period (and only during the training period), we randomized the attributes of the unit (including health, speed, starting level, etc.) to play against humans. Later, when a tester kept beating the 1v1 robot, random training was added, and the tester began to fail. Our robot team also applies similar randomization techniques to physical robots to move from simulation to the real world. )

OpenAI Five uses random data written for the 1v1 robot. It also uses a new "lane assignment" (lane assignment). At the beginning of each training competition, each hero is randomly "assigned" to a subset and punished to avoid these lanes.

It can be found that exploration has been well rewarded. Rewards mainly include measuring how people make decisions in the game: net worth, killing, death, assists, final hits, and so on. We process each agent's reward by subtracting the average reward from other teams to prevent it from finding a positive situation.

Coordination

Teamwork is controlled by a super-parameter called "teamwork". Team spirit ranges from 0 to 1, and every hero in OpenAI Five should be concerned about how important the average of his or her individual reward function and team reward function is. We adjusted the value from 0 to 1 in training.

Rapid

The system is implemented as a general RL training system Rapid, which can be applied to any environment. We have used Rapid to solve other OpenAI problems, including competitive self-rivalry.

The training system is divided into deployment staff running game copies and agent collection experiences, and optimizer nodes that perform synchronous gradient declines throughout the GPU queue. Deployment workers synchronize their experiences to the optimizer through Redis. Each lab also includes evaluating the staff of trained and reference agents, as well as monitoring software such as TensorBoard, Sentry and Grafana.

In the process of synchronous gradient descent, each GPU calculates the gradient of the batch part, and then makes a global average of the gradient. We initially used MPI's allreduce for averaging, but now we use NCCL2 for parallel GPU computing and network data transfer.

The right side shows the latency of different amounts of GPU synchronized 58MB data (the size of the OpenAI Five parameter). The latency is low enough to be masked by GPU calculations that GPU runs in parallel.

OpenAI implements Kubernetes, Azure, and GCP backends for Rapid.

Games

So far, OpenAI Five has played against the team under various restrictions:

Best OpenAI workforce: 2.5k MMR (46th percentile)

Best spectator players to watch OpenAI staff games (including Blitz commenting on the first OpenAI staff game): 4-6k MMR (90-99th percentage points), although they have never competed as a team

Company workforce: 2.5k-4k MMR (46th-90th percentile)

Amateur team: 4.2k MMR (93rd percentile), trained as a team

Semi-professional team: 5.5k MMR (99th percentile), team training

The April 23rd version of OpenAI Five was the first to beat the script baseline. The May 15 version of OpenAI Five tied with the first team, winning one game and losing another. The June 6th version of OpenAI Five decisively won all the competitions. Informal matches have been established with teams 4 and 5, and the results are not expected to be very good, but OpenAI Five has won two of the first three games.

We have observed that OpenAI Five has the following performance:

In exchange for controlling the enemy's safe lane, he repeatedly sacrificed his own safe lane (the scary top lane; the radiant bottom lane), forcing the opponent to move closer to the more difficult side. This strategy has emerged in the professional field in the past few years and is now considered to be a popular strategy.

Complete the transition from the beginning of the game to the middle of the season faster than the opponent. It does this: (1) build a successful ganks (when the player moves on the map to ambush the enemy hero-see animation), when the player overexpands in their driveway; (2) push the tower in groups before the opponent is organized.

Unlike the current game style, there is a lot of early experience and gold support for heroes (usually not giving priority to resources). OpenAI Five's priority brings its damage to its peak earlier and makes it more powerful, winning team battles and taking advantage of mistakes to ensure a quick victory.

The difference with human beings.

OpenAI Five can access the same information as humans, and instantly see data such as location, health status, and inventory of items that humans must check manually. Our approach has nothing to do with observing the state, but just rendering pixels from the game requires thousands of gpu.

The average movement speed of the OpenAI Five is about 150,170 movements per minute (the theoretical maximum speed is 450s, because it is observed every four frames). While frame-perfect timing is possible for skilled players, it is negligible for OpenAI Five. The average reaction time of OpenAI Five is 80ms, which is faster than that of humans.

These differences are the most important in 1v1 (the response time of our robots is 67ms), but because we see humans learn from robots and adapt to robots, the competitive environment is relatively level. In the months after last year's TI test, dozens of professionals used our 1v1 robot for training. According to Blitz, the 1v1 robot has changed the way people think about 1v1 (the robot uses a fast-paced game style that everyone has adapted to now).

A surprising discovery

Sparse and intensive reward learning curves show that density reaches the same level of performance faster. Two-value reward can lead to good performance. Our 1v1 model has a shape reward, including the reward for the final hit target, killing and so on. They did an experiment that rewarded only those who succeeded or failed, which trained an order of magnitude slower with some stagnation in the middle, in contrast to the smooth learning curve we usually see. The experiment was run on 4500 cores and 16k80 gpu, trained to the semi-professional level (70 TrueSkill) instead of the 90 TrueSkill of the best 1v1 robot.

You can learn from scratch. For 1v1, I learned the card soldier who used the traditional RL and got a reward. A member of the team left 2v2's model training while on vacation to see how much more training would be needed to improve performance. To his surprise, the model has learned to get stuck without any special guidance or reward.

The learning curve before and after bug repair was compared to show how repairing bug can improve the learning speed. The chart shows the training operation of the code that defeats amateur players, compared to fixing some bug, such as occasional crashes during training, or a bug that gets a large negative reward if it reaches level 25.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.