Reveal the IT infrastructure behind LOL: products, not services 04/21 Update SLTechnology News&Howtos

Reveal the IT infrastructure behind LOL: products, not services

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Welcome to the Tungsten Fabric user case series to discover more application scenarios of TF. The protagonist of the "revealing LOL" series is Tungsten Fabric user Riot Games Game Company. As the developer and operator of "League of Legends" of LOL, Riot Games faces the challenge of complex deployment on a global scale. Let's reveal the "heroes" behind LOL and see how they run online services.

Author: Nicolas Tittley and Ala Shiban (article source: Riot Games) translator: TF compilation group

This long series of articles explores and documents how Riot Games develops, deploys, and operates the back-end infrastructure. We are Nicolas Tittley and Ala Shiban, software architects and product managers of the Riot development experience team. Our team is responsible for helping Riot developers build, deploy and operate games wherever players are located, while focusing on cloud-agnostic platforms that make it easier to release and operate games. In an article two years ago, Maxfield Stewart introduced the development of ecosystems and many of the tools used at the time. Here we will update some of the latest content, including the new challenges, how to solve the problem, and what we have learned from it. Quick review

We strongly recommend that you go back to the previous article, but if you want to read this article directly, here is a super-stripped-down version to help you catch up. Riot uses a combination of bare metal and cloud infrastructure to run back-end systems around the world. These back-end systems are distributed to different geographical locations and run a complete set of services that allow players to interact with LOL "League of Legends" based on completely different deployments. Like the back-end systems of most games, the LOL back-end starts as a whole, run by a dedicated operations team. Over time, Riot gradually embraced DevOps practices and micro-service-based architectures. As detailed in the first article in this series, to help our developers serve players faster, Riot relied heavily on packaging services such as Docker containers and began to run them in the cluster scheduler. Until a recent article, many of the tools used to achieve this were discussed. How does it work?

Very awesome, but painful and happy. At the time of the last article (Editor's Note: December 2017), we were operating more than 5000 production containers. This number has not stopped growing. Today, we run more than 14500 containers in the area where Riot operates independently (Riot-operated regions). Riot developers like to create new things for players, and the easier it is for them to write, deploy, and operate services, the more they can create exciting new experiences. The development team owns and is responsible for their services in a true DevOps way. They create workflows to deploy, monitor, and operate those services, and when they can't find what they need, they simply invent and reinvent them themselves. This is a very free time for developers, who rarely encounter problems that they can't solve on their own. But slowly, we began to notice some worrying trends. Monthly QA and load test environments are becoming more and more unstable. We have to spend more time looking for misconfigurations or outdated dependencies. These isolated events are not critical, but overall, they take a lot of time and energy from the team-we prefer to spend them on creating player value. To make matters worse, not only are similar difficulties beginning to emerge in non-Riot shards, which is not operated independently by Riot, but a series of other problems have also been exposed. Partners must connect with more and more developers and adopt more and more microservices, each in a different way and different from each other. Now, operators must work harder than ever to create effective and stable slicing areas. In these areas where Riot does not operate independently, the incidence of problems is much higher, directly due to incompatibility of real-time versions of microservices, or other similar cross-border issues. The present situation of DevOps of Riot

Before we discuss how to solve packaging, deployment, and operations, let's take a moment to explore the operating environment of Riot. None of these are unique to Riot, but the overlap of all these details shows how we are organized to provide value to all players. Developer mode

Engineers at Riot like to build things themselves! To help them do this, we adopted a powerful DevOps way of thinking. The team builds and has its own back-end service, ensures that it is supported, and offloads when the service performance is not as expected. Overall, Riot engineers are happy to achieve fast iterations and are happy to be responsible for their real-time services. This is a very standard DevOps setting, and Riot does not go against the trend in any way. Stateful fragmentation mode

Due to historical reasons, scale problems, and legal factors, the back-end systems of Riot products are organized in a fragmented manner. Among them, the production segment is usually geographically close to the target audience. This has many benefits, including improved latency issues, better matching, limited failure domains, and a clear off-peak window where maintenance operations can be performed. Of course, we also run a lot of developers and QA fragments internally and externally, such as the "League of Legends" public test suit (PBE).

Operation mode

This is where things get more complicated. Although Riot is a developer, for reasons of compliance and know-how, we work with some local operators to provide some service fragmentation. In effect, this means that Riot developers must package each component of the shard, deliver it to the operators, and instruct them on how to deploy, configure, and operate all shards. Riot developers do not manipulate, access, or even view these shards themselves. (editor's note: the fragmentation logic in this paper can be understood as the definition of block and sub-region) iterative solution

Try 1-New Alliance deployment tool

The first time we tried to improve the situation, we took a whole new approach, trying to take advantage of open source components and at least Riot customization capabilities to advance the deployment and operation of Riot. Although this work successfully deployed a complete "League of Legends" fragment, the tool was designed in a way that did not meet the expectations of developers and operators. The team expressed dissatisfaction with the tool, which proved to be too difficult for operations and too constrained for developers. So, after the first shard deployment, we made the painful decision to decommission these tools. This may seem radical, but because all teams still have deployment systems they maintain and have not yet fully transitioned, we can quickly phase out new tools. Try 2-more proc

Since the first attempt was not as successful as expected, we turned to tradition and achieved the requirements by adding processes. Extensive communication, clear release dates, documented processes, change management meetings and ceremonies, and permanent spreadsheets have made a little progress to some extent, but they always feel bad. The team likes their free DevOps, and the huge amount of change and the speed of change all make their work more onerous. Although the situation of our partners has improved, we have not reached the desired level of operation. Try 3-metadata

We decided to try another method. Previously, we have used developers as the main audience of the tool, and now we are starting to look at how the deployment / operation system will work for partner operators. We have carefully designed a tool that allows developers to add standardized metadata, such as required configuration and extension features, to the packaged microservices of their Docker containers. This has led to progress, allowing operators to understand the required service configuration and deployment features in a more standardized way, and to reduce reliance on developers in day-to-day operations. At this point, failure rates, incident rates, and additional downtime at local and partner operating sites have improved, but we still experience frequent deployment and operational failures that could have been avoided. Try the application and environment mode of 4-Riot

We finally adopted a new approach to shift the focus from personal services to the entire product. We have created a high-level declarative specification and a set of tools that can be manipulated on it, making the specification and tools different. Before going into more detail, let's take a look at what went wrong with the first three attempts. Reflect on the mistakes that deploy and operate the product, not the service.

While embracing DevOps and microservices has brought us many benefits, it creates a dangerous feedback loop. The development team creates microservices, deploys them, operates them, and is responsible for their performance. This means that they optimize logs, metrics, and processes for themselves, and often give little consideration to whether their services can be understood by others, including those with no development background or even engineering skills. As developers create more and more micro services, it becomes very difficult to operate the overall product and leads to more and more failures. Most importantly, Riot's mobile team structure makes the ownership of some micro-services unclear, making it difficult to figure out who to contact during diversion, resulting in a lot of attribute errors. More and more heterogeneous micro-services, deployment processes, and organizational changes have overwhelmed the operations team in the partner region. Figure out "why."

We examined failures in Riot and non-Riot operational areas and abstracted the difference in failure frequency into a key observation: allowing discontinuous change flows into the distributed system will eventually lead to preventable events. When the team wants to coordinate across boundaries, it starts to fail because dependencies need to bundle releases with multiple changes. The team either uses a manual process to create a release cycle, coordinates releases through project management rituals, or temporarily releases smaller release changes, causing the team to get confused in the process of finding compatible versions. Both have their own advantages and disadvantages, but they tend to collapse in large organizations. Imagine that dozens of teams need to continuously deliver hundreds of micro-services that represent shared products in a coordinated manner and allow them to use different development practices. To make matters worse, it is very difficult for partners to try to apply these processes, and their operators lack the context of how the parts are put together. New solution: Riot's application and environment model given that previous attempts failed to produce the desired results, we decided to eliminate partial state manipulation by creating a self-use inherent pair (opinionated) declarative specification that captures the entire distributed product-the environment. The environment contains all the declarative metadata required to fully specify, deploy, configure, run, and operate a set of distributed microservices that together represent a product and are complete and immutable. We chose the name "environment" because it is the last word Riot will overuse. Naming is really a difficult task. With the release of the game Legend of Rune Land LOR, we have proved that we can describe the entire microservice gaming backend (including game servers) and enable it to deploy, run and operate as a product in the data centers of Riot's autonomous operations and global partners. We also demonstrated our ability to achieve this goal while improving the advantages of the already popular DevOps approach. Prescriptive description of what (OPINIONATED ON WHAT)

The specification describes the hierarchical relationship between service bundles or environments.

Application specifications bundled into environmental specifications

One of the advantages of declarative and high-level declarative specifications is that they are easy to operate. One of the difficulties for partner operators is their inability to understand, adjust, and potentially automate the deployment of the entire back end of the game. The declarative nature of the specification means that it does not require an engineer's scripting or programming expertise to make changes to most of the specification. Maintaining a high level of specification helps to decouple the definition of the back end of the game from the basic implementation. This allows us to migrate from an internal choreographer / scheduler called Admiral to a Mesos-based scheduler and consider migrating to Kubernetes with minimal impact on the game studio. It also enables our partner operators to exchange their infrastructure components when needed. For example, it allows operators to use different metrics aggregation systems without changing the micro-service tools. Immutability and versioning We have found that it is critical to use a shared language to reference services and environments in order to deploy and operate effectively in the fast-growing world of DevOps. Version control services and the environment and their associated metadata enable us to ensure that the correct version is deployed in all locations. It allows our partner operators to know exactly which version they are running and pass it back to us. In addition, when applied to the entire environment, it provides a set of well-known services that can be checked for quality and marked as "good". This bundling eliminates any possibility of losing dependencies when communicating a new version to a partner. Making these versions immutable ensures that we maintain this lingua franca. When the same version of the service is deployed in two different shards, we can now be sure that they are exactly the same. Focus on operations given that our goal is to improve the level of partner operators serving players, we quickly realized that deploying software was only the first step. It is equally important to understand how to classify, operate, and maintain real-time systems. Historically, we have relied heavily on operating manuals. The manual is maintained by developers with varying degrees of success, recording everything from required configuration values to high-level architectures. In order to equip partner operators with all the knowledge needed to configure and operate each service, we decided to bring as much information as possible in these operational manuals to the forefront of the service specification. This significantly reduces the time it takes for partner regions to invest in new services and ensures that they are informed of all important changes when micro-services are updated. Today, partner operators can use the specification to learn about operational metadata, including required / optional configurations, extended features, maintenance operations, key metrics / alert definitions, deployment policies, inter-service dependencies, and more and more other useful information. Dealing with fragmentation differences

Of course, fragments are not identical copies of each other. Although we want to make them as close as possible, there are always some configurations that must be different. The database password, supported languages, extension parameters, and specific adjustment parameters must vary with each shard. To support this model, our tools use hierarchical overlay system deployment environment specifications that allow operators to specialize specific deployments while still knowing that they are derived from known good versions. Let's see how it works! Application case

A simple game backend can include two environments, one for the game server and the other for meta-game services (rankings, matching systems, etc.). The meta-game environment consists of a variety of services: ranking, matching system, game history, and so on. Each service contains one or more Docker images, which are conceptually equivalent to Kubernetes containers. The same hierarchy is correct for all environments, and philosophically, each environment encapsulates everything needed to deploy, run, and operate the game back end on any supported infrastructure or cloud, and all its dependencies. The specification also includes all the metadata needed to run and operate the entire environment. The growing collection includes configuration, confidentiality, metrics, alerts, documentation, deployment and rollout policies, inbound network restrictions, and storage, database, and cache requirements. Below we have an example of two hypothetical game fragmentation deployments in two areas. You can see that they all consist of a meta-game environment and a game server environment. The game server product environment in European slicing lags behind the similar game environment in American slicing. This provides a common language for games and operations teams to describe and compare different game fragmentation deployments. The growing number of services in each environment can be kept simple so that dozens of shards can be reliably deployed. Example of game fragmentation deployment

Our next step: delay-aware scheduling

We want to be able to describe the expected and acceptable delays between services and optimize the tool for basic areas and lower-level PaaS services to meet these requirements. This will result in some services being located in the same rack, host, or cloud area, rather than allowing them to be distributed among other services. This is highly relevant to us because of the performance characteristics of the game server and support services. Riot is already a multi-cloud company with our own data center, as well as AWS and partner clouds, but we rely on statically designed topologies. Card games and shooter games have different profiles and do not have to do manual topologies for one or two situations, saving engineers time and allowing them to focus on the game.

Finally, we face the problem of declining stability in the process of running the game, mainly from the game fragments run by partners. The tool development team bundled open source deployment tools and added metadata to the container, while the game team implemented a centralized release process. These methods can solve the symptoms, but not the root cause of the problem, which means that we have not reached the target level. The solution we eventually adopted introduces a new specification that captures all topologies, hierarchies, and metadata and all their dependencies across the back end of the game. This approach works because it brings a consistent release of the binding container, dependencies on the way they interact, and all the supporting metadata needed to start and operate the entire game. Immutability leads to deterministic deployment and predictable operations. As a platform team, our goal is to select systems and building modules that can produce a virtuous cycle in which functional development will naturally lead to products that are easy to operate. Combining the agility of the DevOps model with an easy-to-operate entire product is the key to long-term organizational agility. Our environmental bundling approach directly improves operational indicators and, more importantly, improves the quality of the player experience. We are glad to see how other people in the industry can solve similar problems. We have seen ideas and projects from CNCF (Cloud Native Computing Foundation) and large cloud providers such as the Microsoft Open Application pattern Specification. It is hoped that some of these projects will replace our own specifications and move towards industry-wide solutions. In future articles, we will explore the Riot specification in more detail, introduce examples, and discuss design tradeoffs and Riot-specific shortcuts. Thank you for reading! If you have any questions, you are welcome to contact us.

END

More articles in the "revealing LOL" series

Reveal the IT infrastructure behind LOL? embark on the journey of deployment diversity

Uncover the IT infrastructure behind LOL? the key role "scheduling"

Reveal the IT infrastructure behind LOL? SDN unlocks the new infrastructure.

Reveal the IT infrastructure behind LOL? infrastructure is the code.

Revealing the IT Infrastructure behind LOL / Micro Services ecosystem revealing the IT Infrastructure behind LOL. What can developers do?

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.