Technology Architecture sharing: Meituan Distribution system Architecture Evolution practice 02/13 Update SLTechnology News&Howtos

Technology Architecture sharing: Meituan Distribution system Architecture Evolution practice

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Since its establishment, Meituan Distribution has experienced many great-leap-forward development. The rapid growth of business not only puts forward higher and higher requirements for the overall architecture and infrastructure of the system, but also continues to drive the technical team to deeply understand the business, accurately locate the domain model, and efficiently support the expansion of the system. How to realize the rapid and effective upgrade of system architecture under the background of rapid business growth and higher and higher availability? How to ensure the efficiency and quality of R & D in complex business? This article will introduce some thinking and practice of Meituan distribution.

Distribution service

From Logistics to Real-time Distribution in the same City

The development of the logistics industry is inseparable from the development of commerce. In recent years, the transformation of commerce has created new opportunities for the development of logistics. The rise of e-commerce has effectively driven the rapid development of the express delivery industry, and directly created express companies like Shunfeng and Sitong Yida. In recent years, the rise of O2O business model, especially the development of takeout and fresh scenes, has promoted the rapid development of real-time distribution in the same city.

Unlike other branches in the field of logistics, real-time distribution in the same city has the following characteristics:

Fast prescription: Meituan takeout average delivery time 28min.

Short distance: most of the distribution distances are in the range of 3~5km, and the larger ones are extended to the same city.

The randomness is strong: the pick-up point and delivery point have the randomness of time and space, so it is relatively difficult to predict and plan.

The development opportunity of real-time distribution business in the same city

Generally speaking, the process reengineering of the industry is inseparable from two factors:

Internal cause: major breakthroughs in technology or infrastructure

External causes: upgrading of user consumption or major changes in the market

In terms of technology, the application of AI and big data is becoming more and more popular. based on artificial intelligence, it can accurately evaluate the difficulty of distribution, ETA and rider's ability. With the rapid development of GPS and the continuous opening of the capabilities of GIS manufacturers, the development cost of LBS-based applications has been greatly reduced. In terms of infrastructure, thanks to the continuous investment of the state, the quality of the mobile network continues to improve, and the cost decreases year by year, which indirectly promotes the smartphone to achieve almost universal coverage.

In terms of the market, due to the super-large-scale characteristics of China's population and the high degree of crowd aggregation, the demand for home delivery scenes such as takeout continues to increase in major cities, especially in first-tier cities. Users have higher requirements for the safety, timeliness, clothing of delivery staff, polite language and so on.

Under the joint action of these two factors, it has contributed to the development of the real-time distribution industry in the same city. For the real-time distribution business in the same city, performance capability and operational efficiency are two key issues for the R & D team to solve:

Performance guarantee: to achieve real-time control of waybill scheduling on the platform, with the ability to regulate and control supply and demand.

Improve operational efficiency: strengthen the management and control ability of distribution riders, improve the operational efficiency of the whole distribution business, and continue to reduce costs.

Technical challenge

The essence of Meituan's distribution system-- the cooperation between machines and mass riders, serves the large-scale collaboration system of users and businesses all over the country. The challenge of technology essentially comes from the pain point of the business, which is embodied in the requirements of strong performance ability online and strong operation ability offline. The technical challenges also come from both online and offline aspects:

Online fulfillment SLA is even more demanding. Distribution business needs to take into account the interests of users, merchants and riders, and the impact of any outage may be catastrophic. If the experience is not good, users will say, "Why am I still hungry when I pay for it?" The merchants will say that this is because no one takes out the meal, but for the rider, they will feel that they have put in time and labor, but do not get enough income.

Offline business is more complex. The management mode of multiple business lines is different, which is a great challenge to how to take into account the commonness and differentiation of the system.

Evolution of system architecture

The evolution of Meituan distribution system architecture can be divided into three stages:

MVP phase: business model exploration, rapid trial and error, how to have the ability of fast iteration.

Large-scale stage: business grows exponentially, how to ensure business development and solve the problems of system availability, expansibility, R & D efficiency and so on.

Refinement stage: business model is gradually mature, operation is gradually refined, how to drive business development through product technological innovation.

MVP stage

Trial and error stage, you need to quickly explore whether the business model is the same direction, this stage do not expect a lot of things to think clearly, users and the market will quickly feedback the results. Therefore, for the technical team, the most important capability at this stage is fast. Snatch the market, but not break it quickly.

From the perspective of system architecture, the MVP phase only needs to do coarse-grained disassembly, and we make a preliminary service division of the system according to the human, financial and material areas to ensure that the subsequent business areas can be separated and inherited from these three main areas.

By the way, the organization form of the team at that time, the R & D team was organized according to the project system, and everyone worked together to maintain a system. At that time, there were no QA positions in the team, and the development quality was guaranteed by PM and RD. It is normal to release more than 20 times a day.

Large-scale stage

Entering this stage, the business and products have been preliminarily verified by the market, and indeed found the right direction. At the same time, the growth rate of business development also puts forward higher requirements for the ability of the R & D team, because at this stage, a large number of urgent and important things will emerge, and the problems of system availability and scalability will be gradually highlighted. If not handled properly, it will lead to frequent system failures, low R & D efficiency and other problems, making R & D exhausted.

At this stage, at the architectural level, we focus on three aspects:

How should the overall architecture evolve? Where is the boundary between the performance system and the operation system?

How to ensure the availability of the performance system? How to plan the system capacity?

How does the operating system address the real pain points of the business? How to improve R & D efficiency under a large number of "trivial needs"?

The overall idea to solve the above problems is to simplify complexity (sort out logical relationships), divide and rule (professionals do professional things), and gradually evolve (consider ROI).

Overall architecture design

In the overall architecture, we disassemble the distribution system into the implementation system, the operation system and the master data platform.

In the design of the compliance system (the upper right side of the picture), firstly, a preliminary division is made according to the user side and the rider side, so that the split takes into account the unity of the double-end role and the scheduling process. For example, the user side pays more attention to the consistency of the order success rate and the order status, while the rider side pays more attention to the dispatch effect and the order push success rate, which decouples the order issuing, payment, scheduling and other modules as a whole.

As for the operation system (upper left), there are many and miscellaneous requirements for a long time, so the architecture design needs to figure out what the distribution operation system should manage and what it should not manage. In the long-term project development, we proceed from the business strategy and organizational structure, clarify the business strategic objectives and stage strategies, sort out the core responsibilities, assessment objectives and collaboration processes of each business team / position, and finally sort out the distribution operation management center in the emergence phase into four areas:

Business planning: how to define goals scientifically and ensure that goals can be effectively achieved.

Business management: how to improve the efficiency and quality of each business management process.

Rider operation: riders are the core resources, how many riders a city needs, whether the grading of riders is scientific, and how to regulate and control needs a systematic plan.

Settlement platform: improving the efficiency of money is the key to cost leadership. It takes a long time to think about how to use the money correctly and accurately.

In addition to the architecture design of compliance and operation systems, there is also a very key issue at the architecture design level, that is, how to divide the boundaries and responsibilities of compliance and operation systems. Personal understanding of this problem may be the most critical architecture design problem in the large-scale stage of O2O business, if it can not be effectively solved, it will lay a huge hidden trouble for the availability and scalability of the system. There are great differences in business requirements and technical responsibilities between compliance and operation, and most of the data production is in the operation system, and the core and most critical application is in the compliance system. Although their respective domain responsibilities are clear, they are not necessarily simple and clear about the specific requirements boundaries. In this regard, we draw lessons from the idea of MDM and put forward the concept of master data platform (the bottom of the figure), focusing on solving the cooperation and boundary problems between the implementation system and the operation system.

Master data platform

Master data is the most basic business unit data in enterprise information system, and it is organization, post, personnel, business, user, city and other data for distribution. Correspond to business data, such as orders, attendance, payroll, etc. Master data has two most critical characteristics:

Foundational: business data grows on the dimension of master data, for example, order data is transaction data under two master data entities, user and merchant.

Sharing: all kinds of systems are strongly dependent on master data, and upstream business systems need to perceive and interact with the changes in master data.

Master data management is not achieved overnight, but is iterated step by step with the development of the business. In the early days, the system was simple, and the upstream system read data directly from DB and applied it. After the system is gradually complex, it is easy for multiple team developers to influence each other, which is not conducive to system expansion, and there is a great risk in usability. To this end, we set up a special team of master data, split the master data service independently, and recover all access to the data to the service. On this basis, after continuous iteration and evolution, we finally absorbed the ideas of CQRS (Command Query Responsibility Segregation) and MDM (Master Data Management), and gradually divided the whole master data platform into four parts:

Production system: responsible for modeling data production and isolating the impact of data production on the core model. For example: rider entry, organizing the split process, and so on.

Core model: mining data entity relations to improve the ability of the model. For example: one person with multiple posts, two-line reporting, etc.

Transport center: the application scenario support for the performance system, abstracts many attributes of the rider into a capacity model, and focuses on the construction of availability and throughput capacity.

Management center: provides a standardized framework for the operation system, and provides a unified solution for information retrieval, process approval, access control and other scenarios.

System availability

The rapid growth of business puts forward higher and higher requirements for the availability of the system. at the level of methodology, we put forward four capacity-building according to the time series of accidents (beforehand, during and after the accident). Namely: prevention ability, diagnosis ability, solution ability, avoidance ability. At the same time, in the specific work, we are divided into two aspects: process and system.

Usability construction is a long-term project. Considering ROI, the initial stage focuses on the process construction in advance, that is, a series of online operation processes such as online specifications, which can avoid 80% of the online failures in the early stage. After the process specification runs through and proves to be effective, and then gradually improve human efficiency through the construction of the system.

Disaster tolerance capacity

In the construction of disaster tolerance capacity, the first question to think about is what is the biggest risk point of the system. From a management point of view, the "grey area" of responsibility is usually where the quality of the system is prone to risk. Therefore, the first disaster recovery treatment in the early stage is the degradation of core dependence and third-party dependence, giving priority to ensuring that once there are problems with dependent services and middleware, the system itself has the most basic degradation capability.

In the second stage, we put forward the end-to-end disaster recovery capability. First of all, we have built the business market and defined the real-time monitoring core business indicators (single quantity, number of online riders, etc.), through which we can quickly judge whether there is something wrong with the system. Second, we have expanded key dimensions (cities, App versions, operators, etc.) on the core indicators to quickly assess the impact of the problem. Finally, through the Trace system, we visually show the invocation relationship between services and the success rate at the link level, and have the ability to quickly locate the root cause of the problem.

In the third stage, we expect to integrate the disaster recovery plan into the system and build a customized and integrated disaster recovery tool based on various accident scenarios, which can further shorten the fault response, processing time and R & D learning costs. For example, in order to further improve the SLA of the distribution system, we deeply optimize the end-to-end disaster recovery capability, focusing on solving the end-to-end interaction problem when the rider has a weak network or no network. In some areas of China, the population is very dense, but the network quality of mobile operators is poor, which will cause the rider to operate App in this area with a large delay or even unable to operate, which has a great impact on the rider's normal work. Therefore, we continue to strengthen the ability of long connection and multi-channel interoperability at the link level of the mobile network, and integrate the network diagnosis, processing and verification tools, so that the end-to-end arrival rate of the rider App has been further improved.

System capacity

For a rapidly growing business, the capacity planning of the system is a long-term proposition. The key point of capacity planning is evaluation and expansion.

In terms of evaluation, in the early stage of business development, we an architect can fully control the entire system, using static evaluation can basically measure the capacity of the system. With the increasing complexity of the system, we gradually introduce tools such as Trace and middleware capacity monitoring to assist in capacity evaluation. The main framework of capacity evaluation is defined by the architect team, and the capacity of each subsystem is evaluated in detail by each team. When the business has become very complex, no one or team can guarantee the accurate completion of capacity evaluation. At this time, we have launched projects such as scene pressure test, drainage pressure test, full-link pressure test and so on. By means of traffic marking + shadow table + flow offset + scene playback, the capacity and bottleneck points can be accurately evaluated through online traffic proportional playback.

In terms of capacity expansion, we have implemented redundant backup (master-slave separation), vertical split (split core attributes and non-core attributes), horizontal split (sub-database table), and automatic archiving in stages.

Iterative efficiency of operation system

The operation system involves all aspects of business operation and management. in addition to defining the objectives, process, transport capacity and funds in the business field, we have created a set of operation system integration solutions. By continuously investing energy in the long-term construction of platform services or components, R & D ensures the scalability of each vertical operating system, thus continuously improving the efficiency of R & D. Take the workflow scene as an example, through the way of dynamic form and process platform, unify the engineering realization of all kinds of business flow and approval flow, quantify the efficiency and quality of all kinds of management actions, find the process blocking nodes, and automate some process links. Continue to reduce labor costs through technical means.

Refinement stage

After the business development continues to mature, all kinds of operation and management actions of the business will tend to be refined. At this stage, the business has higher requirements for product technology, and it is expected to continue to build technical barriers through product technological innovation to maintain a leading edge. The business characteristics of distribution naturally have a strong demand for AI applications, ranging from supply adjustment to resource allocation, which is the main battlefield for the effectiveness of AI. For the engineering level, the problem that needs to be considered continuously is how to better realize the business application of AI. To this end, we have focused on improving our capabilities in several areas:

Reduce the cost of trial and error: build a simulation platform, create a "sandbox environment" of the algorithm, and quickly evaluate the effectiveness of the algorithm offline.

Improve the efficiency of algorithm feature iteration: build a feature platform, unify the algorithm strategy iteration framework and feature data production framework, and improve the quality of feature data.

Improve the quality of navigation data: continue to cultivate the LBS platform, improve the quality of basic data, and provide the application ability of location, navigation and space.

Simulation platform

The core of the simulation platform is to create a "sandbox environment". The service industry attribute of distribution requires users, merchants and riders to participate in the service process deeply, so the online trial and error cost of the algorithm is very high. For the construction of the simulation platform, we cut out the details of the scheduling system and build a set of micro scheduling system with coarse grain. and through order playback, users, merchants, rider entity modeling, rider behavior simulation and other methods to simulate the online scene. Each simulation will produce a KPI report of the algorithm to achieve offline prediction of the effectiveness of the algorithm.

Algorithm data platform

The effect of algorithm strategy mainly depends on the quality of algorithm model and feature data. For this reason, we have built an one-stop algorithm data platform around models and features, providing omni-directional data closed-loop solutions from data cleaning, feature extraction, model training, online prediction to algorithm effect evaluation. to provide support for the landing of machine learning and deep learning algorithm models on various business lines of distribution.

LBS platform

LBS platform was implemented as early as the initial stage of distribution business. With the continuous development of algorithm scenarios, LBS continues to deepen point-line-surface space capabilities, providing support for distribution scheduling, time estimation, pricing and other business scenarios, creating task maps, path planning, voice navigation, heat maps and other products.

Conclusion

Meituan distribution system architecture evolution process, the architect team has long focused on technology-driven business, clear domain responsibilities and boundaries and other key issues, while the evolution process of the architecture is also constantly considering the trade-off process of ROI. The continuous development of technology continues to improve the experience, scale and reduce operating costs, while the problem solved by the architecture is to simplify complex problems into simple ones and break them down step by step through domain experts. With the continuous growth of the scale, the continuous innovation of the business will pose higher and higher challenges to the system architecture, and the system architecture design will be a topic that we have studied for a long time.

A brief introduction to the author

Yong Jun, Meituan senior technical expert, head of distribution business system team. Long-term engaged in distribution system quality assurance, operation system construction, system architecture upgrading and other directions.

[this article is reproduced from the official account of Wechat of Meituan's technical team, ID:meituantech, original link: https://mp.weixin.qq.com/s/Ik5vp5zQfx5dS4JFvAlxgQ]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.