How to deploy Spark Cluster Technology in Meituan website 07/12 Update SLTechnology News&Howtos

How to deploy Spark Cluster Technology in Meituan website

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to deploy Spark cluster technology in Meituan website". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Preface

Meituan is a data-driven Internet service. Users' daily click, browse and order payment on Meituan will generate massive logs, which will be summarized, analyzed, mined and learned. to provide data support for Meituan's various recommendations, search systems and even the formulation of the company's strategic goals. Big data processing permeates all kinds of application scenarios of Meituan's business lines. Choosing an appropriate and efficient data processing engine can greatly improve the efficiency of data production, and then indirectly or directly improve the work efficiency of relevant teams.

Meituan's initial data processing is mainly based on Hive SQL, the underlying computing engine is MapReduce, and some of the relatively complex business will be realized by engineers writing MapReduce programs. With the development of business, simple Hive SQL query or MapReduce program has become more and more difficult to meet the needs of data processing and analysis.

On the one hand, the MapReduce computing model does not support multi-round iterative DAG jobs, and data needs to be dropped in each iteration, which greatly affects the efficiency of job execution. In addition, only two computing factors, Map and Reduce, are provided, which makes users high cost and low efficiency when implementing iterative computing (such as machine learning algorithm).

On the other hand, in the daily production of data warehouse, because some of the original logs are semi-structured or unstructured data, cleaning and conversion operations need to be combined with SQL query and complex procedural logic processing, which was previously done by Hive SQL and Python scripts. This method has efficiency problems, when the amount of data is relatively large, the running time of the process is longer, these ETL processes are usually in the upstream position, which will directly affect the completion time of a series of downstream and the generation of various important data reports.

For these reasons, Meituan introduced Spark in 2014. In order to make full use of the resources of the existing Hadoop cluster, we adopt the Spark on Yarn mode, and all Spark app and MapReduce jobs are scheduled and executed through Yarn. The position of Spark in Meituan data platform architecture is shown in the figure:

After nearly two years of promotion and development, from the beginning, only a few teams tried to use Spark to solve data processing, machine learning and other problems, to now it has covered all kinds of application scenarios of Meituan's major business lines. From upstream ETL production to downstream SQL query analysis and machine learning, Spark is gradually replacing MapReduce homework and becoming the mainstream computing engine processed by Meituan and big data. At present, the ratio of Spark jobs to MapReduce jobs submitted by users in Meituan's Hadoop cluster is 4:1. For some upstream Hive ETL processes, after being migrated to Spark, the job execution speed has been increased tenfold under the same resource usage, which has greatly improved the production efficiency of the business side.

Next we will introduce the practice of Spark in Meituan, including our platform work based on Spark and the application case of Spark in production environment. It includes an interactive development platform combined with Zeppelin, and there are also ETL data conversion tools using Spark tasks. The data mining group develops a feature platform and data mining platform based on Spark. In addition, there is an interactive user behavior analysis system based on Spark and its application in SEM delivery services. The following is a detailed introduction.

Spark interactive development platform

In the process of promoting how to use Spark, we summarized the main requirements for users to develop applications:

Data research: before formally developing the program, we first need to understand the business data to be processed, including data format, type (corresponding to field type if stored in a table structure), storage method, and whether there is dirty data or not. even analyze whether there may be data skew according to the business logic implementation, and so on. This requirement is very basic and important. Only by having sufficient control over the data can we write efficient Spark code.

Code debugging: the coding implementation of the business is difficult to be achieved overnight, and may need to be debugged constantly; if every small amount of modification, the test code needs to be compiled, packaged and submitted online, it will have a great impact on the development efficiency of users.

Joint development: for the implementation of an entire business, there is usually multi-party collaboration, which requires a convenient way to share code and execution results to share their ideas and experimental conclusions.

Based on these requirements, we investigated the existing open source systems and finally chose Zeppelin, the incubation project of Apache, as an interactive development platform based on Spark. Zeppelin integrates Spark,Markdown,Shell,Angular and other engines and integrates functions such as data analysis and visualization.

We have added user login authentication, user behavior log audit, rights management and Spark job resource isolation to the native Zeppelin, creating an interactive development platform for Meituan's Spark, on which different users can investigate data, debug programs, share code and conclusions.

Spark integrated in Zeppelin provides three interpreters: Spark, Pyspark and SQL, which are suitable for writing Scala, Python and SQL code, respectively. For the above data research requirements, no matter at the beginning of programming or in the process of coding implementation, when the data information needs to be retrieved, the analysis results can be easily obtained through the SQL interface provided by Zeppelin. In addition, the interactive features of Scala and Python interpreters in Zeppelin meet the needs of users for step-by-step debugging of Spark and Pyspark. At the same time, because Zeppelin can directly connect to online clusters, it can meet users' read and write requests for online data. Finally, Zeppelin uses Web Socket to communicate, and users simply send the http link where the content is to be shared, and all recipients can synchronously perceive code changes, run results, and so on, so that multiple developers can work together.

Spark Job ETL template

In addition to providing platform tools, we will also improve the development efficiency of users from other aspects, such as encapsulating similar requirements and providing a unified ETL template, so that users can easily use Spark to achieve business requirements.

The main body of Meituan's current data production is to load the original log into the Hive table through ETL after cleaning, conversion and other steps. However, many online businesses need to import the data in the Hive table into Tair by forming key-value pairs with certain rules, which can be used for quick access by upper applications. Most of the requirements logic is the same, that is, the values of several specified fields in the Hive table are spliced into key values according to certain rules, and the values of the other fields are valued in the form of json strings. Finally, the resulting pairs are written into Tair.

Because the amount of data in Hive table is generally large, it is inefficient to use stand-alone programs to read data and write Tair, so some business parties decide to use Spark to implement this set of logic. Initially, the engineers of the business side used Spark programs to read data from Hive and write it to Tair (hereinafter referred to as hive2Tair process). In this case, there are the following problems:

Each business side has to implement a set of logically similar processes on its own, resulting in a lot of repetitive development work.

Because Spark is a distributed computing engine, improper code implementation and parameter setting can easily cause great pressure on Tair clusters and affect the normal services of Tair.

Based on the above reasons, we developed the hive2Tair process for Spark and encapsulated it into a standard ETL template. The format and content are as follows:

Source is used to specify the source data of the Hive table, and target specifies the library and table of the target Tair. These two parameters can be used for the scheduling system to resolve the upstream and downstream dependencies of the ETL, so that it can be easily added to the existing ETL production system.

With this template, users only need to fill in some basic information (including the source of the Hive table, the list of fields that make up key, the list of fields that make up value, and the target Tair cluster) to generate a hive2Tair ETL process. The whole process generation process does not need any Spark foundation and does not need to do any code development, which greatly reduces the threshold for users, avoids repeated development, and improves development efficiency. When the process is executed, a Spark job is automatically generated and runs with relatively conservative parameters: dynamic resource allocation is enabled by default, with 2 cores per Executor, memory 2GB, and a maximum number of Executor set to 100. If you have high performance requirements and apply for a large Tair cluster, you can use some tuning parameters to improve write performance. Currently, we only expose the number of Executor and the memory of each Executor to users, and set a relatively safe maximum value to avoid abnormal pressure on Hadoop clusters and Tair clusters due to unreasonable parameter settings.

User feature platform based on Spark

Before there is no feature platform, various data miners extract user feature data according to the needs of their own projects, mainly through Meituan's ETL scheduling platform to complete data extraction on a monthly / daily basis.

However, from the point of view of user characteristics, there will be a lot of repetitive work, and many user features required by different projects are the same. in order to reduce redundant extraction work and save computing resources, the need to establish a feature platform is born, and the feature platform only needs to aggregate the feature data that has been extracted by various developers and provide it to others. The feature platform mainly uses the batch processing function of Spark to complete data extraction and aggregation.

Developers mainly extract features through ETL, and some data are processed by Spark, such as the statistics of user search keywords.

The feature data provided by the developer needs to be added to the feature library according to the configuration file format provided by the platform. For example, in the configuration file of the group purchase, there is a characteristic of the number of times paid by a user in the 24-hour period in the group purchase business. The input is a generated feature table. After the developer verifies that the data is correct through the test, the data is launched. In addition, for some features, only part of the feature data needs to be extracted from the existing table, and the developer only needs a simple configuration.

In the figure, we can see that feature aggregation is divided into two layers. The first layer is the internal aggregation of each business data. For example, in the data configuration file of group purchase, there will be a lot of group purchase features, purchase, browsing and other scattered in different tables. Each business will have independent Spark tasks to complete aggregation, forming a user group purchase feature table. Feature aggregation is a typical join task, which is about 10 times higher than MapReduce performance. The second layer is to aggregate the data of each business table again to generate the final user characteristic data table.

The features in the feature library are visual. When we aggregate the features, we count the number of people covered by the features, the maximum and minimum values of the features, and then synchronize them to the RDB, so that both managers and developers can understand the features intuitively through visualization. In addition, we also provide feature monitoring and alarm, using feature statistics for the last 7 days to compare the number of people covered by each feature yesterday and today, whether it has increased or decreased, such as the number of people covered by women. If you find that today's coverage is 1% lower than yesterday (for example, 600 million users and 200 million women yesterday) So the number of people decreased by 1% * 200 million = 20,000) the sudden decrease of 20,000 female users shows that there is a great anomaly in the data, not to mention that the number of users of the site is increasing every day. These exceptions are emailed to the platform and the people involved in the feature extraction.

Spark data mining platform

The data mining platform is completely dependent on the user feature library, which provides user features through the feature library, and the data mining platform converts the features and outputs them in a unified format, so that developers can quickly complete the development and iteration of the model. it used to take two weeks to develop a model, but now it takes as little as a few hours or a few days to complete. The feature transformation includes the coding of the feature name, the smoothing and normalization of the eigenvalues, and the platform also provides the functions of feature discretization and feature selection, which are completed offline using Spark.

After getting the training samples, developers can use Spark mllib or Python sklearn to complete the model training, obtain the optimized model, save the model to the model storage format defined by the platform, and provide relevant configuration parameters. The model can be launched through the platform, and the model can be scheduled on a daily or weekly basis. Of course, if the model needs retraining or other adjustments, then the developer can also take the model offline. Not only that, the platform also provides a model accuracy alarm feature. After each prediction is completed, the model calculates the prediction accuracy in the samples provided by the user, and compares the accuracy alarm threshold provided by the developer. If it is lower than the threshold, email the developer to inform the developer whether the model needs to be retrained.

When we developed the model prediction function of the mining platform, we took a detour. At the beginning, the model prediction function of the platform was compatible with Spark interface, that is, using Spark to save and load model files and predict. People who have used Spark mllib know that many of the API of Spark mllib are private developers can not use directly, so we encapsulate these interfaces and then provide them to developers, but only solve the problem of Spark developers. The platform also needs to be compatible with the model output, loading and prediction functions of other platforms, which makes us face the problem of maintaining multiple interfaces of a model, and the cost of development and maintenance is high. finally, we give up the implementation of compatible Spark interface, and we define the saving format of the model, as well as the function of model loading and model prediction.

The above introduces Meituan's platform work based on Spark, these platforms and tools are for all business lines of the company, and are designed to avoid meaningless repetitive work by various teams and improve the overall data production efficiency of the company. At present, it seems that the effect is relatively good, these platforms and tools have been widely recognized and applied within the company, of course, there are many suggestions to promote our continuous optimization.

With the development and promotion of Spark, from the upstream ETL to the downstream daily data statistical analysis, recommendation and search system, more and more business lines begin to try to use Spark for a variety of complex data processing and analysis work. The following will take Spark in the interactive user behavior analysis system and SEM delivery service as an example to introduce the application of Spark in Meituan's actual business production environment.

The practice of Spark in Interactive user behavior Analysis system

Meituan's interactive user behavior analysis system is used to provide the function of interactive analysis of massive traffic data. the main users of the system are PM and operators within the company. Ordinary BI reporting system can only provide query on aggregated indicators, such as PV, UV and other related indicators. However, in addition to looking at some aggregation indicators, PM and operators also need to analyze the traffic data of a certain type of users according to their own needs, so as to understand the behavior trajectory of various user groups on the App. Based on these data, PM can optimize product design, and operators can provide data support for their own operations. Several core user requirements include:

Self-service query, different PM or operators may need to perform a variety of analysis functions at any time, so the system needs to support user self-use.

Response speed, most analysis functions must be completed in a few minutes.

Visualization, you can visually view the analysis results.

To solve the above issues, technicians need to address the following two core issues:

The processing of massive data, users' traffic data are all stored in Hive, the amount of data is very large, the amount of data every day is in the scale of billions.

To calculate the results quickly, the system needs to be able to receive the analysis tasks submitted by users at any time and calculate the results they want in a few minutes.

To solve the above two problems, there are two main technologies to choose from: MapReduce and Spark. In the initial architecture, we choose to use MapReduce, a more mature technology, but through testing, it is found that complex analysis tasks developed based on MapReduce take several hours to complete, which will result in a very poor user experience, which is unacceptable to users.

Therefore, we try to use Spark, a memory-based fast big data computing engine, as the core part of the system architecture, mainly using Spark Core and Spark SQL components to achieve a variety of complex business logic. In practice, it is found that although the performance of Spark is very excellent, in the current stage of development, there will be more or less some performance and OOM problems. Therefore, in the development process of the project, a variety of performance tuning is carried out for a large number of Spark jobs, including operator tuning, parameter tuning, shuffle tuning and data tilt tuning. Finally, the execution time of all Spark jobs is about a few minutes. And some OOM problems caused by shuffle and data tilt are solved in practice, which ensures the stability of the system.

Combined with the above analysis, the final system architecture and workflow are as follows:

The user selects the menu corresponding to an analysis function in the system interface, enters the corresponding task creation interface, then selects the filter criteria and task parameters, and submits the task.

Because the system needs to meet different types of user behavior analysis functions (more than ten analysis functions have been provided in the system at present), it is necessary to develop a Spark job for each analysis function.

The Web service is developed with J2EE technology as the background system. After receiving the task submitted by the user, the corresponding Spark job is selected according to the task type, and a child thread is started to execute the Spark-submit command to submit the Spark job.

The Spark job runs on the Yarn cluster, calculates the massive data in Hive, and finally writes the calculation results into the database.

Users view the task analysis results through the system interface, and the J2EE system is responsible for returning the calculation results in the database to the interface for display.

The system works well after being launched: 90% of Spark jobs run in less than 5 minutes, and the remaining 10% of Spark jobs run in about 30 minutes, which is fast enough to respond quickly to the analysis needs of users. According to the feedback, the user experience is very good. At present, the system performs hundreds of user behavior analysis tasks every month, which effectively and quickly supports the various analysis needs of PM and operators.

Application of Spark in SEM delivery Service

The Traffic Technology Group is responsible for Meituan's off-site advertising technology. At present, Spark platform is widely used in SEM, SEO, DSP and other services, including offline mining, model training, stream data processing and so on. Meituan SEM (search engine marketing) put hundreds of millions of keywords, a keyword was discovered by the mining strategy, embarked on a wonderful journey of SEM. It has been screened by the prediction model and put into the major search engines, which may be forced to go offline because of the frequent price adjustment in the market competition or because of poor results. And this kind of trip happens every minute in Meituan. Spark has contributed to the smooth progress of such a large-scale random "migration".

Spark is not only used in Meituan SEM keyword mining, prediction model training, delivery effect statistics and other scenarios, but also rarely used in keyword delivery service, which is also the focus of this paragraph. A fast and stable delivery system is the basis of precision marketing.

Meituan's early SEM delivery service uses a stand-alone architecture, with the rapid growth of the number of keywords, the problems of the old services are gradually exposed. Limited by the major search engine API quota (request frequency), account structure and other rules, the delivery service is only responsible for processing API requests is far from enough, but also need to deal with a large number of business logic. Stand-alone programs can barely cope with multiple processes in the case of a small amount of data, but it is difficult to "take into account the overall situation" for such a large-scale launch demand.

The new SEM launch service was launched in Q2 in the year 15, and the internal development code is Medusa. Medusa built on the Spark platform gives full play to the advantages of Spark big data processing, and provides a high-performance and high-availability distributed SEM delivery service with the following features:

Low threshold, the design idea of the overall architecture of Medusa is to provide database-like services. In the interface layer, RD can "add, delete, change and check" the online keyword list through SQL, just like the local database, and only need to care about their own policy tags, not the physical storage location of keywords. Medusa uses Spark SQL as the interface of the service, which improves the ease of use of the service, standardizes the data storage, and provides data support for other services at the same time. Developing a distributed delivery system based on Spark can also liberate RD from the details of the system layer, with a total code of only 400 lines.

High performance, scalable, in order to achieve the "time" and "space" optimization, Medusa uses Spark to calculate the best storage location of each keyword in the remote account and the best time content of each API request. In the case of limited quota and account capacity, it is easy to control 100 million-level online keyword delivery. The scalable delivery performance is achieved by controlling the number of Executor, and the full rollback of all channels for 4 hours is achieved in the actual combat.

High availability, some students may have questions: is the API request suitable for Spark? Because functional programming requires that the function is a pure function with no side effects (the input is determined, the output is determined). This is indeed a problem. The idea of Medusa is to encapsulate the request API into an independent module, so that the module has no side effects of "pure function" as far as possible, and refer to the idea of track-oriented programming to return all the request log to Spark for further processing, and finally to Hive, so as to ensure the success rate of delivery. In order to control quota consumption more accurately, Medusa does not introduce a single request retry mechanism, and formulates a service degradation scheme, which records the travel of each keyword completely with a very low data loss rate.

This is the end of the content of "how to deploy Spark Cluster Technology in Meituan website". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.