How does Apache Kylin practice in Baidu Maps 07/06 Update SLTechnology News&Howtos

How does Apache Kylin practice in Baidu Maps

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How Apache Kylin is practiced in Baidu Maps, I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

1. Preface

Baidu Map Open platform Business Department data Intelligence Group is mainly responsible for Baidu Map internal related business big data calculation and analysis, processing daily 10 billion-level scale data, for different businesses to provide a single OLAP millisecond response of SQL multi-dimensional analysis query service.

For the application of Apache Kylin in the actual production environment, Baidu Map data Intelligent Group is one of the earliest practitioners in China. Apache Kylin was opened up in November 2014. At that time, our team needed to build a complete big data OLAP analysis and calculation platform to provide multi-dimensional analysis and query services from milliseconds to seconds for a single SQL of 10 billion rows of data. In the process of technology selection, we made reference to Apache Drill, Presto, Impala, Spark SQL, Apache Kylin and so on. For Apache Drill and Presto due to fewer production environment cases, considering that the problems encountered in the later stage are difficult to discuss interactively, and the overall development of Apache Drill is not mature. For Impala and Spark SQL, mainly based on memory computing, high requirements for machine resources, a single SQL can meet the second-level dynamic query response, but the interactive page usually contains multiple SQL query requests, in a very large scale of data, dynamic computing is also difficult to meet the requirements. Later, we focused on the Apache Kylin solution that generated Cube based on MapReduce precomputation and provided low-latency queries, and completed the first full deployment of Apache Kylin in a production environment around February 2015.

Apache Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities over Hadoop to support very large-scale data, originally developed by eBay Inc. Developed and contributed to the open source community, and officially graduated to become a top-level Apache project in November 2015.

two。 The challenge of big data's multi-dimensional analysis

We have run several Cube tests on the Apache Kylin cluster, and the results show that it can effectively solve the three pain points in big data's calculation and analysis.

Pain point 1: the time-consuming problem of dynamic calculation of multi-dimensional indicators of 10 billion-level massive data. Apache Kylin solves the problem by pre-calculating to generate Cube result datasets and storing them in HBase.

Pain point 2: complex condition filtering problem, when users query, Apache Kylin uses router search algorithm and optimized HBase Coprocessor to solve

Pain point 3: cross-month, quarterly, year and other large time interval query problems. For the storage of pre-calculated results, Apache Kylin uses Cube's Data Segment partition storage management to solve.

The solution of these three pain points enables us to achieve a single SQL millisecond response in the specific multi-dimensional analysis products determined by the data model under the scale of big data. Therefore, we have a high interest in Apache Kylin, big data calculation of query analysis applications, a page usually requires multiple SQL queries, assuming that a single SQL query takes 2 seconds to respond, the page has a total of 5 SQL requests, a total of about 10 seconds, which is unacceptable. At this time, Apache Kylin has the advantage of responding to multiple SQL queries on one page.

In practice, according to the needs of different businesses of the company, the background storage and query engine of big data OLAP platform of our data intelligence team is composed of Apache Kylin, Impala and Spark SQL. In the case of small and medium data scale and random analysis dimension indicators, the platform can provide Impala or Spark SQL services. In the specific product cases with super-large-scale tens of billions of row data, due to the high query performance requirements, and the specific products need to analyze the dimensions and indicators are relatively clear, we use Apache Kylin solution. The following will mainly introduce the practical use of Apache Kylin in Baidu Maps.

3. Big data OLAP platform system architecture

(click to enlarge the image)

Main module

Data access: mainly responsible for obtaining the finest-grained fact table data needed by the business from the data warehouse side.

Task management: mainly responsible for the implementation and management of Cube-related tasks.

Task monitoring: mainly responsible for the status of Cube tasks in the execution process and the corresponding operation management.

Cluster monitoring: it mainly includes the monitoring of Hadoop ecological process and Kylin process.

Cluster environment

Due to the particularity of the business, we do not use the internal Hadoop cluster for calculation, storage and query, but deploy a complete cluster and maintain it independently.

Cluster machines: a total of 4, 1 master (100 GB memory) + 3 slaves (30 GB memory).

Software environment: CDH + Hive + HBase + Kylin

4. Secondary Development based on Apache Kylin 4.4.1 Secondary Development of data access Module

For any data computing and processing platform, data access is critical, just like the well-known Spark, it also attaches great importance to data access. At present, our big data OLAP platform can support the introduction of two kinds of data sources: MySQL data source and HDFS data source. In practice, we encounter a problem, assuming that the MySQL and HDFS data sources do not identify the data that indicates that the 1-day data has been calculated, how to make sure that the 1-day data is ready. For Hive data sources, query whether the partition of the Hive Meta where the data resides is ready For MySQL, the way we think of at present is to detect whether the number of rows of data for the day has changed in a cycle at a certain interval. If there is no change, we can basically assume that the data on day T-1 has been calculated by the data warehouse, and then we can trigger the data pull module logic to pull the data into the local file system of the Master node, and then determine whether the data needs to be processed in detail according to the business. Import it into the Hive of Master, trigger all the cube involved in the fact table corresponding to the task, and start the MapReduce calculation. After the calculation is finished, the frontend can refresh and access the latest data. In addition, if at the specified time, it is found that the data on the data warehouse side is still not ready, the data access module will send a text message to the warehouse side and continue to cycle detection until the specified time to exit.

(click to enlarge the image)

Data introduction module

4.2 Secondary development of task management module

Task management is very important for computing platform services, and it is also one of the core extensions of our big data OLAP multidimensional analysis platform. For users, the minimum storage unit of Apache Kylin for Cube is data segment. Partition,data segment similar to Hive is represented by left closed and right open interval, such as [2015-11-01, 2015-11-02), which contains the data of 2015-11-01. The management of Cube data is mainly based on data segment granularity, which is roughly divided into three operations: build, refresh and merge. For a specific product, its data needs to be routinely calculated into the cube every day. Normally, 1 data segment will be generated every day, but it may generate 1 segment for 2 or more days because of the delay in the task of the data warehouse. With the passage of time, on the one hand, a large number of data segment has seriously affected the performance, on the other hand, it also brings difficulties and troubles to management. Therefore, for 1 cube, we use 1 data segment per natural month, which is clear and easy to manage.

Suppose we have 30 days of data in a month, a total of 23 data segment data fragments, such as: [2015-11-01 record2015-11-02), [2015-11-02 2015-11-04), [2015-11-04 2015-11-11), [2015-11-11-12), [2015-11-12 2015-11-13). No, no, no. [2015-11-30, 2015-12-01)

Question 1: suppose we need to go back to the data of 2015-11-01 because there is something wrong with the data, because we can find a data segment like [2015-11-01 data segment 2015-11-02) in cube to meet this time range, so we can directly operate the interface or Rest API to start the refresh update operation of this data segment.

Question 2: suppose we need to go back to the data from 2015-11-02 to 2015-11-03. In the same way, we can find a qualified data segment [2015-11-02 refresh-11-04), and then update this data segment.

Question 3: suppose we need to go back to the data from 2015-11-01 to 2015-11-02, we can't find the data segment that directly satisfies the time interval. So we have two solutions. The first one is to update refresh in sequence [2015-11-01, 2015-11-02) and [2015-11-02, 2015-11-04], which are two data segment implementations. The second solution is to first merge (merge) [2015-11-01 data segment 2015-11-02), (2015-11-02 2015-11-04) these two data segment, and then merge to get such a data segment. Then we pull the new data and perform the update operation to meet the demand.

Question 4: suppose we need to refresh the data of the month 2015-11-01 / 2015-11-30, and we test the same cube on another cluster based on Kylin 1.1.1. If we adopt the first scheme in question 3, we need to gradually refresh 23 data segment of cube, which takes about 17.93min X 30 minutes. If we adopt the second solution in question 3, then we only need to merge 23 data segment into one data segment [2015-11-01 (2015-12-01)], counting one operation. Then perform one update operation, a total of two operations can complete the requirements, in general, it takes about 83.78 minutes, which is much better than the performance of the first method.

Based on the above problems, our platform has carried out the secondary development of Apache Kylin and extended the task management module.

For the build operation of cube, assuming that the data of the data warehouse is delayed for some reason, and the data is produced in 12-03 days in 2015 (Tmuri's data for 1 day), if the processing is not judged, a data segment with a time range of [2015-11-29) will be generated routinely. So, before each cube calculation, our logic automatically detects cross-month problems and generates two data segment [2015-11-29 data segment 2015-12-01) and [2015-12-01 October 2015-12-03].

For the refresh operation of cube, we will use the second scenario mentioned in question 3 and question 4 to automatically merge the data segment and then perform the update refresh operation. Because the above has guaranteed that there will be no cross-month data segment generation, the automatic merge here will not encounter the situation of generating cross-natural months.

For the merge operation of cube, if we automatically merge all the data segment already existing on the previous date of the natural month every day, suppose we want to retroactively update the data of 2015-11-11, then we need to go back (2015-11-01) (because the data segment of this time range is automatically merged every day). In fact, we do not need to go back to the data of 2015-11-01 / 2015-11-10. Therefore, for cube data within a natural month, we first retain the fragment status of one data segment a day in that month, because there is a high probability of finding problems with the data in the previous few days in that month, so it is more reasonable and better to retrace a small data segment fragment. For the data of the whole month of last month, the data has been relatively stable in the middle and last ten days of next month, and the probability of backtracking is relatively small, which is usually the data of the whole month of last month. Therefore, it makes more sense to consolidate the data of the previous month as a whole in the middle and first ten days rather than every day.

(click to enlarge the image)

Task management module

4.3 the secondary development of platform monitoring module 4.3.1 task monitoring

Usually, a product corresponds to multiple pages, a page corresponds to a fact table, and a fact table corresponds to multiple cube, then a product usually contains multiple cube. The above-mentioned cube is based on the three task states of data segment, so it is difficult to check manually, so it is very necessary to monitor the execution of the task. When the task is submitted, check the status of the task at regular intervals. If the task status fails or finally succeeds, an email or SMS alarm will be sent to notify the user.

4.3.2 Cluster monitoring

Because our server is deployed and maintained independently within the team, in order to efficiently monitor the process status of the whole set of Hadoop clusters, Hive,HBase and Kylin, as well as to deal with a large number of temporary files, we have developed a separate monitoring logic module. Once there is a problem with the cluster, you can receive alarm text messages or e-mails as soon as possible.

Platform monitoring module

4.4 Resource isolation secondary development

Because we provide it to various business lines in the form of a platform, when the calculation scale of the business data of a certain business line is large, which will cause a shortage of existing resources on the platform, we will ask the business side to provide machine resources according to the actual situation, and then there is the problem of how to allocate the resource isolation of the corresponding computing queue according to the machine resources provided by the business side. Currently, the official Apache Kylin version can only use one kylin_job_conf.xml for the whole cluster, and all three Cube operations of all projects on the platform can only use the same queue. Therefore, based on kylin-1.1.1-incubating, the source code of tag, we have made relevant changes to support the resource isolation function with project granularity, and submit issue to https://issues.apache.org/jira/browse/KYLIN-1241. The solution is very suitable for application scenarios where our platform administrators themselves also participate in project development. For a project, if you do not need to specify a specific calculation queue or the kylin_job_conf.xml file of the project under $KYLIN_HOME, the system will automatically call the original official logic and use the default Hadoop queue to calculate.

Resource isolation

4.5 Hadoop and HBase optimization

Because the hardware configuration of the independently deployed Hadoop cluster is not high and the memory is very limited, many problems have been encountered in the project practice.

4.5.1 insufficient memory resources for Hadoop task, cube calculation failed

Adjust MapReduce resource allocation parameters: in the process of cube calculation, the mr task will fail. According to log troubleshooting, it is mainly due to the insufficient memory allocation of mr. Therefore, we have adjusted the yarn.nodemanager.resource.memory-mb,mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb and mapreduce.reduce.java.opts parameters according to the actual situation of the task.

4.5.2 HBase RegionServer randomly down at different nodes

Due to the overall resource constraints of the machine, we configure the HBASE_ HEAPSIZE value for HBase is relatively small. With the passage of time, the platform carries more and more projects, and the requirements for memory and computing resources are also gradually increasing. Later, during the operation of the platform, the RegionServer of HBase appeared the phenomenon of random down drop on different nodes, resulting in the unavailability of HBase and affecting the query service of Kylin. This problem has perplexed the team for a long time. Through online materials and some of our own experience, we have optimized the relevant parameters of HBase and Hadoop.

A. tuning the JVM GC-related parameters of HBase, enabling the mslab parameters of HBase: better GC performance can be obtained through GC tuning, reducing the time and FULL GC frequency of a single GC.

B. tuning parameters related to HBase ZK connection timeout: the default ZK timeout setting is too short. Once FULL GC occurs, it is extremely easy to cause ZK connection timeout.

C. ZK Server tuning to improve the ZK timeout parameters of maxSessionTimeout:ZK clients (such as Hbase clients) must be within the range of server timeout parameters, otherwise the timeout parameters set by ZK clients will not have any effect

D. HBASE_OPTS parameter tuning: enable CMS garbage collection period and increase the values of PermSize and MaxPermSize

Hadoop and HBase optimization

5. Apache Kylin project practice 5.1 build Cube based on good fact fact form of join at warehouse end to reduce hive join pressure on small-scale clusters

For the design of Cube, there are special relevant documents with more guiding experience, such as: the dimensions of cube should not exceed 15, the dimensions with larger cardinality should be put in front, the value of dimensions should not be too large, the setting of dimension Hierarchy, and so on.

In practice, we will divide the requirements of a product into multiple pages for development. Each page query is mainly based on the cube built by the fact table, and each page corresponds to multiple dimension tables and a fact table. The dimension table is placed on the MySQL side and is uniformly managed by the data warehouse side. The fact table is calculated and stored in HDFS. The fact table does not store the name of the dimension, but only the id of the dimension, mainly based on three aspects. First: reduce the volume of the fact table Second: because our Hadoop cluster is a small cluster deployed separately, the MapReduce computing capacity is limited, and join operations are expected to be completed on the warehouse side to avoid computing pressure such as Hive join on the Kylin cluster. Third, reduce the backtracking cost. Suppose we store the dimension name in Cube, and if the dimension name changes, it will inevitably lead to the backtracking of the entire cube, which is very expensive. One may ask here that only dimension id in the fact table does not have dimension name. Suppose we need join to get the record that contains dimension name in the query result. For a page of a product, the dimension id is passed to the backend when we query, and the dimension name corresponding to the dimension id comes from the dimension table in MySQL. You can query the dimension name and save it as a dimension map with the dimension id for future use. At the same time, the visual range of a page is limited, and although the total amount of query results is very large, the record result of the fact table that meets the conditions returned by each page is limited, so we can map the corresponding name of each column of id through the previously saved dimension map, which is equivalent to completing the traditional join operation of id and name in the front-end logic.

5.2 Aggregation cube assists in the calculation of medium-and high-dimensional indicators to solve the problem of data expansion in upward summary calculation

For example, our fact table has detail partition data, and the detail partition contains the finest-grained data of os and appversion (note: the calculation of the cuid dimension is processed on the warehouse side). Our cube design also chooses os and appversion,hierarchy hierarchical structure. Os is the parent node of appversion. From the combined dimension of os+appversion (group by os, appversion), there is no problem with the number of users, but when counting the number of users according to os (group by os) single dimension. The cube established based on this detail partition is calculated upwards. Assuming that users use android version 8.0 in the morning, a large number of users upgrade to android version 8.1 in the afternoon, and android 8.0 combined dimension + android 8.1 combined dimension is calculated up to get os=android (group by os, where os=android) single-dimensional users. The data will be inflated and the data is inaccurate. So we add an agg partition to the fact table, and the agg partition contains the os single-dimensional results that have been de-duplicated from the cuid granularity group by. In this way, when a user requests an os dimension summary, Apache Kylin will calculate the eligible candidate cube sets according to the router algorithm and rank them according to weight (students familiar with BI products such as MicroStrategy should know this kind of case). The selector will select the os single-dimensional agg cube based on the agg partition, rather than the cube created from the detail partition from the bottom up from the finest granularity to the top. So as to ensure the correctness of the data.

5.3 add retention class analysis, how to update history more efficiently?

For small clusters, computing resources are very valuable. Suppose we analyze the retention of a project from day to 30, day to week to day to week to 4 weeks, day to January to day to April, week to 4 weeks, month to January to month to April. Then we will encounter problems with the traditional storage scheme.

5.3.1 traditional scheme

If today is 2015-12-02, what we actually get is 2015-12-01.

(click to enlarge the image)

The idea of the above data storage scheme is that when today is 2015-12-02, then 2015-12-01 can calculate active users, so, we will keep 2015-11-30 on the first day, 2015-11-29 on the second day, and 2015-11-28 on the third day to update the data of these columns (such as the red diagonal). This is because each column of daily data is based on the same day. When the nth day arrives, then fill in the x-day retention of this day, so that for a task, the historical data of many days before the update will be cascaded, such as the data on the red diagonal.

Advantages of this solution:

A, if you want to view one or more metrics within a certain time range, you can select the desired column metrics directly according to the time interval.

B, if you want to view multiple indicators for a particular day, you can also directly select multiple indicators for that day.

Disadvantages of this scheme:

A, historical data needs to be updated every day, such as those with red diagonals, resulting in a large number of MapReduce tasks pre-calculating cube, which requires more machine computing resources to support.

B, if you add new retention in the future, such as half a year retention, annual retention, then the diagonal length is longer, you need to update more days of historical data every day, and you need more time to run tasks.

C, for a large number of historical data tasks of cascading updates, in fact, it is highly dependent on how to ensure that multiple cube cascades of the retained project are updated correctly every day, which is very complex and difficult to maintain and monitor. It is also easy to encounter such a problem for the data warehouse.

D, for historical data that need to be traced back in batches with a large time interval, the task calculation difficulties and difficulties involved in question 3 are particularly prominent.

5.3.2 workarounds

If today is 2015-12-02, what we calculate is the actual data of 2015-12-01 (can be compared with the above structure)

(click to enlarge the image)

The idea of this scheme is that when today is 2015-12-02, it is actually the data of 2015-12-01, which is stored in the example above, but the retention of the day to the nth day represents the retention of the date corresponding to the n day, which is equivalent to rotating the red diagonal.

Advantages of this solution:

A, if you want to view a certain indicator in a certain time range, you can directly select the column indicator in that range

B, if you add new retention in the future, such as half-year retention, annual retention and other indicators, you do not need to cascade update the historical days of data, but only need to update the data of 2015-12-01. The time complexity O (1) remains unchanged, and there is no high requirement for physical machine resources.

Disadvantages of this scheme:

A, if it involves a query of multi-column metrics for a certain day or a certain time range, you need the front-end development retention analysis special processing logic to slide according to the corresponding time window, select different columns from different rows, and then render to the front-end page.

At present, we use a flexible storage scheme in the project.

After reading the above, have you mastered how Apache Kylin is put into practice on Baidu Maps? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.