Using Spark+CarbonData to replace Impala instance Analysis 07/12 Update SLTechnology News&Howtos

Using Spark+CarbonData to replace Impala instance Analysis

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about using Spark+CarbonData to replace Impala example analysis, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it with the editor.

A mobile bureau in China uses Impala components to process detailed lists of telecom services, which is about 100TB every day, and the detailed list records are more than 10 billion a day. There are the following problems in the process of using impala:

The detailed list is stored in Parquet format, the data table uses time + MSISDN number as partition, and Impala query is used, and the query performance is relatively poor if the query scenario is not partitioned.

In the process of using Impala, we encounter a lot of performance problems (such as slow metadata synchronization caused by the expansion of catalog metadata), poor performance of concurrent queries, and so on.

Impala belongs to MPP architecture, which can only achieve 100 nodes. Generally, when the number of concurrent queries reaches about 20, the throughput of the whole system has reached full load, and the throughput can not be improved in the expansion nodes.

Resources can not be managed and dispatched through YARN, so Hadoop cluster can not realize dynamic resource sharing of Impala, Spark, Hive and other components. Resource isolation can not be achieved by opening the detailed order query ability to a third party.

In view of the above series of problems, the mobile office point customer asked us to give the corresponding solution. Our big data team analyzed the above problem and made the technology selection. In this process, we took several typical business scenarios of this mobile office point as input to verify the prototype of Spark+CarbonData, Impala2.6, HAWQ, Greenplum and SybaseIQ, respectively, and optimize the performance. Optimize the data loading performance and query performance of CarbonData for our business scenario and contribute it to the CarbonData open source community. Finally, we choose Spark+CarbonData, which is also a typical SQL On Hadoop solution, which indirectly confirms the trend of migration of traditional data warehouses to SQL on Hadoop. Refer to the information on the community official website, combined with our verification, testing and understanding: CarbonData is big data Hadoop ecological high-performance data storage solution, especially in the case of large data volume. It is deeply integrated with Spark and is compatible with all ecological functions of Spark (SQL,ML,DataFrame, etc.). Spark+CarbonData is suitable for a single data to meet the needs of various business scenarios. It includes the following capabilities:

Storage: row and column file storage, column storage is similar to Parquet, ORC, row storage is similar to Avro. Support a variety of index structures for phone bills, logs, pipelining and other data.

Computing: deep integration and optimization with Spark computing engine; support docking with Presto, Flink, Hive and other engines

Interface:

API: compatible with native API interfaces such as DataFrame, MLlib, Pyspark, etc

SQL: compatible with Spark syntax base, while supporting CarbonSQL syntax extensions (update deletion, index, pre-aggregate table, etc.).

Data Management:

Support incremental data storage, data batch management (aging management)

Support for data updates, deletions

Support docking with Kafka for quasi-real-time storage

For a detailed introduction and use of the key technologies, please read the document https://carbondata.apache.org/ on the official website.

Here is a supplementary description of why SQL on Hadoop technology is chosen as the final solution.

Anyone who has come into contact with big data knows that big data has a 5V feature, from traditional Internet data to mobile Internet data, and then to the popular IoT. In fact, with every progress of the industry, the amount of data will increase by two to three orders of magnitude. And the current data growth is showing an accelerated growth trend, so now put forward an including mobile Internet and Internet of things, including the Internet big data's five major characteristics: Volume, Velocity, Variety, Value, Veracity. With the growth of the amount of data, the traditional data warehouse faces more and more challenges.

Challenges facing traditional data warehouses:

At the same time, the data system is constantly evolving.

Evolution of storage: offline, near-> all online

Evolution of storage architecture: centralized storage-> distributed storage

The evolution of storage model: fixed structure-> flexible structure.

Evolution of data processing patterns

Fixed model fixed algorithm-> flexible model flexible algorithm

Evolution of data processing types

Structured centralized single-source computing-> multi-structured distributed multi-source computing

Evolution of data processing Architecture

Database static processing-> data real-time / streaming / mass processing

Kimball, the father of the above-mentioned change database, puts forward a point of view:

The core point of Kimball:

Hadoop changes the data processing mechanism of the traditional data warehouse, and a processing unit of the traditional database is decoupled into three layers in hadoop:

Storage tier: HDFS

Metadata layer: Hcatalog

Query layer: Hive, Impala, Spark SQL

Schema on Read gives users more choices:

Data is imported into the storage tier in its original format

Manage target data structures through the metadata layer

It is up to the query layer to decide when to extract the data

After long-term exploration and familiarity with the data, users can adopt Schema on Write mode to solidify the intermediate table to improve query performance.

Serial number

Data processing mode based on RDBMS

Data processing mode based on hadoop

one

Strong consistency

Final consistency, processing efficiency is higher than data accuracy

two

The data must be converted, otherwise the subsequent process cannot continue

Data can be stored in the original format for a long time without conversion.

three

Data must be cleaned and modeled.

Data cleaning and stylization are not recommended

four

The data is basically stored in the physical table, and the file access efficiency is low.

Most of the data is saved in files, and physical tables are equivalent to structured files.

five

Metadata is limited to dictionary tables

Extension of metadata to HCatalog services

six

SQL is the only data processing engine.

Open data processing engines: SQL, NOSQL, Java API

seven

The data processing process is completely controlled by IT personnel.

Data engineers, data scientists and data analysts can all participate in data processing.

SQL on Hadoop data Warehouse Technology

Data processing and analysis

SQL on hadoop

Kudu+Impala, Spark, HAWQ, Presto, Hive, etc.

Data modeling and storage

Schema on Read

Avro & ORC & Parquet & CarbonData

Stream processing

Flume+Kafka+Spark Streaming

The development and maturity of SQL-on-Hadoop technology promote change

After the above technical analysis, we finally chose SQL on Hadoop technology as the future data warehouse evolution direction of our platform. Someone here must have asked why we did not choose MPPDB technology. Here we have also compared SQL on Hadoop with MPPDB (note that Impala is actually a technology similar to MPPDB):

Contrast item

SQL on Hadoop

MPPDB

Fault tolerance

Fine-grained fault tolerance is supported. Fine-grained fault tolerance means that a task failure will automatically retry without having to resubmit the entire query.

Coarse-grained fault tolerance, can not deal with backward nodes (Straggler node). Coarse-grained fault tolerance means that the failure of a task will cause the entire query to fail, and then the system resubmits the entire query to get the results.

Expansibility

The number of cluster nodes can be expanded to hundreds or even thousands

It is difficult to scale to more than 100 nodes, usually about 50 nodes (for example, we used Greenplum to verify that the performance of more than 32 machines degraded)

Concurrency

With the increase of available resources of cluster size, the number of concurrency increases nearly linearly.

MPPDB will maximize the use of resources for queries to improve query performance, so the number of concurrency supported is low. Generally, when the number of concurrent queries reaches about 20, the throughput of the whole system has reached full load.

Query delay

1. The data size is less than 1PB, the record level of a single table is 1 billion, and the latency of a single query is usually about 10s.

2. The data size is larger than that of 1PB, and the query performance can be guaranteed by increasing cluster resources.

1. The data size is less than 1PB, and the record level of a single table is 1 billion. The MPP delay of a single query is usually less than seconds or even milliseconds.

2. The size of the data is larger than that of 1PB, and the query performance may decline sharply due to the limitation of architecture.

Data sharing

Storage and computing are separated, and a common storage format can support different data analysis engines, including data mining, etc.

The unique storage format of MPPDB database, which cannot be directly used by other data analysis engines

Since the local point launched Spark+CarbonData to replace Impala at the end of September 2018, it has handled the amount of single data larger than 100TB every day. During the business peak, the data loading performance has changed from the average single 60MB/s of the previous impala to the single 100MB/s of the platform. In typical business scenarios, the query performance of Spark+CarbonData is more than twice that of Impala+parquet when the query performance is 20 concurrent queries.

At the same time, the following problems are solved:

Hadoop cluster resource sharing problem, Impala resources can not be unified resource scheduling management through Yarn, Spark+CarbonData can achieve dynamic resource sharing with other components such as Spark,Hive through Yarn unified resource scheduling management.

For the problem of Hadoop cluster expansion, previously Impala could only use 100 machines, but now Spark+CarbonData can reach the cluster size of thousands of nodes.

Matters needing attention during implementation:

Data loading uses CarbonData's local sort method to load. In order to avoid the problem of too many small files in a large cluster, loading only specifies a small number of machines for data loading. In addition, for tables with a relatively small amount of data loaded each time, you can specify a table-level compaction to merge the small files generated during the loading process.

According to the query characteristics of the business, the fields that are often queried and filtered are set to the sort column attributes of the data table (such as the user numbers often queried by telecommunications services, etc.), and the order of the fields set sort column is arranged from high to low according to the query frequency of the fields. If the query frequency is not much different, then the distinct values of the fields are arranged from high to low to improve the query performance.

Create the blocksize size set by the data table. The data file block size of a single table can be defined by TABLEPROPERTIES, in MB. The default value is 1024MB. This is based on the amount of data loaded each time the actual data table is loaded, and according to our practical experience: it is generally recommended that the table with a small amount of data be set to 256MB, and the table with a large amount of data, blocksize, set to 512MB.

Query performance tuning can also be combined with the characteristics of business query, query high-frequency fields, create datamap such as bloomfilter to improve query performance.

There are also some Spark-related parameter settings. For data loading and query, we first analyze the performance bottleneck with SparkUI. We will not introduce the relevant parameters one by one here. Remember that performance tuning is a technical detail, and parameter adjustment should be targeted. Adjust only one or more parameters at a time, depending on the effect. If it does not take effect, adjust it back. Remember not to adjust too many parameters at once.

The above is the example analysis of using Spark+CarbonData to replace Impala. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.