How to analyze the SparkSQL+SequoiaDB performance tuning Strategy 07/03 Update SLTechnology News&Howtos

How to analyze the SparkSQL+SequoiaDB performance tuning Strategy

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to analyze the SparkSQL+SequoiaDB performance tuning strategy, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

The following describes the docking and use of SequoiaDB (distributed Storage) and Spark (distributed Computing) products, and how to improve the performance of statistical analysis in massive data scenarios.

01 introduction of SequoiaDB and SparkSQL

SequoiaDB is an open source financial-level distributed relational database, which supports standard SQL and transaction functions, supports complex index queries, and has deep integration with Hadoop, Hive and Spark. SequoiaDB provides more data sharding rules than ordinary big data products, including horizontal sharding, range sharding, master subtable sharding (similar to partition partitioning) and multi-dimensional sharding. Users can choose the corresponding sharding method according to unused scenarios to improve the storage capacity and operation performance of the system.

Spark has developed rapidly in recent years, and more and more developers use SparkSQL for big data processing and analysis. SparkSQL is an integral part of Spark products, and the execution engine of SQL is implemented using Spark's RDD and Dataframe.

SparkSQL and another popular big data SQL product-Hive have similarities, but the two products still have essential differences, the biggest difference is the execution engine, Hive default support Hadoop and Tez computing framework, while SparkSQL only supports Spark RDD computing framework, but SparkSQL has a more in-depth implementation plan optimization and processing engine optimization.

02 how to integrate SequoiaDB and SparkSQL?

Spark itself is a distributed computing framework. Unlike Hadoop, it provides distributed computing and distributed storage for developers at the same time, but opens the development interface of the storage layer. As long as developers implement the interface method according to the interface specification of Spark, any storage product can become the source of Spark data computing, including the data source of SparkSQL.

SequoiaDB is a distributed database, which can store huge amounts of data for users, but if we want to do statistics and analysis of massive data, we still need to rely on the concurrent computing performance of the distributed computing framework to improve computing efficiency. So SequoiaDB developed a SequoiaDB for Spark connector for Spark, which allows Spark to support concurrent data acquisition from SequoiaDB, and then complete the corresponding data calculation.

The docking method between Spark and SequoiaDB is relatively simple. Users only need to add SequoiaDB for Spark connector spark-sequoiadb.jar and SequoiaDB java driver sequoiadb.jar to the CLASSPATH of each Spark Worker.

For example, if you want the SparkSQL to be connected to SequoiaDB, you can add the SPARK_CLASSPATH parameter to the spark-env.sh configuration file, and if the parameter already exists, add a new jar package to the SPARK_CLASSPATH parameter, such as:

SPARK_CLASSPATH= "/ media/psf/mnt/sequoiadb-driver-2.9.0-SNAPSHOT.jar:/media/psf/mnt/spark-sequoiadb_2.11-2.9.0-SNAPSHOT.jar"

After the user modifies the spark-env.sh configuration, restart spark-sql or thriftserver to complete the interface between Spark and SequoiaDB.

03 SequoiaDB and SparkSQL performance optimization

The performance optimization of SparkSQL + SequoiaDB will be introduced from four aspects: connector computing principle, SparkSQL optimization, SequoiaDB optimization and connector parameter optimization.

3.1 SequoiaDB for SparkSQL

A) how connector works

Although Spark products provide a variety of functional modules for users, they are only functional modules for data calculation. The Spark product itself does not have any storage capabilities, and by default, Spark reads data from a local file server or HDFS. Spark also opens its interface with the storage layer to the majority of developers. As long as developers implement their storage layer connectors according to the Spark interface specification, any data source can be called the data source of Spark computing.

The following figure shows the relationship between Spark worker and datanode in the storage tier.

Figure 1

The relationship between the Spark computing framework and the storage layer can be seen in the following figure. After receiving a computing task, Spark master will first communicate with the storage layer to get the storage situation of all the data designed by this computing task from the access snapshot or storage planning of the storage layer. The result returned to Spark master by the storage layer is the partition queue of the data store.

Spark master then assigns the partition in the partition queue of the data store to Spark worker one by one. After receiving the partition information of the data, Spark work can understand how to obtain the calculated data. Then Spark work will actively connect with the node node of the storage layer to obtain data, and then start data calculation combined with the computing tasks sent to Spark worker by Spark master.

Figure 2

The implementation principle of the connector of SequoiaDB for Spark is basically the same as the above description, except that when generating the partition task of data calculation, the connector will generate the query plan in SequoiaDB according to the query conditions pressed by Spark.

If SequoiaDB can scan the index according to the query criteria, the partition task generated by the connector will be to connect the Spark work directly to the data node of the SequoiaDB.

If SequoiaDB cannot do an index scan according to the query criteria, the connector will get all the block information of the relevant data table, and then generate a partititon calculation task containing several block connection information based on the partitionblocknum and partitionmaxnum parameters.

B) Connector parameters

The SequoiaDB for Spark connector is refactored after SequoiaDB 2.10 to improve the performance of Spark concurrently fetching data from SequoiaDB, and the parameters are adjusted accordingly.

You create a table whose data source is SequoiaDB on SparkSQL. The table template is as follows:

Create [temporary] [(schema)] using com.sequoiadb.spark options ()

Keyword description of SparkSQL table creation command:

1. Temporary keyword, which indicates whether the table or view is created when it is adjacent. If the user marks the temporary keyword, the table or view will be automatically deleted after the client restarts.

two。 When creating a table, the user can choose not to specify the table structure, because if the user does not explicitly specify the table structure, SparkSQL will automatically detect the table structure of the existing data when creating the table.

3. The com.sequoiadb.spark keyword is the entry class of SequoiaDB for Spark connector

4. Options is the configuration parameter of SequoiaDB for Spark connector

An example of SparkSQL table creation is as follows:

Create table tableName (name string, id int) using com.sequoiadb.spark options (host 'sdb1:11810,sdb2:11810,sdb3:11810', collectionspace' foo', collection 'bar', username' sdbadmin', password 'sdbadmin')

The list of options parameters for creating the table for SparkSQL for SequoiaDB is as follows:

Table 1

3.2 SparkSQL optimization

If you want to use SparkSQL to perform statistical analysis on massive data, you should make performance tuning in three aspects:

1. Increase the maximum available memory size of Spark Worker to prevent data from going beyond the memory range during calculation, and some of the data needs to be written to temporary files.

two。 Increase the number of Spark Worker, and set that each Worker can use CPU resources around the current server to improve concurrency.

3. Adjust the running parameters of Spark

Users can set the spark-env.sh configuration file. SPARK_WORKER_MEMORY is the parameter that controls the available memory of Worker, and SPARK_WORKER_INSTANCES is the parameter of how many Worker each server starts.

If you need to adjust the running parameters of Spark, you should modify the spark-defaults.conf configuration file. The parameters that can significantly improve the statistical calculation of massive data are:

1. Spark.storage.memoryFraction. This parameter controls the memory ratio of Worker. Users store temporary computing data. The default value is 0.6, which means 60%.

2. Spark.shuffle.memoryFraction. This parameter controls the memory ratio of each Worker that can be occupied by shuffle during calculation. The default is 0.2, which means 20%. If there is less calculation data temporarily stored and there are more group by, sort, join and other operations in the calculation, you should consider increasing the spark.shuffle.memoryFraction and reducing the spark.storage.memoryFraction to avoid writing the excess memory into the temporary file.

3. Spark.serializer, which sets which serialization method Spark uses at run time. The default is org.apache.spark.serializer.JavaSerializer, but to improve performance, org.apache.spark.serializer.KryoSerializer serialization should be selected.

3.3 SequoiaDB optimization

In the combination of SparkSQL+SequoiaDB, because the data is read from SequoiaDB, three points should be considered in performance optimization.

1. As far as possible, the data of large tables should be distributed storage, so it is suggested that table which meets the condition of two-dimensional segmentation should adopt two kinds of data balance methods of master and child table + Hash segmentation for distributed data storage.

two。 When importing data, you should avoid importing data to multiple collections in the same collection space at the same time, because multiple collections in the same collection space share the same data file. If you import data to multiple collections in the same collection space at the same time, the storage of data blocks under each collection will be too discrete, resulting in too many data blocks to be read when Spark SQL acquires massive data from SequoiaDB.

3. If SparkSQL's query command contains query conditions, you should index the corresponding fields in SequoiaDB accordingly.

3.4 connector optimization

The parameter optimization of SequoiaDB for Spark connector is mainly divided into two scenarios, one is data reading, the other is data writing.

There is little optimization space for data writing, and only one parameter can be adjusted, that is, bulksize parameter, which means that when the connector writes data to SequoiaDB, it forms a network packet with 500 records, and then sends a write request to SequoiaDB. Usually set the bulksize parameter, so that a network packet does not exceed 2MB.

When optimizing the parameters of data reading, users need to pay attention to three parameters: partitionmode, partitionblocknum and partitionmaxnum.

Partitionmode, the partition mode of the connector. Available values include single, sharding, datablock, and auto. The default value is auto, which represents the intelligent identification of the connector.

1. The single value means that when SparkSQL accesses SequoiaDB data, it does not consider the concurrency performance. Only one thread is used to connect to the Coord node of SequoiaDB. This parameter is generally used when creating a table to sample table structure data.

2. When accessing SequoiaDB data on behalf of SparkSQL, the sharding value is directly connected to each datanode of the SequoiaDB. This parameter generally adopts the scenario where the query condition is included in the SQL command, and the query can be queried by index in SequoiaDB.

3. When the datablock value represents SparkSQL to access SequoiaDB data, the data is read using the data blocks concurrently connected to SequoiaDB. This parameter is generally used in scenarios where the SQL command cannot use index query in SequoiaDB, and the amount of data queried is large.

4. The auto value means that when SparkSQL queries SequoiaDB for data, the way to access SequoiaDB will be determined by the connector according to different circumstances.

Partitionblocknum, which takes effect only when partitionmode=datablock, represents how many SequoiaDB block read tasks each Worker gets at a time when doing data calculation. The default value of this parameter is 4. If the amount of data stored in SequoiaDB is large and there are many data blocks involved in calculation, users should increase this parameter to keep the calculation task of SparkSQL in a reasonable range and improve the efficiency of data reading.

Partitionmaxnum, which takes effect only when partitionmode=datablock, represents the maximum number of block read tasks that the connector can generate. The default value for this parameter is 1000. The main purpose of this parameter is to avoid the total number of data blocks in SequoiaDB is too large, which leads to too many computing tasks in SparkSQL, and finally leads to the decline of the overall computing performance.

On how to analyze the SparkSQL+SequoiaDB performance tuning strategy to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.