In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Because engaged in big data work, often in the process of operation data storage space is too large, read speed is too slow and other problems, I began to study the parquet format storage, the following are some of my own opinions (the tables used are in the project, we understand as wide table):
I. Reading and writing performance of two formats stored in SparkSql (taking the resource product topology information wide table as an example)
1. Assumption: Since the query reading operation and calculation performance of column formula are higher than that of ordinary storage format for parquet storage format, in the process of reading the same sql, the storage format of table is parquet, and the reading performance is higher and the time spent is less.
2. Test steps:
① Set up a table with the same parquet storage format for the relevant hive table used by sparksql;
(2) executing the jar packages separately to integrate the data;
3. Count the running time and draw a conclusion by comparing.
3. Test results: (probably due to resource occupation problem, parquet storage format takes a long time in the third test, but it is still lower than the reading efficiency of hdfs format storage)
number| Time Parquet Format Table Common Format Table
First time 9 minutes 0 seconds 14 minutes 51 seconds
Second time 9 minutes 38 seconds 14 minutes 18 seconds
Third time 13 minutes 9 seconds 16 minutes 14 seconds
Average time 10 minutes 36 seconds 15 minutes 08 seconds
4. Test analysis
In this test, the same sparksql is used for reading conversion, but the storage method of hive used is different. From the above execution time, it is obvious that in the process of reading and writing to generate resource product topology information wide table, it takes less time to build a table for parquet storage. The assumption is true.
Second, the storage format is different. In the column calculation, the storage in parquet format is more efficient than the storage in ordinary files.
1. Assumption: Since parquet stores columns, it is not necessary to scan all the data, only read the columns involved in each query, which can reduce I/O to N times, and save the statistical information of each column to realize partial predicate pushdown. Therefore, we assume that the format of storage is different, and it is more efficient to store parquet format in column calculation than ordinary file storage in determinant operation.
2. Test steps:
① The tables to be processed in sparksql are stored in parquet format and common format;
(2) sparksql with determinant operation;
Compare the execution time and draw conclusions.
3. Test results:
Source of generated files| Time taken to execute Parquet format Storage operation time Regular format Storage time Sql difficulty analysis
7-day quality tracking (Build_mov_fix_7day_fix) table ksdd_oss.build_mov_fix 2 minutes 18 seconds 7 minutes 50 seconds Sparksql to group about 5 million data multiple times to find the maximum value
To calculate the customer health score (Cust_Count_Yarn), use the table ksdd_oss.cust_health_detail 4 minutes 3 seconds 4 minutes 45 seconds Sparksql to check 4 billion data for one day and perform multiple sets of complex addition and subtraction to find the maximum value and judge the final result.
Customer Health File--Open installation and repair monitoring period data (Build_mov_fix_7day_health) Use tables ksdd_oss.cust_health_detail and ksdd_oss.cust_health_detail_num 12 minutes, 49 seconds, 4 minutes, 2 seconds Sparksql to calculate and associate the health file details with the statistics table for 7 days. The data in the two tables is more than 4 billion.
4. Test analysis
This test, through the complex large amount of data to calculate, write the results. Through the execution time, we can see how efficient the execution is. From the results of the first two tests, the advantages of parquet format storage are significantly higher than that of ordinary file storage. From the third point of view, the association query operation has a large amount of data, and the test result is not what we assumed. What causes the above results, after querying the data, we know that in the case of fewer column queries, we use the parquet format to store queries efficiently, including the time consumed by direct impala queries on large tables is also very short. The main reason for the above reasons should be that too many fields need to be aggregated, resulting in a very low query efficiency stored in parquet format. Therefore, we should consider the number of table fields used when executing sparksql. If there are too many fields, it is recommended to use the general storage format instead of parquet. Through analysis, we will discuss the hypothesis in different cases, and decide which way to store the data format by the specific number of fields.
Third, the storage format is different, Parquet format storage takes up less memory
1. Assumption: Since the members of each column in the parquet storage format are isomorphic, more efficient data compression algorithms are used for different data types to further reduce I/O; and because the members of each column are isomorphic, encoding methods more suitable for CPU pipeline can be used to reduce CPU cache failures, so we assume that the storage format is different, and the memory occupied by the Parquet format storage is smaller.
2. Test steps:
(1) generating the same amount of data;
(2) storing the generated data according to common file format and parquet format;
③ Look at the size of the space occupied by the generated file, and then draw a conclusion.
3. Test results:
Source of generated files| Memory Size Parquet Storage File Size Normal File Size Relative Savings Proportion Space Savings Size
Resources Product topology information Wide table intermediate file (zyproinfotwo) 3.6G 12.6G 71.4% 9G
Resource width table writes back oracle data (zydata_to_oracle) 726M 2.3G (2355.2M) 69.2% 1629.2M
Topological import bras data (topo_add_stb) 294.3M 1.2G (1228.8M) 76% 934.5M
4. Test analysis
In this test, we can see through the test data that using Parquet format to store data takes up less space and saves more than 70%. In our application, for the storage of large amounts of basic tables, we can generate Parquet format for storage. In the code, we can parse and register the required files into temporary tables and then extract the required fields. From this, we can see that the assumption is true, the storage format is different, and the memory occupied by Parquet format storage is smaller.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.