Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the problems encountered in the development and production of Hive and Spark

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the problems encountered in the development and production of Hive and Spark". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the problems encountered in the development and production of Hive and Spark"?

Production environment version Hive: 1.2.1, Spark: 2.3.2

1.insert overwrite directory does not overwrite data

Note that the generated result is a directory, and different file names in the generated directory will not be overwritten, so it is easy to have data double or not overwrite the data. For example, the original result of data sharding is as follows:

/ mytable/000000_0/mytable/000000_1/mytable/000000_2/mytable/000000_3 # # the newly generated data contains only 000000000.Then 1 / 2 / 3 shards will not be deleted.

Solution: use the directory to create an external insertoverwrite, which may be more deadly if the directory is imported into another system. Note that when creating the appearance, if it is a partition table, delete the partition, and then insert overwrite will also cause data duplication. Test version 2.3.2

/ / a file data content / / 2 10 10insert overwrite table partition 2 10 dt string / create management table create table T2 (id int) partitioned by (dt string); load data local inpath'/ xxx/a'// create external partition table create external table T1 (id int) partitioned by (dt string); / / overwrite partition 10insert overwrite table T1 partition (dt='10') select 1 from T2 where dt=10;// delete partition 10ALTER TABLE T1 DROP PARTITION (dt='10') / / overwrite 10 this partition insert overwrite table T1 partition (dt='10') select 2 from T2 where dt=10

The result shows that 6 pieces of data are obviously abnormal, but the result is normal in hive (ps). Finally, it is found that the parameter is changed due to debugging by the partners in the group. In fact, the default parameter is fine, that is, true).

Solution:

Set spark.sql.hive.convertMetastoreParquet=trueset spark.sql.hive.convertMetastoreOrc=true2.insert tableA select xx from tableB field property problem

Note: even if the field name of select is the same as that of the field in tableA, but the insertion order is not the same as the field order, then the result will be problematic.

The common occurrence is the select under later view, which thinks that if the field name is right, there is no problem to insert. In fact, it is inserted sequentially, that is, select re-insertion, which may have something to do with the underlying mr implementation.

If the 3.spark.write write file is overwritten, and if multiple writes are concurrent, data duplication may also occur.

Through spark.write writing, and then associated with the table, there are two main ways to write the path, which is controlled by the parameter: mapreduce.fileoutputcommitter.algorithm.version

Spark writes HDFS depends on Hadoop's FileOutputCommitter, it has two algorithms, one is to write directly to the temporary directory, and then copy to the output directory, called v1 version, the other is directly output to the result directory, called v2 version, the output performance is better than v2 version, but neither of these two kinds can guarantee the final consistency, that is, in the case of concurrent writing, there are problems, in which v1 is a two-phase commit It can be guaranteed that when reading, it is not dirty data.

For more information, please refer to https://blog.csdn.net/u013332124/article/details/92001346

It often happens when a task starts, then stops, and then reschedules immediately.

Note: offline transactionality is relatively poor, pay more attention, try not to restart the task immediately, let the same directory write data operation at the same time, if you must restart the scheduling task immediately, the solution is to kill the task, delete the _ remporay file, offline consistency is often greater than performance, although you choose v1 or v2 version submission.

The problem with 4.msck repair table

When backtracking data and performing routine tasks, it may occur that the write data is not finished, and the backtracking task will msck repair the table, resulting in incomplete data read by the downstream task, similar to the three problems. The submission of two version will appear after the msck repair, and the data is incomplete, and the partition will be detected by the downstream task.

Solution:

Write + repair-> insert / write + add partition5.insert overwrite own table partition

Direct writing is supported in hive. In spark, data partitions will be deleted first, and then lazy loading will begin to read data, which leads to null pointers and null data problems. The solution is to first cache + any action data to memory.

Note: generally this is not recommended, unless it is just the beginning of the development of unimportant tables, the offline approach is to add version to distinguish versions, which is often used to add fields, modify field generation logic, first add a version, do it in the new version, notify the downstream switch, or change the version to the old partition to make sure it is correct.

6. Remember to rely on partitions with the latest logic

In the actual scenario, it is necessary to make the data run idempotent as much as possible, otherwise online problems will occur without knowing the maintenance. Common processing methods such as incremental table idempotent ods_day ≤ on the same day, ods_day=, the common processing method for full scale, even if you do not need a card on the same day, in the current day to repair data, or when backtracking, there is no guarantee that the data will always be valid, and data that cannot be traced back will make you sick.

Incompatibility between 7.Spark and Hive

For Orc file format and Parquet file format, in terms of compatibility, Spark implementation is a little newer and Hive is a little older. For example, partitions established with Spark have problems such as null and empty in some cases. You can set the following two parameters to enable the use of hive's orc and Parquet formats to build tables, so that Spark and Hive can read compatible data tables at the same time. It is recommended that unless some temporary tables need to be stored in the database. You can try to build it in the way of Hive. After it is finished, there will be no compatibility problem if you add the following parameters when Spark is written.

/ / the first set spark.sql.hive.convertMetastoreParquet=trueset spark.sql.hive.convertMetastoreOrc=true / / the second / / Spark uses the same convention as Hive to write parquet data. Config ("spark.sql.parquet.writeLegacyFormat", "true") 8. The problem of reading too high Qps on 20-12-07 19:39:17 ERROR executor.Executor: Exception in task 740.2 in stage 2.0 (TID 5250) org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 16384; received: 0)

When asked as shown in the picture, it is shown from the surface that reading the file is too high, resulting in the transmission of only half of the data, which is related to the current limitation of the machine. As far as our factory is concerned, the answer to the person who has asked bmr maintenance is to reduce the frequency of reading a file.

Solution: a little depressed at the beginning, offline data read there is a concurrency problem, in fact, is limited in a certain datanode on a block of files to read qps, so, from these aspects to investigate, first analyze which tables are key tables, see if the key tables are large files, if so, adjust it to small files, the common way is repatition output or hive reduce number.

9. Sort the bug of the window function

After many tests.-this symbol doesn't work.

Spark.sql (sql) .withcolumn ("rn", row_number (). Over (Window.partitionBy ('f_trans_id). OrderBy (-' f_modify_time)) / / alternative (preferably in sql rather than DSL Spark.sql (sql) .withColumn ("rn", row_number (). Over (Window.partitionBy ('f_trans_id). OrderBy ($"f_modify_time" .desc)) spark.sql (sql) .withColumn ("rn", row_number (). Over (Window.partitionBy (' f_trans_id) .orderby (col (f_modify_time) .desc) 10.alter table cascase

Alter table needs to add CASCADE, otherwise the historical data partition of backtracking cannot be seen in hive. The principle is that without adding cascade, it is only useful for metadata after fields are added. For old partition backtracking, even if the data backtracking is completed, you cannot get it. If you find this problem only after backtracking, you can supplement the metadata by change colunmn.

At this point, I believe you have a deeper understanding of "what are the problems encountered in the development and production of Hive and Spark?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report