What are the pits in importing data from oracle to hive using sqoop 07/06 Update SLTechnology News&Howtos

What are the pits in importing data from oracle to hive using sqoop

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you the use of sqoop to import data from oracle to hive which pit, I hope you will learn something after reading this article, let's discuss it!

1. Select individual columns for import using the-- columns parameter

two。 You can specify the qualified data to use the-- where parameter, and you can use the parameters in oracle. For example, I used the to_date ('2013-06-20 June June-20 June June to date import) function to specify the date of import.

Use the-- fields-terminated-by parameter to specify the column delimiter in 3.sqoop

4. The file location can be specified using the-target-dir parameter, but the folder is created by sqoop itself, and an exception will be reported if it already exists. Sometimes there are several tasks, and if you want to import to the same folder, you can use the-- append parameter. Do not worry that two tasks will generate files with the same name and lead to failure, because when you add the-- append parameter, sqoop will first generate a temporary folder under the current user directory to store the running results, and then cut the knot from the temporary folder to the folder you specify according to the location you specify, he will automatically change the file name to avoid duplication.

5. Want to start multiple map tasks to run generally directly use the-m parameter, he will default to import the table according to the primary key to sharding, that is, he defaults to use-- split-by your primary key to slicing, the process is to take your primary key of the maximum and minimum values, and then use the difference to divide the number of map tasks you set, so that you will get each map task processing data range. It looks wonderful, but there are many shortcomings in the middle. First of all, if you don't have a primary key, you have to specify the-split-by parameter manually, or you don't execute tasks in parallel and set-m 1. Another deficiency is the data tilt problem encountered this time. For example, I set-m 4, but my primary key, such as a minimum value of 1000 and a maximum value of 10000, but I only have 100 pieces of data, most of which are distributed in the area where the primary key range is 1000 to 2000, so the corresponding interval of the four map tasks is about 100000325052505500507750775010000. It can be seen that the data is concentrated on the first map task, and the other three have no data at all, which not only does not have the speed of parallel execution, but also affects the speed to some extent. So how to choose split-by should be chosen according to the actual situation.

6. How to use a column of type date as the split-by parameter value this task I found that it is more uniform to use a column of type date as the shard value, but when I directly use-- split-by my date column, I report an error java.sql.SQLDataException: ORA-01861: literal does not match format string online google knows that to the effect that a version of the bug solution is roughly converted to string type The website is https://issues.apache.org/jira/browse/SQOOP-1946. My solution is-- split-by "to_char (my date column, 'yyyy-MM-dd')"

7. The imported column has a newline character how to solve the problem. I thought everything would be all right. I never thought that the imported column had a nclob type, which stored articles and was bound to have newline characters. Sure enough, hive query data are all confused, query data found-- hive-delims-replacement (replace\ n,\ r and other symbols with your specified characters) and-- hive-drop-import-delims parameters. Can be added to find that there is no egg, no effect. Baidu goole for a long time, finally found a post, http://stackoverflow.com/questions/28076200/hive-drop-import-delims-not-removing-newline-while-using-hcatalog-in-sqoop means to use the-- map-column-java parameter to display the specified column String type. My solution is-- map-column-java my CLOB column = String sure enough, the problem is solved, and the newline characters are all removed.

After reading this article, I believe you have a certain understanding of "using sqoop to import data from oracle to hive". If you want to know more about it, please follow the industry information channel and thank you for your reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.