Sample analysis of sqoop import data, full library import and creation of job, and implementation of timing incremental import 10/19 Update SLTechnology News&Howtos

Sample analysis of sqoop import data, full library import and creation of job, and implementation of timing incremental import

2025-10-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail about sqoop import data, full library import and job creation, and example analysis of timed incremental import. The content of the article is of high quality, so Xiaobian shares it with you as a reference. I hope you have a certain understanding of relevant knowledge after reading this article.

Start with a simple introduction,

sqoop Import a single table into hive:

sqoop import \--connect jdbc:mysql://192.168.49.214:3306/mysqlcdc --username root \--password 123456 \--table data \--hive-import \--fields-terminated-by '\t' \-m 1

This is the easiest way to import mysql tables into hive without specifying the hive table name, default to default library, table name and mysql table name. SQOOP can also select the data you want from multiple tables through sql statements, such as:

sqoop import \--connect 'jdbc:sqlserver://192.168.49.180:1433;database=rcscounty_qn' \--username sa \--password 123456! \--fields-terminated-by '\t' \--hive-import \--hive-table rcs.user_orgname \--m 1 \--query 'SELECT u.USER_ID as id, u.USER_NAME as name, u.ORG_ID as orgId, o.ORG_NAME as orgName FROM USER u , ORG o where o.ORG_ID = u.ORG_ID and $CONDITIONS'

Import sqlserver database data through sqoop, query the data you want through query, and import these data into hive. $CONDITIONS is indispensable. When there are query conditions, query conditions and and connections. When there are no query conditions, you can put them in where.

Importing data into hive via sqoop has the following characteristics:

1) The specified hive table can exist or not exist. If it does not exist, it will be automatically created. If there is no data in the table, the data will be imported. If the data format is not correct, an error will be reported. Adding--hive-overwrite will rewrite the hive table.

2) Hive tables created by sqoop can only be internal tables, even if the path where the data is stored in hdfs is specified by--target-dir, only folders are actually created in hdfs, and the data is placed in/user/hive/warehouse/by default.

3) The separators specified by--fields-terminated-by in the same hive table should be unified, otherwise the imported data will be squeezed into one column and cannot be separated.

2. sqoop full library import

sqoop import-all-tables "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \--connect 'jdbc:sqlserver://192.168.49.180:1433;database=rcscounty_qn' \--username sa \--password 123456 \--fields-terminated-by '\t' \--hive-database ods_quannan \-m 10 \--create-hive-table \--hive-import \--hive-overwrite \--autoreset-to-one-mapper

Import all the tables in a database into a hive library. If all the tables in this library have primary keys, you don't need the--autoreset-to-one-mapper in the last row.

If you need to rename each hive table regularly, for example, the previous table name is table, and the table you want to import is called ods_table, which cannot be implemented through sqoop, you need to write your own script to import it.

When I import the whole database, the total database is 133 tables, and the actual import is 80 tables. Maybe there is something wrong with my parameters, which is the case for many executions. Therefore, it is recommended to check whether the quantity is correct after importing.

3. sqoop incremental import

Because the hive table does not have a primary key, the hive table cannot be updated, and only newly inserted data can be added, that is, incremental import.

There are two ways of incremental import, one is append, and the other is incremental lastmodified.

Incremental import can only be imported into hdfs, not into hive, so do not have--hive import in the statement.

Append method:

sqoop import \--connect 'jdbc:mysql://192.168.49.201:3307/blade? serverTimezone=UTC&zeroDateTimeBehavior=CONVERT_TO_NULL' \--username root \--password 123456 \--table blade_blog \--target-dir '/user/hive/warehouse/ods.db/blade_blog' \--incremental append \--check-column id \--last-value 3 \-m 1

Since you are modifying hdfs data, you need to specify the hdfs path with target-dir. No overtime zone may report an error Establishing SSL connection without server's identity verification is not recommended. However, if the time zone is added, the data obtained in hive may be faster/slower than the data in mysql if the setting is incorrect when passing time type data. When setting, you must first query the time zone setting of your own database. Generally, mysql defaults to UTC.

The &zeroDateTimeBehavior=CONVERT_TO_NULL parameter is consistent with the time zone reason, and cannot connect to mysql database without it. If you can connect to mysql database without adding, you can do without adding.

Specify the increment method as append, check the column id, and set the value to 3, so data with id greater than 3 (excluding equal) will be imported. Duplicate data is not merged, so if you run it twice in a row, you'll see two pieces of data with id 4.

Check column cannot be character, must be number or time. Append mode officially recommended to use numbers, time recommended to use lastmodified mode import.

The lastmodified method is divided into two incremental import methods, one is not to merge duplicate data (append), and the other is to merge duplicate data (merge-key). Examples are as follows

Append mode

sqoop import --connect 'jdbc:mysql://192.168.49.214:3306/mysqlcdc? serverTimezone=UTC&zeroDateTimeBehavior=CONVERT_TO_NULL' \--username root \--password 123456 \--table data \--target-dir '/user/hive/warehouse/data' \--check-column last_mod \--incremental lastmodified \--last-value '2019-08-30 16:49:12' \--m 1 \--append

last_mod column All data with a time greater than or equal to 2019-08-30 16:49:12 will be imported.

Merge-key method:

sqoop import --connect 'jdbc:mysql://192.168.49.214:3306/mysqlcdc? serverTimezone=CST&zeroDateTimeBehavior=CONVERT_TO_NULL' \--username root \--password 123456 \--table data \--target-dir '/user/hive/warehouse/data' \--check-column last_mod \--incremental lastmodified \--fields-terminated-by ',' \--last-value '2019-08-28 17:31:58' \--m 1 \--merge-key id

Specify merge-key as id, and all data with duplicate id in hive table will be merged, whether it is added by this incremental import or not.

Note that the actual time range of imported data is from the last-value you specify to the execution of this sqoop statement. For example, if you specify last-value as 2019-08-28 17:31:58, the time to execute this sqoop statement is 2021-1-8 15:00:00, but there is a data time in the database that is 2022-2-4 12:31:21. This data will not be imported. In the printed log you can see:

If the execution does not report errors, duplicate data is also merged, but the data is not updated or newly imported, it is recommended to check whether the hdfs file path is correct.

4. Create incremental import as job and set up timed task

SQOOP can save some SQOOP operations as jobs for later execution. Then create a timed task to achieve the purpose of timed incremental import.

Create a SQOOP job:

sqoop job \--create one.more.time \-- import \--connect 'jdbc:mysql://192.168.49.101:3307/maybe? serverTimezone=UTC&zeroDateTimeBehavior=CONVERT_TO_NULL' \--username root \--password 123456 \--table blade_blog \--target-dir '/user/hive/warehouse/ods.db/b_blog' \--fields-terminated-by '\t' \--m 1 \--check-column update_time \--incremental lastmodified \--last-value '2020-01-16 15:34:01' \--merge-key id

This creates a job named one.more.time.

View job:

sqoop job --list

Perform timed incremental import through job. The last-value value of the first execution is the value specified by you. After that, the execution time of this job will be recorded as the parameter of the next last-value, so as to dynamically configure last-value and avoid repeated import.

Execution of job:

sqoop job --exec one.more.time

Create a timed task:

Check if crontab is installed

rpm -qa | grep crontab

If not, install crontabs, centos for yum install crontabs.

Write a shell script to execute the job.

Create a timed task for the current user:

crontab -e

enter edit

40 08 * * 1-5 ls /home/software/sqoop-script/maybe.sh

Maybe.sh is executed at 8:40 every day from Monday to Friday. For more on crontab timing specifications, see Linux Basics for Timed Tasks.

This is how the timing increments are synchronized.

About sqoop import data, full database import and job creation and implementation of timed incremental import sample analysis shared here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.