Sqoop Chinese Handbook 07/12 Update SLTechnology News&Howtos

Sqoop Chinese Handbook

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. Overview

This document mainly explains the use of SQOOP, and the reference is mainly from the official documentation of Cloudera SQOOP. In order to describe the meaning of the parameters more clearly in Chinese, almost all the instructions of the parameters in this document have been verified by me.

2. Codegen

Mapping a relational database table to a java file, java class class, and related jar package serves two main purposes:

1. Map the database table to a Java file, and match the fields of the corresponding table in the Java file.

2. The generated Jar and class files will be used when the metastore function is used.

Basic statement:

Sqoop codegen-connect jdbc:mysql://localhost:3306/hive-username root-password 123456-table TBLS2

Parameter description-bindir specifies the output path of the generated java file, the compiled class file and the JAR package file that packages the generated file into JAR-the name specified by the generated Java file set by class-name-the storage path of the java file generated by outdir-package-name package name, such as cn.cnnic, the two-level directories cn and cnnic will be generated, and the generated files (such as java files) will be stored in the cnnic directory-input-null-non-string in the generated java file You can set the null string to the value you want to set (such as the empty string'')-input-null-string is the same as above, it is best to set it with the above properties, and set the same value (such as empty string, etc.). -map-column-java database fields are mapped to various attributes in the generated java file, and the default data type corresponds to the database type. For example, if the type of a field in the database is bigint, the data type in the Java file is long. Through this attribute, you can change the data type of the database field mapped in java. Format such as-map-column-java DB_ID=String,id=Integer-null-non-string in the generated java file, such as TBL_ID==null? "null": "". Through this property setting, you can set the null string to other values such as ddd,TBL_ID==null? "ddd": "- null-string. It is best to use it with the above attribute, and set it to the same value-table corresponds to the table name of the relational database. The attributes in the generated java file correspond to the fields of the table one by one. 3. Create-hive-table

Generate HIVE tables corresponding to the table structure of relational database tables

Basic statement:

Sqoop create-hive-table-connect jdbc:mysql://localhost:3306/hive-username root-password 123456-table TBLS-hive-table h_tbls2

Parameter description-hive-home Hive installation directory, you can overwrite the default hive directory by this parameter-hive-overwrite overwrites the data that already exists in the hive table-create-hive-table defaults to false, and if the target table already exists, the creation task will fail-hive-table followed by the hive table to be created-table specifies the relational database table name 4. Eval

You can quickly use SQL statements to operate on a relational database, which enables you to know in advance whether the relevant SQL statements are correct and display the results on the console when you use import for data import.

Query example:

Sqoop eval-connect jdbc:mysql://localhost:3306/hive-username root-password 123456-query "SELECT * FROM tbls LIMIT 10"

Example of data insertion:

Sqoop eval-connect jdbc:mysql://localhost:3306/hive-username root-password 123456-e "INSERT INTO TBLS2

VALUES (100pr 1375170308pr. 0pr. 0pr. 0pr. 0pr. 0pr.)

The two parameters-e and-query can be run correctly after testing, such as the query and the insert SQL statement, respectively, as above.

5. Export

Import data from hdfs to relational database

Sqoop export-connect jdbc:mysql://localhost:3306/hive-username root-password

123456-table TBLS2-export-dir sqoop/test

Parameter description-direct Quick Mode, using database import tools such as mysql's mysqlimport, can import data into relational databases more efficiently than jdbc connections. -export-dir the source directory of the HDFS where the data is stored-mmam map mappers starts N CPMs to import data in parallel. The default is 4. It is best not to set the number to higher than the maximum number of Map in the cluster-the relational database table to be imported by table-update-key followed by the conditional column name. Through this parameter, you can update the data that already exists in the relational database. Similar to the update operation in the relational database-update-mode update mode, there are two values updateonly and the default allowinsert. This parameter can only be used if there is no record to be imported in the relational data table. For example, if there is a record of id=1 in the hdfs to be imported, if there is already a record id=2 in the table, then the update will fail. -input-null-string optional parameter, if not specified, the string null will be used-input-null-non-string optional parameter, if not specified, the string null will be used-staging-table this parameter is used to ensure transaction security during data import into relational database tables, because there may be multiple transactions during import, so one transaction failure will affect other transactions For example, the imported data will have errors or duplicate records, and so on, so this parameter can avoid this situation. Create the same data structure as the import target table and leave the table empty before running the data import, all transactions store the results in the table, and then the table writes the results to the target table through a transaction. -clear-staging-table if the staging-table is not empty, this parameter allows you to clear the data in the staging-table before running the import. -batch this mode is used to execute basic statements (the meaning is not clear yet) 6. Import

Import the data from the database table into hive, and if there is no corresponding table in hive, a table with the same name as the database table is automatically generated.

Sqoop import-connect jdbc:mysql://localhost:3306/hive-username root-password

123456-table user-split-by id-hive-import

-split-by specifies the name of the primary key field in the database table, in this case id.

Parameter description-append appends the data to the dataset that already exists in hdfs. With this parameter, sqoop will first import the data into a temporary directory and then rename the file to an official directory to avoid repeating the name of the file that already exists in that directory. -as-avrodatafile imports the data into an Avro data file-as-sequencefile imports the data into a sequence file-as-textfile imports the data into a plain text file, and after generating the text file, you can query the results in hive through the sql statement. -boundary-query boundary query, that is, a result set is obtained by SQL query before import, and then the imported data is the data in the result set, such as:-boundary-query 'select id,creationdate from person where id=3', indicating that the imported data is a record of id=3, or select min (), max () from. Note that the field of the query cannot have a field with a data type of string, otherwise an error will be reported: java.sql.SQLException: Invalid value for

GetLong ()

The cause of the problem is still unknown.

-columns specifies the value of the field to import in a format such as:-columns id,username-direct direct import mode, using the import and export tool that comes with the relational database. It is said on the official website that it will be faster to import-direct-split-size divides the imported stream into bytes on the basis of direct import using the above direct, especially when importing data from PostgreSQL using direct connection mode, a file that reaches the set size can be divided into several separate files. -inline-lob-limit sets the maximum value of the large object data type-map map mappers launches N SQL to import data in parallel. The default is 4. It is best not to set the number to higher than the number of nodes in the cluster.-query,-e imports data from the query results. You must specify-target-dir and-hive-table when this parameter is used. There must be a where condition in the query statement and $CONDITIONS must be included in the query condition. Example:-query 'select * from person where $CONDITIONS'-target-dir

/ user/hive/warehouse/person-hive-table person

-the column name of the split-by table, which is used to split the unit of work, usually followed by the primary key ID-table relational database table name. The data is obtained from the table-target-dir specifies the hdfs path-warehouse-dir and-target-dir cannot be used at the same time. Specify the storage directory for data import, which is suitable for hdfs import. It is not suitable for importing hive directory-where query conditions when importing data from relational database. Example:-where'id = 2 the data is not compressed by default. With this parameter, you can use the gzip compression algorithm to compress the data, which is suitable for SequenceFile, text text files, and Avro files-compression-codecHadoop compression encoding. The default is the gzip-null-string optional parameter. If not specified, the string null will be used as the-null-non-string optional parameter, if not specified Then the string null will be used with the incremental import parameter description-check-column (col) to determine the column name, such as id-incremental (mode) append: append, such as append import to records greater than the value specified by last-value. Lastmodified: the last modification time, append the record after the date specified by last-value-last-value (value) specify the maximum value of the column since the last import (greater than the specified value), or you can set a value yourself

For the incremental parameter, lastmodified is used if the date is used as the basis for appending the import, otherwise the append value is used.

7. Import-all-tables

Import all the tables in the database into HDFS, and each table corresponds to a separate directory in hdfs.

Sqoop import-all-tables-connect jdbc:mysql://localhost:3306/test

Sqoop import-all-tables-connect jdbc:mysql://localhost:3306/test-hive-import

Parameter description-as-avrodatafile same as import parameter-as-sequencefile same as import parameter-as-textfile same as import parameter-direct same as import parameter-direct-split-size same as import parameter-inline-lob-limit same as import parameter-MKui numllim mappers same as import parameter-warehouse-dir same as import parameter-zjinger compress same as import parameter-compression-codec same as import parameter 8. Job

A task used to generate a sqoop that, after generation, is not executed unless the command is used to execute the task.

Sqoop job

Parameter description-create generates a job, for example: sqoop job-create myjob-import-connectjdbc:mysql://localhost:3306/test-table

Person

-delete removes a jobsqoop job- delete myjob-exec to execute a jobsqoop job-exec myjob-help display help description-list displays all jobsqoop job- list-meta-connect used to connect to metastore services, for example:-meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop-show displays various parameters of a job sqoop job- show myjob-verbose print command runtime details 9. List-databases

Print out all the database names of the relational database

Sqoop list-databases-connect jdbc:mysql://localhost:3306/-username root-password 123456

10. List-tables

Print out all the table names of a relational database

Sqoop list-tables-connect jdbc:mysql://localhost:3306/zihou-username root-password 123456

11. Merge

Combine the data under different directories in HDFS and store them in the specified directory, for example:

Sqoop merge- new-data / test/p1/person-onto / test/p2/person-target-dir / test/merged-jar-file / opt/data/sqoop/person/Person.jar-class-name Person-merge-key id

Where the class name specified by-class-name corresponds to the Person class in Person.jar, and Person.jar is generated through Codegen

Parameter description-A directory in new-data Hdfs that stores data. It is hoped that the data in this directory can be retained first after merging. In principle, the directory where the newer data is stored corresponds to this parameter. -A directory in onto Hdfs that stores data. It is hoped that the data in this directory can be replaced by updated data after merging. In principle, the directory where older data is stored corresponds to this parameter. -merge-key merge key, which is usually the jar package introduced when the primary key ID-jar-file is merged. The jar package is the table name or object name corresponding to the jar package-class-name generated by the Codegen tool, and the class class is included in the jar package. -the directory where the merged target-dir data is stored in HDFS 12. Metastore

Record the metadata information of sqoop job. If you do not start the metastore instance, the default metadata storage directory is: ~ / .sqoop. If you want to change the storage directory, you can change it in the configuration file sqoop-site.xml.

Metastore instance startup: sqoop metastore

Parameter description-shutdown closes a running metastore instance 13. Version

Display sqoop version information

Statement: sqoop version

14. Help

Print sqoop help information

Statement: sqoop help

15. Common parameters Hive Parameter description-hive-delims-replacement replaces characters such as\ n,\ r, and\ 01 in the data with custom strings-hive-drop-import-delims can change the data type of the generated field when importing data into hive, removing characters such as\ n,\ r and\ 01 from the data-when map-column-hive generates the hive table The format is:-map-column-hiveTBL_ID=String,LAST_ACCESS_TIME=string-hive-partition-key to create a partition, followed by the partition name directly. After creation, you can see the partition name through the describe table name. The default value is string-hive-partition-value, which is the value corresponding to the key set by-hive-partition-key when importing data into hive. -hive-home Hive installation directory, you can override the default hive directory by this parameter-hive-import imports data from the relational database into the hive table-hive-overwrite overwrites the data that already exists in the hive table-create-hive-table defaults to false, if the target table already exists Then the creation task will fail-hive-table followed by the hive table to be created-table specifies the relational database table name database connection parameter description-connect Jdcb connection url For example:-connect jdbc:mysql://localhost:3306/hive-connection-manager specifies the connection management class to be used-driver database driver class-hadoop-home Hadoop root directory-help print help information-P read the password from the control side-the database connection password in password Jdbc url-the database connection user name in username Jdbc url-verbose prints out the details on the console-connection-param-file a record of the database File output parameters for connection parameters

Used for import scenarios.

For example:

Sqoop import-connect jdbc:mysql://localhost:3306/test-username root-P-table person-split-by id-check-column id-incremental append-last-value 1-enclosed-by'\ "

-escaped-by.-fields-terminated-by.

Parameter description-enclosed-by adds specified characters before and after the field value, such as double quotation marks. Example:-enclosed-by'\ ". Display examples:" 3 "," jimsss "," dd@dd.com "- escaped-by escape the double quotation marks. For example, the field value is" test ". After-escaped-by\\ processing, the value displayed in hdfs is:\" Test\ " Invalid for single quotation marks-fields-terminated-by sets what symbol each field ends with, the default is a comma, or it can be changed to another symbol, such as a period., for example:-fields-terminated-by.-lines-terminated-by sets the delimiter between each record line, the default is newline, but you can also set the string you want. For example:-lines-terminated-by'#'is separated by a # sign-mysql-delimitersMysql 's default delimiter setting, fields are separated by lines, lines are separated by line breaks, the default escape symbol is\, and field values are enclosed in single quotation marks. -optionally-enclosed-by enclosed-by forces each field value to be preceded by a specified symbol, while-optionally-enclosed-by only adds a specified symbol to field values with double or single quotation marks, so it is optional. For example:-optionally-enclosed-by'$'

Display the results:

$"hehe", test $

File input parameters

Parsing of the data format, used in export scenarios, corresponding to file output parameters.

For example:

Sqoop export-connect jdbc:mysql://localhost:3306/test-username root-password

123456-table person2-export-dir / user/hadoop/person-staging-table person3

-clear-staging-table-input-fields-terminated-by','

There is a certain format of data in hdfs. When importing such data into a relational database, the corresponding field values must be parsed according to this format. For example, there is data in this format in hdfs:

3Jimsssdddfudd.comcript1 Magi 2013-08-07 16 purse 00PUR 48.0, "hehe", test

The above fields are separated by commas, so when parsing, you must parse each field value with a comma, such as:

-input-fields-terminated-by','

Parameter description-input-enclosed-by parses the value of the field before and after the specified character, such as the value of double quotes:-input-enclosed-by'\ ". Data examples:" 3 "," jimsss "and" dd@dd.com "- input-escaped-by escape the field value containing escaped double quotation marks. For example, the field value is\" Test\ ". After the processing of-input-escaped-by\\, the parsed value is:" Test ". Invalid for single quotation marks. -input-fields-terminated-by parses the field values with delimiters between fields, for example:-input-fields-terminated-by,-input-lines-terminated-by parses field values with delimiters between each record line, for example:-input-lines-terminated-by'#'is separated by a # sign-input-optionally-enclosed-by and-input-enclosed-by function is similar For the difference from-input-enclosed-by, see the description of-optionally-enclosed-by in the output parameters

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.