How to import data from traditional relational database into Hadoop 07/11 Update SLTechnology News&Howtos

How to import data from traditional relational database into Hadoop

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to import the data of the traditional relational database into Hadoop, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

The critical data of most enterprises exists in OLTP databases, and the data stored in these databases contain information about users, products, and other useful information. If you want to analyze this data, the traditional method is to copy the data to the OLAP data warehouse on a regular basis. Hadoop has emerged in this area and plays two roles: a substitute for data warehouse and a bridge between structured, unstructured data and data warehouse. Figure 5.8 shows the first role, where Hadoop is used as a large-scale join and aggregation tool before importing data to OLAP systems, a common platform for BI applications.

Figure 5.8 using Hadoop for OLAP data input, output and processing

Take Facebook as an example, the enterprise has successfully used Hadoop and Hive as OLAP platforms to process digital PB data. Figure 5.9 shows an architecture similar to Facebook. The architecture also includes feedback loops for OLTP systems that can be used to push insights found in Hadoop, such as providing advice to users.

In any usage model, we need a way to bring relational data into Hadoop and export it to a relational database. In this section, we will use Sqoop to simplify the process of exporting relational data to Hadoop.

Practice: import data from MySQL using Sqoop

Sqoop is a project that can be used to import and export relational data to and from Hadoop. It is a good advanced tool that encapsulates the logic associated with moving relational data to Hadoop. All we need to do is provide Sqoop with an SQL query to determine which data to output. This technique provides details on how to use Sqoop to move some data from MySQL to HDFS.

Figure 5.9 OLAP using Hadoop and feedback to the OLTP system

This section uses Sqoop version 1.4.4, and the code and scripts used in this technology may not work with other versions of Sqoop, especially Sqoop 2, which is implemented as a Web application.

problem

Load relational data into the cluster and ensure that writes are valid and idempotent.

Solution

In this technique, we will see how to use Sqoop as a simple mechanism for bringing relational data into an Hadoop cluster. We will describe the process of importing data from MySQL to Sqoop, as well as bulk imports using fast connectors (connectors are database-specific components that provide read and write access to the database).

Discuss

Sqoop is a relational database input and output system created by Cloudera and currently an Apache project.

When performing an import, Sqoop can write to HDFS, Hive, and HBase, and it can do the opposite for the output. The import is divided into two parts: connecting to the data source to collect statistics, and then triggering the execution of the actual imported MapReduce job. Figure 5.10 shows these steps.

Figure 5.10 Sqoop Import: connect to the data source and use MapReduce

Sqoop has the concept of a connector, which contains dedicated logic for reading and writing to external systems. Sqoop provides two types of connectors: generic connectors for regular reads and writes, and fast connectors for efficient import using a database-proprietary batch mechanism. Figure 5.11 shows these two types of connectors and their supported databases.

Figure 5.11 Sqoop connector for reading and writing external systems

Before continuing, we need to access the MySQL database and MySQL JDBC JAR needs to be available. The following script creates the necessary MySQL users and schemas and loads the data. The script creates a hip_sqoop_user MySQL user and creates an sqoop_test database with three tables: stocks,stocks_export and stocks_staging. It then loads the stock sample data into the table. All of these steps are performed by running the following command:

This is a quick browse script feature:

The first Sqoop command is the basic import, where you specify the MySQL database and the table join information to export:

MySQL name

The MySQL table name in Linux is case-sensitive, ensuring that the table name provided in the Sqoop command uses the correct case.

By default, Sqoop uses the table name as the target directory in HDFS to execute imported MapReduce jobs. If you run the same command again, the MapReduce job will fail because the directory already exists.

Let's take a look at the stocks directory in HDFS:

Import data format

Sqoop has imported the data as a comma delimited text file. It supports many other file formats, which can be activated using the parameters listed in Table 5.6.

Table 5.6 controls the Sqoop parameters of the import file format

If you are importing large amounts of data, you may need to use a file format such as Avro, which is a compact data format and is used in conjunction with compression. The following example uses the Snappy compression codec with an Avro file. It also uses the-- target-dir option to write the output to a different directory of the table name and specifies that a subset of rows should be imported using the-- where option. You can use-- columns to specify specific columns to extract:

Note that the compression provided on the command line must be defined in the configuration file core-site.xml under the io.compression.codecs property. The Snappy compression codec requires Hadoop native libraries to be installed. For more details on compression settings and configuration, see Chapter 4, and see the link at the end of the article.

You can learn about the structure of Avro files by introducing the AvroDump tool to understand how Sqoop lays out records. Sqoop uses Avro's GenericRecord for record-level storage (see Chapter 3 for more information, link to the end of the article). If you run AvroDump against a file generated by Sqoop in HDFS, you will see the following:

Use Sqoop with SequenceFiles

One reason why SequenceFiles is difficult to use is that there is no general way to access data in SequenceFile. You must have access to the Writable class used to write data. In the case of Sqoop, the code can generate this file, which introduces a major problem: if you move to a newer version of Sqoop, and that version modifies the code generator, the classes generated by the old code may not work with SequenceFiles. You need to migrate all old SequenceFiles to new versions, or have code that can use different versions of these SequenceFiles. Because of this limitation, it is not recommended to use SequenceFiles with Sqoop. If you are looking for more information about how SequenceFiles works, run the Sqoop import tool and view the stocks.java file generated in the working directory.

You can go a step further and specify the entire query using the-- query option, as follows:

Protection password

So far, we have been using plaintext passwords on the command line, which is a security vulnerability because other users on the host can easily list running processes and view passwords. Fortunately, Sqoop has some mechanisms that can be used to avoid password disclosure.

The first method is to use the-P option, which causes Sqoop to prompt for a password. This is the safest method because it does not need to store passwords, but this means that the Sqoop command cannot be executed automatically.

The second method is to use the-- password-file option, where you can specify the file that contains the password. Note that this file must exist on the configured file system (usually HDFS), not on the local disk of the Sqoop client. You may want to lock the file so that only you have read access to the file. This is still not the most secure option because root users on the file system can still snoop on files, and even non-root users can easily access them unless they are running a higher security level of Hadoop.

The last option is to use the options file. Create a file called ~ / .sqoop-import-opts:

Don't forget to lock the file to prevent users from snooping:

We can then provide this file name to the Sqoop job with the-- options-file option, and Sqoop will read the options specified in the file, which means there is no need to provide them on the command line:

Data split

How does Sqoop parallelize imports between multiple mapper? In figure 5.10, I showed how the first step of Sqoop is to extract metadata from the database. It examines the imported table to determine the primary key and runs the query to determine the lower and upper limits of the data in the table (see figure 5.12). Sqoop assumes that the data within the minimum and maximum keys is nearly evenly distributed because it splits the delta (the range between the minimum and maximum keys) by the number of mapper. Then, a unique query containing a series of primary keys is provided for each mapper.

Figure 5.12 determines the Sqoop preprocessing steps for query split

We can configure Sqoop to use a non-primary key with the-- split-by parameter, which is useful when there is no uniform distribution between the minimum and maximum values. However, for large tables, it is important to note that the columns specified in split-by are indexed to ensure the best import time, and alternate queries can be constructed using the-- boundary-query parameter to determine the minimum and maximum values.

Incremental import

Sqoop supports two types of import: append numeric data that increases over time, such as auto-increment keys, and lastmodified for timestamped data. In both cases, you need to use-- check-column to specify the column, the-- incremental parameter to specify the mode (the value must be append or lastmodified), and the actual value used to determine incremental changes through-- last-value.

For example, if you are importing stock data that was updated on January 1, 2005, do the following:

Assuming that another system continues to write to the table, you can use the-- last-value output of this job as input to subsequent Sqoop jobs, so that only rows that are newer than that date are imported.

Sqoop jobs and Metastore

You can see the last value of the increment column in the command output. How can you best automate a process that can reuse this value? Sqoop has a concept of jobs that can save this information and reuse it in subsequent implementations:

Executing the above command creates a named job in Sqoop Metastore that tracks all jobs. By default, Metastore is included in the home directory under .sqoop and is used only for your own jobs. If you want to share jobs between users and teams, you need to install a JDBC-compliant database for Sqoop's Metastore and specify its location using the-- meta-connect parameter when issuing job commands.

The job creation command executed in the previous example does nothing but add the job to Metastore. To run the job, you need to execute it explicitly, as follows:

The metadata displayed by the show parameter includes the last value of the delta column. This is actually the time the command was executed, not the last value in the table. If you are using this feature, make sure that the clocks of the database server and any clients that interact with the server, including Sqoop clients, are synchronized with the Network time Protocol (NTP).

Sqoop prompts for a password when you run the job. To run it in an automated script, you need to use Expect (a Linux automation tool) to provide a password from the local file when a Sqoop prompt is detected, and the Expect script used with Sqoop can be found on GitHub at https://github.com/alexholmes/hadoop-book/blob/master/bin/sqoop-job.exp.

The Sqoop job can also be deleted, as shown below:

$sqoop job-delete stock_increment

Fast MySQL Import

What if you want to bypass JDBC completely and use a fast MySQL Sqoop connector for high-throughput loading of HDFS? This method uses the mysqldump utility that ships with MySQL to perform the load. You must ensure that mysqldump is in the user path where the MapReduce job is running. To enable a quick connector, you must specify the-- direct parameter:

What are the disadvantages of fast connectors? Quick connectors apply only to text output files, specifying Avro or SequenceFile, because the imported output format does not work.

Import to Hive

The final step in this technique is to import the data into the Hive table using Sqoop. The only difference between the HDFS import and the Hive import is that the Hive import has a post-processing step in which the Hive table is created and loaded, as shown in figure 5.13.

Figure 5.13 Sqoop Hive Import event sequence

When data is loaded into Hive from a HDFS file or directory, as in the case of Sqoop Hive import (step 4 in the figure), Hive moves the directory to its repository instead of copying the data (step 5) to improve efficiency. After import, the HDFS directory written by the Sqoop MapReduce job will not exist.

The Hive import is triggered by the-- hive-import parameter. Like fast connectors, this option is not compatible with the-- as-avrodatafile and-- as- sequencefile options:

Import a string that contains the Hive delimiter

Downstream processing problems may occur if you import columns that can contain any Hive delimiters (\ n,\ r, and\ 01 characters). In this case, there are two options: specify-- hive-drop-import-delims, which will remove the conflicting characters in the import section, or specify-- hive-delims-replacement, which will replace them with different characters.

If the Hive table already exists, the data is appended to the existing table. If this is not the desired behavior, you can use the-- hive-overwrite parameter to indicate that the existing table should be replaced with imported data. Sqoop currently supports only Hive text output, so the LZOP compression codec is the best choice because it can be split in Hadoop (see Chapter 4 for details). The following example shows how to use-- hive-overwrite LZOP compression together. To do this, we need to build and install LZOP on the cluster because it is not bundled with Hadoop (or CDH) by default. For more information, see Chapter 4 (for a link at the end of the article):

Finally, we can use the-- hive-partition-key and-- hive-partition-value parameters to create different Hive partitions based on the values of the columns to be imported. For example, if you want to partition input by stock name, do the following:

Now, the previous example is by no means optimal. Ideally, a single import would create multiple Hive partitions. Because you are limited to specifying a single key and value, each unique partition value needs to run an import, which is laborious. It is best to import into an unpartitioned Hive table and then retrace the creation of partitions on the table after loading.

In addition, the SQL query provided to Sqoop must also be careful to filter out the results to include only those that match the partition. In other words, it would be useful if Sqoop updated the WHERE clause with the symbol = "AAPL".

Continuous Sqoop execution

If you need to schedule import HDFS,Oozie on a regular basis, you can do Sqoop integration, allowing you to perform import and export on a regular basis. An example of Oozie workflow.xml is as follows:

Single and double quotes are not supported in elements, so if you need to specify parameters that contain spaces, you need to use elements:

Another consideration when using Oozie's Sqoop is the need to provide the JDBC driver JAR for Oozie. We can copy the JAR to the lib/ directory of the workflow, or we can use JAR to update the lib directory of the Hadoop installation.

Thank you for reading this article carefully. I hope the article "how to import data from traditional relational databases into Hadoop" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.