Experience of data migration 07/08 Update SLTechnology News&Howtos

Experience of data migration

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

A few days ago, went on a business trip to the customer site to help migrate data. After a few days of hard work, the migration data was finally automated and can be operated in batches every day. Here, the editor will share with you the pitfalls (which may also be caused by lack of actual combat experience).

first of all, Rong Xiaobian I complain, is not familiar with the development environment is really a little sad, to a computer, we do not say there is no IDE, not even java installed, there is no tool to connect to the database, the only thing to be thankful is that there is xshell, but not in line with personal shortcuts preferences, there is no way, want to develop efficient, do-it-yourself configuration. The configuration of these development environments alone has sacrificed the editor's whole morning, but fortunately, there has been a marked acceleration in late development. Have lunch and get down to business in the afternoon.

got the migration task in the afternoon and found that there were many tables in a library, including large and small tables. At that time, I was silent for a few seconds, and I felt that I was here this week. Here, the editor also thought about the migration process during the silence of a few seconds. First of all, the data is stored in the relational database, then we have to upload the data to HDFS through sqoop, then map the data with the appearance of hive, and finally set up an indexed inner table to store a complete copy of data. The inner table is generally partitioned, bucket-divided, and indexed orc tables, the query speed is obviously much faster than the appearance. Then the editor will analyze the steps a little bit.

Preliminary preparation:

When gets the data from a database, we first analyze which tables are larger and which are smaller, separate large tables from small tables, and use different migration methods. Generally, customers provide the number of data entries per table. If not, only selecct count (*) from table; can check out the data of these tables, which is not only easy to distinguish between large and small tables, but also is of great help to the later data check.

1. Migrate data from a relational database to HDFS

here migrates data editor using sqoop, although sqoop is relatively slow, but the learning cost is relatively low, and it is easy to generate statements in batches, so the requirements for development are not so high. First, test whether a sqoop can successfully migrate data, then write a script to generate sqoop statements in batches, and finally call these statements to migrate data in parallel in the background. Here are some tips for using sqoop:

If the resources of the cluster are sufficient, there are generally no production tasks in the new cluster. The parameter "- m" can be set to > 1 in our sqoop statement, which means that multiple map are used to extract data in parallel. -- split-by, of course, when the-m parameter is greater than the setting, set this parameter at the same time to indicate which field in the table is divided into map parallelism.

Select-- split-by tries to use the scattered fields in the table to ensure that the amount of data extracted by each map task is roughly the same.

If the amount of data in a table is relatively large, such as more than 100 million items, we need to divide the table into multiple sqoop to extract:

-- query: specify the conditions after where and extract part of the data

The advantage of is that if there is only one sqoop task, and after extracting 90% of the data, it is found that the sqoop task is dead, then the extraction fails, which is not only time-consuming, but also the data has not been extracted. Divided into multiple sqoop tasks, not only can be parallel, and the amount of data for each task is not large, if there is a task to fail, only need to abstract the data under that where condition, and it is also of great help to find mistakes.

The selection of partition fields is also important. Date is generally used as a condition after where to ensure that the amount of data divided by each sqoop task is about the same.

Directory planning for sqoop extracted data # small table directory planning / tmp/ library name / table name # directory planning for large tables / tmp/ library name / table name / partition name

Query statement: in the sqoop command, when we write a query to extract data, remember not to:

-- ×

Select * from table

-- √

Select field 1, field 2. From table

Otherwise, it may cause sqoop extraction to slow down, and may even cause no data to be extracted.

when we pay attention to the above, we can write scripts to generate sqoop statements for each table in batches, according to the library name. Table name, get the metadata of the table in the relational database, and finally assemble the sqoop. Finally, the task script is written and these sqoop statements are executed regularly.

Actual data sharing:

has been tested by the editor here. When the amount of data is relatively large, it takes time for multiple sqoop and single sqoop:

Take 600G data as an example:

-it takes 3 to 4 hours to extract data in parallel with multi-sqoop.

-it takes more than 12 hours to extract data from a single sqoop.

-single sqoop & & (- M1) extracts 50 million pieces of data, about 27 minutes.

two。 Establish appearance mapping

To put it bluntly, is to map the extracted data through the appearance in the hive. In fact, it is not difficult here. It mainly depends on the fact that if the customer requests, it may be a single library of the appearance, or the name of the appearance is unified: table name _ ext. But remember, do not manually write table-building statements. If there are more than 100 tables, the mentality is easy to explode. Here, you can use the metadata of relational database to generate hive table-building statements. Here we take MySQL as an example:

SELECT CONCAT ('create table', TABLE_NAME,'(', substring (column_info, 1, length (column_info)-1),')', 'comment','', TABLE_COMMENT,'',' ') FROM (SELECT TABLE_NAME, TABLE_COMMENT, group_concat (CONCAT (COLUMN_NAME,', DATA_TYPE, 'comment','", COLUMN_COMMENT,'") AS column_infoFROM (SELECT t1.TABLE_NAME, CASE WHEN t2.TABLE_COMMENT = NULL THEN t1.TABLE_NAME ELSE t2.TABLE_COMMENT END AS TABLE_COMMENT, COLUMN_NAME CASE WHEN DATA_TYPE = 'varchar' THEN' string' WHEN DATA_TYPE = 'int' THEN' int' WHEN DATA_TYPE = 'tinyint' THEN' tinyint' WHEN DATA_TYPE = 'decimal' THEN' double' WHEN DATA_TYPE = 'datetime' THEN' string' WHEN DATA_TYPE = 'timestamp' THEN' string' WHEN DATA_TYPE = 'float' THEN' double' WHEN DATA_TYPE = 'double' THEN' double' WHEN DATA_TYPE = 'bigint' THEN' bigint' END AS DATA_TYPE CASE WHEN COLUMN_COMMENT = NULL THEN COLUMN_NAME ELSE COLUMN_COMMENT END AS COLUMN_COMMENTFROM COLUMNS T1 JOIN TABLES T2 ON t1.TABLE_NAME = t2.TABLE_NAMEWHERE t1.TABLE_NAME = 'tasking appellation equipmentalization status`) t3GROUP BY TABLE_NAME, TABLE_COMMENT) T4

There are many such examples on the Internet, but the editor will not introduce them here. After the appearance is established, it is best to check the amount of data and compare whether the data in the table in the relational database is the same as that in hive, so as to verify whether there is any data loss in sqoop.

Pit encountered:

when we extract large tables in partitions, we cannot directly map to appearances. We need to create a range partition table and map the partition directories of the tables to each partition.

3. Establish an efficient inner table

In fact, this step of is to insert the external data into an optimized inner table that is the same as the external field. This inner table is generally divided into buckets, indexed, or based on flash memory. Anyway, it is a table that greatly improves the speed of query, also known as a business table.

editor here uses an efficient query based on flash memory, a table structure developed within the enterprise. Here's how to determine the sub-bucket field: the advantage of sub-bucket is

(1) achieves higher query processing efficiency. Buckets add an extra structure to the table that Hive can take advantage of when dealing with some queries. Specifically, joining two tables with buckets on the same column (containing join columns) can be efficiently implemented using Map-side joins (Map-side join). Such as JOIN operations. For JOIN operations, two tables have the same column, if a bucket operation is performed on both tables. Then the bucket that holds the same column values can be operated by JOIN, which can greatly reduce the amount of data in JOIN.

(2) makes sampling (sampling) more efficient. When dealing with a large data set, it will bring a lot of convenience if the query can be run on a small part of the data in the stage of developing and modifying the query.

, if you determine the bucket field, generally, if a table with a primary key uses the primary key as a sub-bucket field, if there is no primary key, find a few scattered fields to use:

Select count (distinct feild) from table

Find the field with the largest data as the bucket field. The specific number of buckets is suggested to be a prime number, because if it is a non-prime number, it may lead to uneven sub-buckets, because if the number of sub-buckets is 9, then the field values of 18 and 27 will be divided into a bucket, which may lead to bucket "hot spots".

4. External insert to the inner table

is a time-consuming process second only to sqoop. Because of the partition established by our inner table, we need to use the dynamic partition insert of hive in this step. Generally, the insert statement is:

Insert into table_1 partition (par_field) select field1.field2...from table_2

It is important to note here that the last field of select must be a partition field.

when we insert, we need to check whether all the data are successful in insert. In this case: the amount of external data = the amount of data on the inner table = the amount of data in the relational database.

5. Processing of daily incremental data

when we complete the data migration, the exterior is actually like a transfer station, only transferring the data to the inner table. If we ensure that there is a complete data in the inner table, we can empty the external data at this time. This is why we set the external location to / tmp. After emptying the external data, extract the daily increase data to the external location address. Then all the exterior data is insert into the inner table, which ensures that the daily increment data is successfully imported into hive.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.