Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Why copy data from MySQL to Redshift

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces why the data is copied from MySQL to Redshift. It is very detailed and has a certain reference value. Friends who are interested must read it!

Why copy data from MySQL to Redshift

Many companies that use MySQL to support their Web applications choose Redshift for data analysis. There are several reasons why you should do the same:

Maintain application performance. As we have already mentioned, running an analysis query on a production MySQL database can have a serious impact on its performance. It may even cause it to collapse. Analytical queries are very resource-consuming and require dedicated computing power.

Analyze all your data. As an OLTP database, MySQL is designed for transaction data, such as customer records and financial data. However, you want to gain insight from the entire dataset, including non-transaction types. You can use Redshift to capture and analyze all your data in one place.

Faster analysis. Redshift is a massively parallel processing (MPP) data warehouse, which means it can process large amounts of data in a very short period of time. On the other hand, it is difficult for MySQL to extend to the computing power needed for large modern analytical queries. Even MySQL replica databases are difficult to achieve the same speed as Redshift.

Scalability. MySQL is designed to run on a single-node instance rather than a modern distributed cloud infrastructure. Therefore, scaling beyond a single node requires time-and resource-intensive technologies, such as sharding or master node settings. All of this will further slow down the database.

Four ways to copy MySQL to Redshift

Because of the inherent weaknesses of MySQL, many companies copy data to Redshift to meet their analytical needs. There are four ways to do this:

Import and export

Incremental selection and replication

Use binlog to change data capture (CDC)

ETL

1. Import and export

The easiest way to copy to Redshift is to export the entire MySQL data. However, this is also the least efficient method. There are three steps:

Exit

Change

Import

Exit

First, export the data using MySQL's mysqldump command. A typical mysqldump command is as follows:

Java:

$mysqldump-h yourmysqlhost-u user mydatabase mytable1 mytable2-- result-file dump.sql

The output of this command is your MySQL SQL statement. You cannot run SQL on Redshift as is-you must convert the statement to a format suitable for Redshift import.

Change

For the best upload performance, convert your SQL statements to TSV (Tab separated values) format. You can do this using the Redshift COPY command.

The COPY command converts your SQL statement to TSV format. Then upload the files in bulk to the Redshift table in Amazon S3. For example, a row of data in a MySQL dump is as follows:

Java:

Mysql > INSERT INTO `users` (`id`, `firstname`, `lastname`, `age`) VALUES (1923, 'John',' Smith', 34), (1925)

With COPY, it looks like this:

1923 Johnsmith 1925 TmmyKing

Notice that values are separated by tabs (\ t).

You may also have to convert data values to be compatible with Redshift. This is because MySQL and Redshift support different columns and data types.

For example, the DATE value "0000-00-00" is valid in MySQL, but an error is thrown in Redshift. You must convert the value to an acceptable Redshift format, such as "0001-01-01".

Import

After converting the MySQL statement, the final step is to import it from S3 to Redshift. To do this, simply run the COPY command:

Java:

The malpractice of the import and export of users' credentials' aws_access_key_id=your_access_key;aws_secret_access_key=your_secret_key';

Although import and export are the easiest ways to copy to Redshift, they are not suitable for frequent updates.

For example, it takes about 30 minutes to export 18 Mbps data from MySQL over a 100 GB network. It will take another 30 minutes to import this data into Redshift. This assumes that you encounter zero connection problems during import or export, which forces you to restart the process.

A more efficient way to copy MySQL to Redshift is incremental SELECT and COPY.

two。 Incremental SELECT and COPY

If imports and exports are too slow for your needs, incremental SELECT and COPY may be your answer.

The SELECT and COPY methods update only records that have changed since the last update. This takes much less time and bandwidth than importing and exporting the entire dataset. SELECT and COPY enable you to synchronize MySQL and Redshift more frequently.

To use incremental SELECT and COPY, your MySQL table must meet several conditions:

The table must have a updated_at column, and its timestamp is updated each time the row is changed.

The table must have one or more unique keys.

Like import and export, this method is divided into three steps:

1. Exit

Incremental SELECT exports only rows that have changed since the last update. The SELECT query you run on MySQL is as follows:

Java:

SELECT * FROM users WHERE updated_at > = '2016-08-12 20 purl 0000'

Save the results to a file for conversion.

two。 Transformation

This conversion step is the same as the import and export method. Converts MySQL data to Redshift's TSV format.

3. Import

At this point, your MySQL TSV file includes updated lines and newly inserted lines. You cannot simply run the COPY command directly on the target Redshift table. This will cause the updated rows to be copied.

To avoid duplicate rows, use the DELSERT (delete + insert) technique:

Create a temporary table on Redshift with the same definition as the target table.

Run the COPY command to upload the data to the temporary table.

Delete rows that also exist in the temporary table from the target table. It looks like this:

Java:

DELETE FROM users USING users_staging s WHERE users.id = s.id

Where is the unique key for the id table.

Finally, the row is inserted from the temporary table to the target table:

Java:

Shortcomings of INSERT INTO users (id, firstname, lastname, updated_at) SELECT id, firstname, lastname, updated_at FROM users_staging Sitt select and COPY

Incremental SELECT and COPY are more efficient than importing and exporting, but they have their own limitations.

The main problem is that rows deleted from the MySQL table remain in the Redshift indefinitely. If you want to purge old data from MySQL while retaining historical data on Redshift, this is not a problem. Otherwise, rows deleted in Redshift can cause serious headaches during data analysis.

Another disadvantage of this approach is that it does not copy table schema changes. When you add or remove columns from the MySQL table, you need to manually make changes to the Redshift table.

Finally, the query used to extract update rows from the MySQL table affects the performance of the MySQL database.

If any of these shortcomings is a saboteur, then the next approach suits you.

3. Change data capture using Binlog

Change data capture (CDC) is a technique that captures changes made to data in MySQL and applies them to the target Redshift table. It is similar to incremental SELECT and COPY in that it imports only the changed data, not the entire database.

However, unlike incremental SELECT and COPY, CDC allows you to actually replicate from MySQL to Redshift.

To use the CDC method on the MySQL database, you must use the binary change log (binlog). Binlog allows you to capture change data as a stream for near-real-time replication.

Binlog captures not only data changes (insert, update, delete), but also table schema changes, such as adding / deleting columns. It also ensures that rows deleted from MySQL are also deleted in Redshift.

Getting started with Binlog

When you use CDC with binlog, you are actually writing an application that reads, transforms, and imports streaming data from MySQL to Redshift.

You can do this using an open source library called mysql-replication-listener. This C++ library provides a streaming API to read data from MySQL binlog in real time. Advanced API can also be used in multiple languages: kodama (Ruby) and python-mysql-replication (Python).

1. Set up

First, set the MySQL configuration parameters to enable binlog. The following is a list of parameters related to binlog:

Java:

Log_bin = / file_path/mysql-bin.log

The parameter binlog_format sets the format of how binlog events are stored in the binlog file. Three formats are supported: statement, mix, and row.

Statement format saves the query as is in the binlog file (for example, UPDATE SET firstname='Tom' WHERE id=293;). Although it saves the size of the binlog file, it has problems when used for replication.

To copy to Redshift, use the line format.

The row format preserves the changed value in the binlog file. It increases the binlog file size, but ensures data consistency between MySQL and Amazon Redshift. Log_bin sets the path where the binlog file is stored. Expire_logs_days determines the number of days the binlog file is retained.

Specify the table to copy in the replicate-wild-do-table parameter. Only those specified tables can be entered into the binlog file.

We recommend keeping the binlog file for a few days. This ensures that you have time to resolve any problems that occur during replication.

If you use MySQL replication from the server as the source, it is important to specify log-slave-updates as TRUE. Otherwise, data changes made on the replication master server will not be recorded in binlog.

In addition, your MySQL account requires the following permissions to perform replication-related tasks:

Copy slave station

Choice

Reload

Copy client

Lock table

two。 Export and conversion

When you use binlog, "export" is actually a real-time data stream of your MySQL binlog files. The delivery of binlog data depends on the API you use.

For example, for Kodama,binlog data is delivered in the form of an binlog event stream.

Kodama allows you to register event handlers for different event types (insert, update, delete, change table, create table, and so on). Your application will receive binary log events. It then generates output ready for Redshift import (for data changes) or schema changes (for table schema changes).

The data change import is similar to the conversion steps of our other replication methods. Unlike others, however, binlog allows you to handle deleted events. You need to specifically handle deleted events to maintain Redshift upload performance.

3. Import

Finally, it's time to import your binlog data stream.

The problem is that Redshift does not have steam upload function. Use the DELSERT import techniques that we outlined in the incremental SELECT and COPY methods.

Shortcomings of Binlog

Binlog is an ideal way to copy from MySQL to Redshift, but it still has its drawbacks. Building your CDC application requires serious development work.

In addition to the data flow we described above, you must also build:

Transaction management. Track data flow performance in case errors force your application to stop reading binary log data. Transaction management ensures that you can continue from where you last interrupted.

Data buffering and retry. Similarly, Redshift may become unavailable when your application is sending data. Your application needs to buffer unsent data until the Redshift cluster comes back online. If this step is not done properly, it may result in data loss or duplication.

Table schema change support. The table schema change binary log event (change / add / delete table) appears as a native MySQL SQL statement and does not run on Redshift as is. To support table schema changes, you must convert the MySQL statement to the corresponding Amazon Redshift statement.

4. Using ETL as a Service

With the ETL tool, you can copy data to Redshift in near real time.

Unlike the CDC method, such tools manage the entire replication process and automatically map MySQL data types to the format used by Redshift, so you don't have to do so. You can even synchronize multiple MySQL databases (and other types of databases) to Redshift at the same time.

In addition, the setup process is simple and short.

Use Amazon Redshift to make full use of MySQL

You rely on MySQL to power your business, but its limitations in data analysis are well known. Redshift provides a simple and powerful solution for your BI requirements. MySQL and Redshift can take your business to new heights.

As you can see, there are several ways to copy data from MySQL to Redshift. The method ranges from simple to complex, from very slow to close to real-time. The method you choose depends on several factors:

Replication frequency

Size of the MySQL dataset

Available developer resources

Remember that the fastest and most realistic method of replication is change data capture (CDC), which leverages MySQL's binlog. The downside is that it takes hours for developers to build and maintain the application.

The above is all the content of the article "Why copy data from MySQL to Redshift". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report