Several performance Best practices for AWS Redshift ETL 04/21 Update SLTechnology News&Howtos

Several performance Best practices for AWS Redshift ETL

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Amazon Redshift is a fast and scalable column storage database that supports SQL query. It supports PB-level quantity query and is suitable for enterprise-level data warehouse. At the same time, Redshift supports large-scale concurrent queries and result set caching, and the query response time is as fast as subseconds, which is nearly ten times faster than that of other data warehouses. With Redshift, you have less waiting time and more time to gain data insights.

ETL is a popular concept in the computer world, which refers to the process of copying data from one or more sources to the target system, which consists of three steps:

1Gravity extract selects / extracts the data to be exported from the data source

2 according to the needs of the business, the exported data will be transformed in the format / presentation form.

3Query load imports the converted data into the target system.

Before using Redshift, you need to import the data into Redshift, the ETL of Redshift. For example, the migration of databases, the transfer of data from old databases to Redshift, and so on.

The purpose of this article is to share some tips and suggestions on Redshift ETL in our daily work of bosicloud:

1. Use the COPY command to load multiple files of the same size into Redshift

Amazon Redshift is a MPP database, that is, a massively parallel processing database. Behind the Redshift is an EC2 cluster. Each computing node (ec2) is further subdivided into slice, and all slice divides the processing capacity of the computing nodes equally. The number of slice per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slice and each DS2.8XLARGE compute node has 16 slice.

When loading data, the best-case scenario is that the entire workload is divided equally among the slice of all computing nodes (EC2 node). When only one large file is loaded, or multiple files with large differences in size are loaded, it may lead to uneven workload distribution of computing nodes, which in turn leads to poor performance of the whole Redshift when loading data. For example, our Redshift has two compute nodes, and the workload of each node when loading a large file is as follows:

As you can see from the figure above, the loaded task falls on the compute-0, while the compute-1 is idle, which is a bucket principle, and the Redshift ends up spending as much time as the compute node with the longest working time.

Therefore, it is best to cut large files into multiple small files of the same size, and the total number of files is exactly the integral multiple of the computing node, so that each computing node can be divided into the same number of small files. In addition, we also recommend that these small files be compressed, such as gzip, lzop, or bzip2, because Redshift supports these compression formats when loading and storing data. Obviously, the compressed data is smaller and the load is less.

2. Use workload management to adjust Redshift queue reasonably.

Redshift manages multiple queue through workload management (WLM). When a user submits a SQL query task to Redshift, SQL will be assigned to the queue corresponding to group for execution according to the group of the submitter. The memory and computing power of Redshift are divided into many units / units. A slot represents the memory and computing power of a unit. A slot can execute a SQL query task at the same time. Different queue has different numbers of slot,slot to determine how many SQL query tasks the queue can execute concurrently. Because ETL is often accompanied by many COMMIT operations, and COMMIT consumes a lot of computing resources. In order to carry out ETL without affecting the submission of SQL query by ordinary users, we recommend that ETL and the SQL query submitted by ordinary users be distributed to different queue. Otherwise, the ordinary user query may not be able to respond in time due to waiting for ETL COMMIT.

In addition, in order to speed up the COPY ETL process of Redshift, we can also adjust the number of slot of ETL queue through the wlm_query_slot_count parameter, so as to increase the memory, computing power of queue and the number of concurrent SQL queries.

WLM setting interface of Redshift

3. Use "BEGIN … COMMIT" to reduce the number of COMMIT. We mentioned earlier that ELT is a multi-step task, each step often ends with a COMMIT, and COMMIT is an expensive operation. So we bosicloud suggest using BEGIN... END combines multiple steps that can be merged into a single step as much as possible and executes COMMIT only once, for example:

Begin

CREATE temporary staging_table

INSERT INTO staging_table SELECT.. FROM source

DELETE FROM table1 WHERE?

INSERT INTO table1 SELECT.. FROM staging_table

DELETE FROM table2 WHERE?

INSERT INTO table2 SELECT.. FROM staging_table

Commit

4, use Redshift Spectrum for ad-hoc query

In the past, we could load the data ETL on S3 into Redshift for ad-hoc query. If you ETL only for an occasional query or two, the ETL will seem very expensive and not cost-effective. Don't forget that AWS recently introduced a new Redshift Spectrum feature, that is, you can directly use Redshift Spectrum to query data on S3 without loading the data into Redshift. Although Specturm queries are slower than normal Redshift queries, they are still much faster than ETL queries.

5. SQL utility script for ETL health check:

 returns queue statistics for the past 1 day, such as maximum queue length and queue time

Select startqueue,node, datediff (ms,startqueue,startwork) as queue_time, datediff (ms, startwork, endtime) as commit_time, queuelen

From stl_commit_stats

Where startqueue > = dateadd (day,-1, current_Date)

Order by queuelen desc, queue_time desc

 returns information about the COPY executed within a week, such as the start time of the COPY (Starttime), the ID of the queue (query), the SQL statement (querytxt), the number of files (n_files) and the file size (size_mb) of the COPY, and so on:

Select q.starttime, s.query, substring (q.querytxtmem1120) as querytxt

S.n_files, size_mb, s.time_seconds

S.size_mb/decode (s. Timekeeper secondsmemo 1. Timekeeper seconds) as mb_per_s

From (select query, count () as n_files

Sum (transfer_size/ (10241024)) as size_MB, (max (end_Time)-

Min (start_Time)) / (1000000) as time_seconds, max (end_time) as end_time

From stl_s3client where http_method = 'GET' and query > 0

And transfer_time > 0 group by query) as s

LEFT JOIN stl_Query as q on q.query = s.query

Where s.end_Time > = dateadd (day,-7, current_Date)

Order by s.time_Seconds desc, size_mb desc, s.end_time desc

Limit 50

 establishes a view view to view the usage of each tablespace. Consider unload the contents of a table with fast space growth to S3.

CREATE OR REPLACE VIEW admin.v_space_used_per_tbl

AS with info_table as (SELECT TRIM (pgdb.datname) AS dbase_name

, TRIM (pgn.nspname) as schemaname

, TRIM (pgc.relname) AS tablename

, id AS tbl_oid

, b.mbytes AS megabytes

, CASE WHEN pgc.reldiststyle = 8

THEN a.rows_all_dist

ELSE a.rows END AS rowcount

, CASE WHEN pgc.reldiststyle = 8

THEN a.unsorted_rows_all_dist

ELSE a.unsorted_rows END AS unsorted_rowcount

, CASE WHEN pgc.reldiststyle = 8

THEN decode (det.n_sortkeys,0, NULL,DECODE (a. A.unsorted_rows_all_dist::DECIMAL (32) / a.rows_all_dist) 0): DECIMAL (20)

ELSE decode (det.n_sortkeys,0, NULL,DECODE) (a.rowsMagne0, (a.unsorted_rows::DECIMAL (32) / a.rows) 100)):: DECIMAL (20jing2) END

AS pct_unsorted

FROM (SELECT

Db_id

, id

, name

, MAX (ROWS) AS rows_all_dist

, MAX (ROWS)-MAX (sorted_rows) AS unsorted_rows_all_dist

, SUM (rows) AS rows

, SUM (rows)-SUM (sorted_rows) AS unsorted_rows

FROM stv_tbl_perm

GROUP BY db_id, id, name

) AS a

INNER JOIN

Pg_class AS pgc

ON pgc.oid = a.id

INNER JOIN

Pg_namespace AS pgn

ON pgn.oid = pgc.relnamespace

INNER JOIN

Pg_database AS pgdb

ON pgdb.oid = a.db_id

INNER JOIN (SELECT attrelid

MIN (CASE attisdistkey WHEN 't'THEN attname ELSE NULL END) AS "distkey"

MIN (CASE attsortkeyord WHEN 1 THEN attname ELSE NULL END) AS head_sort

MAX (attsortkeyord) AS n_sortkeys

MAX (attencodingtype) AS max_enc

SUM (case when attencodingtype 0 then 1 else 0 end):: DECIMAL (20je 3) / COUNT (attencodingtype):: DECIMAL (20je 3) 100.00 as pct_enc

FROM pg_attribute

GROUP BY 1) AS det ON det.attrelid = a.id

LEFT OUTER JOIN

(SELECT

Tbl

, COUNT () AS mbytes

FROM stv_blocklist

GROUP BY tbl

) AS b

ON a.id=b.tbl

WHERE pgc.relowner > 1)

Select info.*

, CASE WHEN info.rowcount = 0 THEN 'nCompact a'

WHEN info.pct_unsorted > = 20 THEN 'VACUUM SORT recommended'

ELSE'nbat a'

END AS recommendation

From info_table info

 finds the top 50 most time-consuming SQL queries of the week (multiple identical sql query times are merged)

-- query runtimes

Select trim (database) as DB, count (query) as n_qry, max (substring (qrytext,1,80)) as qrytext, min (run_seconds) as "min", max (run_seconds) as "max", avg (run_seconds) as "avg", sum (run_seconds) as total, max (query) as max_query_id

Max (starttime):: date as last_run, aborted

Listagg (event,',') within group (order by query) as events

From (

Select userid, label, stl_query.query, trim (database) as database, trim (querytxt) as qrytext, md5 (trim (querytxt)) as qry_md5, starttime,endtime, datediff (seconds, starttime,endtime):: numeric (12mem2) as run_seconds

Aborted, decode (alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event

From stl_query

Left outer join (select query, trim (split_part (event,':',1)) as event from STL_ALERT_EVENT_LOG where event_time > = dateadd (day,-7, current_Date) group by query, trim (split_part (event,':',1)) as alrt on alrt.query = stl_query.query

Where userid 1

-- and (querytxt like 'SELECT%' or querytxt like' select%')

-and database =''

And starttime > = dateadd (day,-7, current_Date)

)

Group by database, label, qry_md5, aborted

Order by total desc limit 50

[about Bosiyun]

As a professional cloud computing service enterprise, Bosiyun provides operational services on AWS: including architecture consulting services, migration services, cloud security integration services, hybrid cloud management services, big data services and DevOps services. At present, Bosiyun has obtained manufacturer certification in big data, DevOps, architecture, database and operating system, and has branches in Shanghai, Nanjing, Hangzhou, Wuhan and other places. In order to innovate the service model and lead the development of IT service industry, Bosiyun will continue to invest resources to carry out the research and development of intelligent hybrid cloud management platform and map database.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.