In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
Amazon Redshift is a fast and scalable column storage database that supports SQL query. It supports PB-level quantity query and is suitable for enterprise-level data warehouse. At the same time, Redshift supports large-scale concurrent queries and result set caching, and the query response time is as fast as subseconds, which is nearly ten times faster than that of other data warehouses. With Redshift, you have less waiting time and more time to gain data insights.
ETL is a popular concept in the computer world, which refers to the process of copying data from one or more sources to the target system, which consists of three steps:
1Gravity extract selects / extracts the data to be exported from the data source
2 according to the needs of the business, the exported data will be transformed in the format / presentation form.
3Query load imports the converted data into the target system.
Before using Redshift, you need to import the data into Redshift, the ETL of Redshift. For example, the migration of databases, the transfer of data from old databases to Redshift, and so on.
The purpose of this article is to share some tips and suggestions on Redshift ETL in our daily work of bosicloud:
1. Use the COPY command to load multiple files of the same size into Redshift
Amazon Redshift is a MPP database, that is, a massively parallel processing database. Behind the Redshift is an EC2 cluster. Each computing node (ec2) is further subdivided into slice, and all slice divides the processing capacity of the computing nodes equally. The number of slice per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slice and each DS2.8XLARGE compute node has 16 slice.
When loading data, the best-case scenario is that the entire workload is divided equally among the slice of all computing nodes (EC2 node). When only one large file is loaded, or multiple files with large differences in size are loaded, it may lead to uneven workload distribution of computing nodes, which in turn leads to poor performance of the whole Redshift when loading data. For example, our Redshift has two compute nodes, and the workload of each node when loading a large file is as follows:
As you can see from the figure above, the loaded task falls on the compute-0, while the compute-1 is idle, which is a bucket principle, and the Redshift ends up spending as much time as the compute node with the longest working time.
Therefore, it is best to cut large files into multiple small files of the same size, and the total number of files is exactly the integral multiple of the computing node, so that each computing node can be divided into the same number of small files. In addition, we also recommend that these small files be compressed, such as gzip, lzop, or bzip2, because Redshift supports these compression formats when loading and storing data. Obviously, the compressed data is smaller and the load is less.
2. Use workload management to adjust Redshift queue reasonably.
Redshift manages multiple queue through workload management (WLM). When a user submits a SQL query task to Redshift, SQL will be assigned to the queue corresponding to group for execution according to the group of the submitter. The memory and computing power of Redshift are divided into many units / units. A slot represents the memory and computing power of a unit. A slot can execute a SQL query task at the same time. Different queue has different numbers of slot,slot to determine how many SQL query tasks the queue can execute concurrently. Because ETL is often accompanied by many COMMIT operations, and COMMIT consumes a lot of computing resources. In order to carry out ETL without affecting the submission of SQL query by ordinary users, we recommend that ETL and the SQL query submitted by ordinary users be distributed to different queue. Otherwise, the ordinary user query may not be able to respond in time due to waiting for ETL COMMIT.
In addition, in order to speed up the COPY ETL process of Redshift, we can also adjust the number of slot of ETL queue through the wlm_query_slot_count parameter, so as to increase the memory, computing power of queue and the number of concurrent SQL queries.
WLM setting interface of Redshift
3. Use "BEGIN … COMMIT" to reduce the number of COMMIT. We mentioned earlier that ELT is a multi-step task, each step often ends with a COMMIT, and COMMIT is an expensive operation. So we bosicloud suggest using BEGIN... END combines multiple steps that can be merged into a single step as much as possible and executes COMMIT only once, for example:
Begin
CREATE temporary staging_table
INSERT INTO staging_table SELECT.. FROM source
DELETE FROM table1 WHERE?
INSERT INTO table1 SELECT.. FROM staging_table
DELETE FROM table2 WHERE?
INSERT INTO table2 SELECT.. FROM staging_table
Commit
4, use Redshift Spectrum for ad-hoc query
In the past, we could load the data ETL on S3 into Redshift for ad-hoc query. If you ETL only for an occasional query or two, the ETL will seem very expensive and not cost-effective. Don't forget that AWS recently introduced a new Redshift Spectrum feature, that is, you can directly use Redshift Spectrum to query data on S3 without loading the data into Redshift. Although Specturm queries are slower than normal Redshift queries, they are still much faster than ETL queries.
5. SQL utility script for ETL health check:
returns queue statistics for the past 1 day, such as maximum queue length and queue time
Select startqueue,node, datediff (ms,startqueue,startwork) as queue_time, datediff (ms, startwork, endtime) as commit_time, queuelen
From stl_commit_stats
Where startqueue > = dateadd (day,-1, current_Date)
Order by queuelen desc, queue_time desc
returns information about the COPY executed within a week, such as the start time of the COPY (Starttime), the ID of the queue (query), the SQL statement (querytxt), the number of files (n_files) and the file size (size_mb) of the COPY, and so on:
Select q.starttime, s.query, substring (q.querytxtmem1120) as querytxt
S.n_files, size_mb, s.time_seconds
S.size_mb/decode (s. Timekeeper secondsmemo 1. Timekeeper seconds) as mb_per_s
From (select query, count () as n_files
Sum (transfer_size/ (10241024)) as size_MB, (max (end_Time)-
Min (start_Time)) / (1000000) as time_seconds, max (end_time) as end_time
From stl_s3client where http_method = 'GET' and query > 0
And transfer_time > 0 group by query) as s
LEFT JOIN stl_Query as q on q.query = s.query
Where s.end_Time > = dateadd (day,-7, current_Date)
Order by s.time_Seconds desc, size_mb desc, s.end_time desc
Limit 50
establishes a view view to view the usage of each tablespace. Consider unload the contents of a table with fast space growth to S3.
CREATE OR REPLACE VIEW admin.v_space_used_per_tbl
AS with info_table as (SELECT TRIM (pgdb.datname) AS dbase_name
, TRIM (pgn.nspname) as schemaname
, TRIM (pgc.relname) AS tablename
, id AS tbl_oid
, b.mbytes AS megabytes
, CASE WHEN pgc.reldiststyle = 8
THEN a.rows_all_dist
ELSE a.rows END AS rowcount
, CASE WHEN pgc.reldiststyle = 8
THEN a.unsorted_rows_all_dist
ELSE a.unsorted_rows END AS unsorted_rowcount
, CASE WHEN pgc.reldiststyle = 8
THEN decode (det.n_sortkeys,0, NULL,DECODE (a. A.unsorted_rows_all_dist::DECIMAL (32) / a.rows_all_dist) 0): DECIMAL (20)
ELSE decode (det.n_sortkeys,0, NULL,DECODE) (a.rowsMagne0, (a.unsorted_rows::DECIMAL (32) / a.rows) 100)):: DECIMAL (20jing2) END
AS pct_unsorted
FROM (SELECT
Db_id
, id
, name
, MAX (ROWS) AS rows_all_dist
, MAX (ROWS)-MAX (sorted_rows) AS unsorted_rows_all_dist
, SUM (rows) AS rows
, SUM (rows)-SUM (sorted_rows) AS unsorted_rows
FROM stv_tbl_perm
GROUP BY db_id, id, name
) AS a
INNER JOIN
Pg_class AS pgc
ON pgc.oid = a.id
INNER JOIN
Pg_namespace AS pgn
ON pgn.oid = pgc.relnamespace
INNER JOIN
Pg_database AS pgdb
ON pgdb.oid = a.db_id
INNER JOIN (SELECT attrelid
MIN (CASE attisdistkey WHEN 't'THEN attname ELSE NULL END) AS "distkey"
MIN (CASE attsortkeyord WHEN 1 THEN attname ELSE NULL END) AS head_sort
MAX (attsortkeyord) AS n_sortkeys
MAX (attencodingtype) AS max_enc
SUM (case when attencodingtype 0 then 1 else 0 end):: DECIMAL (20je 3) / COUNT (attencodingtype):: DECIMAL (20je 3) 100.00 as pct_enc
FROM pg_attribute
GROUP BY 1) AS det ON det.attrelid = a.id
LEFT OUTER JOIN
(SELECT
Tbl
, COUNT () AS mbytes
FROM stv_blocklist
GROUP BY tbl
) AS b
ON a.id=b.tbl
WHERE pgc.relowner > 1)
Select info.*
, CASE WHEN info.rowcount = 0 THEN 'nCompact a'
WHEN info.pct_unsorted > = 20 THEN 'VACUUM SORT recommended'
ELSE'nbat a'
END AS recommendation
From info_table info
finds the top 50 most time-consuming SQL queries of the week (multiple identical sql query times are merged)
-- query runtimes
Select trim (database) as DB, count (query) as n_qry, max (substring (qrytext,1,80)) as qrytext, min (run_seconds) as "min", max (run_seconds) as "max", avg (run_seconds) as "avg", sum (run_seconds) as total, max (query) as max_query_id
Max (starttime):: date as last_run, aborted
Listagg (event,',') within group (order by query) as events
From (
Select userid, label, stl_query.query, trim (database) as database, trim (querytxt) as qrytext, md5 (trim (querytxt)) as qry_md5, starttime,endtime, datediff (seconds, starttime,endtime):: numeric (12mem2) as run_seconds
Aborted, decode (alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
From stl_query
Left outer join (select query, trim (split_part (event,':',1)) as event from STL_ALERT_EVENT_LOG where event_time > = dateadd (day,-7, current_Date) group by query, trim (split_part (event,':',1)) as alrt on alrt.query = stl_query.query
Where userid 1
-- and (querytxt like 'SELECT%' or querytxt like' select%')
-and database =''
And starttime > = dateadd (day,-7, current_Date)
)
Group by database, label, qry_md5, aborted
Order by total desc limit 50
[about Bosiyun]
As a professional cloud computing service enterprise, Bosiyun provides operational services on AWS: including architecture consulting services, migration services, cloud security integration services, hybrid cloud management services, big data services and DevOps services. At present, Bosiyun has obtained manufacturer certification in big data, DevOps, architecture, database and operating system, and has branches in Shanghai, Nanjing, Hangzhou, Wuhan and other places. In order to innovate the service model and lead the development of IT service industry, Bosiyun will continue to invest resources to carry out the research and development of intelligent hybrid cloud management platform and map database.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.