In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the knowledge of "how to create the distribution key of a table by Greenplum". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Greenplum creates a table-- Distribution key
Greenplum is a distributed system, and you need to specify a distribution key when you create a table (CREATEDBA permission is required to create the table) in order to distribute the data evenly across the segment. Choosing the distribution key is very important, and if you choose the wrong key, the data will not be unique, and more seriously, it will cause a sharp decline in SQL performance.
Greenplum has two distribution strategies:
1. Hash distribution.
Greenplum uses the hash distribution policy by default. This policy can select one or more columns as the distribution key (DK for short). The distribution key does the hash algorithm to confirm that the data is stored on the corresponding segment. The same distribution key values will be hash to the same segment. It is better to have a unique key or primary key on the table to ensure that the data is not evenly distributed on each segment. Grammar, distributed by.
If there is no primary key or unique key, the first column is selected as the distribution key by default. Increase primary key
2. Randomly distribution.
The data will not be randomly assigned to the segment, and the same records may be stored on different segment. Random distribution can ensure that the data is average, but Greenplum does not have unique key constraints across nodes, so there is no guarantee that the data is unique. Based on uniqueness and performance considerations, it is recommended to use hash distribution, which will be described in a separate document in the performance section. Grammar, distributed randomly.
1. Hash distribution key
Create a table without specifying the distribution column and distribution type. Create a hash distribution table by default, and use the first column ID field as the distribution key.
TestDB=# create table t_hash (id int,name varchar (50)) distributed by (id)
CREATE TABLE
TestDB=#
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) |
Distributed by: (id)
After the primary key is added, the primary key is upgraded to a distribution key instead of the id column.
TestDB=# alter table t_hash add primary key (name)
NOTICE: updating distribution policy to match new primary key
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "t_hash_pkey" for table "t_hash"
ALTER TABLE
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Indexes:
"t_hash_pkey" PRIMARY KEY, btree (name)
Distributed by: (name)
Verify that the hash distribution table can achieve the uniqueness of primary key or unique key value.
TestDB=# insert into t_hash values (1Magnesia szlsd1')
INSERT 0 1
TestDB=#
TestDB=# insert into t_hash values (2ZLSD1')
ERROR: duplicate key violates unique constraint "t_hash_pkey" (seg2 gp-s3:40000 pid=3855)
In addition, unique keys can still be created on the primary key column
TestDB=# create unique index u_id on t_hash (name)
CREATE INDEX
TestDB=#
TestDB=#
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Indexes:
"t_hash_pkey" PRIMARY KEY, btree (name)
"u_id" UNIQUE, btree (name)
Distributed by: (name)
However, non-primary key columns cannot create unique indexes separately, and must contain multiple distribution key columns if you want to create them.
TestDB=# create unique index uk_id on t_hash (id)
ERROR: UNIQUE index must contain all columns in the distribution key of relation "t_hash"
TestDB=# create unique index uk_id on t_hash (id,name)
CREATE INDEX
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Indexes:
"t_hash_pkey" PRIMARY KEY, btree (name)
"uk_id" UNIQUE, btree (id, name)
Distributed by: (name)
After the primary key is deleted, the original hash distribution key remains unchanged.
TestDB=# alter table t_hash drop constraint t_hash_pkey
ALTER TABLE
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Distributed by: (name)
When the distribution key is not the primary key or the only key, let's verify the conclusion that the same value of the distribution key falls on a segment.
In the following experiment, the name column is the distribution key. When we insert the same name value, we can see that all seven records fall in the No. 2 segment node.
TestDB=# insert into t_hash values (1 Magi ZLSD')
INSERT 0 1
TestDB=# insert into t_hash values (2ZLSD')
INSERT 0 1
TestDB=# insert into t_hash values (3 recordszlsd')
INSERT 0 1
TestDB=# insert into t_hash values (4)
INSERT 0 1
TestDB=# insert into t_hash values (5 Magnesia szlsd')
INSERT 0 1
TestDB=# insert into t_hash values (6)
INSERT 0 1
TestDB=#
TestDB=#
TestDB=# select gp_segment_id,count (*) from t_hash group by gp_segment_id
Gp_segment_id | count
-+-
2 | 7
(1 row)
2. Randomly distributed bonds
To create a random distribution table, you need to add the distributed randomly keyword, and it is not known which column is used as the distribution key.
TestDB=# create table t_random (id int, name varchar (100)) distributed randomly
CREATE TABLE
TestDB=#
TestDB=#
TestDB=#\ d t_random
Table "public.t_random"
Column | Type | Modifiers
-+-
Id | integer |
Name | character varying (100) |
Distributed randomly
Verify the uniqueness of the primary key / unique key, and you can see that the randomly distributed table cannot create primary and unique keys
TestDB=# alter table t_random add primary key (id,name)
ERROR: PRIMARY KEY and DISTRIBUTED RANDOMLY are incompatible
TestDB=#
TestDB=# create unique index uk_r_id on t_random (id)
ERROR: UNIQUE and DISTRIBUTED RANDOMLY are incompatible
TestDB=#
It can be seen from the experiment that the uniqueness of the data can not be realized. Moreover, the data is inserted into the random distribution table, not polling insertion. In the experiment, there are 3 segment, but 3 records are inserted on the 1st, and 2 records are inserted on the 2 segment node before inserting the data in the 0 segment. It is not known how the random distribution table can achieve the average distribution of data. This experiment also verifies the conclusion that the same value of the random distribution table is distributed in different segment.
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
| 1 | 1 |
(1 row)
TestDB=#
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
2 | 1
| 1 | 1 |
(2 rows)
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
2 | 1
| 1 | 2 |
(2 rows)
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
2 | 2
| 1 | 2 |
(2 rows)
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
2 | 2
| 1 | 3 |
(2 rows)
TestDB=# insert into t_random values (1Magnesia szlsd3')
INSERT 0 1
TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id
Gp_segment_id | count
-+-
2 | 2
| 1 | 3 |
0 | 1
(3 rows)
3. CTAS inherits the original table distribution key
There are two CTAS grammars in Greenplum, and no matter which syntax inherits the distribution key of the original table by default. However, some special properties of the table, such as primary key, unique key, APPENDONLY, COMPRESSTYPE (compression), are not inherited.
TestDB=#\ d t_hash
Table "public.t_hash"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Indexes:
"t_hash_pkey" PRIMARY KEY, btree (name)
"uk_id" UNIQUE, btree (id, name)
Distributed by: (name)
TestDB=#
TestDB=#
TestDB=# create table t_hash_1 as select * from t_hash
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause-- Using column (s) named' name' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column (s) chosen are the optimal data distribution key to minimize skew.
SELECT 0
TestDB=#\ d t_hash_1
Table "public.t_hash_1"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) |
Distributed by: (name)
TestDB=#
TestDB=# create table t_hash_2 (like t_hash)
NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE TABLE
TestDB=#\ d t_hash_2
Table "public.t_hash_2"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Distributed by: (name)
If CTAS creates a table to change the distribution key, add distributed by.
TestDB=# create table t_hash_3 as select * from t_hash distributed by (id)
SELECT 0
TestDB=#
TestDB=#\ d t_hash_3
Table "public.t_hash_3"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) |
Distributed by: (id)
TestDB=#
TestDB=#
TestDB=# create table t_hash_4 (like t_hash) distributed by (id)
CREATE TABLE
TestDB=#
TestDB=#\ d t_hash5
Did not find any relation named "t_hash5".
TestDB=#\ d t_hash_4
Table "public.t_hash_4"
Column | Type | Modifiers
-+
Id | integer |
Name | character varying (50) | not null
Distributed by: (id)
When CTAS, randomly random distribution key should pay special attention to, must add distributed randomly, otherwise the original table is hash distribution key, CTAS new table is random distribution key.
TestDB=#\ d t_random
Table "public.t_random"
Column | Type | Modifiers
-+-
Id | integer |
Name | character varying (100) |
Distributed randomly
TestDB=#
TestDB=#\ d t_random_1
Table "public.t_random_1"
Column | Type | Modifiers
-+-
Id | integer |
Name | character varying (100) |
Distributed by: (id)
TestDB=# create table t_random_2 as select * from t_random distributed randomly
SELECT 7
TestDB=#
TestDB=#\ d t_random_2
Table "public.t_random_2"
Column | Type | Modifiers
-+-
Id | integer |
Name | character varying (100) |
Distributed randomly
This is the end of the content of "how Greenplum creates the distribution key of a table". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.