How does Greenplum create the distribution key of a table 07/11 Update SLTechnology News&Howtos

How does Greenplum create the distribution key of a table

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to create the distribution key of a table by Greenplum". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Greenplum creates a table-- Distribution key

Greenplum is a distributed system, and you need to specify a distribution key when you create a table (CREATEDBA permission is required to create the table) in order to distribute the data evenly across the segment. Choosing the distribution key is very important, and if you choose the wrong key, the data will not be unique, and more seriously, it will cause a sharp decline in SQL performance.

Greenplum has two distribution strategies:

1. Hash distribution.

Greenplum uses the hash distribution policy by default. This policy can select one or more columns as the distribution key (DK for short). The distribution key does the hash algorithm to confirm that the data is stored on the corresponding segment. The same distribution key values will be hash to the same segment. It is better to have a unique key or primary key on the table to ensure that the data is not evenly distributed on each segment. Grammar, distributed by.

If there is no primary key or unique key, the first column is selected as the distribution key by default. Increase primary key

2. Randomly distribution.

The data will not be randomly assigned to the segment, and the same records may be stored on different segment. Random distribution can ensure that the data is average, but Greenplum does not have unique key constraints across nodes, so there is no guarantee that the data is unique. Based on uniqueness and performance considerations, it is recommended to use hash distribution, which will be described in a separate document in the performance section. Grammar, distributed randomly.

1. Hash distribution key

Create a table without specifying the distribution column and distribution type. Create a hash distribution table by default, and use the first column ID field as the distribution key.

TestDB=# create table t_hash (id int,name varchar (50)) distributed by (id)

CREATE TABLE

TestDB=#

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) |

Distributed by: (id)

After the primary key is added, the primary key is upgraded to a distribution key instead of the id column.

TestDB=# alter table t_hash add primary key (name)

NOTICE: updating distribution policy to match new primary key

NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "t_hash_pkey" for table "t_hash"

ALTER TABLE

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Indexes:

"t_hash_pkey" PRIMARY KEY, btree (name)

Distributed by: (name)

Verify that the hash distribution table can achieve the uniqueness of primary key or unique key value.

TestDB=# insert into t_hash values (1Magnesia szlsd1')

INSERT 0 1

TestDB=#

TestDB=# insert into t_hash values (2ZLSD1')

ERROR: duplicate key violates unique constraint "t_hash_pkey" (seg2 gp-s3:40000 pid=3855)

In addition, unique keys can still be created on the primary key column

TestDB=# create unique index u_id on t_hash (name)

CREATE INDEX

TestDB=#

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Indexes:

"t_hash_pkey" PRIMARY KEY, btree (name)

"u_id" UNIQUE, btree (name)

Distributed by: (name)

However, non-primary key columns cannot create unique indexes separately, and must contain multiple distribution key columns if you want to create them.

TestDB=# create unique index uk_id on t_hash (id)

ERROR: UNIQUE index must contain all columns in the distribution key of relation "t_hash"

TestDB=# create unique index uk_id on t_hash (id,name)

CREATE INDEX

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Indexes:

"t_hash_pkey" PRIMARY KEY, btree (name)

"uk_id" UNIQUE, btree (id, name)

Distributed by: (name)

After the primary key is deleted, the original hash distribution key remains unchanged.

TestDB=# alter table t_hash drop constraint t_hash_pkey

ALTER TABLE

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Distributed by: (name)

When the distribution key is not the primary key or the only key, let's verify the conclusion that the same value of the distribution key falls on a segment.

In the following experiment, the name column is the distribution key. When we insert the same name value, we can see that all seven records fall in the No. 2 segment node.

TestDB=# insert into t_hash values (1 Magi ZLSD')

INSERT 0 1

TestDB=# insert into t_hash values (2ZLSD')

INSERT 0 1

TestDB=# insert into t_hash values (3 recordszlsd')

INSERT 0 1

TestDB=# insert into t_hash values (4)

INSERT 0 1

TestDB=# insert into t_hash values (5 Magnesia szlsd')

INSERT 0 1

TestDB=# insert into t_hash values (6)

INSERT 0 1

TestDB=#

TestDB=# select gp_segment_id,count (*) from t_hash group by gp_segment_id

Gp_segment_id | count

-+-

2 | 7

(1 row)

2. Randomly distributed bonds

To create a random distribution table, you need to add the distributed randomly keyword, and it is not known which column is used as the distribution key.

TestDB=# create table t_random (id int, name varchar (100)) distributed randomly

CREATE TABLE

TestDB=#

TestDB=#\ d t_random

Table "public.t_random"

Column | Type | Modifiers

-+-

Id | integer |

Name | character varying (100) |

Distributed randomly

Verify the uniqueness of the primary key / unique key, and you can see that the randomly distributed table cannot create primary and unique keys

TestDB=# alter table t_random add primary key (id,name)

ERROR: PRIMARY KEY and DISTRIBUTED RANDOMLY are incompatible

TestDB=#

TestDB=# create unique index uk_r_id on t_random (id)

ERROR: UNIQUE and DISTRIBUTED RANDOMLY are incompatible

TestDB=#

It can be seen from the experiment that the uniqueness of the data can not be realized. Moreover, the data is inserted into the random distribution table, not polling insertion. In the experiment, there are 3 segment, but 3 records are inserted on the 1st, and 2 records are inserted on the 2 segment node before inserting the data in the 0 segment. It is not known how the random distribution table can achieve the average distribution of data. This experiment also verifies the conclusion that the same value of the random distribution table is distributed in different segment.

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

| 1 | 1 |

(1 row)

TestDB=#

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

2 | 1

| 1 | 1 |

(2 rows)

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

2 | 1

| 1 | 2 |

(2 rows)

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

2 | 2

| 1 | 2 |

(2 rows)

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

2 | 2

| 1 | 3 |

(2 rows)

TestDB=# insert into t_random values (1Magnesia szlsd3')

INSERT 0 1

TestDB=# select gp_segment_id,count (*) from t_random group by gp_segment_id

Gp_segment_id | count

-+-

2 | 2

| 1 | 3 |

0 | 1

(3 rows)

3. CTAS inherits the original table distribution key

There are two CTAS grammars in Greenplum, and no matter which syntax inherits the distribution key of the original table by default. However, some special properties of the table, such as primary key, unique key, APPENDONLY, COMPRESSTYPE (compression), are not inherited.

TestDB=#\ d t_hash

Table "public.t_hash"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Indexes:

"t_hash_pkey" PRIMARY KEY, btree (name)

"uk_id" UNIQUE, btree (id, name)

Distributed by: (name)

TestDB=#

TestDB=# create table t_hash_1 as select * from t_hash

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause-- Using column (s) named' name' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column (s) chosen are the optimal data distribution key to minimize skew.

SELECT 0

TestDB=#\ d t_hash_1

Table "public.t_hash_1"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) |

Distributed by: (name)

TestDB=#

TestDB=# create table t_hash_2 (like t_hash)

NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table

CREATE TABLE

TestDB=#\ d t_hash_2

Table "public.t_hash_2"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Distributed by: (name)

If CTAS creates a table to change the distribution key, add distributed by.

TestDB=# create table t_hash_3 as select * from t_hash distributed by (id)

SELECT 0

TestDB=#

TestDB=#\ d t_hash_3

Table "public.t_hash_3"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) |

Distributed by: (id)

TestDB=#

TestDB=# create table t_hash_4 (like t_hash) distributed by (id)

CREATE TABLE

TestDB=#

TestDB=#\ d t_hash5

Did not find any relation named "t_hash5".

TestDB=#\ d t_hash_4

Table "public.t_hash_4"

Column | Type | Modifiers

Id | integer |

Name | character varying (50) | not null

Distributed by: (id)

When CTAS, randomly random distribution key should pay special attention to, must add distributed randomly, otherwise the original table is hash distribution key, CTAS new table is random distribution key.

TestDB=#\ d t_random

Table "public.t_random"

Column | Type | Modifiers

-+-

Id | integer |

Name | character varying (100) |

Distributed randomly

TestDB=#

TestDB=#\ d t_random_1

Table "public.t_random_1"

Column | Type | Modifiers

-+-

Id | integer |

Name | character varying (100) |

Distributed by: (id)

TestDB=# create table t_random_2 as select * from t_random distributed randomly

SELECT 7

TestDB=#

TestDB=#\ d t_random_2

Table "public.t_random_2"

Column | Type | Modifiers

-+-

Id | integer |

Name | character varying (100) |

Distributed randomly

This is the end of the content of "how Greenplum creates the distribution key of a table". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.