Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to deal with massive data and use partitions in Postgres

2025-03-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to deal with large amounts of data and use partitions in Postgres. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

An interesting pattern we found is that there are one or two tables in the Postgres database cluster that are growing at a faster rate, quantified as large tables at the GB or TB level.

In general, the data stored in these tables is usually event tracking data in the application, or application log data

Tables of this size are not a problem in data storage, but there are other problems:

With the degradation of query performance, updating the index becomes slow.

Longer maintenance time, VACUUM (tablespace collation)

You need to configure the use of data in the application

As the amount of data increases over time, high query performance can be ensured by using Postgres table partitioning, and there is no need to split the data into different data storage areas.

In our platform, we use pg_partman to maintain table partitions. (advertisement:) Heroku platform has Heroku Postgres,Heroku Redis,Heroku Kafka storage orchestration service.

In our control platform, we have a table where the data stored in the table is a record of the state changes of everyone's data store, and after a few weeks, we don't need to use this information.

At this point, we use table partitioning to do this, and after two weeks, we can quickly delete the tables, and the speed of other query statements will not be affected during that time.

To understand how Postgre ensures high performance in the case of a large amount of data, we need how to use inheritance inside Postgres, how to set table partitions manually, learn to use the Postgres extension module, pg_partman, you can learn more about Postgre partition setup and maintenance.

First, inheritance

Postgres has basic support for table partitioning through table inheritance. Postgres table inheritance is the same concept as object-oriented inheritance. The table is said to inherit from another when it maintains the same data definition and interface. Table inheritance has been implemented in Postgres for a long time, and this function is relatively mature.

Take a look at how a table inheritance is implemented in our case:

CREATE TABLE products (id BIGSERIAL, price INTEGER created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ); CREATE TABLE books (isbn TEXT, author TEXT, title TEXT) INHERITS (products); CREATE TABLE albums (artist TEXT, length INTEGER, number_of_songs INTEGER) INHERITS (products)

In this example, the product derives books and records, which means that if a record is inserted into the book list, it will have all the product tables with the same characteristics, plus the book list. If a query is issued against the product table, the query refers to the product table, plus all of its child information.

For this example, the query references products, books, and records.

This is the default behavior in Postgres. However, you can also issue a separate query for each child table in it

Part II: set up manual partitions

Now that we have a grasp on inheriting from Postgres, we will set up manual partitions.

The basic prerequisite for partitioning is the existence of all other children inherited from the primary table.

We will use phrase subtables and partitions to alternate the rest of the installation process.

Live data should not be stored on the main table. In contrast, when data is written to the main table, the data needs to be redirected to the appropriate child partition table, which is usually done using Postgres triggers. Most importantly, check that the constraints are placed in each child table so that if the appropriate data is inserted directly into the child table, the insert will succeed. If the data does not belong to a partition, it will not be stored in the partition table.

To do table partitioning, you need to select a key to determine how to distinguish the partitioning information. Let's partition a very large active table in our Postgres database. For an event table, time is the key to determining how to split the information. Let's assume that our event table get 10 million insert is completed on any given day, and here is our original event table schema:

CREATE TABLE events (uuid text, name text, user_id bigint, account_id bigint, created_at timestamptz)

Let's make a few more assumptions to prove the example. Aggregate queries run against the event table have only a time frame for each day. This means that we aggregate breakup hours for any given day. We only used the data in the event table for a few days. At that time after that, let's not query the data. Most importantly, we have 10 million of the events generated by the day.

Given these assumptions, there is a reason to create daily partitions. We use the creation time of the data in the table as the key value to partition the data (for example, created_at)

CREATE TABLE events (uuid text, name text, user_id bigint, account_id bigint, created_at timestamptz); CREATE TABLE events_20160801 (CHECK (created_at > = '2016-08-01 00 AND created_at)

< '2016-08-02 00:00:00') ) INHERITS (events);CREATE TABLE events_20160802 ( CHECK (created_at >

= '2016-08-02 00 AND created_at

< '2016-08-03 00:00:00') ) INHERITS (events); 我们的主表定义为事件表,有两个子表用来存储接受到的数据, events_20160801 和 events_20160802。 我们也把它们以确保唯一数据的那一天结束在该分区上的 CHECK 约束。 现在我们需要创建一个触发器,以确保在主表中输入任何数据都能够去寻找正确的分区︰ CREATE OR REPLACE FUNCTION event_insert_trigger()RETURNS TRIGGER AS $$BEGIN IF ( NEW.created_at >

= '2016-08-01 00:00:00'AND NEW.created_at

< '2016-08-02 00:00:00' ) THEN INSERT INTO events_20160801 VALUES (NEW.*); ELSIF ( NEW.created_at >

= '2016-08-02 00:00:00'AND NEW.created_at

< '2016-08-03 00:00:00' ) THEN INSERT INTO events_20160802 VALUES (NEW.*); ELSE RAISE EXCEPTION 'Date out of range. Fix the event_insert_trigger() function!'; END IF; RETURN NULL;END;$$LANGUAGE plpgsql;CREATE TRIGGER insert_event_trigger BEFORE INSERT ON event FOR EACH ROW EXECUTE PROCEDURE event_insert_trigger(); 太好了 !分区创建,触发功能定义,并且触发器已添加到事件表。 在这一点上,我的应用程序可以插入事件表上的数据和数据可以定向到适当的分区。 那么问题来了,利用手工操作表分区,会存在很多的失误, 需要我们自己每次都去手动更新分区,创建触发器, 我们还没谈论到尚未从数据库中删除旧数据。 这样引入了pg_partman . 第三部 使用pg_partman来实现 Postgres 使用 pg_partman,使使得表分区的管理更加简便(相较于手动创建分区)  让我们通过一个实例,这样做从零开始运行︰ 首先,让我们来加载扩展和创建我们的事件表。 如果你已经有一个大的表定义,pg_partman 文件具有指导意义,如何将该表格转换成一种使用表分区。 $ heroku pg:psql -a sushisushi::DATABASE=>

CREATE EXTENSION pg_partman;sushi::DATABASE= > CREATE TABLE events (id bigint, name text, properities jsonb, created_at timestamptz)

Continue to use our hypothesis, the event data we made earlier. We produce 10 million events every day, and our query aggregation is on a daily basis. In view of this, we are going to create partitions by day.

Sushi::DATABASE= > SELECT create_parent ('public.events',' created\ _ at', 'time',' daily')

This command tells pg_partman to create a partition using the create_at column of the data table as the key

Another problem here is that this command to create a partition needs to be executed manually.

At present, there is no regular partition maintenance of the database, creating new table partitions and transferring related data.

Sushi::DATABASE= > SELECT run_maintenance ()

The run_maintenance () command instructs pg_partman to take a closer look at all partitioned tables and determine if new partitions need to be created and old partitions destroyed. Whether or not a partition should be destroyed or whether the retention option is determined. This command will be executed on the terminal command line.

We need to set up a scheduled task, which can be accomplished by using the Heroku scheduler. This command will run to check the database table partition every hour, checking that the partition and partition creation are in the command.

Heroku scheduler is a very efficient service, and the frequency of hourly execution does not have a significant impact on the performance of the database.

This is it. We have configured the table partition in Postgres, and it will only do a few maintenance actions on the back end.

The installation process of pg_partman (what we are doing so far) is just a facade.

For more details about pg_partman, check out the documentation for the extension module.

# part IV important question: do I need to use table partitioning?

Table partitioning allows you to break a very large table into many smaller tables to achieve a higher performance improvement.

As pointed out in the section on setting up partitions manually, when many challenges exist, try to create and use tables to partition yourself.

Using pg_partman can reduce the burden of this kind of business.

However, table partitioning is not your first choice to solve all problems. Some other questions should be asked to determine whether it is reasonable to use table partitioning to solve this problem:

You have a dataset large enough to be stored in a table. Does it grow significantly with events?

Is the data immutable? Immutable means that there is no update operation after it is inserted for the first time?

Have you optimized the index?

Is the data still valuable after a period of time?

Is there a small range of data queries?

Can large-scale data be archived to a cheap storage medium? Or do old data need to be "aggregated" or "aggregated"?

If your answer to all these questions is yes, you can use table partitioning. In general, table partitioning requires you to evaluate how to use your data, to consider the use of its table partition in advance from a large architectural design point of view, and to consider your usage patterns. As long as you take these factors into account, using table partitioning will be of great help to your application performance.

On how to deal with large amounts of data in Postgres and the use of partitions to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report