Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Suspension Analysis of PostgreSQL synchronous replication Master Library

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "PostgreSQL synchronous replication master library hang analysis", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "PostgreSQL synchronous replication master library hang analysis" bar!

In the Streaming Replication environment, the PostgreSQL master node is set to replicate synchronously. If the standby node is not started or the network is unable to connect to the master node, the process of the master node will hang if it executes DML. This pending problem is analyzed below.

I. data structure

Latch

Latch structures should be considered opaque "opaque" and can only be accessed through public functions. The definition here is to run Latchs as part of a larger structure.

/ / normally, variables of type int are usually accessed by atoms, and sig_atomic_t can also be thought of as data of type int. / / because an instruction is required to complete these variables, sig_atomic_t cannot be a structure, only a numeric type. Typedef int _ _ sig_atomic_t;/* * Latch structure should be treated as opaque and only accessed through * the public functions. It is defined here to allow embedding Latches as * part of bigger structs. * Latch structures should be treated as "opaque" opaque and can only be accessed through public functions. * the definition here is to run Latchs as part of a larger structure. * / typedef struct Latch {sig_atomic_t is_set; bool is_shared; int owner_pid;#ifdef WIN32 HANDLE event;#endif} Latch; II. Source code interpretation

N/A

Second, follow-up analysis

Start the master node, do not start the standby node, use psql to connect to the database, and perform SQL,Session suspension:

Testdb=# drop table t1

Use gdb to track pending processes

[xdb@localhost ~] $ps-ef | grep postgresxdb 1318 1 0 12:14 pts/0 00:00:00 / appdb/xdb/pg11.2/bin/postgresxdb 1319 1318 0 12:14? 00:00:00 postgres: logger xdb 1321 1318 0 12:14? 00:00:00 postgres: checkpointer xdb 1322 1318 0 12:14? 00:00:00 postgres: background writer xdb 1323 1318 0 12:14? 00:00:00 postgres: walwriter xdb 1324 1318 0 12:14? 00:00:00 postgres: autovacuum launcher xdb 1325 1318 0 12:14? 00:00:00 postgres: archiver xdb 1326 1318 0 12:14? 00:00:00 postgres: stats collector xdb 1327 1318 0 12:14? 00:00:00 postgres: logical replication launcher xdb 1331 1318 0 12:15? 00:00:00 postgres: xdb testdb [local] DROP TABLE waiting for 0/5D07B668 [xdb@localhost ~] $gdb-p 1331GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7.

View the call stack

(gdb) bt#0 0x00007f4636d48903 in _ epoll_wait_nocancel () from / lib64/libc.so.6#1 0x000000000088e668 in WaitEventSetWaitBlock (set=0x21640e8, cur_timeout=-1, occurred_events=0x7ffc96572f40, nevents=1) at latch.c:1048#2 0x000000000088e543 in WaitEventSetWait (set=0x21640e8, timeout=-1, occurred_events=0x7ffc96572f40, nevents=1, wait_event_info=134217761) at latch.c:1000#3 0x000000000088dcec in WaitLatchOrSocket (latch=0x7f462d5b44d4, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217761) at latch.c:385#4 0x000000000088dbcd in WaitLatch (latch=0x7f462d5b44d4, wakeEvents=17, timeout=-1 Wait_event_info=134217761) at latch.c:339#5 0x0000000000863e2d in SyncRepWaitForLSN (lsn=1560786536, commit=true) at syncrep.c:286#6 0x0000000000546279 in RecordTransactionCommit () at xact.c:1359#7 0x0000000000546da3 in CommitTransaction () at xact.c:2074#8 0x0000000000547a3f in CommitTransactionCommand () at xact.c:2817#9 0x00000000008be250 in finish_xact_command () at postgres.c:2523#10 0x00000000008bbf45 in exec_simple_query (query_string=0x20a1d78 "drop table T1) ") at postgres.c:1170#11 0x00000000008c0191 in PostgresMain (argc=1, argv=0x20cdcd8, dbname=0x20cdb40" testdb ", username=0x209ea98" xdb ") at postgres.c:4182#12 0x000000000081e06c in BackendRun (port=0x20c3b10) at postmaster.c:4361#13 0x000000000081d7df in BackendStartup (port=0x20c3b10) at postmaster.c:4033#14 0x0000000000819bd9 in ServerLoop () at postmaster.c:1706#15 0x000000000081948f in PostmasterMain (argc=1, argv=0x209ca50) at postmaster.c:1379#16 0x0000000000742931 in main (argc=1, argv=0x209ca50) at main.c:228 (gdb)

Kill process, re-enter setting breakpoints on WaitLatch for tracking

# [xdb@localhost ~] $kill-9 1331#testdb=# select pg_backend_pid (); pg_backend_pid-1377 (1 row) # [xdb@localhost ~] $gdb-p 1377... (gdb) b WaitLatchBreakpoint 1 at 0x88dbac: file latch.c, line 339. (gdb) # testdb=# drop table T1 ERROR: table "T1" does not existtestdb=# create table T1 (id int)

Enter the breakpoint

(gdb) b WaitLatchBreakpoint 1 at 0x88dbac: file latch.c, line 339. (gdb) cContinuing.Breakpoint 1, WaitLatch (latch=0x7f462d5b44d4, wakeEvents=17, timeout=-1, wait_event_info=134217761) at latch.c:339339 return WaitLatchOrSocket (latch, wakeEvents, PGINVALID_SOCKET, timeout, (gdb)

Enter WaitLatchOrSocket

(gdb) stepWaitLatchOrSocket (latch=0x7f462d5b44d4, wakeEvents=17, sock=-1, timeout=-1, wait_event_info=134217761) at latch.c:359359 int ret = 0; (gdb) (gdb) p * latch$1 = {is_set = 0, is_shared = true, owner_pid = 1377}

Build a waiting event set

(gdb) n362 WaitEventSet * set = CreateWaitEventSet (CurrentMemoryContext, 3); (gdb) n364 if (wakeEvents & WL_TIMEOUT) (gdb) 367 timeout =-1 (gdb) 369 if (wakeEvents & WL_LATCH_SET) (gdb) p * set$2 = {nevents = 0, nevents_space = 3, events = 0x2181eb8, latch = 0x0, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p * set- > events$3 = {pos = 0, events = 0, fd = 0, user_data = 0x0} (gdb) p * set- > epoll_ret_events$4 = {events = 0, data = {ptr = 0x0, fd = 0, U32 = 0 U64 = 0}} (gdb) $5 = {events = 0, data = {ptr = 0x0, fd = 0, U32 = 0, U64 = 0}} (gdb) n370 AddWaitEventToSet (set, WL_LATCH_SET, PGINVALID_SOCKET, (gdb) 373 if (wakeEvents & WL_POSTMASTER_DEATH & & IsUnderPostmaster) (gdb) 374AddWaitEventToSet (set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, (gdb) 377 if (wakeEvents & WL_SOCKET_MASK) (gdb) 385 rc = WaitEventSetWait (set, timeout, & event) 1, wait_event_info) (gdb) p * set$6 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p * set- > events$7 = {pos = 0, events = 1, fd = 11, user_data = 0x0} (gdb) p * set- > epoll_ret_events$8 = {events = 0, data = {ptr = 0x0, fd = 0, U32 = 0, U64 = 0} (gdb)

Enter WaitEventSetWait

(gdb) stepWaitEventSetWait (set=0x2181e90, timeout=-1, occurred_events=0x7ffc96572f40, nevents=1, wait_event_info=134217761) at latch.c:925925 int returned_events = 0; (gdb)

Input parameters

(gdb) n928 long cur_timeout =-1; (gdb) p * set$9 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) p * occurred_events$10 = {pos = 35135068, events = 0, fd =-1772664741, user_data = 0x7ffc96572fa0} (gdb)

Perform relevant judgments and set parameters

(gdb) n930 Assert (nevents > 0); (gdb) 936 if (timeout > = 0) (gdb) 943 pgstat_report_wait_start (wait_event_info); (gdb) 946 waiting = true; (gdb)

If no event occurs, the loop

951 while (returned_events = = 0) (gdb)

Do not meet the condition that set- > latch- > is_set is T, continue the cycle

982 if (set- > latch & & set- > latch- > is_set) (gdb) p * set- > latch$11 = {is_set = 0, is_shared = true, owner_pid = 1377} (gdb)

Enter WaitEventSetWaitBlock

(gdb) n1000 rc = WaitEventSetWaitBlock (set, cur_timeout, (gdb) stepWaitEventSetWaitBlock (set=0x2181e90, cur_timeout=-1, occurred_events=0x7ffc96572f40, nevents=1) at latch.c:10421042 int returned_events = 0; (gdb)

Call epoll_wait, suspend

(gdb) n1048 rc = epoll_wait (set- > epoll_fd, set- > epoll_ret_events, (gdb) p * set$12 = {nevents = 2, nevents_space = 3, events = 0x2181eb8, latch = 0x7f462d5b44d4, latch_pos = 0, epoll_fd = 37, epoll_ret_events = 0x2181f00} (gdb) (gdb) n

Start the standby node

# [xdb@localhost ~] $pg_ctl startpg_ctl: another server might be running; trying to start server anyway...

Received a signal.

Program received signal SIGUSR1, User defined signal 1.0x00007f4636d48903 in _ epoll_wait_nocancel () from / lib64/libc.so.6 (gdb) (gdb) nSingle stepping until exit from function _ epoll_wait_nocancel,which has no line number information.procsignal_sigusr1_handler (postgres_signal_arg=-1) at procsignal.c:262262 {(gdb) Thank you for your reading. This is the content of "PostgreSQL synchronous replication Master Library suspension Analysis". After the study of this article, I believe that you have a deeper understanding of the problem of PostgreSQL synchronous replication master library suspension analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report