How linux uses strace commands to locate and diagnose faults 07/06 Update SLTechnology News&Howtos

How linux uses strace commands to locate and diagnose faults

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how linux uses strace commands to locate and diagnose faults". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Locate the cause of the fault through Strace

This is an Nginx error log:

Connect () failed (110x Connection timed out) while connecting to upstream

Connect () failed (111x Connection refused) while connecting to upstream

It looks like there's something wrong with Upstream, and in this case Upstream is PHP (version 5.2.5). Unfortunately, the monitoring is not perfect, I don't know what went wrong, so I had no choice but to restart PHP constantly to alleviate the failure.

If it is a chore to restart the service manually every time, fortunately it can be done every minute through the CRON setting:

The code is as follows:

# / bin/bash

LOAD=$ (awk'{print $1}'/ proc/loadavg)

If [$(echo "$LOAD > 100" | bc) = 1]; then

/ etc/init.d/php-fpm restart

Unfortunately, this is only a stopgap measure, and if we want to solve it completely, we must find out the real cause of the failure.

Don't gossip, it's Strace's turn to come out, and count the time spent on each system call:

The code is as follows:

Shell > strace-c-p $(pgrep-n php-cgi)

% time seconds usecs/call calls errors syscall

30.53 0.023554 132 179 brk

14.71 0.011350 140 81 mlock

12.70 0.009798 15 658 16 recvfrom

8.96 0.006910 7 927 read

6.61 0.005097 43 119 accept

5.57 0.004294 4 977 poll

3.13 0.002415 7 359 write

2.82 0.002177 7 311 sendto

2.64 0.002033 2 1201 1 stat

2.27 0.001750 1 2312 gettimeofday

2.11 0.001626 1 1428 rt_sigaction

1.55 0.001199 2 730 fstat

1.29 0.000998 10 100 100 connect

1.03 0.000792 4 178 shutdown

1.00 0.000773 2 492 open

0.93 0.000720 1 711 close

0.49 0.000381 2 238 chdir

0.35 0.000271 3 87 select

0.29 0.000224 1 357 setitimer

0.21 0.000159 2 81 munlock

0.17 0.000133 2 88 getsockopt

0.14 0.000110 1 149 lseek

0.14 0.000106 1 121 mmap

0.11 0.000086 1 121 munmap

0.09 0.000072 0 238 rt_sigprocmask

0.08 0.000063 4 17 lstat

0.07 0.000054 0 313 uname

0.00 0.000000 0 15 1 access

0.00 0.000000 0 100 socket

0.00 0.000000 0 101 setsockopt

0.00 0.000000 0 277 fcntl

100.00 0.077145 13066 118 total

"brk" looks so suspicious that it takes 30% of its time. Just to be on the safe side, confirm separately:

The code is as follows:

Shell > strace-T-e brk-p $(pgrep-n php-cgi)

Brk (0x1f18000) = 0x1f18000

Brk (0x1f58000) = 0x1f58000

Brk (0x1f98000) = 0x1f98000

Brk (0x1fd8000) = 0x1fd8000

Brk (0x2018000) = 0x2018000

Note: there are two options related to operation time in Strace, "- r" and "- T". The difference between them is that "- r" represents relative time and "- T" represents absolute time. Simple statistics can use "- r", but it should be noted that in a multi-tasking context, CPU may be switched out to do other things at any time, so the relative time may not be accurate, so it is best to use "- T" at this time. You can see the operation time at the end of the line, and you can find that it is indeed very slow.

Before continuing to locate the cause of the failure, let's use "man brk" to find out what it means:

Brk () sets the end of the data segment to the value specified by end_data_segment, when that value is reasonable, the system does have enough memory and the process does not exceed its max data size (see setrlimit (2)).

To put it simply, it is used to apply for new memory (data segment) when there is not enough memory, but why?

The code is as follows:

Shell > strace-T-p $(pgrep-n php-cgi) 2 > & 1 | grep-B 10 brk

Stat ("/ path/to/script.php", {...}) = 0

Brk (0x1d9a000) = 0x1d9a000

Brk (0x1dda000) = 0x1dda000

Brk (0x1e1a000) = 0x1e1a000

Brk (0x1e5a000) = 0x1e5a000

Brk (0x1e9a000) = 0x1e9a000

Through "grep", we can easily get the relevant context, run it repeatedly several times, and find that every time we request some PHP scripts, there will be several time-consuming "brk", and these PHP scripts have a common feature, that is, very large, even hundreds of kilograms, why is there such a large PHP script? In fact, in order to avoid database operations, programmers persist very large array variables into PHP files through "var_export", and then obtain the corresponding variables through "include" in the program. Because the variables are too large, PHP has to execute "brk" frequently. Unfortunately, in this example environment, this operation is relatively slow, resulting in too long processing time for requests and a limited number of PHP processes. As a result, it causes request congestion on the Nginx, which eventually leads to high load failure.

The following needs to verify that the inference seems to be correct, and first look up where the problem script is involved:

The code is as follows:

Shell > find / path-name "* .php" | xargs grep "script.php"

Disable them directly to see if the server can recover. You may think this is too rusty, but in special cases, you must make a special decision. You can't be indecisive like a pussy. After a while, the server load returns to normal. Then count the time spent on system calls:

The code is as follows:

Shell > strace-c-p $(pgrep-n php-cgi)

% time seconds usecs/call calls errors syscall

24.50 0.001521 11 138 2 recvfrom

16.11 0.001000 33 30 accept

7.86 0.000488 8 59 sendto

7.35 0.000456 1 360 rt_sigaction

6.73 0.000418 2 198 poll

5.72 0.000355 1 285 stat

4.54 0.000282 0 573 gettimeofday

4.41 0.000274 7 42 shutdown

4.40 0.000273 2 137 open

3.72 0.000231 1 197 fstat

2.93 0.000182 1 187 close

2.56 0.000159 2 90 setitimer

2.13 0.000132 1 244 read

1.71 0.000106 4 30 munmap

1.16 0.000072 1 60 chdir

1.13 0.000070 4 18 setsockopt

1.05 0.000065 1 100 write

1.05 0.000065 1 64 lseek

0.95 0.000059 1 75 uname

0.00 0.000000 0 30 mmap

0.00 0.000000 0 60 rt_sigprocmask

0.00 0.000000 0 3 2 access

0.00 0.000000 0 9 select

0.00 0.000000 0 20 socket

0.00 0.000000 0 20 20 connect

0.00 0.000000 0 18 getsockopt

0.00 0.000000 0 54 fcntl

0.00 0.000000 0 9 mlock

0.00 0.000000 0 9 munlock

100.00 0.006208 3119 24 total

Obviously, "brk" has been replaced by "recvfrom" and "accept", but these operations are inherently time-consuming, so being able to locate "brk" is the cause of the failure.

Diagnosing problems with strace

In the early years, if you knew that there was a strace command, it would be very powerful, but now everyone basically knows strace. If you have a performance problem and ask for help, nine times out of ten, you will be advised to hang it up with strace, but when you hang it up and look at the characters rolling all over the screen, nine times out of ten you can't see why. This article shows you some tricks when diagnosing problems with strace through a simple case.

The following real cases, if there are similarities, is inevitable! Let's look at the top results of a high-load server:

Tip: when running top, press "1" to open the CPU list, and press "shift+p" to sort by CPU.

In this example, it is easy to find that CPU is mainly occupied by several PHP processes, while the PHP process takes up more memory, but the system memory still has a balance, and SWAP is not serious, which is not the main cause of the problem.

However, in the CPU list, we can see that CPU mainly consumes kernel state "sy" rather than user mode "us", which is not consistent with our experience. The Linux operating system has many tools for tracking program behavior. Kernel mode function call tracking uses "strace" and user mode function call tracking uses "ltrace", so here we should use "strace":

The code is as follows:

Shell > strace-p

However, if you directly use strace to track a certain process, what is waiting for you is often full of rolling characters. It is not easy to see the crux of the problem from here. Fortunately, strace can summarize the time by operation:

The code is as follows:

Shell > strace-cp

The "c" option is used to summarize the total time spent on each operation, and the result after running is roughly shown in the following figure:

Obviously, we can see that CPU is mainly consumed by clone operations, and we can track clone separately:

The code is as follows:

Shell > strace-T-e clone-p

The "T" option allows you to get the actual time consumed by the operation, and the "e" option allows you to track an operation:

Obviously, a clone operation takes hundreds of milliseconds. For the meaning of clone, refer to the man documentation:

Clone () creates a new process, in a manner similar to fork (2). It is actually a library function layered on top of the underlying clone () system call, hereinafter referred to as sys_clone. A description of sys_clone is given towards the end of this page.

Unlike fork (2), these calls allow the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers. (Note that on this manual page, "calling process" normally corresponds to "parent process". But see the description of CLONE_PARENT below.)

Simply put, create a new process. So when do such system calls occur in PHP? The query business code sees the exec function, and verifies that it does cause the clone system call through the following command:

The code is as follows:

Shell > strace-eclone php-r'exec ("ls");'

This is the end of "how linux uses strace commands to locate and diagnose faults". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.