How to use strace for debugging in Software deployment 07/09 Update SLTechnology News&Howtos

How to use strace for debugging in Software deployment

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use strace to debug in software deployment. It is very detailed and has a certain reference value. Friends who are interested must read it!

What is strace?

Strace is a tool for "tracking system calls". It is primarily a Linux tool, but you can also use similar tools on other systems (such as DTrace and ktrace).

Its basic usage is very simple. Just follow strace with the command you need to run, and it will show all the system calls triggered by that command (you may need to install strace first):

$strace echo Hello...Snip lots of stuff...write (1, "Hello\ n", 6) = 6close (1) = 0close (2) = 0exit_group (0) =? + exited with 0 +

What are these system calls? They are like API provided by the operating system kernel. A long time ago, software had direct access to hardware. If the software needs to display something on the screen, it will be entangled with the video hardware's ports and memory-mapped registers. When multitasking operating systems become popular, this leads to chaos, as different applications "compete" for hardware, and errors in one application can cause other applications to crash, or even cause the entire system to crash. So CPU began to support a variety of different privilege modes (or "protection rings"). They allow the operating system kernel to run in the highest privileged mode with full hardware access, while other applications running in low-privileged mode must make system calls to the kernel to interact with the hardware.

At the binary level, there are some differences between initiating system calls and simple function calls, but most programs use encapsulated functions provided by standard libraries. For example, the POSIX C standard library contains a write () function that contains all the hardware architecture-related code used to make write system calls.

To put it simply, the interaction between an application and its environment (computer system) is accomplished through system calls. So tracking system calls is a good way to check errors when software works on one machine but doesn't work on another. Specifically, you can analyze the following typical operations by tracking system calls:

Console input and output (IO)

Network IO

File system access and file IO

Process / thread life cycle management

Raw memory management

Access specific device drivers

When can I use strace?

In theory, strace works for any user-space program, because all user-space programs need to make system calls. Strace is most effective for compiled low-level programs, but if you can avoid the large amount of extra output from the runtime environment and interpreter, you can still use it with high-level language programs such as Python.

Strace often shows off when software works on one machine but doesn't work on another, and throws vague error messages about files, permissions, or commands that can't be run. Unfortunately, it cannot diagnose high-level problems, such as digital certificate validation errors. These problems often require a combination of strace (sometimes ltrace) and other advanced tools (such as debugging digital certificate errors using the openssl command line tool).

The examples in this article are based on a separate server, but tracking system calls can usually be done on a more complex deployment platform, only by finding the right tools.

A simple example.

Suppose you are trying to run a server application called foo, but the following occurs:

$fooError opening configuration file: No such file or directory

Obviously, it didn't find the configuration file you've already written. This happens because package management tools sometimes specify custom paths when compiling applications, so you should follow the installation guidelines provided by a specific distribution. If the error message tells you where the correct configuration file should be, you can solve the problem in a few seconds, but what if it doesn't tell you? How do you find the right path?

If you have access to the source code, you can solve the problem by reading the source code. This is a good backup plan, but not the fastest solution. You can also use a single-step debugger like gdb to observe the behavior of the program, but it is more efficient to use strace, a tool specifically designed to demonstrate the interaction between the program and the system environment.

At first, the large amount of output generated by strace may overwhelm you, but fortunately you can ignore most of the useless information. I often use the-o parameter to save the output trace results to a separate file:

$strace-o / tmp/trace fooError opening configuration file: No such file or directory$ cat / tmp/traceexecve ("foo", ["foo"], 0x7ffce98dc010 / * 16 vars * /) = 0brk (NULL) = 0x56363b3fb000access ("/ etc/ld.so.preload", R_OK) =-1 ENOENT (No such file or directory) openat (AT_FDCWD, "/ etc/ld.so.cache", O_RDONLY | O_CLOEXEC) = 3fstat (3, {st_mode=S_IFREG | 0644 St_size=25186,...}) = 0mmap (NULL, 25186, PROT_READ, MAP_PRIVATE, 3,0) = 0x7f2f12cf1000close (3) = 0openat (AT_FDCWD, "/ lib/x86_64-linux-gnu/libc.so.6", O_RDONLY | O_CLOEXEC) = 3read (3) "\ 177ELF\ 2\ 1\ 3\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 3\ 0 >\ 0\ 0\ 0\ 0\ 0 >\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0.) = 832fstat (3, {st_mode=S_IFREG | 0755, st_size=1824496,...}) = 0mmap (NULL, 8192, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,-1, 0) = 0x7f2f12cef000mmap (NULL, 1837056, PROT_READ, MAP_PRIVATE | MAP_DENYWRITE, 3 0) = 0x7f2f12b2e000mprotect (0x7f2f12b50000, 1658880, PROT_NONE) = 0mmap (0x7f2f12b50000, 1343488, PROT_READ | PROT_EXEC, MAP_PRIVATE | MAP_FIXED | MAP_DENYWRITE, 3, 0x22000) = 0x7f2f12b50000mmap (0x7f2f12c98000, 311296, PROT_READ, MAP_PRIVATE | MAP_FIXED | MAP_DENYWRITE, 3, 0x16a000) = 0x7f2f12c98000mmap (0x7f2f12ce5000, 24576, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_DENYWRITE, 3, 0x1b6000) = 0x7f2f12ce5000mmap (0x7f2f12ceb000, 14336, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS,-1 0) = 0x7f2f12ceb000close (3) = 0arch_prctl (ARCH_SET_FS, 0x7f2f12cf0500) = 0mprotect (0x7f2f12ce5000, 16384, PROT_READ) = 0mprotect (0x56363b08b000, 4096, PROT_READ) = 0mprotect (0x7f2f12d1f000, 4096, PROT_READ) = 0munmap (0x7f2f12cf1000, 25186) = 0openat (AT_FDCWD, "/ etc/foo/config.json") O_RDONLY) =-1 ENOENT (No such file or directory) dup (2) = 3fcntl (3, F_GETFL) = 0x2 (flags O_RDWR) brk (NULL) = 0x56363b3fb000brk (0x56363b41c000) = 0x56363b41c000fstat (3, {st_mode=S_IFCHR | 0620, st_rdev=makedev (0x88, 0x8),...}) = 0write (3 "Error opening configuration file"..., 60) = 60close (3) = 0exit_group (1) =? + exited with 1 +

The first page of strace output is usually a low-level process startup process. You can see a lot of mmap, mprotect, brk calls, which are used to allocate raw memory and map dynamic link libraries. In fact, when looking for errors, it is best to read the output of strace from the bottom up. You can see that the write call returns an error message at the end. If you look up, you will see that the first failed system call is openat, which throws an ENOENT ("No such file or directory") error when trying to open / etc/foo/config.json. Now we know where to put the configuration file.

This is a simple example, but I dare say that debugging with strace does not require more complex work in 90% of the cases. Here are the complete debugging steps:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Get ambiguous error messages from the program

Use strace to run the program

Error message found in output

Go back and find the first failed system call

The system call in step 4 is likely to show you what the problem is.

Tips

Before you start more complex debugging, here are some useful debugging tips to help you use strace efficiently:

Man is your friend.

In many * nix operating systems, you can view a list of system calls through man syscalls. You will see something like brk (2), which means you can get more information about it by running man 2 brk.

One minor problem: the man 2 fork displays the fork () man page encapsulated in GNU libc, while fork () is now actually implemented by the clone system call. Fork has the same semantics as clone, but if I write a program with fork () and use strace to debug it, I won't find any information about fork calls, only clone calls. Questions like this can be confusing when comparing source code with the output of strace.

Use-o to save the output to a file

Strace can generate a lot of output, so it is helpful to save the output to a separate file (as in the example above). It can also avoid confusion between the output of the program itself and the output of strace in the console.

Use-s to view more parameters

You may have noticed that the second part of the error message does not appear in the above example. This is because strace displays only the first 32 bytes of the string parameter by default. If you need to capture more parameters, append parameters like-s 128 to strace.

-y makes it easier to track files or sockets

"everything is a file" means that the * nix system does all IO operations through a file descriptor, whether it's a real file or through a network or interprocess pipeline. This is convenient for programming, but when tracking system calls, it will be difficult to tell the true behavior of read and write.

The-y parameter causes strace to indicate the specific direction of each file descriptor in the comment.

Use-p to attach to a running process

As we will see in later examples, sometimes you want to track a running program. If you know that the process number of this program is 1337 (which can be queried through ps), you can do this:

$strace-p 1337...system call trace output...

You may need root permission to run.

Use-f to trace child processes

Strace tracks only one process by default. If this process produces a child process, you will see the system call that creates the child process (usually clone), but you will not see any calls triggered within the child process.

If you think there is an error in the child process, you need to use the-f parameter to enable the child process tracking function. The disadvantage of this is that the output will be more confusing. When tracking a process, strace displays a single call event flow. When tracking multiple processes, you will see the initial call that starts, followed by a series of calls to other threads, and finally the initial call that ends. In addition, you can use the-ff parameter to separate all calls into different files (see the strace manual for more information).

Use-e for filtering

As you can see, the default trace output is all system calls. You can use the-e parameter to filter the calls you need to track (see the strace manual). The advantage of this is that running filtered strace is faster than using grep for secondary filtering. To be honest, I won't be disturbed most of the time.

Not all mistakes are bad.

A simple but common example is a program that searches for files in multiple locations, such as shell to search for which bin/ directory contains executables:

$strace sh-c uname...stat ("/ home/user/bin/uname", 0x7ffceb817820) =-1 ENOENT (No such file or directory) stat ("/ usr/local/bin/uname", 0x7ffceb817820) =-1 ENOENT (No such file or directory) stat ("/ usr/bin/uname", {st_mode=S_IFREG | 0755, st_size=39584,...}) = 0.

The heuristic method of "the last failed call before the error message" is ideal for finding errors. In any case, the bottom-up search makes sense.

The C programming guide is very helpful in understanding system calls

Standard C library function calls do not belong to system calls, but they are only the only thin layer above system calls. So if you know (or even know a thing or two) how to use C, it's very easy to read system call tracking information. For example, if you are debugging network system calls, you can try skimming Beej's classic Network programming Guide.

A more complex debugging example

As I said, a simple debugging example shows how I use strace in most cases. However, sometimes more detailed work is needed, so here is a slightly more complex (and real) example.

Bcron is a task scheduler, which is another implementation of the classic * nix cron daemon. It has been installed on a server, but when someone tries to edit the job schedule, the following occurs:

# crontab-e-u logsbcrontab: Fatal: Could not create temporary file

Okay, now bcron tries to write to some files, but it fails and doesn't tell us why. The following is the output of strace:

# strace-o / tmp/trace crontab-e-u logsbcrontab: Fatal: Could not create temporary file# cat / tmp/trace...openat (AT_FDCWD, "bcrontab.14779.1573691864.847933", O_RDONLY) = 3mmap (NULL, 8192, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,-1,0) = 0x7f82049b4000read (3, "# Ansible: logsagg\ n20 14 * lo"..., 8192) = 150read (3, "", 8192) = 0munmap (0x7f82049b4000) 8192) = 0close (3) = 0socket (AF_UNIX, SOCK_STREAM, 0) = 3connect (3, {sa_family=AF_UNIX, sun_path= "/ var/run/bcron-spool"}, 110) = 0mmap (NULL, 8192, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS,-1,0) = 0x7f82049b4000write (3, "156:Slogs\ 0#Ansible: logsagg\ n201". "32:ZCould not create temporary f"..., 8192) = 36munmap (0x7f82049b4000, 8192) = 0close (3) = 0write (2, "bcrontab: Fatal: Could not creat", 49) = 49unlink ("bcrontab.14779.1573691864.847933") = 0exit_group (8192) =? + exited with

There is a write error message before the end of the program, but this time is a little different. First of all, there are no related failed system calls before. Second, we see that the error message was read by read from somewhere else. It looks like the real error happened somewhere else, and bcrontab is just relaying the information.

If you look at man 2 read, you will see that the first parameter (3) of read is a file descriptor, which is the handle used by the * nix operating system for all IO operations. How do you know what file descriptor 3 stands for? In this case, you can run strace with the-y argument (as described above), which will tell you the exact direction of the file descriptor in the comment, but it is useful to know how to analyze the trace results from the above output.

A file descriptor can come from one of many system calls (depending on whether it is used for the console, network sockets, real files, and so on), but in any case, we can search for system calls with a return value of 3 (for example, look for = 3 in the output of strace). You can see two such calls in this strace: the top openat and the middle socket. Openat opens a file, but the following close (3) indicates that it has been closed. (note: file descriptors can be reused after opening and closing. So the socket call is relevant (it's the last one before read), which tells us that brcontab is communicating with a network socket. On the next line, connect indicates that file descriptor 3 is a Unix domain socket connected to / var/run/bcron-spool.

Therefore, we need to figure out which process is listening on the other side of the Unix socket. There are two clever tips for debugging in a server deployment. One is to use netstat or the newer ss. Both commands describe the active network sockets in the current system, using the-l parameter to display the socket in the listening state, and the-p parameter to get the program information that is using the socket. They have more useful options, but these two are enough to get the job done. )

# ss-pl | grep / var/run/bcron-spoolu_str LISTEN 0128 / var/run/bcron-spool 1466637 * 0 users: ("unixserver", pid=20629,fd=3))

This tells us that the listener for the / var/run/bcron-spool socket is the command unixserver, and its process ID is 20629. Coincidentally, the program also uses the file descriptor 3 to connect to the socket. )

The second common tool is to use lsof to find the same information. It lists all files (or file descriptors) that are open on the current system. Alternatively, we can get information about a specific file:

# lsof / var/run/bcron-spoolCOMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEunixserve 20629 cron 3u unix 0x000000005ac4bd83 0t0 1466637 / var/run/bcron-spool type=STREAM

Process 20629 is a resident process, so we can use strace-o / tmp/trace-p 20629 to view the system calls for that process. If we try to edit cron's scheduled task table on another terminal, we can capture the following information when an error occurs:

Accept (3, NULL, NULL) = 4clone (child_stack=NULL, flags=CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | SIGCHLD, child_tidptr=0x7faa47c44810) = 21181close (4) = 0accept (3, NULL, NULL) =? ERESTARTSYS (To be restarted if SA_RESTART is set)-SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21181, si_uid=998, si_status=0, si_utime=0, si_stime=0}-wait4 (0, [{WIFEXITED (s) & & WEXITSTATUS (s) = = 0}], WNOHANG | WSTOPPED, NULL) = 21181wait4 (0, 0x7ffe6bc36764, WNOHANG | WSTOPPED, NULL) =-1 ECHILD (No child processes) rt_sigaction (SIGCHLD, {sa_handler=0x55d244bdb690, sa_mask= [CHLD], sa_flags=SA_RESTORER | SA_RESTART, sa_restorer=0x7faa47ab9840} {sa_handler=0x55d244bdb690, sa_mask= [CHLD], sa_flags=SA_RESTORER | SA_RESTART, sa_restorer=0x7faa47ab9840}, 8) = 0rt_sigreturn ({mask= []}) = 43accept (3, NULL, NULL) = 4clone (child_stack=NULL, flags=CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | SIGCHLD, child_tidptr=0x7faa47c44810) = 21200close (4) = 0accept (3, NULL, NULL) =? ERESTARTSYS (To be restarted if SA_RESTART is set)-SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21200, si_uid=998, si_status=111, si_utime=0, si_stime=0}-wait4 (0, [{WIFEXITED (s) & & WEXITSTATUS (s) = = 111}], WNOHANG | WSTOPPED, NULL) = 21200wait4 (0, 0x7ffe6bc36764, WNOHANG | WSTOPPED, NULL) =-1 ECHILD (No child processes) rt_sigaction (SIGCHLD, {sa_handler=0x55d244bdb690, sa_mask= [CHLD], sa_flags=SA_RESTORER | SA_RESTART, sa_restorer=0x7faa47ab9840} {sa_handler=0x55d244bdb690, sa_mask= [CHLD], sa_flags=SA_RESTORER | SA_RESTART, sa_restorer=0x7faa47ab9840}, 8) = 0rt_sigreturn ({mask= []}) = 43accept (3, NULL, NULL

(the last accept call was not completed during the trace. Unfortunately, this tracking does not contain the error message we want. We did not observe any information that bcrontan sent or received to the socket. However, we see a lot of process management operations (clone, wait4, SIGCHLD, and so on). This process produces a child process, and we guess that the real work is done by the child process. If we want to capture the tracking information of the child process, we must append the-f parameter to strace. Here is the error message we finally found using strace-f-o / tmp/trace-p 20629:

21470 openat (AT_FDCWD, "tmp/spool.21470.1573692319.854640", O_RDWR | O_CREAT | O_EXCL, 0600) =-1 EACCES (Permission denied) 21470 write (1, "32:ZCould not create temporary f"..., 36) = 3621470 write (2, "bcron-spool [21470]: Fatal: logs:". 84) = 8421470 unlink ("tmp/spool.21470.1573692319.854640") =-1 ENOENT (No such file or directory) 21470 exit_group =? 21470 + exited with

Now we know that the process ID 21470 got an unprivileged error when trying to create the file tmp/spool.21470.1573692319.854640 (relative to the current working directory). If we know the current working directory, we can get the full path and indicate why the process cannot create temporary files here. Unfortunately, the process has exited, so we can't use lsof-p 21470 to find out the current working directory, but we can go back and find out which system call process ID 21470 used to change its working directory. The system call is chdir (which can be easily found in the search engine). The following is the result of going all the way back to the server process ID 20629:

20629 clone (child_stack=NULL, flags=CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID | SIGCHLD, child_tidptr=0x7faa47c44810) = 21470. 21470 execve ("/ usr/sbin/bcron-spool", ["bcron-spool"], 0x55d2460807e0 / * 27 vars * /) = 0.21470 chdir ("/ var/spool/cron") = 0.21470 openat (AT_FDCWD, "tmp/spool.21470.1573692319.854640", O_RDWR | O_CREAT | O_EXCL, 0600) =-1 EACCES (Permission denied) 21470 write (1) "32:ZCould not create temporary f"..., 36) = 3621470 write (2, "bcron-spool [21470]: Fatal: logs:"..., 84) = 8421470 unlink ("tmp/spool.21470.1573692319.854640") =-1 ENOENT (No such file or directory) 21470 exit_group (111) =? 21470 + exited with

(if you are confused here, you may need to read my previous article on * nix process management and shell)

The server process with PID 20629 now does not have permission to create files in / var/spool/cron/tmp/spool.21470.1573692319.854640. The most likely reason is the typical * nix file system permission setting. Let's check it out:

# ls-ld / var/spool/cron/tmp/drwxr-xr-x 2 root root 4096 Nov 6 05:33 / var/spool/cron/tmp/# ps u-p 20629USER PID% CPU% MEM VSZ RSS TTY STAT START TIME COMMANDcron 20629 0.0 2276 752? Ss Nov14 0:00 unixserver-U / var/run/bcron-spool-- bcron-spool

That's the problem! This service process runs as the cron user, but only the root user has permission to write to the / var/spool/cron/tmp/ directory. A simple chown cron/ var/spool/cron/tmp/ command can make bcron work properly.

These are all the contents of the article "how to debug with strace in software deployment". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.