Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Slow troubleshooting and solution of SSH login of Kuaijie CVM

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Kuaijie CVM is a new generation of hosts launched by UCloud with excellent performance and high cost performance, with a network of up to 10 million PPS and storage of up to 1.2 million IOPS. In order to improve the overall performance of the product, a lot of tuning has been made to the Host kernel, KVM and Guest kernel. The "high kernel Ubuntu18.04" image is one of the optimized CVM images, which integrates the official linux 5.0.1 mainline version kernel.

In July this year, a user reported that every time the Kuaijie CVM created with this image starts, the first SSH login will be very slow, and sometimes it will take tens of seconds or even minutes to log in successfully, affecting the user experience.

After troubleshooting, it is located that the Linux kernel random number entropy pool initializes slowly and will only be triggered under the combination of multiple conditions. More in-depth investigation found that because of the kernel bug, all processes that use libssl 1.1.1 (such as nginx with https enabled) have similar problems, which will have a lot of potential impact on system security.

In the end, by upgrading the self-maintained kernel, we quickly fixed the problem and ensured the experience and security of Kuaijie CVM.

Cdn.xitu.io/2019/11/21/16e8d0e0b784bc28?w=640&h=374&f=jpeg&s=19194 ">

In this paper, the investigation process is combed.

Preliminary investigation

This problem occurs only on a single user and only affects the first SSH login after startup, which returns to normal once the login is successful. It is not easy to capture on the scene, but we try to reproduce it.

Ssh-v

Open the redundant log mode of the ssh client to try to log in to the problem host and find that it will always get stuck at "debug1: pledge: network". According to the prompt, sshd has completed the user authentication process.

As you can see, the problem should have occurred just after the identification was completed, so it is more likely that the problem occurred in the PAM process defined by / etc/pam.d/sshd.

Motd

Looking at the / etc/pam.d/sshd file, according to the phenomenon and intuition, I decided to try to block several sections of the configuration first, including the motd line. Motd (message of the day) is part of the banner content that is presented to the user after Ubuntu login. Then restart the host and find that the ssh login becomes faster and is no longer stuck.

According to the data, under the motd mechanism, pam_motd.so will execute all the scripts in the / etc/update-motd.d/ directory in turn, and the output of these scripts will be pieced together into a file / run/motd.dynamic, and finally presented in banner.

Therefore, it is suspected that stutters are generated during the execution of these scripts, read these scripts, perform breakpoint echo debugging, and finally find that stutters occur when the "/ usr/bin/landscape-sysinfo" command in the "50-landscape-sysinfo" script is executed.

Landscape-sysinfo

This command is just a tool to collect and display the usage of system resources in banner. This problem is a little hard to believe, but in fact, there is no stutter when you execute this command many times after logging in.

Try to further trace the execution of this command, use strace to trace the execution of this command, and log it.

Analysis of the log shows that when started, the command was stuck on the getrandom system call, and the unblocking time was 23:10:48.

Getrandom

Click to view the Resources http://man7.org/linux/man-pages/man2/getrandom.2.html

Getrandom encapsulates the read operation of the / dev/urandom character device file, which is used to obtain high-quality random numbers. / dev/urandom uses the value of / dev/random as the seed reference, and the / dev/random value comes from the noise of hardware running (the random quality is very high). This mechanism also determines that the quality of random numbers generated by / dev/urandom at the beginning of the operating system is not high (just started, / dev/random is noisy enough, slow generation, poor randomness, easy to be predicted, which indirectly leads to low quality of the initial seed of / dev/urandom), so three states of its quality are set within / dev/urandom:

0 = not initialized, but / dev/urandom is already available

1 = Fast initialization, quick initialization using a small amount of entropy, can be used as soon as possible at startup, and the quality is OK, but it is still not recommended for encryption scenarios, usually within a few seconds after the operating system starts.

2 = fully initialized, with the highest quality of random numbers, which can be used to encrypt scenarios, which can be achieved in tens of seconds-a few minutes after the operating system starts.

By default, getrandom checks the quality status of / dev/urandom before reading / dev/urandom. If it has not been fully initialized, it will block until it is fully initialized, so as to ensure that the random numbers obtained through this API are of high quality and fast, and provide reliable dependence for the security field.

After understanding the function and performance of the getrandom interface, I looked through the kernel startup log and found a point with high time correlation.

As you can see, after / dev/urandom is fully initialized at 23:10:48, the call blocking of getrandom is also unblocked, and after repeated validation, the correlation is confirmed. At this time, the conclusion and the suggested solution are: the reason: the operating system initializes the random number entropy pool slowly, which causes the ssh login to be blocked when using a command of random number.

Recommendation: disable motd or delete landscape-sysinfo to speed up ssh login.

In-depth investigation

The conclusion of the preliminary investigation is a bit counterintuitive, and measures to ban or delete should also be cautious. To this end, I decided to find more evidence, in addition, I also need to explain why this phenomenon did not exist in the older version of Ubuntu.

Try to check the strace performance of landscape-sysinfo on the normal host. After consulting the log, it is noticed that the strace record in this environment is different from the strace record in the problem host in the call mode. There is no such call "getrandom (" xxx ", 32,0)" in landscape-sysinfo on the normal host. Pay attention to the third flag parameter value, which is used to indicate the default behavior of using getrandom. That is, / dev/urandom is blocked when it is not fully initialized. Flag GRND_NONBLOCK is used in all getrandom places, that is, do not block if initialization is not complete, and just return an error.

At this point, it is suspected that there is a problem with the landscape-sysinfo version.

Landscape-sysinfo

Comparing the landscape-sysinfo version on the two hosts, it is found that the version number is indeed different, the version number with the problem is higher, and the version number with no problem is lower.

The host that has no problem is executed apt-get update & apt-get upgrade, and the problem is found to reappear after upgrade. Draw a tentative conclusion: the new version of landscape-sysinfo uses getrandom's blocking mode to get random numbers, so don't upgrade the version of landscape-sysinfo.

At the beginning, we tried to reproduce and verify on other hosts, but found that in the image of another high kernel version, the lower version of landscape-sysinfo could also reproduce this problem. Strace tracked the call and found that its call behavior was similar to that of the higher version of landscape-sysinfo. Since this command is actually a python3 script, it is suspected that it is caused by the upgrade of the library it depends on. Check the package of the apt-get upgrade upgrade to find several packages that are closely related to random numbers. After several exclusion attempts, it is found that the problem is actually caused by the upgrade of the library libssl1.1, and the call to getrandom also comes from libssl1.1.

Libssl1.1

Flip through

Release note https://www.openssl.org/news/openssl-1.1.1-notes.html of libssl1.1

.

As you can see, indeed, the upgrade of libssl1.1.1, which rewrites the generator of internal random numbers, is also in line with the previous performance, updated to read more secure random numbers using getrandom (at the cost of blocking when it is used as soon as it is powered on).

Continuing to try to reproduce and verify on other hosts, it was found that libssl1.1.1 was installed on a low kernel version of the Ubuntu host, but the problem could not be reproduced. As expected, libssl1.1.1 is upgraded to be more secure, and if you can get random numbers as soon as you boot up, this is against the original intention of the design of the getrandom interface, and tends to suspect that there may be bug in the kernel.

Kernel bug

Take the blocking of libssl call getrandom as the key topic to consult data, and finally find the data with strong correlation.

Click to view related information https://unix.stackexchange.com/questions/442698/when-i-log-in-it-hangs-until-crng-init-done

Where CRNG refers to the random number generator of cryptographic strength.

According to this data, it confirms the guess of kernel bug that the kernel corrected such a bug:getrandom at 4.16 that it will no longer block after rapid initialization, which is contrary to the interface design of getrandom and is easy to cause security problems (CVE-2018-1108).

Kernel bug fix commit https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=43838a23a05fbd13e47d750d3dfd77001536dd33

Verify that the kernel version of the host is 4.15.0, which is consistent with this situation, that is, it is very likely that the bug has not been fixed. At this point, try to upgrade the kernel version of the lower kernel version of the host. If this guess is correct, stutters should also occur after upgrading to a higher version.

A version 5. 0 kernel was selected on the apt source, and it was found that there was no problem after the upgrade.

Looking through the kernel log, I found a new phenomenon. Previously, I saw that for the initialization of / dev/urandom, there is usually a "fast init done" log, followed by a "crng init done" log after a long time, which corresponds to the two quality states of / dev/urandom.

In this kernel version, the log of "crng done (trusting CPU's manufacturer)" appears as soon as it is started, which clearly shows that the entropy pool is initialized very quickly, so that there will be no stutter problem.

Query the information about this phenomenon and find a kernel compilation option: CONFIG_RANDOM_TRUST_CPU.

CONFIG_RANDOM_TRUST_CPU

Click here to see a detailed description of the options https://lwn.net/Articles/760121/

First appeared in version 4.19: https://cateee.net/lkddb/web-lkddb/RANDOM_TRUST_CPU.html

Check the two hosts of the high version of the kernel, indeed, the problem host this flag does not have enable, no problem host this flag explicit enable.

The kernel version used by the host in question is taken from the latest mainline branch compilation of Ubuntu. By default, this flag is not started for optimization.

Conclusion

The emergence of this problem needs to meet the following conditions at the same time: 1. Linux kernel version 4.17 and above 2. Linux kernel compilation option CONFIG_RANDOM_TRUST_CPU is not set, or CPU non-IVB and above x863. The linux kernel bug is not revert by the community (almost all the latest community versions of 4.x are https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1602769.html-free by revert,5.x)

) 4. Libssl version 1.1.1 or above

Influence surface

In fact, the SSH login stutter is just a facade, and the real scope of the problem can be expanded very much. Any process that uses libssl1.1.1 to generate random numbers will be affected, resulting in a random number block within 3-5 minutes after restart. For example, there is nginx running on the server (turn on https). It is generally believed that nginx will be able to provide services immediately after the machine starts, but affected by this problem, nginx will be stuck for three or five minutes. That is to say, all user https requests in the first three or five minutes will fail to access.

Problem repair

In order to achieve both performance and security, we enabled the CONFIG_RANDOM_TRUST_CPU option when compiling the kernel version 4.19 and later, and we adopted this method, the enable option, and ensured that the virtual machine could access the RDRAND instruction set, and quickly rereleased the cloud host image. If you use a custom kernel, you should try to avoid the version between 4.17 and 4.19, or properly handle CVE-2018-1108.

Summary

When it comes to CVMs, you will first think of computing, storage and network, and few people pay attention to the kernel. However, kernel building is also the core work of the CVM, which is very important for performance and stability.

The slow login of SSH was initially a feedback from a single user and was limited to the first login after Ubuntu was launched, but by persisting in troubleshooting and following the path, we found the potential impact and fixed it to prevent it.

By independently maintaining the kernel source code of the CVM, the UCloud team can continuously tune the performance to match the development of the product; on the other hand, it ensures that when it encounters various problems on the existing network, it has the ability to quickly troubleshoot and solve them, and prevent greater system security risks in a timely manner.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report