How to optimize the performance of compiler server 07/02 Update SLTechnology News&Howtos

How to optimize the performance of compiler server

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article analyzes "how to optimize the performance of the compiler server". The content is detailed and easy to understand, and friends who are interested in "how to optimize the performance of the compiled server" can follow the editor's idea to read it slowly and deeply. I hope it will be helpful to everyone after reading. Let's follow the editor to learn more about "how to optimize the performance of the compiler server".

Background

With the widespread use of enterprise SDK in multiple product lines, and with the growth of SDK developers and the increasing number of patches submitted to SDK every day, the pressure to automate code submission checks has significantly exceeded the load on general-purpose servers. So I applied to the company for a dedicated server for SDK build check.

$cat / proc/cpuinfo | grep ^ proccessor | wc-l 48$ free-h total used free shared buffers cached Mem: 47G 45G 1.6G 20M 7.7g 25g-/ + buffers/cache: 12G 35G Swap: 0B 0B 0B $df file system Capacity used available used mount point. / dev/sda1 98G 14G 81G 15% / / dev/vda1 2.9T 1.8T 986G 65% / home

This is a KVM virtual server that provides CPU 48 threads with 47 gigabytes of memory and about 3TB disk space.

Through top, we can see that the load of CPU is not high, is it the bottleneck of IO? Go to IT and get root permission, get to work!

Due to cognitive limitations, if there is a lack of consideration, I hope to communicate and learn together.

Overall understanding of IO stack

If you have a complete understanding of the IO stack, it will undoubtedly help to optimize IO more finely. Following the order of the IO stack from top to bottom, we analyze the areas that can be optimized layer by layer.

There is a complete IO stack structure diagram of Linux on the Internet, but it is too complete to be understood. As far as I know, the simplified IO stack should look like the following figure.

User space: in addition to the user's own APP, it also implies all the libraries, such as the common C library. Our commonly used IO functions, such as open () / read () / write (), are system calls that are implemented directly by the kernel, while fopen () / fread () / fwrite () is a function implemented by the C library to achieve more advanced functions by encapsulating system calls.

Virtual file system: shields the differences of specific file systems and provides a unified entrance to user space. The specific file system registers the mount hook to the virtual file system through register_filesystem (), and initializes the file system through callback mount hook when the user mounts the specific file system. The virtual file system provides inode to record the metadata of the file, and dentry records the directory entries. For user space, the virtual file system registers the system call, for example, SYSCALL_DEFINE3 (open, const char _ user *, filename, int, flags, umode_t, mode) registers the system call to open ().

Specific file system: the file system should realize the management of storage space, in other words, it plans which space stores the data of which files, just like each storage box, A file is saved in this block, B file is placed in which block. Different management strategies and the different functions they provide create a variety of file systems. In addition to common block device file systems such as vfat, ext4, and btrfs, there are sysfs, procfs, pstorefs, tempfs and other file systems built on memory, as well as yaffs,ubifs and other file systems built on Flash.

Page cache: can be understood simply as a piece of memory that stores disk data, but its internal is a page as a management unit, the common page size is 4K. The size of this memory is not fixed, and every time there is a new piece of data, apply for a new memory page. Because the performance of memory is much higher than that of disk, in order to improve IO performance, we can cache IO data in memory so that we can get the desired data in memory without having to wait for disk to read and write. It is simple to apply for memory to cache data. How to manage all page caches and how to retrieve cached pages in time is the essence.

General block layer: general block layer can also be subdivided into bio layer and request layer. The page cache is managed by pages, while bio records the relationship between blocks and pages. A block can be associated with multiple different memory pages, and bio is submitted to the request layer through submit_bio (). A request can be understood as a collection of multiple bio, combining multiple address contiguous bio into a single request. Multiple request are merged and sorted by IO scheduling algorithm, and then submit IO requests to the lower layer in an orderly manner.

Device drivers and block devices: different block devices have different usage protocols, while specific device drivers implement the protocols needed by specific devices to drive devices normally. For block devices, block device drivers need to parse the request into device operation instructions and communicate with block devices to exchange data under the specification of the protocol.

Figuratively speaking, what is the process of initiating an IO read request?

User space is provided by the virtual file system through a unified IO system call, from user state tangent to kernel state. The virtual file system passes the requirements to the specific file system by calling the callback of the registration of the specific file system. Then the specific file system converts to the specific disk block address according to its own management logic and looks for the cache data of the block device from the page cache. Read operations are generally synchronous, and if there is no data cached in the page cache, a disk read is initiated to the generic block layer. The general block layer merges and sorts IO requests generated by all processes, and reads the real data from the block device through the device driver. Finally, return layer by layer. The data read is copied to the buffer in user space and a copy is retained in the page cache for quick access next time.

If the page cache fails, synchronization goes all the way to the block device, while for asynchronous writes, the data is put into the page cache and returned, and the kernel flashback process flushes back to the block device at the appropriate time.

According to this process, considering that I don't want permission to KVM host, I can only optimize it from the IO stack on Guest side, including the following aspects:

Swap partition (swap)

File system (ext4)

Page cache (Page Cache)

Request layer (IO scheduling algorithm)

Because the source code and compiled temporary files are small but extremely large, the requirement for random IO is very high. To improve the performance of random IO, more data needs to be cached without changing the hardware to merge more IO requests.

Consult ITer to know that all servers have backup power to ensure that there will be no power outage. Under such circumstances, we can optimize the speed as much as possible without worrying about data loss caused by power outage.

Generally speaking, the core idea of optimization is to use memory to cache data as much as possible to minimize unnecessary overhead, such as the overhead caused by the use of logs in the file system to ensure data consistency.

Swap partition

The existence of the swap partition allows the kernel to replace some uncommonly used memory into the swap partition when the memory pressure is high, so as to free up more memory for the system. When the physical memory capacity is insufficient and running memory-eating applications, the effect of swapping partitions is very obvious.

However, the optimized server should not use swap partitions instead. Why? The total memory of the server reaches 47 gigabytes, and the server does not eat a lot of memory except the Jenkins slave process. From the perspective of memory usage, most of the memory is occupied by cache/buffer, which is a file cache that can be discarded, so the memory is sufficient and there is no need to expand virtual memory by swapping partitions.

# free-h total used free shared buffers cached Mem: 47G 45G 1.6G 21M 18G 16G-/ + buffers/cache: 10G 36G

The swap partition is also the space of the disk, and placing data from the swap partition can also take up IO resources, which is contrary to the purpose of this IO optimization, so it is necessary to cancel the swap partition in this server.

Looking at the system status, it was found that this server did not enable swap.

# cat / proc/swaps Filename Type Size Used Priority #

File system

The user initiates a read and write, passes through the virtual file system (VFS), and gives it to the actual file system.

First of all, query the mount of the partition:

# mount... / dev/sda1 on on / type ext4 (rw) / dev/vda1 on / home type ext4 (rw)...

This server mainly has two block devices, which are sda and vda. Sda is a common SCSI/IDE device, and if we use a mechanical hard disk on our personal PC, it will often be a sda device node. The vda is a virtio disk device. Because this server is a virtual machine provided by KVM, whether it is sda or vda, it is a virtual device. The difference is that the former is a fully virtualized block device, while the latter is a semi-virtualized block device. From the information found on the Internet, using paravirtualized devices can achieve more efficient collaboration between Host and Guest, thus achieving higher performance. In this example, sda is used as the root file system, and vda is used to store user data. At compile time, it mainly depends on the IO of the vda partition.

Vda uses the ext4 file system. Ext4 is a stable file system used on the current common Linux. Check its super block information:

# dumpe2fs / dev/vda1... Filesystem features: has_journal dir_index...... Inode count: 196608000 Block count: 786431991 Free inodes: 145220571 Block size: 4096...

I guess that ITer uses the default parameter to format the partition, allocates it with a block size of 4K, and enables the number of inodes to 196.6 million.

There is nothing wrong with setting the block size to 4K, which is suitable for the current situation where the source file is too small, and there is no need to reduce the block size in order to have more compact space. Free inode reached 145.22 million, and the proportion of free time reached 73.86%. Of the current 74% space utilization, only 26.14% is used by inode. If one inode occupies 256B, then 100 million inode takes up 23.84G. There are so many inode that a lot of space is wasted. Unfortunately, the number of inode is specified during formatting and cannot be modified later, nor can it be reformatted simply and rudely at present.

What can we do? We can optimize the log and mount parameters.

Logging is to ensure the consistency of the file system in the event of a power outage (in ordered log mode) by writing the metadata to the log block and then modifying the metadata after writing the data. If the power is lost at this time, the file system can be rolled back to the previous consistent state through logging, that is, to ensure that the metadata matches the data. However, as mentioned above, this server has a backup power supply, so there is no need to worry about power outage, so you can cancel the log.

# tune2fs-O ^ has_journal / dev/vda1 tune2fs 1.42.9 (4-Feb-2014) The has_journal feature may only be cleared when the filesystem is unmounted or mounted read-only.

Unfortunately, it failed. Due to the constant execution of tasks, it is not good to directly umount or-o remount,ro, so you cannot cancel the log when you mount it. Since we can't cancel it, we'll let the log reduce the wear and tear, so we need to modify the mount parameters.

Ext4 mount parameter: data

Ext4 has three logging modes, which are ordered,writeback,journal. There are a lot of information about their differences on the Internet. Let me briefly introduce them:

Jorunal: writes metadata and data to a log block. The performance is almost halved because the data is written twice, but it is the safest.

Writeback: the metadata is written to the log block, and the data is not written to the log block, but the data is not guaranteed to be unloaded first. The highest performance, but because the order of metadata and data is not guaranteed, it is also the least safe to lose power.

Ordered: similar to writeback, but ensures that the data is stored first, followed by metadata. This performance ensures adequate security, which is the default mode recommended on most PC

In a server environment where there is no need to worry about power outages, we can use writeback's logging mode for maximum performance.

# mount-o remount,rw,data=writeback / home mount: / home not mounted or bad option # dmesg [235737.532630] EXT4-fs (vda1): Cannot change data mode on remount

Frustrated, but also can not be dynamically changed, simply write to / etc/config, can only hope to restart next time.

# cat / etc/fstab UUID=... / home ext4 defaults,rw,data=writeback...

Ext4 mount parameter: noatime

Three timestamps are recorded for each file on Linux

The full name of the timestamp means atimeaccess time access time, that is, the time of the last read, the modification time of mtimedata modified time data, and the time when the content was last changed. The change time of ctimestatus change time file status (metadata), such as permissions, owners, etc.

The Make we compiled and executed can determine whether or not to recompile based on the modification time, while the access time recorded by atime is actually redundant in many scenarios. Therefore, noatime came into being. Not recording atime can greatly reduce the amount of metadata written by reading, and the writing of metadata often produces a large number of random IO.

# mount-o... noatime... / home

Ext4 mount parameter: nobarrier

This is mainly to decide whether to use write barrier (write barrier) in the log code to correctly sort the log commits on disk, so that the volatile disk write cache can be used safely, but it will bring some performance losses. In terms of functionality, it is very similar to the writeback and ordered logging modes. If you haven't studied the source code in this area, maybe it's the same thing. In any case, disabling the write barrier will undoubtedly improve write performance.

# mount-o... nobarrier... / home

Ext4 mount parameter: delalloc

Delalloc is an acronym for delayed allocation, and if enabled, ext4 delays the request for blocks until it times out. Why postpone the application? In inode, the block number of the file data is recorded by multi-level index. if a large file appears, it will be in the form of extent section and allocate a continuous block. In inode, only the starting block number and length need to be recorded, and there is no need for the index to record all the blocks. In addition to reducing the pressure on inode, consecutive blocks can change random writes to sequential writes, speeding up write performance. Continuous blocks are also in line with the principle of locality, which can increase the hit probability during pre-reading, thus speeding up the reading performance.

# mount-o... delalloc... / home

Ext4 mount parameter: inode_readahead_blks

The maximum number of indoe block pre-read by ext4 from the inode table. Access to a file must go through inode to obtain file information and block address. If the inode you need to access is hit in memory, you don't need to read it from disk, which will undoubtedly improve read performance. The default value is 32, which means that the maximum pre-read 32 × block_size, that is, 64K inode data, in the case of sufficient memory, we can no doubt further expand to allow it to pre-read more.

# mount-o... inode_readahead_blks=4096... / home

Ext4 mount parameter: journal_async_commit

Commit blocks can be written to disk without waiting for descriptor blocks. This will speed up the log.

# mount-o... journal_async_commit... / home

Ext4 mount parameter: commit

How many seconds of data are cached by ext4 at a time. The default value is 5, which means that if you lose power at this time, you will lose up to 5 seconds of data. If you set larger data, you can cache more data, and you may lose more data in the event of a power outage. In this case, when the server is not afraid of power outage, increasing the value can improve performance.

# mount-o... commit=1000... / home

Summary of ext4 mount parameters

Finally, in the case of no umount, the command I executed to adjust the mount parameters is:

Mount-o remount,rw,noatime,nobarrier,delalloc,inode_readahead_blks=4096,journal_async_commit,commit=1800 / home

In addition, it is also modified in / etc/fstab to avoid loss of optimization after restart.

# cat / etc/fstab UUID=... / home ext4 defaults,rw,noatime,nobarrier,delalloc,inode_readahead_blks=4096,journal_async_commit,commit=1800,data=writeback 00...

Page cache

The page cache exists between the FS and the generic block layer, but it can also be classified into the generic block layer. In order to improve IO performance and reduce the number of real reads and writes from disk, the Linux kernel designs a layer of memory cache to cache disk data into memory. Because memory is managed in 4K pages, disk data is also cached in pages, so it is also called page cache. Each cache page contains a copy of some of the disk information.

If the read data happens to be hit in the cache because it has been read and written before or loaded by a pre-read, it can be read directly from the cache without going deep to disk. Whether synchronous write or asynchronous write, the data is copy to cache. The difference is that asynchronous write is only copy and returns directly after the page is marked dirty, while synchronous write also calls operations like fsync () to wait for write back. For more information, please see the kernel function generic_file_write_iter (). Dirty data generated by asynchronous writes is flushed back by the kernel work queue writeback process when "appropriate".

So, when is the right time? How much data can be cached at most? For this optimized server, there is no doubt that delayed flashback can reduce the number of disk writes in frequently deleted files, and caching more data can make it easier to merge random IO requests, helping to improve performance.

The following files in / proc/sys/vm are closely related to brushing back dirty data:

Percentage of available memory dirty data triggered by dirty_background_ratio by default value of configuration file function 0 percent of available memory triggered by background_bytes dirty data triggered by 10dirty_bytes synchronous write dirty data triggered by 0dirty_ratio synchronous write dirty data percentage of available memory 20 percent _ expire_centisecs dirty data timeback time (unit: 1max 100s) 3000dirty_writeback_centisecs flashback process timing wake up time (unit: 1max 100s) 500

There are a few points to add to the above configuration file:

XXX_ratio and XXX_bytes are different calculation methods for the same configuration attribute, with priority XXX_bytes > XXX_ratio

Available memory is not all the memory of the system, but free pages + reclaimable pages

Dirty data timeout means that after the data ID in memory is dirty for a certain period of time, the next time the flashback process works, it must be brushed back.

The flashback process will not only wake up regularly, but also passively wake up when there is too much dirty data.

The difference between dirty_background_XXX and dirty_XXX is that the former only wakes up the flashback process, and the application can still write data to Cache asynchronously. When the proportion of dirty data continues to increase, the dirty_XXX condition is triggered and asynchronous writing is no longer supported by applications.

For a more complete description of the features, you can see the kernel document Documentation/sysctl/vm.txt, or a summary blog I wrote, "Linux dirty data flushing parameters and tuning"

For the current case, my configuration is as follows:

Dirty_background_ratio = 60 dirty_ratio = 80 dirty_writeback_centisecs = 6000 dirty_expire_centisecs = 12000

This configuration has the following characteristics:

Wake up the flashback process when dirty data reaches 60% of available memory

When dirty data reaches 80% of available memory, every piece of data in the application must wait synchronously.

Wake up the flashback process every 60s

If the dirty data in memory exists for more than 120s, it will be brushed back the next time it wakes up.

Of course, to avoid losing the optimization result after reboot, we wrote in / etc/sysctl.conf:

# cat / etc/sysctl.conf... Vm.dirty_background_ratio = 60 vm.dirty_ratio = 80 vm.dirty_expire_centisecs = 12000 vm.dirty_writeback_centisecs = 6000

Request layer

In asynchronous write scenarios, when dirty pages reach a certain proportion, the data in the page cache needs to be flushed back to disk through the general block layer. The bio layer records the relationship between the disk block and the memory page, merges the continuous bio of multiple physical blocks into a request in the request layer, and then merges and sorts the IO requests generated by all processes in the system according to the specific IO scheduling algorithm. So what are the IO scheduling algorithms?

Search the IO scheduling algorithm on the Internet, a large number of materials are describing the three scheduling algorithms of Deadline,CFQ,NOOP, but there is no note that this is only a single queue applicable scheduling algorithm. On the latest code (the version of the code I analyzed is 5.7.0), it has completely switched to the new architecture of multi-queue, and the supported IO scheduling algorithm is mq-deadline,BFQ,Kyber,none.

There are a lot of materials on the Internet about the advantages and disadvantages of different IO scheduling algorithms, which will not be repeated in this paper.

The description of Block Layer in "Linux-storage-stack-diagram_v4.10" can vividly explain the difference between single queue and multi-queue.

In a single-queue architecture, there is only one global queue for a block device, and all requests have to be stuffed into this queue. In the case of multi-core and high concurrency, especially when the server has 32 cores, locks added to ensure mutual exclusion cause a lot of overhead. In addition, if the disk supports multi-queue parallel processing, the single-queue model can not give full play to its superior performance.

Under the multi-queue architecture, two-level queues of Software queues and Hardware dispatch queues are created. Software queues is a queue per CPU core in which IO scheduling is implemented. Because each CPU has a separate queue, there is no lock contention problem. The number of Hardware Dispatch Queues depends on the hardware, one queue per disk, and N queues will also be created if the disk supports N queues in parallel. Locks are required during the submission of IO requests from Software queues to Hardware Dispatch Queues. In theory, the worst efficiency of a multi-queue architecture is the same as that of a single-queue architecture.

Let's go back to the current server to be tuned. What IO scheduler is currently in use?

# cat / sys/block/vda/queue/scheduler none # cat / sys/block/sda/queue/scheduler noop [deadline] cfq

The kernel version of this server is

# uname-r 3.13.0-170-generic

Looking at the git submission record of the Linux kernel, it is found that the IO scheduling algorithm suitable for multi-queue has not been implemented on the 3.13.0 kernel version, and the multi-queue architecture has not been completely cut at this time, so the traditional noop,deadline and cfq scheduling algorithms still exist for sda devices using single queue, while the IO scheduling algorithm for multi-queue vda devices (virtio) is only none. The risk of upgrading the kernel to use the mq-deadline scheduling algorithm seems high. Therefore, there is not much to optimize the IO scheduling algorithm.

But this is the only way to optimize the Request layer? Since the IO scheduling algorithm cannot be optimized, can we modify the parameters related to queue? For example, increase the length of the Request queue and increase the amount of pre-read data.

There are two writable files in / sys/block/vda/queue, nr_requests and read_ahead_kb, the former is the maximum number of request that can be applied for at the configuration block layer, and the latter is the maximum amount of data that can be read ahead. By default

Nr_request = 128read_ahead_kb = 128,

I expanded to

Nr_request = 1024 read_ahead_kb = 512

Optimization effect

After optimization, check the memory usage at full load:

# cat / proc/meminfo MemTotal: 49459060 kB MemFree: 1233512 kB Buffers: 12643752 kB Cached: 21447280 kB Active: 19860928 kB Inactive: 16930904 kB Active (anon): 2704008 kB Inactive (anon): 19004 kB Active (file): 17156920 kB Inactive (file): 16911900 kB. Dirty: 7437540 kB Writeback: 1456 kB

As you can see, the file-related memory (Active (file) + Inactive (file)) reaches 32.49GB and the dirty data reaches 7.09GB. The amount of dirty data is less than expected, far below the threshold set by dirty_background_ratio and dirty_ratio. Therefore, if you need to cache more write data, you can only extend the scheduled wake-back time dirty_writeback_centisecs. This server is mainly used to compile SDK, which requires far more reading than writing, so it doesn't make much sense to cache more dirty data.

I also found that the Buffers reached 12 gigabytes, which should be the inode of ext4 taking up a lot of cache. As analyzed above, the ext4 of this server has a large number of surplus inode, and the percentage of invalid inode in the cached metadata is unknown. Reducing the number of inode and increasing the utilization of inode may improve the hit rate of inode pre-reading.

After optimization, 8 SDK can be compiled in parallel at a time, and it takes about 13 minutes to complete a complete compilation process (including updating code, crawling commit, compiling kernel, compiling SDK, etc.) without entering the error handling process.

On how to compile the server performance optimization practice is shared here, I hope that the above content can make you improve. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.