What is the time-consuming optimization principle of C++ service compilation? 07/13 Update SLTechnology News&Howtos

What is the time-consuming optimization principle of C++ service compilation?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what is the principle of time-consuming optimization of C++ service compilation?", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the principle of time-consuming optimization of C++ service compilation?"

I. background

Large C++ projects will face the problem of long compilation time. Whether it is development and debugging iteration, admission testing, or continuous integration phase, compilation behavior is everywhere, and reducing compilation time is of great significance to improve the efficiency of research and development.

Meituan search and NLP Department provides the basic search platform services for the company. For performance considerations, the underlying basic services are implemented in C++ language, in which the deep query understanding service (DeepQueryUnderstanding, hereinafter referred to as DQU) we are responsible for also faces the problem of long compilation time. The compilation time of the whole service code is about 20 minutes before optimization (32-core machine parallel compilation). It has affected the efficiency of the team's development iterations. Based on this background, we have made a special optimization for the compilation of DQU services. In this process, we have also accumulated some optimization knowledge and experience, which we share with you here.

II. Compilation principle and Analysis 2.1 introduction to the compilation principle

In order to better understand the compilation optimization scheme, before introducing the optimization scheme, let's briefly introduce the compilation principle. Usually, when we develop C++, the compilation process mainly includes the following four steps:

Preprocessor: macro definition replacement, header file expansion, conditional compilation expansion, delete comments.

The gcc-E option gives the preprocessed result with an extension of .I or .II.

The preprocessing does not do any syntax checking, not only because it does not have the syntax checking function, but also because the preprocessing command does not belong to the Cbig Category + statement (which is why you do not add a semicolon when defining macros). Syntax checking is what the compiler does.

After preprocessing, all you get is the real source code.

Compiler: generates assembly code and obtains an assembly language program (translating a high-level language into a machine language) in which each statement accurately describes a low-level machine language instruction in a standard text format.

The gcc-S option gives you the compiled assembly code file with a .s extension.

Assembly language provides a common output language for different compilers of different high-level languages.

Assembler: generates an object file.

The gcc-c option gives you the compiled result file with a .o extension.

A .o file is a file generated according to the binary encoding of.

Linker: generates an executable or library file.

Static library: when compiling a link, all the code of the library file is added to the executable file, so the generated file is larger, but the library file is no longer needed at run time, and is generally renamed ".a".

Dynamic library: the code of the library file is not added to the executable file when compiling the link, but the library is loaded by the runtime link file when the program is executed, so the executable file is relatively small, and the dynamic library is generally suffixed with ".so".

Executable: links all binaries together into a single executable program, regardless of whether they are target binaries or library binaries.

2.2 characteristics of C++ compilation

(1) each source file is compiled independently

There is a big difference between the compilation system of Module and other high-level languages. in other high-level languages, the compilation unit is the whole Module, that is, all the source code under the Module, which will be executed in the same compilation task. On the other hand, the compilation unit is based on files in C _ blank +. Each .c / .cc / .cxx / .cpp source file is an independent compilation unit, so that compilation optimization can only be based on the contents of this file, so it is difficult to provide code optimization across compilation units.

(2) each compilation unit needs to parse all the included header files independently.

If N source files refer to the same header file, the header file needs to be parsed N times (for header files such as Thrift files or boost header files, which can be tens of thousands of lines, it's a "ghost story").

If there is a template (STL/Boost) in the header file, the template is instantiated once in each cpp file, and the std::vector in N source files is instantiated N times.

(3) instantiation of template function

In the C++ 98 language standard, the compiler needs to instantiate every template instantiation that occurs in the source code, and the linker also needs to remove duplicate instantiated code when linking. Obviously, when the compiler encounters a template definition, it does repeated instantiation and compilation every time. At this point, if the compiler can avoid such repetitive instantiation work, then the compiler can greatly improve its efficiency. The introduction of external templates, a new language feature in the C++ 0x standard, solves this problem.

In C++ 98, there is already a language feature called explicit instantiation (Explicit Instantiation), which is intended to instruct the compiler to immediately perform template instantiation (that is, forced instantiation). The external template syntax is modified based on the syntax of the explicit instantiation instruction, and the syntax of the external template is obtained by adding the prefix extern before the explicit instantiation instruction.

① explicit instantiation syntax: template class vector. ② external template syntax: extern template class vector.

Once an external template declaration is used in a compilation unit, the compiler skips template instantiation that matches the external template declaration when compiling the compilation unit.

(4) imaginary function

The compiler handles virtual functions by adding a pointer to each object and storing the address to the virtual function table, which stores the virtual function address of the class (including inheriting from the base class). If the derived class overrides the new definition of the virtual function, the virtual function table holds the address of the new function, and if the derived class does not redefine the virtual function, the virtual function table holds the address of the original version of the function. If the derived class defines a new virtual function, the address of the function is added to the virtual function table.

When a virtual function is called, the program looks at the address of the virtual function table stored in the object, turns to the corresponding virtual function table, uses the virtual function defined in the class declaration, and the program uses the function address of the array and executes the function.

Changes after using virtual functions:

The ① object adds a space to store the address (4 bytes for 32-bit systems and 8 bytes for 64 bits). ② each class compiler creates a virtual function address table. ③ needs to add the operation of finding an address in the table for each function call.

(5) Compiler optimization

In order to meet the optimization needs of users to varying degrees, GCC provides nearly a hundred optimization options to make different choices and balances for the three-dimensional model of compilation time, target file length and execution efficiency. There are many ways to optimize, and in general, there will be the following categories:

① simplifies operation instructions. ② tries its best to meet the pipelining operation of CPU. ③ adjusts the execution order of the code by guessing the behavior of the program. ④ makes full use of registers. ⑤ expands simple calls, and so on.

If you understand all of these compilation options, code-specific optimization is still a complex task, fortunately, GCC provides you with different levels of optimization from O0-O3 and Os for you to choose from, in these options, including most of the effective compilation optimization options, and on this basis, some options can be shielded or added, thus greatly reducing the difficulty of use.

O0: no optimizations are made, which is the default compilation option.

O and O1: do partial compilation optimizations to the program, and the compiler tries to reduce the size of the generated code and shorten the execution time, but does not perform optimizations that take a lot of compilation time.

O2: is a more advanced option than O1 for more optimization. GCC will perform almost all optimizations that do not include time and space tradeoffs. When the O2 option is set, the compiler does not do loop expansion and function inline optimization. Compared with O1, O2 optimization increases the compilation time and improves the execution efficiency of the generated code.

O3: more optimizations based on O2, such as the use of pseudo-register networks, inlining of ordinary functions, and more optimizations for loops.

Os: mainly to optimize the size of the code, usually various optimizations will disrupt the structure of the program, so that debugging work becomes impossible. And it will disrupt the execution order, and programs that rely on memory operation order need to do relevant processing to ensure the correctness of the program.

Problems that may arise from compilation optimization:

① debugging issues: as mentioned above, any level of optimization will lead to changes in the structure of the code. For example, the merging and elimination of branches, the elimination of common subexpressions, and the replacement and change of load/store operations in the loop will make the execution order of the target code unrecognizable, resulting in a serious lack of debugging information.

② memory operation order change problem: after O2 optimization, the compiler will affect the order of memory operations. For example:-fschedule-insns allows data processing to complete other instructions first;-fforce-mem may cause data between memory and registers to produce similar dirty data inconsistencies. For some logic that depends on the order of memory operations, it needs to be strictly processed before it can be optimized. For example, use the Volatile keyword to restrict the operation of variables, or use Barrier to force CPU to strictly follow the instruction order.

(6) optimization across compilation units can only be handed over to the linker.

When the linker links, it first determines the location of each target file in the final executable file. Then access the address redefinition table of all target files and redirect the address recorded in it (plus an offset, that is, the starting address of the compilation unit on the executable). Then traverse the unresolved symbol tables of all target files, look for matching symbols in all exported symbol tables, and fill in the implementation address in the location recorded in the unresolved symbol table. finally, the contents of all the target files are written in their respective locations to generate an executable file. The details of the link are complicated, and the link stage is a single process, which can not be accelerated in parallel, which leads to the slow link of large projects.

III. Analysis of service problems

DQU is the query understanding platform used by Meituan search. It contains a large number of models, thesauri, and code structure, including more than 20 Thrift files, using a large number of Boost processing functions, and introduces the SF framework, the company's third-party component SDK and three Submodule of word segmentation. Each module adopts the way of dynamic library compilation and loading, and the modules transfer data through the message bus, which is a large Event class. In this way, this class contains the definition of the data types needed by each module, so each module will introduce the Event header file, unreasonable dependencies cause this file to be changed, and almost all modules will be recompiled.

The compilation problems faced by each service have their own characteristics, but the essential causes of the problems are similar. combined with the process and principle of compilation, we analyze the compilation problems of DQU services from three aspects: precompilation deployment, header file dependence and compilation time-consuming.

3.1 compilation and deployment analysis

Compilation deployment analysis is through the C++ pre-compilation phase of the retention of .II files, to view the size of the compiled files after deployment, specifically, you can specify the compilation option "- save-temps" in cmake to retain the compilation intermediate files.

Set (CMAKE_CXX_FLAGS "- std=c++11 ${CMAKE_CXX_FLAGS}-ggdb-Og-fPIC-w-Wl,--export-dynamic-Wno-deprecated-fpermissive-save-temps")

The most direct reason for compilation time-consuming is that the compilation file is relatively large after deployment, by compiling the file size and content after deployment, and through pre-compilation deployment analysis, you can see that the file after deployment has more than 400,000 lines. It is found that there are a large number of Boost library references and header file references caused by the deployment file is relatively large, affecting the compilation time-consuming. In this way, we can find the commonness of the time-consuming compilation of each file. the following figure is a screenshot of the file size after compilation.

3.2 header file dependency analysis

Header file dependency analysis is a way to analyze whether the code is reasonable from the point of view of the number of referenced header files. we implement a script to count the dependency relationship of header files and analyze the output header file dependency count. used to help determine whether the header file dependency is reasonable.

(1) Statistics on the total number of header file references

The total number of header files directly and indirectly dependent on the compiled source files is counted by the tool, which is used to analyze the problem in terms of the number of header files.

(2) dependency statistics of a single header file

Through the tool to analyze the header file dependency and generate the dependency topology diagram, we can directly see the unreasonable dependency.

The figure contains the reference hierarchy, as well as the number of reference header files.

3.3 Segmentation statistics of compilation time-consuming results

Compilation time segmentation statistics is from the results of each file compilation time and each compilation phase of the time-consuming situation, this is an intuitive result, under normal circumstances, and file expansion size and the number of header file references are positively related, cmake by specifying environment variables can print out the compilation and link phase of the time-consuming situation, through this data can be intuitive analysis of the time-consuming situation.

Set_property (GLOBAL PROPERTY RULE_LAUNCH_COMPILE "${CMAKE_COMMAND}-E time") set_property (GLOBAL PROPERTY RULE_LAUNCH_LINK "${CMAKE_COMMAND}-E time")

Compilation time-consuming result output:

3.4 Construction of analytical tools

Several compiled data can be obtained through the above tool analysis:

① header file dependency and number. The size and content of ② precompilation. It takes time for ③ files to compile. The overall link to ④ takes time. ⑤ can calculate the degree of compilation parallelism.

Through the input of these data, we consider that we can make an automated analysis tool to find out the optimization points and interface display. For this purpose, we have built a full-process automation analysis tool, which can automatically analyze time-consuming common problems and TopN time-consuming documents. The analysis tool processing flow is shown in the following figure:

(1) the effect of overall statistical analysis

Specific field description:

① cost_time compilation takes time, in seconds. ② file_compile_size, compilation intermediate file size, in M. ③ file_name, file name. ④ include_h_nums, which introduces the number of header files in a unit. ⑤ top_h_files_info, which introduces the most TopN header files.

(2) Statistics of Top10 compilation time-consuming files

Used to show the TopN files that take the longest time to compile statistics, N can be customized.

(3) Statistics of intermediate file size in Top10 compilation

By counting and displaying the size of the compiled file, it is used to determine whether this piece is in line with expectations, which corresponds to the compilation time.

(4) Top10 introduces header file statistics of the most header files.

(5) Statistics on the repetition times of Top10 header files

At present, this tool supports one-click generation of compilation time-consuming analysis results. Several gadgets, such as the dependent file count tool, have been integrated into the company's online integration testing process. Automatic tools are used to check the impact of code changes on compilation time-consuming. The construction of the tool is still being iterated and optimized. Later, it will be integrated into the company's MCD platform, which can automatically analyze to locate the problem of long compilation time. Solve the compilation time-consuming problem of other departments.

IV. Optimization scheme and practice

By using the above related tools, we can find the commonness of Top10 compilation time-consuming files, such as relying on the message bus file platform_query_analysis_enent.h, which introduces more than 2000 header files directly and indirectly. We focus on optimizing this kind of files. Through the compilation and expansion of the tool, we find out the common problems such as the use of Boost, template class expansion, Thrift header file expansion, and do special optimization for these problems. In addition, we have also used some compilation optimization schemes commonly used in the industry, and achieved good results. The following is a detailed description of the various optimization schemes we have adopted.

4.1 General compilation acceleration scheme

There are many general compilation acceleration tools (schemes) in the industry, which can improve the compilation speed without invading the code, which is worth trying.

(1) parallel compilation

On the Linux platform, you generally use GNU's Make tool to compile. When you execute the make command, you can add the-j parameter to increase the degree of compilation parallelism. For example, make-j 4 will open 4 tasks. In practice, we do not write this parameter to death, but use the $(nproc) method to dynamically obtain the CPU core number of the compiler as the compilation concurrency, so as to maximize the performance advantage of multi-core.

(2) distributed compilation

The use of distributed compilation technology, such as using Distcc and Dmucs to build a large-scale, distributed C++ compilation environment, Linux platform using network clusters for distributed compilation, we need to consider the network delay and network stability. Distributed compilation is suitable for larger projects, such as stand-alone compilation that takes hours or even days. In terms of code size and stand-alone compilation time, DQU service does not need to be accelerated in a distributed way for the time being. For more information, please see the official Distcc documentation.

(3) pre-compiled header files

PCH (Precompiled Header), this method saves the compilation results of commonly used header files in advance, so that the compiler can directly use the precompiled results when dealing with the introduction of the corresponding header files, thus speeding up the whole compilation process. PCH is a very commonly used method to speed up compilation in the industry, and the feedback is very good. In our project, because a lot of Shared Library compilation is involved, and Shared Library cannot share PCH with each other, it does not achieve the desired effect.

(4) CCache

CCache (Compiler Cache is a compilation cache tool, its principle is to save the compilation results of cpp in the file cache, and later on, if there is no change in the corresponding files, you can directly obtain the compilation results from the cache. It should be noted that Make itself also has a certain caching function, when the target file has been compiled (and depends on no change), if the source file timestamp does not change, it will not be compiled again; but CCache is cached according to the contents of the file, and multiple projects on the same machine can share the cache, so it is more applicable.

(5) Module compilation

If your project is developed with Category 20, congratulations, Module compilation is also a solution to optimize the compilation speed, the previous version of cpp will treat each cpp as a compilation unit, there will be the problem that the header file will be parsed and compiled multiple times. The emergence of Module is to solve this problem, Module no longer needs a header file (only one module file, no need to declare and implement two files), it will directly compile your (.cppm or .cppm) module entity and automatically generate a binary interface file. Import is different from include preprocessing, the compiled module will not be compiled repeatedly the next time import, which can greatly improve the efficiency of the compiler.

(6) automatic dependency analysis

Google has also launched an open source Include-What-You-Use tool (IWYU for short) and a Clang-based redundant header file checking tool for the project. IWYU relies on Clang to compile the suite, which can scan the file dependency problem. At the same time, the tool also provides scripts to solve the header file dependency problem. We try to build this analysis tool, which also provides automatic header file solution. However, because our code dependency is relatively complex, such as dynamic libraries, static libraries, sub-warehouses, etc., the optimization function provided by this tool can not be used directly. If the code structure of other teams is relatively simple, they can consider using this tool to analyze the optimization and generate the following result files to guide which header files need to be deleted.

> Fixing # includes in'/ opt/meituan/zhoulei/query_analysis/src/common/qa/record/brand_record.h'@@-1 include 9 + 1 10 @ # ifndef _ MTINTENTION_DATA_BRAND_RECORD_H_ # define _ MTINTENTION_DATA_BRAND_RECORD_H_-#include "qa/data/record.h"-# include "qa/data/template_map.hpp"-# include "qa/data/template_vector.hpp"-# include + # include / / for BOOST_CLASS _ VERSION+#include / / for string+#include / / for vector++#include "qa/data/file_buffer.h" / / for REG_TEMPLATE_FILE_HANDLER4.2 Code Optimization Scheme and practice

(1) pre-type declaration

By analyzing the header file reference statistics, we find that the bus type Event is the most frequently referenced in the project, and members of various business needs are placed in this type. The example is as follows:

# include "a.h" # include "b.h" class Event {/ / Business A, B, C... A1 A1; A2 A2; / /. B1 b1; B2 b2; / /.}

As a result, Event contains a large number of header files, and the file size reaches 15m after the header file is expanded; and various businesses will need to use Event, which will naturally be a serious drag on compilation performance.

We solve this problem by pre-type declaration, that is, we do not introduce the header file of the corresponding type, only make the pre-declaration, and only use the pointer of the corresponding type in Event, as shown below:

Class A2 / class Event {/ / Business A, B, C... Shared_ptr A1; shared_ptr a2; / /. Shared_ptr b1; shared_ptr b2; / /.}

Only when you actually use the corresponding member variables do you need to introduce the corresponding header file; this really makes it possible to introduce the header file on demand.

(2) external template

Because of the feature that templates are instantiated only when they are used, the same instance can appear in multiple file objects. The compiler instantiates each template, and the linker removes duplicate instantiated code. In projects where templates are widely used, the compiler generates a lot of redundant code, which greatly increases compilation time and link time. The new C++ 11 standard can be avoided through external templates.

/ / util.htemplate void max (T) {...} / / A.cppextern template void max (int); # include "util.h" template void max (int); / / explicitly instantiate void test1 () {max (1);}

When compiling A.cpp, instantiate a max (int) version of the function.

/ / B.cpp#include "util.h" extern template void max (int); / / external template declaration void test2 () {max (2);}

When compiling B.cpp, max (int) instantiation code is no longer generated, thus saving the time-consuming of instantiation, compilation, and linking mentioned earlier.

(3) the use of polymorphic replacement templates

Our project heavily uses dictionary-related operations, such as loading dictionaries, parsing dictionaries, matching dictionaries (various fancy matching), all of which support different types of dictionaries through Template template extensions. According to statistics, there are more than 150 types of dictionaries, which also causes the expansion of the code expanded by the template.

Template class Dict {public: / / matches key and condition, and assigns values to record bool match (const string & key, const string & condition, R & record); / / A pair of Record of each type expands private: map dict;}

Fortunately, most of the operations in our dictionary can abstract several types of interfaces, so we can only implement operations against the base class:

Class Record {/ / Base class public: virtual bool match (const string & condition); / / derived class needs to be implemented}; class Dict {public: shared_ptr match (const string & key, const string & condition); / / the user passes in the pointer of the derived class private: map dict;}

Through inheritance and polymorphism, we effectively avoid a large number of template deployments. It is important to note that using a pointer as a Value for Map increases the pressure on memory allocation, and it is recommended to use Tcmalloc or Jemalloc instead of the default Ptmalloc optimized memory allocation.

(4) replace Boost library

Boost is a widely used basic library, covering a large number of commonly used functions, which is very convenient and easy to use, but there are also some shortcomings. A significant disadvantage is that its implementation takes the form of hpp, where both the declaration and implementation are placed in the header file, which can make the precompilation very large after deployment.

/ / string manipulation is a common function. The expansion size of the header file is larger than 4M#include / /. In contrast, multiple STL header files are introduced, and only 1M#include # include / / is expanded.

No more than 20 Boost functions are mainly used in our project, some of them can be replaced in STL, and some of them are implemented manually, which makes the project change from heavily dependent on Boost to most of Boost-Free, which greatly reduces the burden of compilation.

(5) pre-compilation

There are some common changes in the code that are relatively small, but have a certain impact on the compilation time, such as the files generated by Thrift, the model library files and the general files in the Common directory. We take the method of pre-compiling and translating into dynamic libraries to reduce the compilation time of subsequent files and solve part of the compilation dependency.

(6) solve the compilation dependency and improve the parallelism of compilation.

In our project, there are a large number of module-level dynamic library files that need to be compiled, and the compilation dependency specified by cmake files limits the execution of compilation parallelism to some extent.

For example, in the following scenario, the compilation parallelism can be improved by reasonably setting library file dependencies.

4.3 Optimization effect

We compare the results before and after compilation time-consuming optimization with 32C and 64G memory machines. The statistical results are as follows:

4.4 keep the results of optimization

Compilation and optimization is a "boat against the current" thing, developers always tend to add new features, new libraries and even new frameworks, but it is always difficult to delete old code, old libraries, and offline old frameworks (I believe front-line developers must know it well). Therefore, how to keep the previous optimization results is also very important. We have the following experiences in practice:

Code review is difficult (changes that increase compilation time are often not intuitively detected by the audit code).

Tools and processes are worth relying on.

The key is to control the increment.

We find that the compilation time of cpp files is positively related to the size of its precompiled deployment file (.II) (in most cases); for each online version, the precompiled deployment size of all its cpp files is recorded, and its compilation fingerprint (CF,Compile Fingerprint) is formed. By comparing the two adjacent versions of CF, we can accurately know which changes are the main changes to introduce the compilation time brought by the new version, and we can further analyze whether the time-consuming increase is reasonable and whether there is room for optimization.

We make this way into a scripting tool and introduce the online process, so that we can clearly understand the compilation performance impact of each code release, and effectively help us to keep the previous optimization results.

V. Summary

The DQU project is an important part of Meituan's search business. The system needs to dock 20+RPC, dozens of models, load more than 300 dictionaries, use dozens of gigabytes of memory, and respond to large C++ services with more than 2 billion requests per day. In the case of high-speed iteration of business, the long compilation time brings great trouble to the development students, which restricts the development efficiency to a certain extent. Finally, through the construction of compilation optimization analysis tools, combined with general compilation optimization acceleration scheme and code-level optimization, the compilation time of DQU is reduced by 70%. By introducing CCache and other means, the compilation of local development can be completed within 100s, saving a lot of time for the development team.

After achieving phased results, we summarize the whole problem-solving process and precipitate some analysis methods, tools and process specifications. In the subsequent development iteration process, these tools can quickly and effectively detect the compilation time changes caused by new code changes, and become a part of the detection standard in our online process check. This is very different from our previous one-off or targeted compilation optimization. After all, code maintenance is a lasting process, systematic solution to this problem, not only need effective methods and convenient tools, but also need a standardized, standardized online process to maintain results.

At this point, I believe that everyone on the "C++ service compilation time-consuming optimization principle is" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.