How does C++ design data structures for multithreaded performance 04/09 Update SLTechnology News&Howtos

How does C++ design data structures for multithreaded performance

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how C++ designs data structures for multithreaded performance". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn how C++ designs data structures for multithreaded performance.

8.3.1 dividing array elements for complex operations

Suppose you are doing some complex mathematical calculations, and you need to multiply two large matrices. To achieve matrix multiplication, you multiply each element in the first row of the first matrix and each element corresponding to the first column of the second matrix, and add the results to get the first element in the upper-left corner of the result matrix. Then you continue to multiply the second row by the first column to get the second element of the first column of the result matrix, and so on. As shown in figure 8.3, the highlighted part shows that the second row of the first matrix is paired with the third column of the second matrix, and the value of the second row of the third column of the result matrix is obtained.

Figure 8.3 Matrix multiplication

To make it worthwhile to use multithreading to optimize this multiplication, we now assume that these are large matrices with thousands of rows and thousands of columns. Typically, a non-sparse matrix is represented in memory as a large array, with all the elements of the first row followed by all the elements of the second row, and so on. In order to multiply matrices, there are now three large arrays. To get better performance, you need to pay attention to the data access section, especially the third array.

There are many ways to divide work between online programs. Assuming you have more rows than a processor, you can have each thread calculate the values of certain columns in the result matrix, or let each thread calculate the values of certain rows in the result matrix. or even let each thread calculate the value of a regular rectangular subset in the result matrix.

Looking back at sections 8.2.3 and 8.2.4, you will find that reading adjacent elements in an array is better than reading values in an array everywhere, because it reduces cache use and false sharing. If you make each thread process some columns, you need to read all the elements in the first matrix and the corresponding column elements in the second matrix, but you will only get the values of the column elements. Suppose the matrix is stored in row order, which means that you read N elements from the first row, N elements from the second row, and so on (the value of N is the number of columns you process). Other threads read other elements in each row, which makes it clear that you should read adjacent columns, so the N elements in each row are adjacent and fake sharing is minimized. Of course, if the space used by these N elements is equal to the number of cache lines, there will be no false sharing, because each thread will work on a separate cache line.

On the other hand, if each thread processes some row elements, you need to read all the elements in the second matrix, as well as the related row elements in the first matrix, but it only gets the row elements. Because the matrix is stored in row order, you now read all the elements starting with line N. If you select adjacent rows, it means that this thread is now the only thread writing to these N lines; it has contiguous blocks in memory and will not be accessed by other threads. This is better than having each thread deal with some column elements, because the only place where false sharing can occur is the last elements of one block and the beginning of the next block. But it's worth taking the time to identify the target structure.

How about the third option one is divided into rectangular blocks? This can be thought of as dividing first into columns and then into rows. It has the same problem of false sharing as partitioning based on column elements. If you can choose the number of columns of blocks to avoid this problem, in terms of reading, dividing into rectangular blocks has the advantage that you don't need to read any complete source matrix. You only need to read the values of the rows and columns of the relevant target matrix. From a specific point of view, consider multiplying two matrices with 1000 rows and 1000 columns. There are a million elements. If you have 100 processors, each thread can process 10 lines of elements. Nevertheless, in order to calculate these 10000 elements, you need to read all the elements of the second matrix (a million elements) plus the 10000 elements of the related rows of the first matrix, for a total of 1010000 elements On the other hand, if each thread processes matrix blocks with 100 rows and 100 columns (a total of 10000 elements), then they need to read 100 row elements of the first matrix (100 x 100000000 elements) and 100 column elements of the second matrix (another 100000 elements). This only has 200000 elements, reducing the number of elements read to 1/5. If you read fewer elements, there is less chance of cache misses and the potential for better performance.

Therefore, it is better to divide the result matrix into small squares or similar squares than to fully process several rows per thread. Of course, you can resize each block at run time, depending on the size of the matrix and the number of processors. If performance is important, it is important to analyze the options based on the target structure.

It is also possible that you do not do matrix multiplication, so does it apply? The same principle applies when you divide large chunks of data between online programs. Take a closer look at how the data is read and identify potential causes that affect performance. There may be a similar environment for the problems you encounter, that is, changing the way work is divided can improve performance without changing the basic algorithm.

Well, we've seen how array reading affects performance. What about other types of data structures?

8.3.2 data access in other data structures

Fundamentally, it is also applicable when trying to optimize the data access patterns of other data structures.

1. Change the data allocation between threads so that the adjacent data can be applied by the same thread.

2. Minimize the data required by any given thread.

3. Make sure that the data accessed by independent threads is far enough apart to avoid false sharing.

Of course, it is not easy to apply to other data structures. For example, a binary tree is hard to redivide in any way, useful or useless, depending on how the tree is balanced and how many parts you need to divide it into. Similarly, the nature of the tree means that nodes are dynamically allocated and end up in different places on the heap.

Now, making the data end up in different places on the heap is not a particular problem in itself, but it means that the processor needs to keep more things in the cache. Actually, it can be very beneficial. If multiple threads need to traverse the tree, they all need to read the node of the tree, but if the node of the tree contains a pointer to the data held by that node, then the processor must load the data from memory when needed. If the thread is modifying the required data, this can avoid the performance loss caused by the false sharing between the node data and the data that provides the tree structure.

There is the same problem when using mutexes to protect data. Suppose you have a simple class that contains data items and a mutex to protect multithreaded reads. If the mutex and the data item are close in memory, it is good for the thread that uses the mutex; the data it needs is already in the processor cache because it has been loaded to modify the mutex. But it also has a drawback: when the first thread holds the mutex, if other threads try to lock the mutex, they need to read memory. The lock of the mutex is usually implemented as a read-modify-write atomic operation that attempts to obtain the mutex on the storage unit within the mutex, and then invokes the operating system kernel if the mutex has been locked. This read-modify-write operation may cause the cached data held by threads with mutexes to become invalid. As long as you use mutexes, this is not a problem. However, if the mutex and the data used by the thread share the same buffer line, the performance of the thread that owns the mutex will be affected as another thread tries to lock the mutex.

The way to test whether this false sharing is a problem is to add large chunks of populated data between data elements that can be read concurrently by different threads, for example, you can use:

To test mutually exclusive meta-competition problems or use:

To test whether the array data is falsely shared. If this improves performance, you can see that fake sharing is really a problem, and you can keep populated data or eliminate fake sharing by rescheduling data reads.

Of course, when designing concurrency, you not only need to consider the data read pattern, so let's look at other aspects that need to be considered.

8.4 additional considerations for concurrent design

In this chapter, we looked at some methods of dividing work between threads, the factors that affect performance, and how these factors affect which data reading mode and data structure you choose. However, there is more to consider in designing concurrent code. Things you need to consider, such as abnormal security and scalability. If performance increases as the processing core in the system increases, whether in terms of reducing execution time or increasing throughput, then the code is scalable. In theory, the performance increase is linear. So the performance of a system with 100 processors is 100 times better than that of a system with only one processor.

The code works even if it is not extensible. For example, single-threaded applications are not extensible, and exception safety is about correctness. If your code is not exceptionally safe, it may end with broken invariants or race conditions, or your application may end abruptly because an operation throws an exception. With this in mind, we will consider abnormal security first.

8.4.1 exception security in parallel algorithms

Exception security is a fundamental aspect of good C++ code, and code that uses concurrency is no exception. In fact, parallel algorithms usually require you to think more about exceptions than ordinary linear algorithms. If an operation in a linear algorithm throws an exception, the algorithm only needs to consider invariants that ensure that it can be handled well to avoid resource leakage and fragmentation. It allows the exception to be expanded to be handled by the caller. In contrast, in parallel algorithms, many operations run on different threads. In this case, it is not allowed to expand the exception because it is in the wrong call stack. If a function generates a large number of new threads that end with an exception, the application will be terminated.

As a concrete example, let's review the parallel_accumulate function in listing 2.8, with some changes in listing 8.2

A parallel version of listing 8.2 sta::accumulate (from listing 2.8)

Now we examine and determine where the exception is thrown: in general, any place where a function is called or where an operation is performed on a user-defined type can be thrown.

First, you call distance 2, which performs operations on user-defined iterator types. Because you haven't done any work yet, and this is on the calling thread, this is fine. Next, you assign results selector 3 and threads iterator 4. Again, this is on the calling thread, and you don't do any work or produce any threads, so this is fine. Of course, if the threads constructor throws an exception, then the memory allocated for results must be clear, and the destructor will do it for you.

Skip initialization 5 of block_start because it is safe to go to operations 6, 7, 8 in the loop that generated the thread. Once the first thread is created in 7, it will be troublesome to throw an exception, and the destructor of your new sta::thread object will call

Std::terminate and then the program

Calling accumulate_block 9 may also throw an exception, and your thread object will be destroyed and call std:terminate; on the other hand, the final call to std::accumulate 10 may also throw an exception without causing any difficulty, because all threads will converge here.

This is not for the main thread, but it is also possible to throw an exception, and calling accumulate_block on a new thread may throw exception 1. There are no catch blocks here, so the exception will be handled later and cause the library to call sta::terminate () to abort the program.

Even if it is not obvious, this code is not exceptionally safe.

1 increase exception security

Well, we have identified all the places where an exception may be thrown and the bad effects caused by the exception. So what to do with it? Let's first solve the problem of throwing an exception on a new thread.

The tools to do this are introduced in Chapter 4. If you think carefully about what you want in the new thread, it is obvious that when the code is allowed to throw an exception, you try to calculate the result to return. The combined design of std:: packaged_task and std:future is just right. If you redesign the code to use std::packaged_task, it ends with the code in listing 8.3.

Listing 8.3 uses a parallel version of std::packaged_task 's std::accumulate

The first change is that the function calls the accumulate_block operation to return the result directly, rather than the reference 1 to the storage address. You use std::packaged_task and std::future to keep exceptions safe, so you can also use it to transfer results. This requires you to call std::accumulate 2 to explicitly use the default constructor T instead of reusing the result value provided, but this is only a small change.

The next change is that you use futures vector 3 instead of using the result to store a std:future for each generated thread. In the loop that generates Gocheng, you first create a task 4 for accumulate_block. Std:packaged_task

Now that you have used future, there is no more result array, so the result of the last block must be stored in a variable 7 instead of in a location in the array. Similarly, because you will get a value from future, it is easier to use a basic for loop than to use std:accumulate, starting with the initial value provided 8 and adding up the results of each future to 9. If the corresponding task throws an exception, it will be caught in future and will be thrown again when get () is called. Finally, add the result of the last block 10 before returning the total result to the caller.

Therefore, this removes the possible problem that exceptions thrown in the worker thread will be thrown again in the main thread. If more than one worker thread throws an exception, only one exception will be propagated, but this is not a big problem. If it is, you can use something like

Std::nested_exception to catch all exceptions and throw it.

If an exception is thrown between the first thread you generate and you join them, the remaining problem is thread leakage. The easiest way is to catch all exceptions, merge them into the thread that calls joinable (), and then throw the exception again.

Now it works. All threads will be joined, no matter how the code leaves the block. Still, try-catch blocks are annoying, and you have copied code. You will be added to the normal control flow and the thread that captures the block. Copying code is not a good thing, because it means more changes are needed. We examine it in an object's destructor, which, after all, is the usual way to clear resources in C++. Here is your class.

This is similar to the thread_guard class in listing 2.3, except that it extends to fit all threads. You can simplify the code as shown in listing 8.4.

Listing 8.4 the exception-safe parallel version of std::accumulate

Once you have created your thread container, you have created an instance of a new class 1 to join all threads that are exiting. You can remove your syndication loop as long as you know that these threads will be federated whether the function exits or not. Note that the call to futures.get () 2 will be blocked until the result comes out, so there is no need to explicitly merge with the thread at this point. This is different from the prototype in listing 8.2, where you must join with the thread to ensure that the result vector is copied correctly. Not only do you get the exception-safe code, but your functions are shorter because you extract the federated code into your new (reusable) class.

2. Abnormal security of STD::ASYNC ()

Now that you know what you need to achieve exception safety when dealing with threads, let's take a look at the same thing you need to do when using std::async (). As you have seen, the library handles these threads for you in this case, and when the future is ready, any threads generated are completed. The key thing to note is abnormal security, and if the future is destroyed without waiting for it, the destructor will wait for the thread to complete. This avoids the problem of leaking threads that are still executing and holding data references. Listing 8.5 shows the exception security implementation using std::async ().

Listing 8.5 uses the exception-safe parallel version of std::async 's std::accumulate

This version uses recursion to divide data into blocks instead of recalculating data into blocks, but it is simpler than previous versions and is exceptionally safe. As before, you start with calculating the sequence length 1, and if it is smaller than the maximum block size, call the

Std::accumuate 2 . If its element is larger than the block size, find midpoint 3 and generate an asynchronous task to process the first half! [serial number 4. The second part of the scope uses a direct recursive call to deal with 5. And then add up the results of the two blocks by 6 The library ensures that std::async calls use available hardware threads and do not create many threads. Some "asynchronous" calls will be executed asynchronously when get () is called.

The advantage of this approach is that not only can it take advantage of hardware concurrency, but it is also extremely secure. If the recursive call throws exception 5, the future created by calling std::async 4 will be destroyed when the exception propagates. It takes turns waiting for the asynchronous thread to finish, thus avoiding hanging threads. On the other hand, if an asynchronous call throws an exception, it will be caught by future, and a call to get () 6 will throw the exception again.

What else do you need to consider when designing concurrent code? Let's look at scalability. How much performance will be improved if you migrate your code to more processor systems?

8.4.2 scalability and Amdahl's Law

Scalability is about ensuring that your application can take advantage of the additional processors in the system. At one extreme, you have a single-threaded application that doesn't scale at all, and even if you add 100 processors to the system, it won't change performance. At the other extreme, you have a project like SETI@Home [3], designed to take advantage of thousands of additional processors (in the form of users adding personal computers to the network).

For any given multithreaded program, the number of threads that perform useful work changes when the program runs. Even if each thread is doing something useful, there may be only one thread initializing the application, and then a task will generate other threads. But even that is an unlikely option. Threads often spend time waiting for each other or waiting for the Imap O operation to complete.

Unless every time a thread waits for something (whatever it is), another thread replaces it on the processor, or a processor that can do useful work is idle

A simple way is to divide the program into "serial" parts where only one thread is doing useful work and "parallel" parts where all available processors are doing useful work. If you run your application on a system with more processors, you can theoretically complete the "parallel" part faster because you can divide the work between more processors, but the "serial" part is still serial. Under such a simple assumption, you can estimate the performance that can be achieved by increasing the number of processors. If the "contiguous" part makes up a part of the program fs, then the performance P achieved by using N processors can be estimated as

This is Amdahl's Law (Amdahl'slaw), which is often quoted when talking about the performance of concurrent code. If everything can be done in parallel, then the serial part is 0 and acceleration is N, or if the serial part is 1/3, even if you have an infinite number of processors, you won't get more than 3 acceleration.

Nevertheless, this is an ideal situation. Because tasks are rarely infinitely divided as the equation requires, and it is rare that everything reaches the CPU limits it assumes. As you can see, threads wait for a lot of things when they execute.

One thing that is clear in Amdar's law is that when you use concurrency for performance, it is worth considering the design of the overall application to maximize the possibility of concurrency and to ensure that the processor always has useful work to do. If you can reduce the "serial" part or reduce the possibility of thread waiting, you can improve performance on systems with more processors. Or, if you can provide more data to the system and keep the parallel part ready, you can reduce the serial part and increase the value of performance P.

Basically, scalability means that as more processors are added, it can reduce the time it takes to perform operations or increase the amount of data it processes over a period of time. Sometimes the two points are the same (if each element can be processed faster, then you can process more data), but not always the same. It is necessary to identify aspects of extensibility that are important to you before choosing a way to divide work between online programs.

I mentioned at the beginning of this section that threads don't always have useful work to do. Sometimes they have to wait for another thread, or wait for the Iripple O operation to complete, or something else. If you give the system something useful while waiting, you can effectively hide the wait.

8.4.3 hiding lateness with multithreading

In a lot of discussion about the performance of multithreaded code, we assume that when they actually run on the processor, threads are "running with all their strength and always have useful work to do." This is certainly not true. In application code, threads are frequently blocked while waiting. For example, they may be waiting for some Igamo operation to complete, waiting for a mutex to be obtained, waiting for another thread to complete some operation and notifying a condition variable, or just sleeping for a period of time.

Whatever the reason for waiting, if you only have as many threads as the physical processing units in the system, then blocking threads means you are wasting CPU time. The processor that runs a blocked thread does nothing. Therefore, if you know that a thread will have a considerable amount of time waiting, you can use that idle CPU time by running one or more additional threads.

Consider a virus scanning application that uses pipelines to partition work between threads. The first thread searches the file system to check the files and put them in the queue. At the same time, another thread gets the file names in the queue, loads the files, and scans them for viruses. You know that the thread that searches for files in the file system to scan will definitely reach the Imax O limit, so you can use "idle" CPU time by running an additional scanning thread. Then you have a thread to search for files and the same number of scanning threads as the physical core or processor in the system. Because scanning threads may also have to read important parts of files from disk to scan them, it makes sense to have more scanning threads. But at some point there will be too many threads, and the system will slow down again because it takes more time to switch programs, as described in Section 8.2.5.

Still, this is an optimization problem, so it is important to measure performance before and after a change in the number of threads; the most number of threads will largely depend on the nature of the work and the percentage of time the thread waits.

Depending on the application, it may also run out of idle CPU time without running additional threads. For example, if a thread is blocked waiting for the Iripple O operation to complete, it makes sense to use the available asynchronous Iripple O operation. Then the thread can perform other useful work when the Istroke O operation is performed behind the scenes. In other cases, if a thread is waiting for another thread to perform an operation, the waiting thread can perform that operation on its own instead of being blocked, as in the unlocked queue in Chapter 7. In one extreme case, if a thread waits to complete a task and the task is not executed by another thread, the waiting thread can execute the task within it or in another unfinished task. You see this example in listing 8.1, sorting it continuously as long as the blocks it needs are not sorted.

Sometimes it adds threads to ensure that external events are handled in a timely manner to increase system responsiveness, rather than adding threads to ensure that all available processors are applied.

8.4.4 improve responsiveness with concurrency

Many modern graphical user interface frameworks are event-driven, and users perform operations on the user interface through keyboard input or moving the mouse, which will generate a series of events or messages, which will be processed by the application later. The system itself generates messages or events. To ensure that all events and messages are handled correctly, the application usually has an event loop shown below.

Obviously, the details of the API are different, but the structure is usually the same, waiting for an event, doing the necessary actions to handle it, and then waiting for the next event. If you have a single-threaded application, it will make long-running tasks difficult to write, as described in Section 8.1.3. To ensure that user input is processed in a timely manner, get_event () and process () must be called at a reasonable frequency, no matter what the application is doing. This means that either the task must be periodically paused and control is handed over to the event loop, or get_event () / is called in the code when convenient.

Process () code. Each choice complicates the implementation of the task.

By separating concerns with concurrency, you can execute long tasks on a new thread and use a dedicated gui thread to handle events. Threads can access each other in a simple way, rather than having to put event-handling code into task code in some way. Listing 8.6 lists a simple summary of this separation.

Listing 8.6 detaching GUI threads from task threads

By separating the obstacles in this way, the user thread can respond to events in a timely manner, even if the task takes a long time to execute. This responsiveness is often the key to the user experience when using an application. Whenever a specific operation is performed (no matter what it is), the application is completely locked, making it inconvenient to use. By providing a dedicated event handling thread, GUI can handle GUI-specific messages (such as resizing or redrawing windows) without interrupting the execution of time-consuming processing, and pass related messages when they do affect long tasks.

At this point, I believe you have a deeper understanding of "how C++ designs data structures for multithreaded performance". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.