Tuning method of Java thread 07/12 Update SLTechnology News&Howtos

Tuning method of Java thread

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "the tuning method of Java thread". In the daily operation, I believe that many people have doubts about the tuning method of Java thread. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "tuning method of Java thread"! Next, please follow the editor to study!

The topic discussed below is how to mine the maximum performance of Java threads and synchronization facilities.

Thread Pool and ThreadPoolExecutor

(Thread Pool schematic diagram, source wikipedia)

In Java, threads can be managed with their own code, or they can use thread pools to execute tasks in parallel using ThreadPoolExecutor.

When using a thread pool, one factor is critical: resizing the thread pool is critical to achieving the best performance. The performance of the thread pool varies depending on the basic choice of thread pool size, and under some conditions, too large a thread pool can adversely affect performance.

All thread pools work essentially the same way:

There is a queue to which tasks are submitted. A certain number of threads will take the task from the queue and execute it.

The results of the task can be sent back to the client (such as in the case of an application server), or saved to a database, or saved to an internal data structure, and so on. But after the task is executed, the thread returns to the task queue, retrieves another task and executes, and waits for the next task if there are no more tasks to execute.

The thread pool has a minimum and a maximum number of threads. There will be a minimum number of threads in the pool waiting for tasks to be assigned to them. Because the cost of creating a thread is very high, this improves the overall performance of the task when it is submitted: existing threads get the task and process it. On the other hand, threads need some system resources, including native memory for the stack, and if there are too many free threads, they consume resources that could have been allocated to other processes. The maximum number of threads is also a necessary limiting valve to prevent the execution of too many threads at a time.

ThreadPoolExecutor and related classes call the minimum number of threads the core pool size, and if there is a task to execute and all concurrent threads are busy executing another, start a new thread until the maximum number of threads is created.

Set the maximum number of threads

What is the best setting for the maximum number of threads for a given load on a given hardware?

This question is not easy to answer; it depends on the load characteristics and the underlying hardware. In particular, the optimal number of threads is also related to the frequency of blocking for each task.

For ease of discussion, assume that JVM has four CPU available. Our goal is to maximize the utilization of these four CPU.

Obviously, the maximum number of threads should be set to at least 4. Sure, there are threads in JVM that have other things to do besides handling these tasks, but they almost never take up a complete CPU. If you are using a concurrent garbage collector, this is an exception, and background threads must have enough CPU to run so as not to lag behind in handling the heap.

Will it help if the number of threads is more than 4? It depends on the load characteristics at this point. Consider the simplest case, assuming that tasks are computationally intensive: there are no external network calls (such as not accessing the database) and no fierce competition for internal locks. In the case of using the Modular entity Manager (mock entity manager), the stock price history batch program is one such application: the data on the entity can be calculated in parallel.

Let's use the thread pool to calculate the history of 10000 modular stock entities. Assume that the machine has 4 CPU and test with different number of threads. The specific performance data is shown in Table 1. If there is only one thread in the pool, it takes 255.6 seconds to calculate the dataset; with four threads, it only takes 77 seconds. If the number of threads exceeds 4, it will take a little more time as the number of threads increases.

Table 1: time required to calculate the price history of 10000 models

If the tasks in the application are fully parallel, the percentage to the benchmark is listed as 50% when there are 2 threads and 25% when there are 4 threads. But this fully linear ratio is not possible for several reasons: without the help of other threads, these threads must collaborate on their own to select tasks from the run queue (in general, there is usually more synchronization). When using four threads, the system consumes 100% of the available CPU, and although the machine may not be running other user-level applications, a variety of system-level processes come in and use CPU, making it impossible for JVM to use all CPU cycles.

However, the application performs well in terms of scalability, and even if the number of threads in the pool is significantly overestimated, the performance loss is relatively slight.

In other cases, however, the performance loss can be significant. In the Servlet version of the stock history calculation program, if there are too many threads, the impact will be great, as shown in Table 2. The application server is configured with a different number of threads, and a load generator sends 20 synchronous (simultaneous) requests to the server.

Table 2: operations per second through Servlet

Given that the application server has four CPU available, the maximum throughput can be achieved by setting the number of threads in the pool to 4.

It is important to determine where the bottleneck is when studying performance issues. In this example, the obvious bottleneck is that the CPU utilization is 100% with CPU:4 threads. However, the impact of adding more threads is actually very small, at least when the number of threads is 8 times that of the original, there will be a significant difference.

What if the bottleneck is somewhere else? This example is a bit unusual, and the task is completely CPU-intensive: there is no Icano. In general, it is possible for a thread to call the database, write the output somewhere, or even join some other resources. In that case, the bottleneck is not necessarily CPU, but may be external resources.

In such cases, adding threads is very harmful. Although we often say that databases are always bottlenecks, bottlenecks can be any external resources.

Still take stock Servlet as an example, let's change the goal: what if the goal is to maximize the use of the load generator machine, is simply to run a multi-threaded Java program?

In typical usage, if the Servlet application runs on an application server with four CPU and only one client requests data, the application server is 25% busy and the client machine is almost always idle. If the load increases to 4 concurrent clients, the application server will be 100% busy and the client machine may only be 20% busy.

Just looking at the client, it's easy to conclude that because there is a lot of excess CPU on the client, it should be possible to add more threads to improve its scalability. Table 3 shows how wrong this assumption is: when the client joins a few more threads, performance is greatly affected.

Table 3: calculating the average response time of simulated stock price history

In this example, once the application server becomes a bottleneck (that is, when the number of threads reaches 4), adding load to the server is very harmful-even if only a few threads are added on the client.

This example may seem a little intentional. If the server is already CPU-intensive, who will add more threads? This example is used only because it is easy to understand and only uses the Java program. This means that readers can run it themselves and understand how it works without having to set options such as database connection, Schema, and so on.

It is important to note that the same principle holds true for application servers that also send database requests to CPU-intensive or Imax O-intensive machines. You may only focus on the application server CPU, and it feels good to see less than 100%; when you see that there are extra requests to process, assume that it is a good idea to increase the number of threads on the application server. The results can be surprising, because increasing the number of threads in that case actually reduces the overall throughput (the impact may be significant), as in the previous example with only the Java program.

Another reason why it is important to understand the real bottleneck of the system is:

If the load is also added to the bottleneck, the performance will degrade significantly.

Conversely, if the load at the current bottleneck is reduced, performance may increase.

This is why it is so difficult to design self-tuning thread pools. Thread pools usually have some idea of how much work is pending and even how much CPU is available, but they usually don't see other aspects of the entire environment. As a result, adding threads (a core feature of many self-tuning thread pools and some configurations for ThreadPoolExecutor) is often completely wrong when there is work hanging.

Unfortunately, setting the maximum number of threads is more like art than science, and that's why. In reality, the self-tuning thread pool under test conditions will achieve 80% to 90% of the possible performance; and even if the number of threads required is overestimated, there may be only a small loss. However, when there is a problem with setting the number of threads, the system may have a large degree of problem. In this regard, adequate testing is still critical.

Set the minimum number of threads

Once you have determined the maximum number of threads in the thread pool, it is time to determine the minimum number of threads required. In most cases, developers will simply set them to the same value.

Set the minimum number of threads to some other value, such as 1, to prevent the system from creating too many threads to save system resources. Because each thread needs a certain amount of memory, especially the thread's stack. According to one of the general principles, the system size should be set to handle the expected maximum throughput, and to achieve maximum throughput, the system will need to create all those threads. If the system cannot do this, it does not help to choose a minimum number of threads: if the system meets the condition that all threads need to be started at the maximum number of threads set, and cannot be met, the system will be in trouble. It is better to create all the threads that may eventually be needed and to ensure that the system can handle the expected maximum load.

On the other hand, the negative impact of specifying a minimum number of threads is quite small. If the process has a lot of tasks to perform as soon as it starts, there will be a negative impact: the thread pool needs to create new threads to process the task. Creating threads is bad for performance, which is why thread pools are needed in the first place, but this one-time cost is likely to be undetected in performance tests.

In batch applications, it doesn't matter whether threads are allocated when creating a thread pool (which happens if the maximum and minimum threads are set to the same value) or on demand: the time required to execute the application is the same. In other applications, new threads may be allocated during the warm-up phase (the total time allocated to threads is still the same), and the impact on performance is negligible. Even if thread creation occurs within a measurable period, it is likely to be undetectable as long as such operations are limited.

Another area that can be tuned is the idle time of threads. For example, the minimum number of threads in a thread pool is 1 and the maximum number of threads is 4. Now assume that there is usually a thread executing a task, and then the application enters a loop where the load averages two tasks to execute every 15 seconds. The first time you enter this loop, the thread pool creates the second thread, and it makes sense to keep the newly created thread in the pool for at least a period of time. We want to avoid a situation where the second thread is created, ends its task within 5 seconds, is idle for 5 seconds, and then exits. Five seconds later, you need to create a thread for the next task. In general, for a thread pool with a minimum number of threads, once a new thread is created, it should be retained for at least a few minutes to handle any surge in load. If there is a good model for task arrival rate, you can set idle time based on this model. In addition, free time should be measured in minutes, and at least between 10 and 30 minutes.

Leave some idle threads, which usually have little impact on application performance. In general, the thread object itself does not take up a lot of heap space. Unless the thread maintains a large amount of thread local storage, or the thread's Runnable object references a large amount of memory. In either case, releasing such a thread significantly reduces the live data in the heap (which in turn affects the efficiency of GC).

However, for thread pools, these situations are rare. When an object in the pool is idle, it should no longer reference any Runnable objects (if so, it means where there is bug). Depending on the implementation of the thread pool, thread local variables may continue to be retained; although in some cases thread local variables can effectively facilitate object reuse, the total amount of memory occupied by those thread local objects should be limited.

There is an important exception to this rule for thread pools that can grow to very large (and, of course, run on very large machines). For example, suppose that the task queue of a thread pool is expected to have an average of 20 tasks, then 20 is a good minimum. Suppose the pool runs on a large machine that is designed to handle a peak load of 2000 tasks. If you leave 2000 free threads in the pool, there will be an impact on performance when there are only 20 tasks: if only 20 core threads are busy, the throughput of the former may be 50% of that of the latter compared to 1980 idle threads. Thread pools generally do not encounter such problems, but if they do, you should confirm the appropriate minimum value for the pool.

Thread pool task size

Tasks waiting for the thread pool to execute are saved to some kind of queue or list; when there are threads in the pool to execute the task, one is pulled from the queue. This can lead to an imbalance: the number of tasks in the queue can become very large. If the queue is too large, the tasks in it must wait a long time until the previous task is completed. For example, an overloaded Web server: if a task is added to the queue but is not executed within 3 seconds, the user is likely to go to another page.

Therefore, thread pools usually limit the size of queues that hold tasks waiting to be executed. ThreadPoolExecutor can be handled differently depending on the data structure used to hold the waiting tasks (described in more detail in the next section); the application server usually has some tuning parameters that can be adjusted.

Like the maximum number of threads in a thread pool, there is no general rule on how this value should be tuned. For example, suppose there are 30,000 tasks in the task queue of an application server and four CPU are available. If it takes only 50 milliseconds to execute a task, and assuming that no new task will be reached during that time, it will take 6 minutes to clear the task queue. This may be acceptable, but if each task takes 1 second, it takes 2 hours to empty the task queue. Therefore, measuring our real application is the only way to determine which value to use will bring the performance we need.

In either case, if the queue limit is reached, the additional task will fail. ThreadPoolExecutor has a rejectedExecution method to handle this situation (RejectedExecutionException is thrown by default). The application server returns an error to the user: either the HTTP status code 500 (internal error), or the Web server catches the error and gives the user a reasonable explanation message-the latter is ideal.

Set the size of the ThreadPoolExecutor

The general behavior of a thread pool is as follows:

Prepare a minimum number of threads at creation time. If a task comes and all threads are busy, start a new thread (until the maximum number of threads is reached) and the task can be executed immediately.

Otherwise, the task is added to the waiting queue, and if a new task cannot be added to the task queue, it is rejected.

However, the performance of ThreadPoolExecutor may be a little different from this standard behavior.

Depending on the type of task queue selected, ThreadPoolExecutor decides when to start a new thread. There are three possibilities.

1. SynchronousQueue

If ThreadPoolExecutor is paired with SynchronousQueue, the thread pool behaves as expected, taking into account the number of threads: if all threads are busy and the number of threads in the pool has not yet reached its maximum, the new task starts a new thread. However, there is no way for this queue to hold waiting tasks: if a task comes, the number of threads created has reached its maximum, and all threads are busy, the new task will always be rejected. So it's a good choice to manage only a small number of tasks, but it's not appropriate for other situations. This type of documentation recommends that the maximum number of threads be specified as a very large value, which may work if the task is completely CPU-intensive, but we will see that in other cases it can be counterproductive. On the other hand, this option is better if you need a thread pool that can easily adjust the number of threads.

two。 Unbounded queue

If ThreadPoolExecutor is paired with an unbounded queue (such as LinkedBlockedingQueue), no task will be rejected (because there is no limit to the queue size). In this case, ThreadPoolExecutor will only create threads by the minimum number of threads at most, that is, the maximum thread pool size is ignored. If the maximum number of threads is the same as the minimum number of threads, this selection is closest to the traditional thread pool running mechanism with a fixed number of threads.

3. Bounded queue

ThreadPoolExecutor that uses a bounded queue (such as ArrayBlockingQueue) uses a very complex algorithm when deciding when to start a new thread. For example, suppose the core size of the pool is 4, the maximum is 8, and the maximum ArrayBlockingQueue used is 10. As the task arrives and is placed in the queue, up to four threads (that is, the core size) will run in the thread pool. Even if the queue is completely full, that is, there are 10 waiting tasks, ThreadPoolExecutor uses only four threads.

If the queue is full and a new task is added, a new thread will be started at this time. Instead of rejecting the task because the queue is full, a new thread is started. The new thread runs the first task in the queue to make room for the new task.

In this example, the only situation where there will be eight threads (the maximum number of threads) in the pool is when there are seven tasks in progress, there are ten tasks in the queue, and a new task comes.

The idea behind this algorithm is that the pool uses only four core threads most of the time, even if there are an appropriate number of tasks waiting to run in the queue. At this point, the thread pool can be used as a throttle (which is very beneficial). If there is a large backlog of requests, the pool will try to run more threads to clean up; then the second throttle-the maximum number of threads-works.

If the system has no external bottlenecks and enough CPU cycles, it all works out: adding new threads can process the task queue faster and probably bring it back to the expected size. Of course, the use cases suitable for the algorithm are also easy to construct.

On the other hand, the algorithm does not know why the queue grows suddenly. If it's because of a backlog of external tasks, it's not wise to add more threads. If the thread is running on a machine that is already CPU-intensive, it would be wrong to add more threads. Adding threads makes sense only if the task backlog is triggered by additional load entering the system (such as more clients initiating HTTP requests). (if this is the case, why wait until the queue is close to a certain boundary? If additional resources are available for more threads, adding threads early will improve the overall performance of the system.)

You can find a lot of arguments for or against each of the options mentioned above, but you can apply the KISS principle "Keep it simple, stupid" when trying to get the best performance. You can set the number of core threads and the maximum number of threads for ThreadPoolExecutor to the same, and in terms of saving waiting tasks, select LinkedBlockingQueue; if it is appropriate to use an unbounded task list, select ArrayBlockingQueue if it is appropriate to use a bounded task list.

Quick summary

Sometimes object pooling is a good choice, and thread pooling is one of the cases: the cost of thread initialization is high, and thread pooling makes the number of threads on the system easy to control.

The thread pool must be carefully tuned. Blindly adding new threads to the pool can adversely affect performance in some cases.

When using ThreadPoolExecutor, choosing a simpler option usually results in the best and most predictable performance.

ForkJoinPool

Java 7 introduces a new thread pool: the ForkJoinPool class. This class looks like any other thread pool; like the ThreadPoolExecutor class, it implements the Executor and ExecutorService interfaces. In terms of supporting these interfaces, ForkJoinPool internally uses an unbounded task list for the number of threads specified in the constructor (or the number of CPU on the machine if the no-parameter constructor is selected).

The ForkJoinPool class is designed to match the use of divide-and-conquer algorithms: tasks can be recursively decomposed into subsets. These subsets can be processed in parallel, and then the results of each subset are merged into one result. A classic example is the quick sort algorithm.

The focus of the divide-and-conquer algorithm is that the algorithm creates a large number of tasks that are managed by a relatively small number of threads. For example, to sort an array of 10 million elements. First, create a separate task to perform three operations: sort the subarray containing the first 5 million elements, then sort the subarray containing the next 5 million elements, and then merge the two subarrays.

Similarly, to sort an array of 5 million elements, you can sort a subarray of 2.5 million elements, and then merge the subarrays. Recursive to a certain point (for example, when the subarray contains 10 elements), it is more efficient to use insert sorting directly on the subarray. The following figure illustrates how it works.

Tasks in Recursive Quick sort

In the end, there will be more than 1 million tasks to sort the leaf array (each array has less than 10 elements, which can be sorted directly; here is just an example of 10, and the actual value will vary from implementation to implementation. In current Java library implementations, insert sorting is used when the array has less than 47 elements. It takes more than half a million tasks to merge those ordered arrays, another 250000 tasks to merge at the next level, and so on. In the end, there will be 2097151 tasks.

The bigger problem is that all tasks have to wait for the tasks they derive to be completed before they can be completed. For subarrays with less than 10 elements, the task of sorting them directly must be done first; after that, the task of creating the corresponding subarray can merge the results of its subarray, and so on: all the tasks in the chain are merged in turn. Until the entire array is merged into the final, sorted result.

Because the parent task must wait for the child task to complete, it is not possible to implement this algorithm efficiently using ThreadPoolExecutor. A thread within ThreadPoolExecutor cannot add another task to the queue and wait for it to complete: once the thread is in a waiting state, it cannot be used to perform one of its subtasks. ForkJoinPool, on the other hand, allows threads to create a new task and then suspend the current task. When a task is suspended, the thread can perform other waiting tasks.

Let's take a simple example: for example, there is a double array, and we want to calculate the number of elements in the array that are less than 0.5. Sequential scanning is relatively simple (and there may be advantages, as you'll see later in this section), but for illustration, now divide the array into subarrays and scan in parallel (mimicking more complex quick sorting and other divide-and-conquer algorithms). The code to do this using ForkJoinPool is as follows:

The fork and join methods are the key here: without these methods, it would be very painful to implement this kind of recursion (there are no such methods in tasks performed by ThreadPoolExecutor). These methods use a series of internal queues belonging to each thread to manipulate tasks and switch threads from executing one task to another. The details are transparent to developers, but if you are interested in algorithms, the code is also interesting to read. The focus here is on performance: what is the trade-off between the ForkJoinPool and ThreadPoolExecutor classes?

First, the suspension implemented by the fork/join paradigm allows all tasks to be executed by a small number of threads. Using this sample code to calculate the double value in an array of 10 million elements creates more than 2 million tasks, but these tasks can easily be performed by a small number of threads (or even one thread, if it makes sense for the machine running the test). Running a similar algorithm with ThreadPoolExecutor requires more than 2 million threads, because each thread must wait for its subtasks to complete, and those subtasks can only be completed if threads are available in the pool. With fork/join, we can implement algorithms that cannot be implemented with ThreadPoolExecutor, which is a performance advantage.

Although divide-and-conquer techniques are very powerful, abuse can also lead to poor performance. In this example of counting, you can use a thread to scan the array and count, although it may not be as fast as running the fork/join algorithm in parallel. However, it is also very easy to divide the original array into multiple breaks and use ThreadPoolExecutor to have multiple threads scan the array:

On a machine equipped with four CPU, this code can take full advantage of all available CPU to process arrays in parallel, while avoiding creating and queuing 2 million tasks as in the fork/join example. Performance is expected to be faster, as shown in Table 4.

Table 4: counting 100 million elements

The machine used in the test has 4 CPU,4 GB fixed memory. In the test, ThreadPoolExecutor does not need GC at all, and each ForkJoinPool test spends 1. 2 seconds on GC. This accounts for a large part of the performance difference, but that's not the whole story: the overhead of creating and managing task objects can also hurt ForkJoinPool's performance. If there were similar alternatives, it would probably be faster, at least in this simple example.

ForkJoinPool has an additional feature that implements work theft (work-stealing). This is basically an implementation detail; this means that each thread in the pool has its own queue for the tasks it creates. A thread gives priority to tasks in its own queue, but if the queue is empty, it steals tasks from the queues of other threads. As a result, even if one of the 2 million tasks takes a long time to execute, other threads in ForkJoinPool can do whatever else they want. This is not the case with ThreadPoolExecutor: if a task takes a long time, other threads cannot handle the extra work.

The sample code first calculates the number of elements in the array that are less than 0.5. In addition, what happens if a new value is calculated in the code and saved in the array? A meaningless but CPU-intensive implementation can execute the following code:

Because the outer loop indexed by j is based on the position of the element in the array, the time required to calculate is proportional to the position of the element: it takes a long time to calculate the value of d [0], while it takes only a short time to calculate d [d.length-1].

There is a downside to this test by simply dividing the array into four segments and dealing with it with ThreadPoolExecutor. The thread that calculates the first segment of the array takes a long time to complete, much longer than the fourth thread in the last segment of the array. Once the fourth thread ends, it is idle: all threads have to wait for the first thread to complete its time-consuming task.

In ForkJoinPool with a granularity of 2 million tasks, although one thread is busy with very time-consuming calculations for the first 10 elements in the array, the other threads have work to do, and CPU stays busy for most of the testing. The difference is shown in Table 5.

Table 5: processing time for an array of 10,000 elements

When there is only one thread in the pool, the calculation takes about the same time. This is understandable: the amount of computation is the same no matter how the pool is implemented, and because those calculations are never done in parallel, you can expect them to take the same time (although there is a small overhead to create 2 million tasks). But when the pool contains four threads, the granularity of the tasks in the ForkJoinPool brings a decisive advantage: it keeps the CPU busy almost throughout the test.

This situation is called "imbalance" because some tasks take longer than others (so the tasks in the previous example can be said to be "balanced"). In general, using segmented ThreadPoolExecutor is better if the task is balanced, and ForkJoinPool is better if the task is uneven.

There is also a more subtle performance suggestion: carefully consider at which point the fork/join paradigm should end recursion. In this example, I casually chose to end when the array size is less than 10:00. If you stop recursion when the array size is 2.5 million, the fork/join test (balanced code that handles 10 million elements on a machine with four CPU) will create only four tasks with roughly the same performance as ThreadPoolExecutor.

On the other hand, for this example, continuing recursion will have better performance in unbalanced testing, even if more tasks are created. Table 6 shows some representative data points.

Table 6: processing time for an array of 10,000 elements

Automatic parallelization

Java 8 introduces the ability to automatically parallelize specific kinds of code into Java. This parallelization depends on the use of the ForkJoinPool class. Java 8 adds a new feature to this class:

A common pool that is available to any ForkJoinTask that is not explicitly assigned to a particular pool.

This common pool is a static element of the ForkJoinPool class, and its size is set to the number of processors on the target machine by default.

This parallelization occurs in many new methods of the Arrays class, including using parallel quick sorting to process arrays, manipulating each element of the array, and so on. There is also an application in the Stream feature of Java 8, which enables operations to be performed on each element in the collection (either sequentially or in parallel). Instead of discussing some of the basic performance features of Stream, let's take a look at how Stream automatically processes in parallel.

Given a collection that contains a series of integers, the following code calculates the price history of the stock symbol that matches the given integer:

This code simulates the price history in parallel: the forEach method creates a task for each element in the array list, and each task is handled by a common ForkJoinTask pool.

Setting the size of the ForkJoinTask pool is as important as setting any other thread pool. By default, the number of threads in the public pool is equal to the number of CPU on the machine. If you have multiple JVM running on the same machine, you should limit the number of threads to prevent these JVM from competing with each other for CPU. Similarly, if the Servlet code performs a parallel task, and we want to ensure that CPU is available for other tasks, consider reducing the number of threads in the common pool. In addition, you can also consider increasing the number of threads if tasks in the common pool block waiting for Istroke O or other data.

This value can be specified by setting the system property-Djava.util.concurrent.ForkJoinPool.common.parallelism=N.

In Table 1 above, we have compared the impact of the number of threads on the historical stock price of parallel computing. Table 7 compares that data to the forEach construct using a common ForkJoinPool that sets the parallelism system property to a given value.

Table 7: the time required to calculate the price history of 10,000 simulated stocks

By default, the public pool has four threads (on this machine with four CPU configured), so the third behavior in the table is general. Such results make performance engineers unhappy when the number of threads is 1 and 2: they look incongruous, and when this happens to a test, the most common reason is a test error. The reason for this is that the forEach method behaves strangely: it uses a thread to execute statements, and it also uses threads in a common pool to process data from Stream. Even in the first test, the public pool is configured to use one thread, but two threads are used to calculate the results. (as a result, ThreadPoolExecutor with 2 threads takes about the same time as ForkJoinPool with 1 thread.)

When using parallel Stream constructs or other automatic parallelization features, if you need to resize common pools, consider subtracting the desired value by 1.

Quick summary

The ForkJoinPool class should be used for recursive, divide-and-conquer algorithms.

Some effort should be made to determine when the recursive task in the algorithm is most appropriate. Creating too many tasks can degrade performance, but it can also degrade performance if there are too few tasks and the execution time required for the task varies.

The automatic parallelization feature in Java 8 uses a common ForkJoinPool instance. We may need to adjust the default size of this instance according to the actual situation.

You can gain a great performance advantage by understanding how threads work. In terms of thread performance, however, there is not much that can be tuned: there are very few JVM flags that can be modified, and those flags have limited effect.

Instead, better thread performance comes from this: following a series of practical principles that manage the number of threads and limit the impact of synchronization. With the help of appropriate profiling tools and lock analysis tools, applications can be checked and modified to avoid the negative impact of thread and lock problems on performance.

Adjust thread stack size

When space is precious, you can adjust the memory used by the thread. Each thread has a native stack that the operating system uses to hold the call stack information for that thread (for example, the main () method calls the calculate () method, and the calculate () method calls the add () method, which the stack records).

The default size of the thread stack varies with different versions of JVM, as shown below. In a nutshell, if there are 128 KB stacks on 32-bit JVM and 256 KB stacks on 64-bit JVM, many applications can actually run. If this value is set too small, the potential disadvantage is that when a thread's call stack is very large, a StackOverflowError is thrown.

Default stack sizes for several JVM

In 64-bit JVM, there is no reason to set this value unless physical memory is very limited and a smaller stack prevents native memory from being exhausted. On the other hand, on a 32-bit JVM, using a smaller stack (such as 128KB) is often a good choice because it frees up some memory in the process space, making the JVM's heap larger.

Deplete native memory

There is not enough native memory to create a thread, or an OutOfMemoryError may be thrown. This means that one of the following three situations may have occurred.

On 32-bit JVM, the process takes up a maximum of 4 GB (or less than 4 GB, depending on the operating system).

The system has actually run out of virtual memory.

On Unix-style systems, the number of processes created by users has reached the quota limit. A separate thread in this respect is treated as a process.

Reducing the size of the stack can overcome the first two problems, but has little effect on the third. Unfortunately, we can't tell what kind of situation it is from the JVM error report, and we can only troubleshoot when we encounter an error.

To change the stack size of a thread, use the-Xss=N flag (for example,-Xss=256k).

Summary

On machines where memory is scarce, the thread stack size can be reduced.

On a 32-bit JVM, you can reduce the thread stack size to slightly increase the memory available to the heap with 4 GB process space constraints.

Monitoring threads and locks

When analyzing the efficiency of threads and synchronization in an application, there are two points to pay attention to: the total number of threads (neither too large nor too small) and the time threads spend waiting for locks or other resources.

1. View thread

Almost all JVM monitoring tools provide information about the number of threads (and what they are doing). Interactive tools like jconsole can also display the status of threads within JVM. On the Threads panel of jconsole, you can observe the increase or decrease in the number of threads during program execution in real time. The following figure is an example.

At some point, the application (NetBeans) uses up to 45 threads. At the beginning of the figure, there was a flashpoint, with a maximum of 38 threads, and then the number of threads stabilized between 30 and 31. Jconsole can print stack information for each individual thread; as shown in the figure, the Java2D Disposer thread is waiting on the lock of a reference queue.

Active Thread View in JConsole

two。 View blocked threads

Real-time thread monitoring is useful if you want to know what threads in the application are running such high-level views, but there is actually no data on what those threads are doing. To determine where the thread's CPU cycles are being consumed, you need to use a profiler. Using the parser, you can well observe which threads are executing. And parsers are generally mature enough to point out areas of code that can speed up overall execution through better algorithms and better code choices.

Diagnosing blocked threads is more difficult, although such information is often more important to the overall execution of the application, especially when contemporary code runs on multi-CPU systems, but does not take advantage of all available CPU. There are generally three ways to perform such diagnostics. One of the methods is to use an analyzer, because most analysis tools provide timeline information about the execution of the thread, so you can see the point in time when the thread is blocked.

Blocked thread and JFR

By far the best way to know when a thread is blocked is to use a tool that can peer inside the JVM and determine when the thread is blocked at a lower level. The Java Flight Recorder (Java Flight Recorder,JFR) is one such tool. We can dive into the events captured by JFR and look for events that cause thread blocking (such as waiting for a Monitor to be obtained, or waiting for a Socket to be read and written, although writing is rare).

These events can be easily viewed with the histogram panel of JMC, as shown in the following figure.

Threads in JFR that are blocked by a Monitor

In this example, the lock associated with HashMap in the sun.awt.AppContext.get method is competed 163times (more than 66 seconds), increasing the measured request response time by an average of 31 milliseconds. The stack trace indicates that the competition stems from the way JSP writes java.util.Date objects. To improve the scalability of this code, you can format the object with a thread-local date instead of simply calling the date object's toString method.

Select the blocking event from the histogram and then examine the calling code, which is suitable for any blocking event; this tightly integrated tool with JVM makes this process possible.

Blocked thread and JStack

If a commercial JVM is not available, one of the alternatives is to take a large number of thread stacks from the program and check them. Jstack, jcmd, and other tools can provide information about the state of each thread in the virtual machine, including whether the thread is running, waiting for a lock, or waiting for Ihand O, and so on. This can be useful for determining what is going on in the application, but there are a lot of things in the output that we don't need.

There are two things to be aware of when looking at the thread stack. First, JVM can only dump a thread's stack at a specific location (safepoint, security point). Second, stack information can only be dumped for one thread at a time, so you may see conflicting information: for example, two threads hold the same lock, or the lock that one thread is waiting for is not held by another thread.

JStack analyzer

It is easy to think that if you grab multiple stack dumps continuously and quickly, you can use it as a simple and fast analyzer. After all, sampling parsers essentially work: periodically detect the thread's execution stack and infer how much time is spent on the method based on this information. However, this is not very effective between safe points and inconsistent snapshots; by looking at these thread stacks, it is sometimes possible to know at a higher level that the method of higher execution costs, but a real parser provides much more accurate information.

The severity of thread blocking can be seen from the thread stack (because the blocked thread is already at a safe point). If there is continuous thread dump information indicating that a large number of threads are blocking on a lock, it can be concluded that there is serious competition on the lock. If there is continuous thread dump information that indicates that a large number of threads are blocking and waiting for Imax O, you can conclude that you need to optimize the ongoing Imax O read operation (for example, if it is a database call, SQL execution should be optimized, or the database itself should be optimized).

The problem with the output of Jstack is that it may vary from version to version, so it is difficult to develop a robust parser. There is no guarantee that this parser will be applied without modification to the specific JVM you are using.

The basic output of the jstack parser looks like this:

The parser aggregates all the threads and can show how many threads are in each state. Eight threads are running (they happen to be getting stack trace information, which is very expensive and is best avoided).

41 threads are blocked by a lock. The method reported is the first non-JDK method in the stack trace, in this case GlassFish's EJBClassLoader.getResourceAsStream. The next step is to consider the stack trace information and search this method to see what resources the thread is blocking to.

In this example, all threads are blocked, waiting to read the same JAR file; the stack trace of these threads indicates that all calls come from operations that instantiate the new SAX instance. The SAX parser can be defined dynamically by listing the resources in the manifest file in the application JAR file, which means that JDK must search the entire classpath to find those entries until it finds the one the application wants to use (or can't find it, go back to the system parser). Because reading this JAR file requires a synchronous lock, all threads that try to create a parser will eventually compete for the same lock, which can greatly affect the throughput of the application. (this is why it is recommended to set the-Djavax.xml.parsers.SAXParserFactory property to avoid these lookups.)

More importantly, a large number of blocked threads can be a performance problem. No matter what the source of blocking is, the configuration or application should be modified to avoid it.

What about threads waiting for notification? Those threads are waiting for other events to occur. They tend to be in a pool, waiting for notification that the task is ready (for example, the getTask () method in the above output is waiting for a request). System threads are dealing with things like RMI distributed GC or JMX monitoring, which appear in the output of jstack in the form of only threads such as the JDK class in the stack. These conditions do not necessarily indicate a performance problem; waiting for notification is normal for these threads.

It can also cause problems if the thread is doing a blocking I _ swap O read (usually the socketRead0 () method). This also affects throughput: the thread is waiting for a back-end resource to reply to its request. At this point, you should check the performance of the database or other back-end resources.

At this point, the study on "tuning methods for Java threads" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.