How to tune the performance of Java 04/22 Update SLTechnology News&Howtos

How to tune the performance of Java

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to tune Java performance". In daily operation, I believe many people have doubts about how to tune Java performance. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to tune Java performance". Next, please follow the editor to study!

There are many bottlenecks in Java application performance, such as disk, memory, network Imax O and other system factors, Java application code, JVM GC, database, cache and so on. According to personal experience, the author divides Java performance optimization into four levels: application layer, database layer, framework layer and JVM layer.

Hierarchical Model of Java performance Optimization

The strongest practice of Java performance tuning the optimization difficulty of each layer increases step by step, and the knowledge involved and the problems solved will be different. For example, the application layer needs to understand the code logic and locate problematic lines of code through the Java thread stack; the database layer needs to analyze SQL and locate deadlocks; the framework layer needs to understand the source code and framework mechanism; and the JVM layer needs to have an in-depth understanding of the type and working mechanism of GC and the role of various JVM parameters.

There are two basic analysis methods around Java performance optimization: on-site analysis and post-analysis.

The on-site analysis method retains the site, and then uses diagnostic tools to analyze the location. On-site analysis has a great impact on online, and some scenarios (especially when it comes to users' critical online business) are not appropriate.

Ex post analysis needs to collect as much on-site data as possible, and then immediately restore the service, and at the same time analyze and reproduce the collected field data. Let's share some cases and practices from the performance diagnosis tool.

Performance diagnosis tools

One kind of performance diagnosis is to diagnose the systems and codes that have identified performance problems, and the other is to test the performance of the pre-online system in advance to determine whether the performance meets the online requirements.

This article focuses on the former, while the latter can be tested with various performance stress testing tools (such as JMeter), which is beyond the scope of this article.

For Java applications, performance diagnosis tools are mainly divided into two layers: OS level and Java application level (including application code diagnosis and GC diagnosis).

OS diagnosis

The diagnosis of OS is mainly concerned with three aspects: CPU, Memory and Imax O.

2. CPU diagnosis

For CPU, the main focus is on average load (Load Average), CPU utilization, and context switching times (Context Switch).

You can view the average system load and CPU utilization through the top command, and figure 2 shows the status of a system through the top command.

Example of top command

The average load has three numbers: 63.6, 58.39, 57.18, which represents the load of the machine in the past 1 minute, 5 minutes, and 15 minutes, respectively. According to experience, if the number is less than the number of 0.7*CPU, the system works normally; if it exceeds this value, even up to four or five times the number of CPU cores, then the load of the system is obviously on the high side.

In the example of the top command, the load in 15 minutes is as high as 57.18. The load in 1 minute is 63.66 (the system is 16 cores), which indicates that the system has a load problem and has a further upward trend, so it is necessary to locate the specific reason.

You can view the number of context switches for CPU through the vmstat command, as shown in the following figure:

Example of vmstat command

The main scenarios in which context switching occurs are as follows:

1) when the time slice runs out, CPU will schedule the next task normally.

2) be preempted by other higher priority tasks

3) when executing a task, it encounters a block of Istroke O, suspends the current task, and switches to the next task.

4) the user code actively suspends the current task and gives up CPU

5) multitasking preempts resources and is suspended because it is not caught.

6) hardware interrupt.

Java thread context switching mainly comes from the competition of shared resources. In general, locking a single object rarely becomes a system bottleneck unless the lock grain is too large. However, in a code block with high access frequency and continuous locking of multiple objects, a large number of context switching may occur, which may become the bottleneck of the system.

For example, in our system, log4j 1.x prints a large number of logs under large concurrency, frequent context switching, and a large number of thread blockages, resulting in a great decline in system throughput. The related code is shown in listing 1. Upgrade to log4j 2.x to solve this problem.

For (Category c = this; c! = null; c=c.parent) {/ / Protected against simultaneous call to addAppender, removeAppender, … Synchronized (c) {if (c.aai! = null) {write + = c.aai.appendLoopAppenders (event);}... 3. Memory

From the operating system perspective, memory is concerned about the adequacy of the application process, and you can use the free-m command to see how much memory is being used.

The virtual memory VIRT and physical memory RES used by the process can be viewed through the top command. The swap partition (Swap) used by a specific application can be calculated according to the formula VIRT = SWAP + RES. The use of swap partition will affect the performance of Java applications, and the swappiness value can be adjusted to as small as possible.

Because for Java applications, taking up too many swap partitions may affect performance, after all, disk performance is much slower than memory.

IV. Icano

Ipicuro includes disk Ipicuro and network Ipicuro. Generally speaking, disks are more prone to Icano bottlenecks. You can check the read and write status of the disk through iostat, and you can see whether the disk Imax O is normal through the CPU I _ wait.

If the disk Icano is in a high state all the time, it means that the disk is too slow or malfunctioning, which has become a performance bottleneck and requires application optimization or disk replacement.

In addition to the commonly used top, ps, vmstat, iostat and other commands, there are other Linux tools that can diagnose system problems, such as mpstat, tcpdump, netstat, pidstat, sar and so on. Brendan summarizes the performance diagnostic tools for different Linux device types, as shown in the following figure, for reference.

Linux performance observation tool

5. Java application diagnosis and tools

Application code performance problem is a kind of performance problem that is relatively easy to solve. Through some application-level monitoring and alarm, if you determine the problematic function and code, you can locate it directly through the code; or through top+jstack, you can find out the problematic thread stack and locate it to the code of the problem thread. For more complex and logical code snippets, printing performance logs through Stopwatch can also locate performance problems in most application code.

Commonly used Java application diagnosis includes thread, stack, GC and other aspects of diagnosis.

Jstack command

The jstack command is usually used with top to locate Java processes and threads through top-H-p pid, and then export the thread stack using jstack-l pid. Because the thread stack is transient, multiple dump is required, usually 3 times dump, usually 5 seconds at intervals. Convert the Java thread pid located by top to hexadecimal, get the nid in the Java thread stack, and find the corresponding problem thread stack.

View long-running Java threads through top-H-p

As shown in figure 5, thread 24985 takes a long time to run and may have problems. After switching to hexadecimal, find the following stack of the corresponding thread 0x6199 through the Java thread stack, so as to locate the problem point, as shown in the following figure.

Jstack View Thread Stack

Java performance tuning best practice JProfiler

JProfiler can analyze CPU, heap and memory with powerful functions, as shown in the following figure. At the same time, combined with the pressure testing tool, the code time can be sampled and counted.

Memory analysis through JProfiler

VI. GC diagnosis

Java GC solves the risk of programmers managing memory, but the application pause caused by GC is another problem that needs to be addressed. JDK provides a series of tools to locate GC problems, including jstat, jmap, and MAT, a third-party tool, etc.

Jstat

The jstat command prints GC details, Young GC and Full GC times, heap information, and so on. The command format is

Jstat-gcxxx-t pid, as shown in the following figure.

Example of jstat command

Jmap

Jmap prints Java process heap information jmap-heap pid. You can dump the heap to a file through jmap-dump:file=xxx pid, and then further analyze its heap usage through other tools

MAT

MAT is the analytical tool of Java heap, which provides intuitive diagnostic reports. The built-in OQL allows SQL-like queries on the heap, which is powerful. Outgoing reference and incoming reference can trace the source of object references.

MAT example

Figure 9 is the strongest practice of Java performance tuning. Figure 9 is an example of the use of MAT. MAT has two columns showing the size of the object, Shallow size and Retained size. The former represents the size of the memory occupied by the object itself and does not contain its reference to the object. The latter is the sum of the Shallow size of the object itself and its direct or indirect reference to the object, that is, the amount of memory released by the GC after the object is recycled. Generally speaking, just pay attention to the size of the latter.

For some large piles (tens of gigabytes) of Java applications, a large amount of memory is needed to open MAT.

Usually the memory of the local development machine is too small to open. It is recommended to install the graphics environment and MAT on the server side offline and open it remotely. Alternatively, execute the mat command to generate a heap index and copy the index locally, but the heap information is limited in this way.

To diagnose GC problems, it is recommended to add-XX:+PrintGCDateStamps to the JVM parameter. The common GC parameters are shown in the following figure.

Common GC parameters

For Java applications, most applications and memory problems can be located through top+jstack+jmap+MAT, which is a must-have tool. Sometimes, Java application diagnosis needs to refer to OS-related information, you can use some more comprehensive diagnostic tools, such as Zabbix (integrated OS and JVM monitoring) and so on. In the distributed environment, infrastructure such as distributed tracking system also provides strong support for application performance diagnosis.

VII. Practice of performance optimization

After introducing some commonly used performance diagnosis tools, we will share cases from the JVM layer, the application code layer and the database layer based on our practice in Java application tuning.

JVM tuning: the pain of GC

When a system of XX commercial platform is reconfigured, RMI is selected as the internal remote calling protocol. After the system is online, periodic service stop response begins to appear, and the pause time varies from a few seconds to tens of seconds. By observing the GC log, it is found that Full GC appears every hour since the service was started. Due to the large system heap setting, Full GC pauses the application for a long time, which has a great impact on the online real-time service.

After analysis, there is no regular Full GC in the system before refactoring, so it is suspected to be a problem at the framework level of RMI. Through public information, it is found that RMI's GDC (Distributed Garbage Collection, distributed garbage collection) starts the daemon thread to execute Full GC periodically to collect remote objects, and its daemon thread code is shown in listing 2.

Listing 2.DGC daemon thread source code private static class Daemon extends Thread {public void run () {for (;;) {/ /... Long d = maxObjectInspectionAge (); if (d > = l) {System.gc (); d = 0;} / / … }}}

It is easier to solve the problem after positioning. One is to directly disable the display call of the system GC by adding the-XX:+DisableExplicitGC parameter, but for systems that use NIO, there is a risk of out-of-heap memory overflow.

Another way is to increase the Full GC interval by increasing the-Dsun.rmi.dgc.server.gcInterval and-Dsun.rmi.dgc.client.gcInterval parameters, and at the same time increase the parameter-XX:+ExplicitGCInvokesConcurrent to adjust the Full GC of a complete Stop-The-World to a concurrent GC cycle to reduce the application pause time without affecting the NIO application.

As can be seen from the figure below, the number of adjusted Full GC decreased significantly after March.

Full GC monitoring statistics

GC tuning is still necessary for applications with high concurrency and large amount of data interaction, especially the default JVM parameters usually do not meet the business requirements and need to be specifically tuned. There is a lot of public information about the interpretation of GC logs, which will not be repeated in this article.

There are basically three ideas for the goal of GC tuning: reducing GC frequency can reduce unnecessary object generation by increasing heap space, reducing GC pause time can be achieved by reducing heap space and using CMS GC algorithm, avoiding Full GC, adjusting CMS trigger ratio, avoiding Promotion Failure and Concurrent mode failure (allocating more space in old times, increasing the number of GC threads to accelerate recovery speed), reducing the generation of large objects, and so on.

Application layer tuning: smell the bad smell of the code

Starting with the application layer code tuning, analyzing the root cause of the decline in code efficiency is undoubtedly one of the good means to improve the performance of Java applications.

After a commercial advertising system (using Nginx for load balancing) is launched on a daily basis, the load of several machines increases sharply and the CPU utilization is quickly full. We did an emergency rollback online and saved the site of one of the servers through jmap and jstack.

Analyze the stack scene through MAT

The strongest practice stack for Java performance tuning is shown in the figure above. According to MAT's analysis of dump data, it is found that the most memory objects are byte [] and java.util.HashMap $Entry, and there are circular references to java.util.HashMap $Entry objects. Preliminary positioning is that there may be a dead loop problem during the put process of the HashMap (in the figure, the next references of java.util.HashMap $Entry 0x2add6d992cb8 and 0x2add6d992ce8 form a loop).

Consult the relevant documentation to locate this is a typical concurrency scenario error

To put it simply, HashMap itself does not have the characteristic of multi-thread concurrency. In the case of simultaneous put operation of multiple threads, the expansion of the internal array will lead to the formation of a circular structure in the internal linked list of HashMap, resulting in an endless loop.

For this launch, the biggest change is to improve system performance by caching website data in memory, while using the lazy loading mechanism, as shown in listing 3.

Listing 3. Website data lazy loading code

Private static Map domainMap = new HashMap (); private boolean isResetDomains () {if (CollectionUtils.isEmpty (domainMap)) {/ / get website details from the remote http interface List newDomains = unionDomainHttpClient .queryAllUnionDomain (); if (CollectionUtils.isEmpty (domainMap)) {domainMap = new HashMap (); for (UnionDomain domain: newDomains) {if (domain! = null) {domainMap.put (domain.getSubdomainId (), domain);} return true;} return false;}

You can see that the domainMap here is a static shared resource, it is a HashMap type, in the case of multithreading, it will cause its internal linked list to form a circular structure, resulting in an endless loop.

Through the connection and access log of the front-end Nginx, we can see that due to the accumulation of a large number of user requests in Nginx after the system restart, started in the Resin container, a large number of user requests pour into the application system, and multiple users request and initialize website data at the same time, resulting in concurrency problems in HashMap. After locating the cause of the fault, the solution is relatively simple, and the main solutions are:

(1) use ConcurrentHashMap or synchronous block to solve the above concurrency problems; (2) complete website cache loading and remove lazy loading before system startup; (3) replace local cache with distributed cache.

For the positioning of bad code, in addition to the general sense of code review, tools such as MAT can also quickly locate the bottleneck of system performance to some extent. However, in some cases bound to specific scenarios or business data, auxiliary code walk-through, performance testing tools, data simulation and even online drainage are needed to finally confirm the source of the performance problem. Here are some possible features of some bad code that we have summarized for your reference:

(1) code readability is poor, there is no basic programming specification; (2) too many objects or generate large objects, memory leaks, etc.; (3) too many IO stream operations, or forget to close; (4) too many database operations, transactions are too long; (5) synchronous use of the scene error; (6) loop iterative time-consuming operation.

Database layer tuning: deadlock nightmare

For most Java applications, the scenario of interacting with the database is very common, especially for OLTP, which requires high data consistency, the performance of the database will directly affect the performance of the whole application. Sogou commercial platform system, as the advertising release and delivery platform of advertisers, has high requirements for the real-time and consistency of its materials, and we have also accumulated some experience in relational database optimization.

For the advertising material library, the high operation frequency (especially through the batch material tool operation) is very easy to cause the database deadlock situation, one of the more typical scenarios is the advertising material price adjustment. Customers often adjust the bid of materials frequently, which indirectly causes greater load pressure on the database system and aggravates the possibility of deadlock. The following is a case study of the price adjustment of advertising materials in an advertising system on Sogou commercial platform.

A commercial advertising system has a sudden increase in traffic one day, resulting in an increase in system load and frequent database deadlocks, as shown in the following figure.

Deadlock statement

Among them, the index on the groupdomain table is three single index structures: idx_groupdomain_accountid (accountid), idx_groupdomain_groupid (groupid) and primary (groupdomainid), using Mysql innodb engine.

This scenario occurs when the group bid is updated, and there are groups, group industries (groupindus tables), and group websites (groupdomain tables) in the scenario.

When updating the group bid, if the group industry bid uses the group bid (marked by isusegroupprice, if it is 1, use the group bid). At the same time, if the group website bid uses the group industry bid (marked by isuseindusprice, if 1, then use the group industry bid), you also need to update the group website bid at the same time. Since there can be up to 3000 websites under each group, the relevant records will be locked for a long time when updating the group bid.

From the deadlock problem above, you can see that both transaction 1 and transaction 2 selected idx_groupdomain_accountid 's single-column index. According to the locking characteristics of the Mysql innodb engine, only one index is selected for use in a transaction, and once the secondary index is used for locking, the primary key index will be attempted to lock. Further analysis shows that transaction 1 locks the idx_groupdomain_accountid secondary index held by transaction 2 (lock range "space id 5726 page no 8658 n bits 824 index"), but transaction 2 has acquired the lock added on the secondary index ("space id 5726 page no 8658 n bits 824 index"), waiting for the request to lock the lock on the primary key index PRIMARY index. Transaction 1 is finally rolled back because transaction 2 waits for too long to execute or does not release the lock.

By tracking the access log of the same day, we can see that on that day, a customer initiated a large number of script operations to modify the promotion group bid, resulting in a large number of transactions cycling waiting for the previous transaction to release the locked primary key PRIMARY index. The root of this problem actually lies in the limited use of indexes by the Mysql innodb engine, which is not prominent in Oracle databases.

The natural way to solve the problem is to want as few records as possible locked by a single transaction, so that the probability of deadlock will be greatly reduced. Finally, the composite index (accountid, groupid) is used, which reduces the number of records locked by a single transaction, and realizes the isolation of extended group data records under different plans, thus reducing the probability of this kind of deadlock.

Generally speaking, the tuning of the database layer basically starts from the following aspects:

(1) optimize at the SQL statement level: slow SQL analysis, index analysis and tuning, transaction splitting, etc.

(2) optimize at the database configuration level, such as field design, cache size adjustment, database parameter optimization such as disk Iripple O, data defragmentation, etc.

(3) optimize the database structure: consider the vertical split and horizontal split of the database.

(4) choose an appropriate database engine or type to adapt to different scenarios, such as considering the introduction of NoSQL.

VIII. Summary and suggestions

Performance tuning also follows the 2-8 principle, and 80% of performance problems are caused by 20% of the code, so optimizing key code gets twice the result with half the effort. At the same time, in order to optimize the performance on demand, excessive optimization may introduce more problems. For Java performance optimization, we should not only understand the system architecture and application code, but also pay attention to the JVM layer and even the bottom layer of the operating system. To sum up, we can mainly consider the following points:

1) tuning of basic performance

The basic performance here refers to the upgrade optimization at the hardware level or operating system level, such as network tuning, operating system version upgrade, hardware device optimization and so on. For example, the use of F5 and the introduction of SDD hard disk, including the upgrade of NIO in the new version of Linux, can greatly improve the performance of applications.

2) performance optimization of database

Including common transaction splitting, index tuning, SQL optimization, NoSQL introduction, such as the introduction of asynchronous processing in transaction splitting, and finally achieving consistency, including the introduction of all kinds of NoSQL databases for specific scenarios, can greatly alleviate the shortcomings of traditional databases under high concurrency

3) Application architecture optimization

Introduce some new computing or storage frameworks, make use of new features to solve the bottleneck of the original cluster computing performance, or introduce a distributed strategy to level computing and storage, including pre-processing in advance, etc., using typical space for time, etc., can reduce the system load to a certain extent.

4) Optimization at the business level

Technology is not the only way to improve system performance. In many scenarios with performance problems, we can see that a large part of them are caused by special business scenarios. If we can evade or adjust the business, it is often the most effective.

At this point, the study of "how to tune Java performance" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.