How to achieve High performance and High concurrency between Java and Netty 07/06 Update SLTechnology News&Howtos

How to achieve High performance and High concurrency between Java and Netty

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how Java and Netty achieve high performance and high concurrency. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

1. Background 1.1. Amazing performance data

Recently, a friend in the circle told me through a private message that by using Netty4 + Thrift compression binary codec technology, they achieved 10W TPS (1K complex POJO object) cross-node remote service invocation. Compared with the traditional communication framework based on Java serialization + BIO (synchronous blocking IO), the performance is improved by more than 8 times.

In fact, I am not surprised by this data. According to my more than 5 years of NIO programming experience, it is entirely possible to achieve the above performance targets by choosing a suitable NIO framework, coupled with high-performance compression binary codec technology, and carefully designing the Reactor thread model.

Let's take a look at how Netty supports cross-node remote service invocation of 10W TPS. Before we start, let's briefly introduce Netty.

1.2. Introduction to the basics of Netty

Netty is a high-performance, asynchronous event-driven NIO framework, which provides support for TCP, UDP and file transfer. As an asynchronous NIO framework, all IO operations of Netty are asynchronous and non-blocking. Through the Future-Listener mechanism, users can easily obtain the results of IO operations actively or through the notification mechanism.

As the most popular NIO framework, Netty has been widely used in the Internet field, big data distributed computing field, game industry, communication industry and so on. Some well-known open source components in the industry are also based on Netty's NIO framework.

2. Netty's way of high performance 2.1. The performance model of RPC call is analyzed in 2.1.1. Three deadly sins of poor performance of traditional RPC calls

The problem of network transmission mode: the traditional RPC framework or remote service (procedure) calls based on RMI use synchronous blocking IO. When the client's concurrent pressure or network delay increases, synchronous blocking IO will cause frequent blocking of I threads due to frequent wait, and the processing capacity of IO will naturally decline because threads can not work efficiently.

Next, let's take a look at the disadvantages of BIO communication through the BIO communication model diagram:

Figure 2-1 BIO communication model diagram

In the server with BIO communication model, an independent acceptor thread is usually responsible for listening to the client connection, and after receiving the client connection, a new thread processing request message is created for the client connection. After the processing is completed, the reply message is returned to the client, and the thread is destroyed. This is a typical one-request-reply model. The biggest problem of this architecture is that it does not have the ability to scale flexibly. when the number of concurrent visits increases, the number of threads on the server is linearly proportional to the number of concurrent accesses. Because threads are very valuable system resources of the JAVA virtual machine, when the number of threads expands, the performance of the system declines sharply. As the number of concurrency continues to increase, there may be handle overflow, thread stack overflow and other problems, and eventually lead to server downtime.

Serialization method problem: there are several typical problems with Java serialization:

1) Java serialization mechanism is an object codec technology within Java, which cannot be used across languages. For example, for docking between heterogeneous systems, the Java serialized code stream needs to be able to be deserialized into the original object (copy) in other languages, which is difficult to support at present.

2) compared with other open source serialization frameworks, the serialized code stream of Java is too large, whether it is transmitted over the network or persisted to disk, it will lead to additional resource consumption.

3) poor serialization performance (high CPU resource consumption).

Thread model problem: due to the use of synchronous blocking IO, this will cause each TCP connection to occupy 1 thread. Because thread resources are very valuable resources of the JVM virtual machine, when the thread can not be released in time due to IO read and write blocking, it will lead to a sharp decline in system performance, which may even lead to the virtual machine unable to create new threads.

2.1.2. Three themes of high performance

1) Transmission: the performance of the framework is largely determined by the BIO, NIO, or AIO,IO model on which channel to send data to the other party.

2) Protocol: what kind of communication protocol, HTTP or internal private protocol. With different choices of protocols, the performance models are also different. The performance of internal private protocols can usually be designed to be better than public protocols.

3) Thread: how to read Datagram? Which thread is the codec after reading, and how the message after codec is distributed? the difference of Reactor thread model also has a great impact on performance.

Figure 2-2 three elements of RPC call performance

2.2. The way of Netty high performance 2.2.1. Asynchronous non-blocking communication

In the process of IO programming, when multiple client access requests need to be processed at the same time, multithreading or IO multiplexing technology can be used to process them. IO multiplexing technology multiplexes the blocking of multiple IO into the blocking of the same select, so that the system can process multiple client requests at the same time in the case of a single thread. Compared with the traditional multi-thread / multi-process model, the biggest advantage of Istroke O multiplexing is the low system overhead, the system does not need to create new additional processes or threads, and does not need to maintain the operation of these processes and threads, which reduces the maintenance workload of the system and saves system resources.

JDK1.4 provides support for non-blocking IO (NIO). The JDK1.5_update10 version uses epoll instead of traditional select/poll, which greatly improves the performance of NIO communications.

The JDK NIO communication model is as follows:

Figure 2-3 Multiplexing model diagram of NIO

Corresponding to the Socket class and the ServerSocket class, NIO also provides two different socket channel implementations of SocketChannel and ServerSocketChannel. Both of these new channels support both blocking and non-blocking modes. Blocking mode is very simple to use, but its performance and reliability are poor, while non-blocking mode is just the opposite. Developers can generally choose the appropriate mode according to their own needs. Generally speaking, applications with low load and low concurrency can choose to block IO synchronously to reduce programming complexity. However, for network applications with high load and high concurrency, the non-blocking mode of NIO needs to be used for development.

The Netty architecture is designed and implemented according to the Reactor pattern, and its server communication sequence diagram is as follows:

Figure 2-3 NIO server communication sequence diagram

The client communication sequence diagram is as follows:

Figure 2-4 NIO client communication sequence diagram

Due to the aggregation of the multiplexer Selector, Netty's IO thread NioEventLoop can concurrently process hundreds of client Channel at the same time. Because the read and write operations are non-blocking, this can fully improve the running efficiency of IO threads and avoid thread hanging caused by frequent IO blocking. In addition, because Netty adopts asynchronous communication mode, one IO thread can handle N client connections and read and write operations concurrently, which fundamentally solves the traditional synchronous blocking IO-connection-thread model, and the performance, flexible scalability and reliability of the architecture have been greatly improved.

2.2.2. Zero copy

Many users have heard that Netty has a "zero copy" function, but it is not clear where it is embodied. This section explains the "zero copy" function of Netty in detail.

The "zero copy" of Netty is mainly reflected in the following three aspects:

1) the receiving and sending ByteBuffer of the Netty uses DIRECT BUFFERS, and the direct memory outside the heap is used for Socket reading and writing, and the second copy of the byte buffer is not needed. If traditional heap memory (HEAP BUFFERS) is used for Socket read and write, JVM copies the heap memory Buffer into direct memory before writing to Socket. Compared with out-of-heap direct memory, the message has one more memory copy of the buffer during the sending process.

2) Netty provides composite Buffer objects, which can aggregate multiple ByteBuffer objects. Users can manipulate the combined Buffer as easily as operating a Buffer, avoiding the traditional way of merging several small Buffer into a large Buffer through memory copy.

3) the transferTo method is adopted in the file transfer of Netty, which can send the data of the file buffer directly to the target Channel, which avoids the memory copy problem caused by the traditional circular write.

Next, let's explain the above three "zero copies". Let's first look at the creation of Netty receiving Buffer:

Figure 2-5 Asynchronous message read "Zero copy"

Each time the message is read in a loop, the ByteBuf object is obtained through the ioBuffer method of ByteBufAllocator. Let's continue to look at its API definition:

Figure 2-6 ByteBufAllocator allocates out-of-heap memory through ioBuffer

When reading and writing Socket IO, in order to avoid copying a copy from heap memory to direct memory, Netty's ByteBuf allocator directly creates non-heap memory to avoid a second copy of the buffer, and uses "zero copy" to improve read and write performance.

Let's move on to the second "zero copy" implementation CompositeByteBuf, which encapsulates multiple ByteBuf into a ByteBuf and provides a uniformly encapsulated ByteBuf interface. Its class definition is as follows:

Figure 2-7 CompositeByteBuf class inheritance relationship

From the inheritance relationship, we can see that CompositeByteBuf is actually a wrapper for ByteBuf. It combines multiple ByteBuf into a collection, and then provides a unified ByteBuf interface. The relevant definitions are as follows:

Figure 2-8 CompositeByteBuf class definition

To add ByteBuf, you do not need to make a memory copy. The related code is as follows:

Figure 2-9 add "zero copy" of ByteBuf

Finally, let's look at the "zero copy" of the file transfer:

Figure 2-10 "zero copy" of file transfer

Netty file transfer DefaultFileRegion sends files to the destination Channel through the transferTo method. Let's focus on the transferTo method of FileChannel, whose API DOC description is as follows:

Figure 2-11 "zero copy" of file transfer

For many operating systems, it sends the contents of the file buffer directly to the target Channel without copying, which is a more efficient way of transmission, which realizes the "zero copy" of file transfer.

2.2.3. Memory pool

With the development of JVM virtual machine and JIT just-in-time compilation technology, object allocation and recycling is a very lightweight task. But for buffer Buffer, the situation is slightly different, especially for out-of-heap direct memory allocation and recycling, which is a time-consuming operation. To reuse buffers as much as possible, Netty provides a buffer reuse mechanism based on memory pools. Let's take a look at the implementation of Netty ByteBuf:

Figure 2-12 memory Pool ByteBuf

Netty provides a variety of memory management strategies, which can be customized differently by configuring relevant parameters in the startup helper class.

Through the performance test below, let's take a look at the performance differences between ByteBuf and normal ByteBuf based on memory pool recycling.

In use case one, use the memory pool allocator to create a direct memory buffer:

Figure 2-13 Test case of non-heap memory buffer based on memory pool

Use case two, a direct memory buffer created using a non-heap memory allocator:

Figure 2-14 non-heap memory buffer test case based on non-memory pool

Each executes 3 million times, and the performance comparison results are as follows:

Figure 2-15 comparison of write performance between memory pool and non-memory pool buffers

Performance tests show that the performance of ByteBuf with memory pool is about 23 times higher than that of ByteBuf (performance data is strongly related to usage scenarios).

Let's briefly analyze the memory allocation of the Netty memory pool:

Figure 2-16 buffer allocation for AbstractByteBufAllocator

Moving on to the newDirectBuffer method, we find that it is an abstract method, which is implemented by a subclass of AbstractByteBufAllocator. The code is as follows:

Figure 2-17 different implementations of newDirectBuffer

The code jumps to the newDirectBuffer method of PooledByteBufAllocator, gets the memory region PoolArena from Cache, and calls its allocate method to allocate memory:

Figure 2-18 memory allocation for PooledByteBufAllocator

The allocate method of PoolArena is as follows:

Figure 2-18 buffer allocation for PoolArena

We focus on the implementation of newByteBuf, which is also an abstract method, with different types of buffer allocation implemented by subclasses DirectArena and HeapArena, because the test case uses out-of-heap memory

Figure 2-19 newByteBuf abstraction method for PoolArena

Therefore, it focuses on the implementation of DirectArena: if unsafe using sun is not enabled, then

Figure 2-20 newByteBuf implementation of DirectArena

Execute the newInstance method of PooledDirectByteBuf with the following code:

Figure 2-21 newInstance implementation of PooledDirectByteBuf

The ByteBuf object is recycled through the get method of RECYCLER, and if it is a non-memory pool implementation, a new ByteBuf object is created directly. After getting the ByteBuf from the buffer pool, call the setRefCnt method of AbstractReferenceCountedByteBuf to set the reference counter for reference counting and memory collection of objects (similar to the JVM garbage collection mechanism).

2.2.4. An efficient Reactor threading model

There are three commonly used Reactor threading models, which are as follows:

1) Reactor single thread model

2) Reactor multithreading model

3) Master-slave Reactor multithreading model

The Reactor single-thread model means that all IO operations are performed on the same NIO thread. The responsibilities of the NIO thread are as follows:

1) as the NIO server, receive the TCP connection of the client

2) as a NIO client, initiate a TCP connection to the server

3) read the request or reply message of the communication peer

4) send a message request or reply message to the communication peer.

The schematic diagram of the Reactor single-threaded model is as follows:

Figure 2-22 Reactor single-threaded model

Because the Reactor mode uses asynchronous non-blocking IO, all IO operations do not cause blocking, and in theory a thread can handle all IO-related operations independently. From an architectural point of view, a Nio thread can indeed fulfill its responsibilities. For example, the TCP connection request message from the client is received through Acceptor, and after the link is successfully established, the corresponding ByteBuffer is dispatched to the designated Handler through Dispatch for message decoding. The user Handler can send messages to the client through the NIO thread.

For some small-capacity application scenarios, the single-threaded model can be used. However, it is not suitable for applications with high load and high concurrency, the main reasons are as follows:

1) A NIO thread can not support the performance of hundreds of links at the same time. Even if the CPU load of the NIO thread reaches 100%, it cannot satisfy the encoding, decoding, reading and sending of massive messages.

2) when the NIO thread is overloaded, the processing speed will slow down, which will lead to a large number of client connection timeout, which will often be retransmitted after the timeout, which will increase the load of the NIO thread, and eventually lead to a large number of message backlog and processing timeout, and the NIO thread will become the performance bottleneck of the system.

3) Reliability problem: once the NIO thread runs away accidentally, or enters a dead loop, it will cause the communication module of the whole system to be unavailable, unable to receive and process external messages, and cause node failure.

In order to solve these problems, we went in and out of the Reactor multithreading model. Let's learn about the Reactor multithreading model.

The biggest difference between the Rector multithreaded model and the single-threaded model is that there is a set of NIO threads that handle IO operations, and its schematic diagram is as follows:

Figure 2-23 Reactor multithreading model

The characteristics of the Reactor multithreading model:

1) there is a special NIO thread-Acceptor thread that listens to the server and receives TCP connection requests from the client

2) Network IO operations-reading, writing, etc., are handled by a Nio thread pool, which can be implemented using a standard JDK thread pool, which contains a task queue and N available threads, which are responsible for reading, decoding, encoding and sending messages.

3) one NIO thread can handle N links at the same time, but one link only corresponds to one NIO thread to prevent concurrent operation problems.

In most scenarios, the Reactor multithreading model can meet the performance requirements; however, in very special application scenarios, a Nio thread responsible for listening and processing all client connections may have performance problems. For example, millions of clients connect concurrently, or the server needs to securely authenticate the client's handshake message, which itself consumes performance. In such scenarios, a single acceptor thread may have a performance problem. In order to solve the performance problem, a third Reactor thread model-master-slave Reactor multi-thread model is produced.

The characteristic of the master-slave Reactor thread model is that the server is no longer a single NIO thread to receive client connections, but an independent Nio thread pool. After Acceptor receives the completion of client TCP connection request processing (which may include access authentication, etc.), it registers the newly created SocketChannel to an IO thread in the I / O thread pool (sub reactor thread pool), which is responsible for the reading, writing, coding and decoding of SocketChannel. The Acceptor thread pool is only used for client login, handshake and security authentication. Once the link is successfully established, the link is registered to the IO thread of the back-end subreader thread pool, and the IO thread is responsible for the subsequent IO operations.

Its threading model is shown below:

Figure 2-24 Reactor master-slave multithreading model

Using the master-slave Nio thread model, we can solve the problem that one server listening thread can not effectively handle all client connections. Therefore, this threading model is recommended in the official demo of Netty.

In fact, Netty's threading model is not fixed, and the above three Reactor threading models can be supported by creating different EventLoopGroup instances in the startup helper class and configuring appropriate parameters. It is precisely because Netty's support for the Reactor threading model provides flexible customization capabilities that it can meet the performance requirements of different business scenarios.

2.2.5. Lock-free serial design concept

In most scenarios, parallel multithreading can improve the concurrency performance of the system. However, if the concurrent access to shared resources is not handled properly, it will lead to serious lock competition, which will eventually lead to performance degradation. In order to avoid the performance loss caused by lock competition as much as possible, we can use serialization design, that is, message processing can be done in the same thread as far as possible, without thread switching, so as to avoid multi-thread competition and synchronous lock.

In order to improve performance as much as possible, Netty adopts serial lock-free design to perform serial operations within IO threads to avoid performance degradation caused by multi-thread competition. On the face of it, serialization design seems to have low CPU utilization and insufficient concurrency. However, by adjusting the thread parameters of the NIO thread pool, multiple serialized threads can be started to run in parallel at the same time. This locally unlocked serial thread design has better performance than a queue-multiple worker thread model.

The working schematic diagram of the serial design of Netty is as follows:

Figure 2-25 schematic diagram of Netty serialization

After the NioEventLoop of Netty reads the message, the fireChannelRead (Object msg) of ChannelPipeline is called directly. As long as the user does not actively switch threads, NioEventLoop will always be called to the user's Handler without thread switching. This serialization method avoids lock competition caused by multithreaded operations and is optimal from a performance point of view.

2.2.6. Efficient concurrent programming

The efficient concurrent programming of Netty is mainly reflected in the following points:

1) large and correct use of volatile

2) widespread use of CAS and atomic classes

3) use of thread-safe containers

4) improve concurrency performance through read-write locks.

If you want to know the details of Netty efficient concurrent programming, you can read the "Application Analysis of multithreaded concurrent programming in Netty" that I shared on Weibo before, in this article, the multithreading skills and applications of Netty are introduced and analyzed in detail.

2.2.7. High performance serialization framework

The key factors that affect serialization performance are summarized as follows:

1) serialized stream size (network bandwidth usage)

2) performance of serialization & deserialization (CPU resource consumption)

3) whether it supports cross-language (docking of heterogeneous systems and switching of development languages).

Netty provides support for Google Protobuf by default. By extending Netty's codec interface, users can implement other high-performance serialization frameworks, such as Thrift's compressed binary codec framework.

Let's take a look at the comparison of byte arrays serialized by different serialization-deserialization frameworks:

Figure 2-26 comparison of serialization stream size among serialization frameworks

As can be seen from the above figure, the stream serialized by Protobuf is only about 1 stroke 4 of Java serialization. It is the poor performance of Java native serialization that has spawned a variety of high-performance open source serialization technologies and frameworks (poor performance is just one reason, as well as cross-language, IDL definitions, and other factors).

2.2.8. Flexible configuration of TCP parameters

Reasonable setting of TCP parameters can have a significant effect on performance improvement in some scenarios, such as SO_RCVBUF and SO_SNDBUF. If it is not set up properly, the impact on performance is very large. Let's summarize several configuration items that have a significant impact on performance:

1) SO_RCVBUF and SO_SNDBUF: generally recommended values are 128K or 256K

2) SO_TCPNODELAY:NAGLE algorithm automatically connects the small packets in the buffer to form larger packets to prevent a large number of small packets from blocking the network, so as to improve the efficiency of network application. However, for delay-sensitive application scenarios, the optimization algorithm needs to be turned off.

3) soft interrupt: if the Linux kernel version supports RPS (version 2.6.35 or above), soft interrupt can be implemented when RPS is enabled to improve network throughput. RPS calculates a hash value according to the source address, destination address, destination and source port of the packet, and then selects the cpu running by the soft interrupt according to this hash value. From the upper layer, that is to say, bind each connection to the cpu, and through this hash value, the soft interrupt is balanced on multiple hashes to improve the parallel processing performance of the network.

Netty can flexibly configure TCP parameters in the startup helper class to meet different user scenarios. The relevant configuration interfaces are defined as follows:

Figure 2-27 TCP parameter configuration definition for Netty

Through the analysis of the architecture and performance model of Netty, we find that the high performance of Netty architecture is carefully designed and implemented. Thanks to high-quality architecture and code, it is not very difficult for Netty to support cross-node service invocation of 10W TPS.

This is how Java and Netty achieve high performance and high concurrency. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.