What is OOM Killed in Flink containerized environment 07/02 Update SLTechnology News&Howtos

What is OOM Killed in Flink containerized environment

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what OOM Killed is under the Flink containerization environment, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

JVM memory partition

For most Java users, the frequency of dealing with JVM Heap in daily development is much higher than that of other JVM memory partitions, so other memory partitions are often called Off-Heap memory. For Flink, the problem of excessive memory usually comes from Off-Heap memory, so it is necessary to have a deeper understanding of the JVM memory model.

According to JVM 8 Spec [1], the memory partitions managed by JVM are shown below:

JVM 8 memory model

In addition to the standard partitions specified in the above Spec, JVM often adds some additional partitions for advanced functional modules in the specific implementation. Taking HotSopt JVM as an example, according to the standard of Oracle NMT [5], we can subdivide JVM memory into the following areas:

Heap: an area of memory shared by threads, mainly storing objects created by the new operator. The release of memory is managed by GC and can be used by user code or JVM itself.

Class: the metadata of the class, corresponding to Method Area in Spec (excluding Constant Pool) and Metaspace in Java 8.

Thread: a thread-level memory area that corresponds to the sum of PC Register, Stack, and Natvive Stack in Spec.

Compiler: memory used by the JIT (Just-In-Time) compiler.

Code Cache: a cache used to store code generated by the JIT compiler.

GC: memory used by the garbage collector.

Symbol: memory that stores Symbol (such as field name, method signature, Interned String), corresponding to Constant Pool in Spec.

Arena Chunk: JVM applies for a temporary cache area for operating system memory.

NMT: memory used by NMT itself.

Internal: other memory that does not meet the above classification, including Native/Direct memory requested by the user code.

Unknown: memory that cannot be classified.

Ideally, we can strictly control the upper limit of memory in each partition to ensure that the overall process is within the container limit. However, too strict management will lead to additional cost and lack of flexibility, so in practice, JVM only provides a hard upper limit for a few of the partitions exposed to users, while other partitions can be regarded as memory consumption of JVM itself as a whole.

The specific JVM parameters that can be used to limit partition memory are shown in the following table (it is worth noting that the industry does not have an accurate definition of JVM Native memory. In this article, Native memory refers to the non-Direct part of Off-Heap memory, which can be interchangeable with Native Non-Direct).

As you can see from the table, it is safe to use Heap, Metaspace, and Direct memory, but the situation of non-Direct Native memory is more complex, which may be some internal use of JVM itself (such as MemberNameTable mentioned below), the JNI dependency introduced by user code, or the Native memory requested by user code itself through sun.misc.Unsafe. In theory, Native memory requested by user code or third-party lib requires the user to plan memory usage, while the rest of the Internal can be incorporated into JVM's own memory consumption. In fact, Flink's memory model follows a similar principle.

Flink TaskManager memory model

First review the TaskManager memory model of Flink 1.10 +.

Flink TaskManager memory model

Obviously, the Flink framework itself contains not only Heap memory managed by JVM, but also Native and Direct memory that manages Off-Heap itself. In the author's opinion, Flink's management strategy for Off-Heap memory can be divided into three types:

Hard limit (Hard Limit): the hard-limited memory partition is Self-Contained, and Flink ensures that its usage does not exceed the set threshold (if there is not enough memory, an exception similar to OOM is thrown)

Soft limit (Soft Limit): a soft limit means that memory usage will stay below the threshold for a long time, but may briefly exceed the configured threshold.

Reservation (Reserved): reservation means that Flink does not restrict the use of partition memory, but only reserves some space when planning memory, but there is no guarantee that actual usage will not be overused.

Considering the memory management of JVM, the logic for judging the consequences of a memory overflow of a Flink memory partition is as follows:

1. If the Flink has a hard limit on the partition, Flink will report that the partition is out of memory. Or move on to the next step.

2. If the partition belongs to a partition managed by JVM, JVM will report the OOM of the JVM partition to which it belongs (such as java.lang.OutOfMemoryError: Jave heap space) when its actual value increases and causes the JVM partition to run out of memory. Or move on to the next step.

3. The memory of the partition continues to overflow, resulting in the overall memory of the process exceeding the container memory limit. In an environment where strict resource control is enabled, the resource manager (YARN/k8s, etc.) will kill the process.

In order to visually show the relationship between Flink memory partitions and JVM memory partitions, the author collates the following memory partition mapping table:

Memory restriction relationship between Flink partition and JVM partition

According to the previous logic, of all Flink memory partitions, only JVM Overhead that is not Self-Contained and the JVM partition to which it belongs has no memory hard limit parameter is likely to cause the process to be dropped by OOM kill. As a hodgepodge of memory reserved for a variety of different uses, JVM Overhead is indeed prone to problems, but it can also act as an isolated buffer to alleviate memory problems from other areas.

For example, the Flink memory model has a trick when calculating Native Non-Direct memory:

Although, native non-direct memory usage can be accounted for as a part of the framework off-heap memory or task off-heap memory, it will result in a higher JVM's direct memory limit in this case.

Although Task/Framework 's Off-Heap partition may contain Native Non-Direct memory, which strictly belongs to JVM Overhead and is not limited by the JVM-XX:MaxDirectMemorySize parameter, Flink counts it into MaxDirectMemorySize. This portion of the reserved Direct memory quota will not be actually used, so it can be reserved for no upper limit of JVM Overhead usage to achieve the effect of reserving space for Native Non-Direct memory.

Common causes of OOM Killed

Consistent with the above analysis, the common causes of OOM Killed in practice are basically due to leaks or overuse of Native memory. Because the OOM Killed of virtual memory is easily avoided through the configuration of the resource manager and is usually not too problematic, only the OOM Killed of physical memory is discussed below.

Uncertainty of RocksDB Native memory

As we all know, RocksDB applies for Native memory directly through JNI, which is not controlled by Flink, so in fact, Flink indirectly affects its memory usage by setting RocksDB memory parameters. However, at present, Flink estimates these parameters, which are not very accurate values, for several reasons.

First of all, it is difficult to calculate part of the memory accurately. The memory footprint of RocksDB has four parts [6]:

Block Cache: a layer of cache on top of OS PageCache that caches uncompressed data Block.

Indexes and filter blocks: index and Bloom filter to optimize read performance.

Memtable: similar to write caching.

Blocks pinned by Iterator: when triggering RocksDB traversal operations (such as traversing all key of RocksDBMapState), Iterator prevents its referenced Block and Memtable from being released during its lifetime, resulting in additional memory footprint [10].

The memory of the first three areas is configurable, but the resources locked by Iterator depend on the application business usage pattern and do not provide a hard limit, so Flink does not take this part into account when calculating RocksDB StateBackend memory.

The second is a bug of RocksDB Block Cache [8] [9], which can cause the size of the Cache not to be strictly controlled and may exceed the set memory capacity in a short period of time, which is equivalent to a soft limit.

For this problem, we usually just need to increase the threshold of JVM Overhead to allow Flink to reserve more memory, because the memory overutilization of RocksDB is only temporary.

Glibc Thread Arena problem

Another common problem is glibc's famous 64 MB problem, which can lead to a significant increase in memory usage of JVM processes and eventually be dropped by YARN kill.

Specifically, JVM requests memory through glibc, and to improve memory allocation efficiency and reduce memory fragmentation, glibc maintains a memory pool called Arena, including a shared Main Arena and thread-level Thread Arena. When a thread needs to request memory but Main Arena is already locked by another thread, glibc allocates a Thread Arena of about 64 MB (64-bit machine) for use by the thread. These Thread Arena are transparent to JVM, but are counted in the overall virtual memory (VIRT) and physical memory (RSS) of the process.

By default, the maximum number of Arena is the number of cpu cores * 8. For an ordinary 32-core server, it takes up to 16 GB, which is not surprising. To control the total amount of memory consumed, glibc provides an environment variable MALLOC_ARENA_MAX to limit the total amount of Arena. For example, Hadoop sets this value to 4 by default. However, this parameter is only a soft limit, and when all Arena is locked, glibc will still create a new Thread Arena to allocate memory [11], resulting in unexpected memory usage.

Generally speaking, this problem occurs in applications that need to create threads frequently. For example, HDFS Client creates a new DataStreamer thread for each file being written, so it is more likely to encounter Thread Arena problems. If you suspect that your Flink application is experiencing this problem, an easier way to verify is to see if there are many consecutive anon segments in the process's pmap that are multiple of 64MB. For example, the blue 65536 KB segments in the figure below are likely to be Arena.

Pmap 64 MB arena

The fix to this problem is simple: set MALLOC_ARENA_MAX to 1, that is, disable Thread Arena to use only Main Arena. The cost, of course, is that threads are less efficient in allocating memory. It is worth mentioning, however, that using Flink's process environment variable parameters (such as containerized.taskmanager.env.MALLOC_ARENA_MAX=1) to override the default MALLOC_ARENA_MAX parameter may not be feasible, because in the case of a conflict between non-whitelisted variables (yarn.nodemanager.env-whitelist), NodeManager will merge the original and appended values by merging URL, resulting in a result such as MALLOC_ARENA_MAX= "4:1".

Finally, a more thorough alternative is to replace glibc with Google's tcmalloc or Facebook's jemalloc [12]. Except that there are no Thread Arena problems, memory allocation performance is better and there is less fragmentation. In fact, the official image of Flink 1.12 also changed the default memory allocator from glibc to jemelloc [17].

JDK8 Native memory leak

Previous versions of Oracle Jdk8u152 had a Native memory leak in bug [13], which caused the Internal memory partition of JVM to grow all the time.

Specifically, JVM caches mapping pairs of string symbols (Symbol) to methods (Method) and member variables (Field) to speed up lookup, each pair of mapping is called MemberName, and the whole mapping relationship is called MemeberNameTable, which is the responsibility of the class java.lang.invoke.MethodHandles. Before Jdk8u152, MemberNameTable used Native memory, so some outdated MemberName was not automatically cleaned up by GC, resulting in memory leaks.

To confirm this problem, you need to check the JVM memory through NMT. For example, the author has encountered an online TaskManager with more than 400 MB of MemeberNameTable.

JDK8 MemberNameTable Native memory leak

After JDK-8013267 [14], MemeberNameTable was moved from Native memory to Java Heap, fixing this problem. However, there is more than one Native memory leak problem with JVM, such as the C2 compiler memory leak [15], so for users like the author who do not have a dedicated JVM team, upgrading to the latest version of JDK is the best way to fix the problem.

YARN mmap memory algorithm

It is well known that YARN calculates the total memory of the entire container process tree based on the process information under / proc/$ {pid}, but a special point in this is the shared memory of mmap. There should be no doubt that mmap memory will all be counted into the process's VIRT, but there are different standards for calculating RSS. According to the calculation rules of YARN and Linux smaps, memory pages (Pages) are divided into two standards:

Private Pages: Pages with only the current process mapping (mapped)

Shared Pages: Pages shared with other processes

Clean Pages: Pages that has not been modified since it was mapped

Dirty Pages: since the mapped Pages has been modified, in the default implementation, YARN calculates the total memory according to / proc/$ {pid} / status, and all Shared Pages will be counted into the RSS of the process. Even if these Pages are mapped by multiple processes at the same time [16], this will lead to a deviation from the actual operating system physical memory, which may cause the Flink process to be mistakenly killed (of course, Provided that the user code uses mmap and there is not enough space reserved).

To do this, YARN provides the yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled configuration option, and when it is set to true, YARN will calculate the memory footprint based on a more accurate / proc/$ {pid} / smap, one of the key concepts is PSS. To put it simply, the difference between PSS is that when calculating memory, the Shared Pages is evenly distributed to all processes that use this Pages, such as one process holding 1000 Private Pages and 1000 Shared Pages that will be shared with another process, so the total number of Page for that process is 1500. Going back to YARN's memory calculation, the process RSS is equal to the sum of all its mapped Pages RSS. By default, YARN calculates a Page RSS formula as ````Page RSS = Private_Clean + Private_Dirty + Shared_Clean + Shared_Dirty ```because a Page is either Private, Shared, Clean or Dirty, so at least three items on the right side of the above announcement are 0. When the smaps option is enabled, the formula is changed to: ```Page RSS = Min (Shared_Dirty, PSS) + Private_Clean + Private_Dirty```. To put it simply, the result of the new formula is to remove the influence of the repeated calculation of Shared_Clean. Although enabling the option of smaps-based computing will make the calculation more accurate, it will introduce the cost of traversing the total memory of Pages computing, which is not as fast as directly fetching the statistics of / proc/$ {pid} / status. Therefore, if you encounter a mmap problem, it is recommended to increase the JVM Overhead partition capacity of Flink.

Thank you for reading this article carefully. I hope the article "what is OOM Killed in the Flink containerized environment" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.