How to deal with large objects in Java 07/06 Update SLTechnology News&Howtos

How to deal with large objects in Java

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to deal with large objects in Java". In daily operation, I believe many people have doubts about how to deal with large objects in Java. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to deal with large objects in Java"! Next, please follow the editor to study!

Substring in String

As we all know, String is immutable in Java, and if you change its contents, it will generate a new string. If we want to use part of the data in the string, we can use the substring method.

The following is the source code for String in Java11.

Public String substring (int beginIndex) {if (beginIndex)

< 0) { throw new StringIndexOutOfBoundsException(beginIndex); } int subLen = length() - beginIndex; if (subLen < 0) { throw new StringIndexOutOfBoundsException(subLen); } if (beginIndex == 0) { return this; } return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen) : StringUTF16.newString(value, beginIndex, subLen);}public static String newString(byte[] val, int index, int len) { if (String.COMPACT_STRINGS) { byte[] buf = compress(val, index, len); if (buf != null) { return new String(buf, LATIN1); } } int last = index + len; return new String(Arrays.copyOfRange(val, index endIndex) { throw new StringIndexOutOfBoundsException(endIndex - beginIndex); } return ((beginIndex == 0) && (endIndex == count)) ? this : new String(offset + beginIndex, endIndex - beginIndex, value);}String(int offset, int count, char value[]) { this.value = value; this.offset = offset; this.count = count;} 可以看到，它在创建子字符串的时候，并不只拷贝所需要的对象，而是把整个 value 引用了起来。如果原字符串比较大，即使不再使用，内存也不会释放。比如，一篇文章内容可能有几兆，我们仅仅是需要其中的摘要信息，也不得不维持整个的大对象。有一些工作年限比较长的面试官，对 substring 还停留在 JDK6 的印象，但其实，Java 已经将这个 bug 给修改了。如果面试时遇到这个问题，保险起见，可以把这个改善过程答出来。这对我们的借鉴意义是：如果你创建了比较大的对象，并基于这个对象生成了一些其他的信息，这个时候，一定要记得去掉和这个大对象的引用关系。集合大对象扩容对象扩容，在 Java 中是司空见惯的现象，比如 StringBuilder、StringBuffer、HashMap，ArrayList 等。概括来讲，Java 的集合，包括 List、Set、Queue、Map 等，其中的数据都不可控。在容量不足的时候，都会有扩容操作，扩容操作需要重新组织数据，所以都不是线程安全的。我们先来看下 StringBuilder 的扩容代码： void expandCapacity(int minimumCapacity) { int newCapacity = value.length * 2 + 2; if (newCapacity - minimumCapacity < 0) newCapacity = minimumCapacity; if (newCapacity < 0) { if (minimumCapacity < 0) // overflow throw new OutOfMemoryError(); newCapacity = Integer.MAX_VALUE; } value = Arrays.copyOf(value, newCapacity);} 容量不够的时候，会将内存翻倍，并使用 Arrays.copyOf 复制源数据。下面是 HashMap 的扩容代码，扩容后大小也是翻倍。它的扩容动作就复杂得多，除了有负载因子的影响，它还需要把原来的数据重新进行散列，由于无法使用 native 的 Arrays.copy 方法，速度就会很慢。 void addEntry(int hash, K key, V value, int bucketIndex) { if ((size >

= threshold) & & (null! = table [bucketIndex]) {resize (2 * table.length); hash = (null! = key)? Hash (key): 0; bucketIndex = indexFor (hash, table.length);} createEntry (hash, key, value, bucketIndex);} void resize (int newCapacity) {Entry [] oldTable = table; int oldCapacity = oldTable.length; if (oldCapacity = = MAXIMUM_CAPACITY) {threshold = Integer.MAX_VALUE; return;} Entry [] newTable = new Entry [newCapacity]; transfer (newTable, initHashSeedAsNeeded (newCapacity); table = newTable Threshold = (int) Math.min (newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);}

The code of List can be checked by yourself. it is also obstructive, and the expansion strategy is 1.5 times the original length.

Because collections are used very frequently in code, if you know the upper limit of data items, you might as well set a reasonable initialization size. For example, HashMap requires 1024 elements and needs to be expanded seven times, which will affect the performance of the application. This question occurs frequently in interviews, and you need to understand the impact of these expansion operations on performance.

Note, however, that sets with load factors such as HashMap (0.75), initialization size = number of required / load factor + 1, if you are not very clear about the underlying structure, you might as well keep the default.

Next, I will explain the optimization at the application level from the structural latitude and time dimension of the data.

Maintain appropriate object granularity

To share with you an actual case: we have a very high concurrency business system, which needs to frequently use the basic data of users.

As shown in the following figure, because the basic information of the user is stored in another service, there needs to be a network interaction every time the basic information of the user is used. What is even more unacceptable is that even if you only need the gender attribute of the user, you also need to query and pull all the user information.

In order to speed up the query speed of the data, the data is initially cached and put into the Redis, the query performance has been greatly improved, but there are still a lot of redundant data to be queried each time.

The original redis key was designed like this:

Type: string key: user_$ {userid} value: json

There are two problems with such a design:

To query the value of one of the fields, you need to query all the json data and parse it yourself.

To update the value of one of the fields, you need to update the entire json string, which is expensive.

For this kind of large-grained json information, it can be optimized by breaking up, so that every update and query has a focused goal.

Next, the data in Redis is designed as follows, using hash structure instead of json structure:

Type: hash key: user_$ {userid} value: {sex:f, id:1223, age:23}

In this way, we can use the hget command or the hmget command to get the data we want and speed up the flow of information.

Bitmap makes objects smaller

In addition to the above, can it be further optimized? For example, our system frequently uses users' gender data to distribute some gifts, recommend some friends of the opposite sex, periodically cycle users to do some cleaning actions, etc., or store some user status information, such as whether online, whether to check in, whether to send messages recently, etc., so as to count the active users. Then the operation of yes or no can be compressed using the structure of Bitmap.

There is also a high-frequency interview question, that is, how many people are occupied by Java's Boolean?

In the Java virtual machine specification, the description is that the Boolean type is mapped to the numbers 1 and 0, which takes up the same 32-bit space as int. Even if some virtual machine implementations map Boolean to byte types, it takes up too much space for a large number of regular Boolean values.

As the code shows, it can save 32 Boolean values by judging each bit in the int!

Int a = 0b0001_0001_1111_1101_1001_0001_1111_1101

Bitmap is a data structure that uses Bit to record, and the data stored in it is either 0 or 1. Remember the cache penetration we mentioned in the previous "four problems that must be solved in distributed caching systems"? You can use Bitmap to avoid, the relevant structure class in Java, that is, the underlying java.util.BitSet,BitSet is implemented using the long array, so its minimum capacity is 64.

The Boolean value of 1 billion only needs the memory of 128MB. The following is a logic for judging the gender of users who occupy 256MB, which can cover ID with a length of 1 billion.

Static BitSet missSet = new BitSet (010000000000); static BitSet sexSet = new BitSet (010000000000); String getSex (int userId) {boolean notMiss = missSet.get (userId); if (! notMiss) {/ / lazy fetch String lazySex = dao.getSex (userId); missSet.set (userId, true); sexSet.set (userId, "female" .equals (lazySex)) } return sexSet.get (userId)? "female": "male";}

This data, put in the heap memory, is still too large. Fortunately, Redis also supports the Bitmap structure, and if there is pressure on memory, we can put this structure into Redis, and the judgment logic is similar.

Another interview algorithm question: given a machine with 1GB memory that provides 6 billion int data, how to quickly determine which data is duplicated?

You can think about it by analogy. Bitmap is a relatively low-level structure, and on top of it is a structure called Bloom filter (Bloom Filter), which can determine whether a value does not exist or may exist.

As shown in the figure, compared with Bitmap, it has one more layer of hash algorithm. Since it is a hash algorithm, there will be conflicts, so it is possible that multiple values fall on the same bit. Unlike HashMap, which uses linked lists or red-black trees to handle conflicts, it reuses the hash slot directly. From this feature, we can see that the Bloom filter can clearly indicate that a value is not in the collection, but cannot determine that a value is exactly in the collection.

There is a BloomFilter class in Guava that can easily implement related functions.

In essence, the above optimization method is also a way to turn large objects into small objects, and there are many similar ideas in software design. For example, for a newly published article, summary data is frequently used, so there is no need to query the entire article; users' feed information only needs to ensure the speed of visible information, and the complete information is stored in large-scale storage with slower speed.

Hot and cold separation of data

In addition to the horizontal structural latitude, the data also has a vertical time dimension. The most effective way to optimize the time dimension is the separation of hot and cold.

The so-called hot data is the data that is close to the user and is frequently used, while the cold data is the data with very low access frequency and very old age.

The same complex SQL, running on tens of millions of data tables, and running on millions of data tables, the former must be very poor. So, although your system is very fast when it first comes online, as the amount of data increases over time, it will gradually become very slow.

Hot and cold separation is to divide the data into two parts, such as the figure below, which generally maintains a full copy of the data and is used to do some time-consuming statistical operations.

Because hot and cold separation is often encountered at work, the interviewer will frequently ask about the hot and cold separation of data. Here are three briefly introduced:

Data double writing

Put the insert, update and delete operations of the cold and hot storage into a unified transaction. Due to the different types of hot storage (such as MySQL) and cold storage (such as Hbase), this transaction is likely to be a distributed transaction. At the beginning of the project, this approach is feasible, but if it is to transform some legacy systems, distributed transactions are basically unchanged, I usually abandon this solution directly.

Write to MQ distribution

Through the publish and subscribe function of MQ, when dealing with data, it does not drop into the library, but sends it to MQ. Start the consumption process separately, and drop the data in MQ into cold storage and hot storage respectively. The business transformed in this way has clear logic and elegant structure. Systems such as orders, which have a clear structure and low requirements for sequencing, can be distributed by MQ. But if you have a very large number of database entities, you have to consider the complexity of the program in this way.

Use Binlog synchronization

For MySQL, you can use Binlog to synchronize, use Canal components to continuously obtain the latest Binlog data, combined with MQ, you can synchronize the data to other data sources.

Thinking divergence

For the operation of the result set, we can diverge our thinking again. A simple redundant result set can be transformed into a complex and efficient data structure. This complex data structure can proxy our requests and effectively transfer time-consuming operations.

For example, our commonly used database index is a kind of reorganization and acceleration of data. B + tree can effectively reduce the number of interactions between database and disk. Through a data structure similar to B + tree, it indexes the most commonly used data and stores it in limited storage space.

Then there is serialization, which is commonly used in RPC. Some services adopt the WebService of SOAP protocol, which is a protocol based on XML, which has the advantages of slow transmission of large content and low efficiency. Nowadays, most Web services interact with json data, and json is more efficient than SOAP.

In addition, you should all have heard of google's protobuf, because it is a binary protocol, and the data is compressed, the performance is very superior. After the data is compressed by protobuf, the size of json is only 1 / 10 / 10 / 10 / 10 / 10, but the performance is improved by 5-100 times.

The design of protobuf is worth using for reference. It has a very compact processing of data through the three segments of tag | leng | value, and the parsing and transmission speed is very fast.

At this point, the study on "how to deal with large objects in Java" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.