Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use BitMap to realize big data sorting and deduplicating

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to use BitMap to achieve big data sorting to remove weight, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

1. Questions are raised:

M (for example, 1 billion) int integers, of which only N have been repeated, read into memory and delete the duplicate integers.

2. Solution problem analysis:

We must first think of opening M int integer data arrays in computer memory, reading M int type arrays by one bye one, then comparing the logarithmic values one by one, and finally removing the duplicated data. Of course, this is feasible in dealing with small-scale data.

Let's consider the case of big data: for example, in the Java language, 1 billion int type data are ranked.

An int type in java occupies 4byte in memory. Then 1 billion int type data need to open up a total of 10 ^ 9 power * 4byte ≈ 4GB contiguous memory space. Take a computer with a 32-bit operating system as an example, the maximum supporting memory is 4G, and the available memory is less than 4G. Therefore, the above method simply does not work when dealing with big data.

Transformation of thinking:

Since we cannot open up int type arrays for all int type data, we can take smaller data types to read cached int type data. Considering that the data processed inside the computer is the bit of 01 sequence, can we use 1bit to represent an int type data?

The derivation of bit mapping:

Use smaller data types to refer to larger data types. For the problem mentioned above, we can use 1 bit to correspond to an int integer. If data of the corresponding int type exists, its corresponding bit is assigned to 1, otherwise, it is assigned to 0 (boolean type). The int in java ranges from-2 ^ 31 to 2 ^ 31-1. Then the length of all possible values is 2 ^ 32. The corresponding bit length is 2 ^ 32. Then you only need to open up a memory space of 2 ^ 32 bit = 2 ^ 29 byte = 512m after this processing. Obviously, this processing can meet the requirements, although the memory consumption is not too small.

Solution to the problem:

First define the int-byte mapping relationship of the following figure, of course, the mapping relationship can be customized. But the premise is to make sure that the superscript of your array is not out of bounds.

But the bit [] array defined above obviously doesn't exist on the computer, so we need to convert it to a basic data type store in java. Obviously, byte [] is the best choice.

Convert it to the byte [] array scheme:

A custom mapping table where each bit corresponds to an int value, and the maximum and minimum values of the int correspond to the maximum and minimum indexes of the array. As can be seen from the above figure, the difference between the int value and the bit index is 2 ^ 31. Of course, you can also define other mappings, but be careful not to cross the bounds of the array.

Bit [] index: since the maximum value may be 2 ^ 32, receive it with long: long bitIndex = num + (1l)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report