Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize disk Multiplex merging and sorting in LevelDB Database

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to realize disk merging and sorting in LevelDB database. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

When the high-level data sinks to the lower level in the LevelDB database, it needs to go through a Major Compaction to merge and sort the ordered key-value pairs of the high-level files and the multiple ordered key-value pairs of the low-level files. The input of the disk multiplex sorting algorithm is ordered key-value pairs from multiple disk files, which are sorted in memory and then output to one or more new disk files.

Multi-path merge sorting is also a common algorithm in the field of big data, which is often used for mass data sorting. When the amount of data is very large, the data cannot be contained in the memory of a single machine. It needs to be sorted by different machines in memory (map process), and then sorted by multiple merging algorithm (reduce process). This is streaming multiplex sorting, why is streaming sorting, because the data source comes from network sockets.

The advantage of multi-path merge sorting is that the memory consumption is extremely low, its memory consumption is proportional to the number of input files, and has nothing to do with the total amount of data, which will only linearly proportional to the sorting time.

Let's personally implement the disk multiplex merging algorithm, why it is a disk, because its input comes from a disk file.

Algorithm idea

We need to maintain an ordered array in memory. The current smallest element of each input file is placed in the array as an element. The array remains sorted by the size of the element.

Then we start to enter the loop, where the logic of the loop always starts with the smallest element, fetches the next element in its file, and compares it with the element in the current array. Different processing is carried out according to the comparison results, and here we use the binary search algorithm for fast comparison. Notice that the elements in each input file are in order.

1. If the extracted element is equal to the smallest element in the current array, the element can be output directly. Let's move on to the next cycle. It is not possible to fetch an element smaller than the smallest element of the current array, because the input file itself is ordered.

two。 Otherwise, you need to insert the element into the specified position in the current array and continue to keep the array in order. The current smallest element in the array is then output and removed. And move on to the next cycle.

3. If you encounter the end of the file, you can no longer call the next () method, and you can directly output and remove the smallest element in the array, and the array becomes smaller. And move on to the next cycle. When the array is empty, it means that all the files have been processed and the algorithm can be finished.

It is worth noting that there will never be two elements of the same file in the array, which ensures that the length of the array will not exceed the number of input files, and it will not squeeze out the unending files out of the array, which will lead to the problem of missing sorting.

Binary search

It is important to note that Java has a built-in binary search algorithm that is more sophisticated in use.

Public class Collections {

...

Public static int binarySearch (List list, T key) {

...

If (found) {

Return index

} else {

Return-(insertIndex+1)

}

}

...

}

If the key can be found in the list, go straight back to the appropriate location. If it cannot be found, it returns a negative number, which is not a simple-1, which indicates where to insert the key, and the array will continue to be in order.

For example, if binarySearch returns index=-1, then insertIndex is-(index+1), which is 0, and the insertion point is at the beginning of the array. If index=-size-1 is returned, then insertIndex is size, the end of the array. Other negative numbers are inserted into the middle of the array.

Input file class

A MergeSource object is created for each input file, which provides hasNext () and next () methods for determining and fetching the next element. Note that the input file is ordered, and the next element is the smallest element in the current input file.

The hasNext () method is responsible for reading the next line and caching it in the cachedLine variable, calling the next () method to convert the cachedLine variable to an integer and return it.

Class MergeSource implements Closeable {

Private BufferedReader reader

Private String cachedLine

Private String filename

Public MergeSource (String filename) {

This.filename = filename

Try {

FileReader fr = new FileReader (filename)

This.reader = new BufferedReader (fr)

} catch (FileNotFoundException e) {

}

}

Public boolean hasNext () {

String line

Try {

Line = this.reader.readLine ()

If (line = = null | | line.isEmpty ()) {

Return false

}

This.cachedLine = line.trim ()

Return true

} catch (IOException e) {

}

Return false

}

Public int next () {

If (this.cachedLine = = null) {

If (! hasNext ()) {

Throw new IllegalStateException ("no content")

}

}

Int num = Integer.parseInt (this.cachedLine)

This.cachedLine = null

Return num

}

@ Override

Public void close () throws IOException {

This.reader.close ()

}

}

Memory ordered array element class

Prepare the array before sorting, put the smallest elements of each input file into the array, and sort.

Class Bin implements Comparable {

Int num

MergeSource source

Bin (MergeSource source, int num) {

This.source = source

This.num = num

}

@ Override

Public int compareTo (Bin o) {

Return this.num-o.num

}

}

List prepare () {

List bins = new ArrayList ()

For (MergeSource source: sources) {

Bin newBin = newBin (source, source.next ())

Bins.add (newBin)

}

Collections.sort (bins)

Return bins

}

Output file class

Be careful to flush () when closing the output file to avoid losing the buffered content in the PrintWriter.

Class MergeOut implements Closeable {

Private PrintWriter writer

Public MergeOut (String filename) {

Try {

FileOutputStream out = new FileOutputStream (filename)

This.writer = new PrintWriter (out)

} catch (FileNotFoundException e) {

}

}

Public void write (Bin bin) {

Writer.println (bin.num)

}

@ Override

Public void close () throws IOException {

Writer.flush ()

Writer.close ()

}

}

Prepare the contents of the input file

Let's generate a series of input files, each containing a bunch of random integers. A total of n files are generated, and the number of integers for each file is between minEntries and minEntries. Returns a list of file names for all input files.

List generateFiles (int n, int minEntries, int maxEntries) {

List files = new ArrayList ()

For (int I = 0; I < n; iTunes +) {

String filename = "input-" + I + ".txt"

PrintWriter writer

Try {

Writer = new PrintWriter (new FileOutputStream (filename))

ThreadLocalRandom rand = ThreadLocalRandom.current ()

Int entries = rand.nextInt (minEntries, maxEntries)

List nums = new ArrayList ()

For (int k = 0; k < entries; knot +) {

Int num = rand.nextInt (10000000)

Nums.add (num)

}

Collections.sort (nums)

For (int num: nums) {

Writer.println (num)

}

Writer.flush ()

Writer.close ()

} catch (FileNotFoundException e) {

}

Files.add (filename)

}

Return files

}

Sorting algorithm

Everything is ready except the east wind. After preparing all the above classes, the sorting algorithm is very simple and the amount of code is very small. Compared with the above algorithm ideas to understand the following algorithm is very easy.

Public void sort () {

List bins = prepare ()

While (true) {

/ / fetch the smallest element in the array

MergeSource current = bins.get (0). Source

If (current.hasNext ()) {

/ / extract the next element from the input file

Bin newBin = newBin (current, current.next ())

/ / binary search, that is, comparing with the elements already in the array

Int index = Collections.binarySearch (bins, newBin)

If (index = = 0) {

/ / algorithm thinking situation 1

This.out.write (newBin)

} else {

/ / algorithm ideas 2

If (index < 0) {

Index =-(index+1)

}

Bins.add (index, newBin)

Bin minBin = bins.remove (0)

This.out.write (minBin)

}

} else {

/ / algorithm idea 3: encounter the end of the file

Bin minBin = bins.remove (0)

This.out.write (minBin)

If (bins.isEmpty ()) {

Break

}

}

}

}

All codes

Readers can directly copy and paste the following code into IDE to run.

Package leetcode

Import java.io.BufferedReader

Import java.io.Closeable

Import java.io.FileNotFoundException

Import java.io.FileOutputStream

Import java.io.FileReader

Import java.io.IOException

Import java.io.PrintWriter

Import java.util.ArrayList

Import java.util.Collections

Import java.util.List

Import java.util.concurrent.ThreadLocalRandom

Public class DiskMergeSort implements Closeable {

Public static List generateFiles (int n, int minEntries, int maxEntries) {

List files = new ArrayList ()

For (int I = 0; I < n; iTunes +) {

String filename = "input-" + I + ".txt"

PrintWriter writer

Try {

Writer = new PrintWriter (new FileOutputStream (filename))

Int entries = ThreadLocalRandom.current () .nextInt (minEntries, maxEntries)

List nums = new ArrayList ()

For (int k = 0; k < entries; knot +) {

Int num = ThreadLocalRandom.current () .nextInt (10000000)

Nums.add (num)

}

Collections.sort (nums)

For (int num: nums) {

Writer.println (num)

}

Writer.close ()

} catch (FileNotFoundException e) {

}

Files.add (filename)

}

Return files

}

Private List sources

Private MergeOut out

Public DiskMergeSort (List files, String outFilename) {

This.sources = new ArrayList ()

For (String filename: files) {

This.sources.add (new MergeSource (filename))

}

This.out = new MergeOut (outFilename)

}

Static class MergeOut implements Closeable {

Private PrintWriter writer

Public MergeOut (String filename) {

Try {

This.writer = new PrintWriter (new FileOutputStream (filename))

} catch (FileNotFoundException e) {

}

}

Public void write (Bin bin) {

Writer.println (bin.num)

}

@ Override

Public void close () throws IOException {

Writer.flush ()

Writer.close ()

}

}

Static class MergeSource implements Closeable {

Private BufferedReader reader

Private String cachedLine

Public MergeSource (String filename) {

Try {

FileReader fr = new FileReader (filename)

This.reader = new BufferedReader (fr)

} catch (FileNotFoundException e) {

}

}

Public boolean hasNext () {

String line

Try {

Line = this.reader.readLine ()

If (line = = null | | line.isEmpty ()) {

Return false

}

This.cachedLine = line.trim ()

Return true

} catch (IOException e) {

}

Return false

}

Public int next () {

If (this.cachedLine = = null) {

If (! hasNext ()) {

Throw new IllegalStateException ("no content")

}

}

Int num = Integer.parseInt (this.cachedLine)

This.cachedLine = null

Return num

}

@ Override

Public void close () throws IOException {

This.reader.close ()

}

}

Static class Bin implements Comparable {

Int num

MergeSource source

Bin (MergeSource source, int num) {

This.source = source

This.num = num

}

@ Override

Public int compareTo (Bin o) {

Return this.num-o.num

}

}

Public List prepare () {

List bins = new ArrayList ()

For (MergeSource source: sources) {

Bin newBin = newBin (source, source.next ())

Bins.add (newBin)

}

Collections.sort (bins)

Return bins

}

Public void sort () {

List bins = prepare ()

While (true) {

MergeSource current = bins.get (0). Source

If (current.hasNext ()) {

Bin newBin = newBin (current, current.next ())

Int index = Collections.binarySearch (bins, newBin)

If (index = = 0 | | index = =-1) {

This.out.write (newBin)

If (index =-1) {

Throw new IllegalStateException ("impossible")

}

} else {

If (index < 0) {

Index =-index-1

}

Bins.add (index, newBin)

Bin minBin = bins.remove (0)

This.out.write (minBin)

}

} else {

Bin minBin = bins.remove (0)

This.out.write (minBin)

If (bins.isEmpty ()) {

Break

}

}

}

}

@ Override

Public void close () throws IOException {

For (MergeSource source: sources) {

Source.close ()

}

This.out.close ()

}

Public static void main (String [] args) throws IOException {

List inputs = DiskMergeSort.generateFiles (100,10000, 20000)

/ / it takes time to run the algorithm many times.

For (int I = 0; I < 20; iTunes +) {

DiskMergeSort sorter = new DiskMergeSort (inputs, "output.txt")

Long start = System.currentTimeMillis ()

Sorter.sort ()

Long duration = System.currentTimeMillis ()-start

System.out.printf ("% dms\ n", duration)

Sorter.close ()

}

}

}

There is also a small defect in this algorithm, that is, if the number of input files is very large, then the array in memory will be very large, and the operation of inserting and deleting the array must be very time-consuming, so you can consider using TreeSet instead of the array, and readers can try it on their own.

This is the end of this article on "how to realize disk merging and sorting in LevelDB database". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report