How to analyze the principle and use of spark MLlib training 04/27 Update SLTechnology News&Howtos

How to analyze the principle and use of spark MLlib training

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to analyze the principle and use of spark MLlib training. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Description

Spark MLlib is an extensible machine learning library provided by Spark. MLlib already contains some general learning algorithms and tools, such as classification, regression, clustering, collaborative filtering, dimensionality reduction and underlying optimization primitives. The API provided by MLlib is mainly divided into the following two categories.

The main API provided in the spark.mllib package operates RDD and may be discarded later.

The high-level API provided in the spark.ml package for building machine learning workflows mainly operates DataFrame. Many operations (algorithms / feature extraction / feature transformation) can be strung together in the form of pipes using pipeline to allow data to flow in this pipeline. All ml models provide a unified algorithm operation interface, such as model training is fit. There are a variety of trainXXX for different models in mllib.

Training principle

Random forest training principle:

Optimization:

Layer-by-layer training: because the data is stored on different machines, the efficiency of frequently accessing the data is very low, so the breadth traversal method is used to construct one layer of all trees at a time, for example, if you want to train 10 trees, construct the first root node of all trees for the first time, construct all nodes with depth 2 for the second time, and so on, so that the number of times to access data is reduced to the maximum depth of the tree. The communication between machines is greatly reduced and the training efficiency is improved.

Sample sampling: when the sample has continuous characteristics, its possible value may be infinite, and storing its possible value takes up more space, so spark sampled the sample, sampling quantity, at least 10, 000 samples.

Feature packing: each discrete eigenvalue (for continuous features, discretization first) is called a Split, and the upper and lower bounds [lowSplit, highSplit] form a bin, that is, feature packing. The default maxBins is 32. For continuous features, the number of bin in discretization is maxBins, using equal frequency discretization; for ordered discrete features, the number of bin is the number of eigenvalues + 1; for unordered discrete features, the number of bin is 2 ^ (Mmurl)-1 M is the number of eigenvalues.

Process:

1. Initialize the model: build numTrees Node and assign the default value emptyNode. These node will be used as the root node of each tree to participate in the following training. Add these node and treeIndex wrappers to the queue nodeQueue, and then add all the node waiting for split to the queue, split in turn, until all node triggers the cutoff condition, that is, the subsequent while loop queue is listed as empty.

2. Select Node to be split: loop to take the node to be processed from nodeQueue and put it into nodesForGroup and treeToNodeToIndexInfo nodesForGroup is Map [Int, Array [Node]], its key is treeIndex,value is Node array, in which is the node of the tree to be split this time

The type of treeToNodeToIndexInfo is Map [Int, Map [Int, NodeIndexInfo]], and the key of key is Map in treeIndex,value is node.id. This id comes from the first parameter of Node initialization, and the id of node in the first round is 1. Its value is NodeIndexInfo structure.

Calculate the best node splitting: first count each partition separately, then accumulate each partition into global statistics, and calculate the best split according to each node traversing all the features.

3. Split node: split the node according to the best split, including the improvement of some attributes of the current node, the construction of the left and right child nodes, etc., and continue to add the left and right child nodes of the current node to the nodeQueue, which contains the nodes that need to continue to split, so that the current layer split is completed.

4. The cycle training cycle executes the steps of taking out node and splitting node until all nodes are triggered.

5. Random forest inherits from the tree model set model, the final result, classified voting, regression average.

Gradient descent steps:

1. Broadcast the current model parameters to each data partition (which can be regarded as virtual computing nodes).

2. Each computing node carries on the data sampling to get the mini batch data, calculates the gradient respectively, and then summarizes the gradient through the treeAggregate operation to get the final gradient gradientSum.

3. Update the model weight with gradientSum.

Disadvantages:

1. All the model parameters are broadcast before each iteration by the way of global broadcast. It is well known that the broadcasting process of Spark consumes very much bandwidth resources, especially when the parameter scale of the model is too large, the broadcasting process and the process of maintaining a copy of weight parameters at each node are extremely resource-consuming, which leads to the poor performance of Spark in the face of complex models.

2. The blocking gradient descent mode is adopted, and each round of gradient descent is determined by the slowest node. From the above analysis, we can see that the mini batch process of Spark MLlib is that after all the nodes have calculated their respective gradients, the Aggregate is finally aggregated layer by layer to generate a global gradient. That is, if a node takes too long to calculate the gradient due to problems such as data skew, then this process will prevent all other nodes from performing new tasks. This synchronous blocking distributed gradient computing method is the main reason for the low parallel training efficiency of Spark MLlib.

3. Spark MLlib does not support complex network structures and a large number of adjustable hyperparameters. In fact, Spark MLlib only supports the training of standard multilayer perceptron neural networks in its standard library, but does not support complex network structures such as RNN,LSTM, and it is unable to select a large number of hyperparameters such as different activation function. This leads to Spark MLlib's poor ability to support deep learning.

After reading the above, do you have any further understanding of how to interpret the principle and use of spark MLlib training? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.