How to use Random Forest algorithm to realize the performance comparison Test of scikit-learn, Spark MLlib, DolphinDB and xgboost 04/25 Update SLTechnology News&Howtos

How to use Random Forest algorithm to realize the performance comparison Test of scikit-learn, Spark MLlib, DolphinDB and xgboost

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use random forest algorithm to achieve scikit-learn, Spark MLlib, DolphinDB, xgboost performance comparison test, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Random forest is a commonly used machine learning algorithm, which can be used not only for classification problems, but also for regression problems. This paper makes a comparative test on the implementation of random forest algorithm on four platforms: scikit-learn, Spark MLlib, DolphinDB and xgboost. Evaluation indicators include memory footprint, running speed and classification accuracy. In this test, the simulated data is used as input for binary classification training, and the generated model is used to predict the simulated data.

1. Testing software

The platform versions used in this test are as follows:

Scikit-learn:Python 3.7.1, science and technology, learning 0.20.2

Spark MLlib:Spark 2.0.2,Hadoop 2.7.2

DolphinDB:0.82

Xgboost:Python package,0.81

two。 Environment configuration

CPU:Intel (R) Xeon (R) CPU E5-2650 v4 2.20GHz (total 24 cores 48 threads)

RAM:512GB

Operating system: CentOS Linux release 7.5.1804

When testing on each platform, the data will be loaded into memory and then calculated, so the performance of the random forest algorithm is independent of the disk.

3. Data generation

This test uses a DolphinDB script to generate simulation data and export it to a CSV file. The training set is averagely divided into two categories, and the feature columns of each category obey two independent multivariate normal distributions N (0,1) and N (2/sqrt (20), 1) with the same standard deviation and different centers. There is no empty value in the training set.

Suppose the size of the training set is n rows and p columns. In this test, the values of n are 10000, 100000, 1, 000, and the value of p is 50.

Because the test set and the training set are independently and identically distributed, the size of the test set has no significant effect on the accuracy of the model. This test uses 1000 rows of simulated data as the test set for all training sets of different sizes.

The DolphinDB script that produces the simulation data is shown in Appendix 1.

4. Model parameters

The following parameters are used to train the random forest model in each platform:

Number of trees: 500

Maximum depth: two cases with a maximum depth of 10 and 30 were tested on 4 platforms.

The number of features selected when dividing nodes: the square root of the total number of features, that is, integer (sqrt (50)) = 7

Impurity index when dividing nodes: Gini index (Gini index). This parameter is only valid for Python scikit-learn, Spark MLlib and DolphinDB.

Number of buckets sampled: 32. This parameter is only valid for Spark MLlib and DolphinDB

The number of concurrent tasks: the number of CPU threads, Python scikit-learn, Spark MLlib and DolphinDB take 48 and xgboost as 24.

When testing xgboost, you tried different values for the parameter nthread, which represents the number of concurrent threads at run time. However, when the value of this parameter is the number of threads in the test environment (48), the performance is not ideal. It is further observed that when the number of threads is less than 10:00, the performance is positively correlated with the value. When the number of threads is greater than 10 and less than 24:00, the performance of different values is not significantly different, after that, the performance decreases when the number of threads increases. This phenomenon has also been discussed in the xgboost community. Therefore, the final number of threads used in this test in xgboost is 24.

5. Test result

The test script is shown in Appendix 2: 5.

When the number of trees is 500 and the maximum depth is 10:00, the test results are shown in the following table:

When the number of trees is 500 and the maximum depth is 30, the test results are shown in the following table:

In terms of accuracy, the accuracy of Python scikit-learn, Spark MLlib and DolphinDB is similar, slightly higher than that of xgboost; in terms of performance, the order from high to low is DolphinDB, Python scikit-learn, xgboost and Spark MLlib.

In this test, all CPU cores are used in the implementation of Python scikit-learn.

The implementation of Spark MLlib does not make full use of all CPU cores and takes up the highest memory. When the amount of data is 10,000, the peak occupancy rate of CPU is about 8%. When the amount of data is 1000000, the peak occupancy rate of CPU is about 25%. When the amount of data is 1000000, it will interrupt execution due to insufficient memory.

The implementation of DolphinDB database uses all CPU cores, and it is the fastest of all implementations, but its memory footprint is 2-7 times that of scikit-learn and 3-9 times that of xgboost. DolphinDB's random forest algorithm implementation provides the numJobs parameter, which can be adjusted to reduce parallelism, thereby reducing memory footprint. Please refer to the DolphinDB user manual for details.

Xgboost is often used for boosted trees training, and can also be used for random forest algorithm. It is a special case where the number of iterations of the algorithm is 1. Xgboost actually has the highest performance when it is around 24 threads, and its utilization of CPU threads is not as good as that of Python and DolphinDB, nor is it faster than both. Its advantage is that it takes up the least memory. In addition, the specific implementation of xgboost is also different from that of other platforms. For example, without the bootstrap process, data is sampled without putting back instead of having put back. This explains why its accuracy is slightly lower than that of other platforms.

6. Summary

The performance of Python scikit-learn 's random forest algorithm is balanced in performance, memory overhead and accuracy, while the performance of Spark MLlib implementation is far lower than that of other platforms. DolphinDB's random forest algorithm achieves the best performance, and DolphinDB's random forest algorithm is seamlessly integrated with the database. Users can train and predict the data in the database directly, and provide numJobs parameters to achieve the balance between memory and speed. The random forest of xgboost is only a special case where the number of iterations is 1, and its implementation is quite different from that of other platforms, and the best application scenario is boosted tree.

Appendix

1. Simulate a DolphinDB script that generates data

Def genNormVec (cls, a, stdev, n) {return norm (cls * a, stdev, n)} def genNormData (dataSize, colSize, clsNum, scale, stdev) {t = table (dataSize:0, `cls join ("col" + string (0. (colSize-1)), INT join take (DOUBLE,colSize)) classStat = groupby (count,1..dataSize, rand (clsNum) DataSize)) for (row in classStat) {cls = row.groupingKey classSize = row.count cols = [take (cls, classSize)] for (I in 0:colSize) cols.append! (genNormVec (cls, scale, stdev, classSize) tmp = table (dataSize:0) `cls join ("col" + string (0. (colSize-1)), INT join take (DOUBLE,colSize)) insert into t values (cols) cols = NULL tmp = NULL} return t} colSize = 50clsNum = 2t1m = genNormData (10000, colSize, clsNum, 2 / sqrt (20), 1.0) saveText (T1m, "t10k.csv") t10m = genNormData (100000, colSize, clsNum, 2 / sqrt (20)) SaveText (t10m, "t100k.csv") t100m = genNormData (1000000, colSize, clsNum, 2 / sqrt (20), 1.0) saveText (t100m, "t1m.csv") t1000 = genNormData (1000, colSize, clsNum, 2 / sqrt (20), 1.0) saveText (t1000, "t1000.csv")

2. Training and forecasting scripts for Python scikit-learn

Import pandas as pdimport numpy as npfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom time import * test_df = pd.read_csv ("t1000.csv") def evaluate (path, model_name, num_trees=500, depth=30, num_jobs=1): df = pd.read_csv (path) y = df.values [:, 0] x = df.values [:, 1:] test_y = test_df.values [:, 0] test_x = test_df.values [: 1:] rf = RandomForestClassifier (n_estimators=num_trees, max_depth=depth, n_jobs=num_jobs) start = time () rf.fit (x, y) end = time () elapsed = end-start print ("Time to train model% s:% .9f seconds"% (model_name, elapsed)) acc = np.mean (test_y = rf.predict (test_x)) print ("Model% s accuracy:% .3f"% (model_name) Acc)) evaluate ("t10k.csv", "10k", 500,10,48) # choose your own parameter

3. Spark MLlib training and prediction code (Scala implementation)

Import org.apache.spark.mllib.tree.configuration.FeatureType.Continuousimport org.apache.spark.mllib.tree.model. {DecisionTreeModel, Node} object Rf {def main (args: Array [String]) = {evaluate ("/ t100k.csv", 500,10) / / choose your own parameter} def processCsv (row: Row) = {val label = row.getString (0). ToDoubleval featureArray = (for (i val score = model.trees.map (tree = > softPredict2 (tree) Point.features) .sum if (score * 2 > model.numTrees) (1.0,point.label) else (0.0, point.label)} val metrics = new MulticlassMetrics (scoreAndLabels) println (metrics.accuracy)} def softPredict (node: Node) Features: Vector): Double = {if (node.isLeaf) {/ / if (node.predict.predict = = 1.0) node.predict.prob else 1.0-node.predict.prob node.predict.predict} else {if (node.split.get.featureType = = Continuous) {if (features (node.split.get.feature) 0.5 print ("Accuracy =% .3f"% np.mean (prediction = = dtest.get) _ label ()) evaluate ('t10k.csv' 500,10,24) / / choose your own parameter read the above content Do you know how to use random forest algorithm to test the performance of scikit-learn, Spark MLlib, DolphinDB and xgboost? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.