What are the advantages and disadvantages of decision tree in spark mllib 04/19 Update SLTechnology News&Howtos

What are the advantages and disadvantages of decision tree in spark mllib

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you what are the advantages and disadvantages of the decision tree in spark mllib. I hope you will gain something after reading this article. Let's discuss it together.

Advantages of the decision tree:

Rules that can be understood can be generated.

The amount of calculation is relatively small.

You can handle consecutive and category fields.

The decision tree can clearly show which fields are more important

Disadvantages of the decision tree:

It is difficult to predict the fields of continuity.

For chronological data, a lot of preprocessing work is needed.

When there are too many categories, errors may increase faster.

When the general algorithm classifies, it is only classified according to a field.

Go out to play the record table, go out to play the temperature, wind and rain, humidity 11001010110011100111001100011100 run the code as follows: package spark.DTimport org.apache.spark.mllib.tree.DecisionTreeimport org.apache.spark.mllib.util.MLUtilsimport org.apache.spark. {SparkContext SparkConf} / * decision tree use case-go out to play * decision tree * * decision tree is a kind of supervised learning Supervised learning is that given a pair of samples, each sample has a set of attributes and a category. * these categories are determined in advance, then a classifier can be obtained through learning. The classifier can correctly classify the newly emerging objects. The principle is: to sum up the classification rules that meet the requirements from a set of disordered and irregular factors. * basis of decision tree algorithm: information entropy ID3 * information entropy: a measure of uncertain information in an event or attribute. In an event or attribute, the greater the information entropy, the greater the uncertainty information, which is more beneficial to the calculation of data analysis. Therefore, the choice of information entropy always chooses the attribute with the highest * information entropy in the current event as the attribute to be tested. * ID3: a greedy algorithm It is used to construct a decision tree. the rate of decline of information entropy is used as the criterion for testing attributes, that is, the attributes with the highest information gain are selected at each * node as the division criteria, and then continue the process. * until the generated decision tree can perfectly classify the training example. * * usage scenario: any classification data that conforms to the key-value pattern can be inferred from the decision tree. * * the object used by the decision tree to predict is fixed, and a specific route from the root to the leaf node is a classification rule. Decide * a classification algorithm and result. * Created by eric on 16-7-19. * / object DT {val conf = new SparkConf () / / create the environment variable .setMaster ("local") / / set the localization handler .setAppName ("ZombieBayes") / / set the name val sc = new SparkContext (conf) Def main (args: Array [String]) {val data = MLUtils.loadLibSVMFile (sc) ". / src/main/spark/DT/DTree.txt") val numClasses = 2 Int / number of categories val categorycalFeaturesInfo = Map [Int, Int] () / set input format val impurity = "entropy" / / set information gain calculation method val maxDepth = 5 / / maximum depth val maxBins = 3 / / set split data set val model = DecisionTree.trainClassifier (data,// input data set numClasses,// classification number In this example, only go out, do not go out, there are two types of categorycalFeaturesInfo,// attribute pair format Here is the simple key-value pair impurity,// calculation information gain form maxDepth / / number of data sets that the height of the tree maxBins// can split) println (model.topNode) println (model.numNodes) / / 5 println (model.algo) / / Classification}} DTree.txt1 1:1 2:0 3:0 4:10 3:1 4:10 1:0 2:1 3 model.topNode 0 4 forward 01 1 2 1 3 1 0 4 1 1 2 0 3 maxBins// 0 4 01 1:1 2:1 3:0 4:0 the results are as follows

Id = 1, isLeaf = false, predict = 1.0 (prob = 0.6666666666666666), impurity = 0.9182958340544896, split = Some (Feature = 0, threshold = 0.0, featureType = Continuous, categories = List ()), stats = Some (gain = 0.31668908831502096, impurity = 0.9182958340544896, left impurity = 0.0, right impurity = 0.72192809483623)

five

Classification

After reading this article, I believe you have a certain understanding of "what are the advantages and disadvantages of the decision tree in spark mllib". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.