Did you tell the difference between ml and mllib in big data's spark? 07/03 Update SLTechnology News&Howtos

Did you tell the difference between ml and mllib in big data's spark?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

An important link in big data's learning process is spark, but there are a lot of knowledge points in spark, and many people are foolishly confused. Among them, the most confusing thing is the difference between ml and mllib, so we might as well understand the difference between the two in detail.

If you want to know big data's learning route, if you want to learn big data knowledge and need free learning materials, you can add group: 784789432. Welcome to join us. Every day, a live broadcast will be held at 3 pm to share basic knowledge, and at 20:00 in the evening, a live broadcast will be held to share the actual combat of big data project.

First of all, about Spark ML

1. Definition: ark machine learning, the right object: DataFrame. 2. The main operation is DataFrame. Where taFrame is a subset of Dataset, which is Dataset [Row]. DataSet is an encapsulation of RDD and makes a lot of optimizations for operations such as SQL. Secondly, about Spark MlLib 1, we define MLlib as the machine learning (ML) library of Spark. Its goal is to make practical machine learning scalable and easy. At a high level, it provides the following tools: a, ML algorithm: commonly used learning algorithms, such as classification, regression, clustering and collaborative filtering B, features: feature extraction, transformation, dimension reduction and selection C, pipeline: tool D for building, evaluating and adjusting ML pipelines, persistence: save and load algorithms, model and pipeline E, utilities: linear algebra, statistics Data processing, etc. 2. The target: RDD starts from Spark 2.0, and the RDD-based API spark.mllib in the package has entered maintenance mode. Only modify bug, do not add new functions. Spark's main machine learning API is now the DataFrame-based API spark.ml in the package. Finally, the differences between the two are summarized. 1. Programming process (1) the process of building machine learning algorithms is different: ML advocates the use of pipelines, thinking of data as water, which flows in from one section of the pipe and out from the other end. (2) General concept: DataFrame = > Pipeline = > A newDataFrame Pipeline: data processing process connected by several Transformers and Estimators Transformer: in: DataFrame = > out: DataFrame Estimator: in: DataFrame = > out: Transformer 2, algorithm interface (1) algorithm interface in spark.mllib is based on RDDs; (2) algorithm interface in spark.ml is based on DataFrames. Ml is recommended in practical use. A series of algorithms in ml based on DataFrames are more suitable for creating MLpipeline with a series of tasks ranging from data cleaning to feature engineering to model training. For example, take naive Bayes as an example: when training the model, naiveBayes.fit (dataset: Dataset []): NaiveBayesModel is used to train the model, and the return value is a naiveBayesModel. You can use naiveBayesModel.transform (dataset: Dataset []): DataFrame to test the model, and then evaluate the model by other methods. The use of the model can refer to the above method, and transform is used for prediction. You can use select to take the predicted value, and use the form "$" label "when using select. Similar to sql, it is easy to use and has a low threshold for entry. 3, the degree of abstraction (1) mlib is mainly based on RDD, the level of abstraction is not high enough; (2) ml mainly abstracts the data processing pipeline, the algorithm is equivalent to a component of the pipeline, and can be replaced by other algorithms at will, so that the algorithm and other processes of data processing are separated to achieve low coupling. 4. From a technical point of view, the type of dataset is different. (1) the API of ML is oriented to Dataset. (2) mllib is oriented to RDD. What's the difference between Dataset and RDD? The bottom of Dataset is RDD. Dataset has further optimized RDD, such as dark magic similar to the sql language, Dataset supports static type analysis, so you can report errors in compile time, various combinators (map,foreach, etc.) performance will be better. After spark3.0, mllib will be abandoned and fully based on ml. Because the object that ml operates on is DataFrame, it is much easier to operate than RDD. Therefore, it is suggested that students who are new to spark can use ml directly. The knowledge points in big data need to be understood and applied in detail, and once something goes wrong, it will affect the overall situation. Therefore, students must lay a good foundation in the learning process, so as to better master big data's knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.