What is the principle analysis of Spark2.2 machine learning library MLlib 07/06 Update SLTechnology News&Howtos

What is the principle analysis of Spark2.2 machine learning library MLlib

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the principle analysis of Spark2.2 machine learning library MLlib. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Machine learning library (MLlib)

MLlib is the machine learning (ML) library of Spark. Machine learning is scalable and easy to use. Provides advanced API, which provides the following tools:

ML algorithm: common learning algorithms, such as classification, regression, clustering and collaborative filtering

Characterization: feature extraction, transformation, dimension reduction and selection

Pipes: tools for building, evaluating, and tuning ML pipes

Persistence: saving and loading algorithms, models, and pipes

Utilities: linear algebra, statistics, data processing, etc.

Disclaimer: DataFrame-based API is the primary API

MLlib RDD-based API is now in maintenance mode.

Starting with Spark 2. 0, the spark.mllib package has entered maintenance mode. Spark's main machine learning API is now DataFrame-based API spark.ml.

What's the impact?

MLlib will support RDD-based API spark.mllib and bug fixes.

MLlib does not add new features to RDD-based API.

In the Spark 2.x release, MLlib will add functionality to DataFrames-based API to achieve functional parity with RDD-based API.

After the functional parity (rough estimate of Spark 2.3) is reached, RDD-based API will be deprecated.

The RDD-based API is expected to be removed in Spark 3.0.

It is expected that RDD-based API will be removed in Spark 3.0.

Why did MLlib switch to DataFrame-based API?

DataFrames provides a more user-friendly API than RDD. Many of the advantages of DataFrame include Spark Datasources, SQL/DataFrame queries,Tungsten and Catalyst optimization, and unified API across languages.

DataFrame-based MLlib API provides unified API across ML algorithms and multiple languages.

DataFrames contributes to ML Pipelines, especially feature transformation. For more information, see Pipelines guide.

What is "Spark ML"?

"Spark ML" is not an official name, but it is occasionally used to refer to MLlib DataFrame-based API. This is mainly due to the Scala package name used by org.apache.spark.ml 's DataFrame-based API and the term "Spark ML Pipelines" that we originally used to emphasize the concept of pipes.

Has MLlib been deprecated?

No, MLlib includes RDD-based API and DataFrame-based API. RDD-based API is now in maintenance mode. But neither API nor MLlib was deprecated.

Dependence

MLlib uses the linear algebraic package Breeze, which relies on netlib-java for optimized numerical processing. If native library 1 is not available at run time, you will see a warning message and will use a pure JVM implementation instead.

Due to licensing problems with run-time proprietary binaries, our netlib-java does not include native agents by default. To configure netlib-java/ Breeze to use system-optimized binaries, include com.github.fommil.netlib:all:1.1.2 (or build Spark with-Pnetlib-lgpl) as a dependency of the project, and read the netlib-java documentation for additional installation instructions for the platform.

To use MLlib in Python, you will need an additional version of NumPy and version 1.4 or later.

Highlights in 2.2

The following list highlights some of the new features and enhancements added to MLlib in Spark 2.2:

The ALS method recommended by the top-k of all users or projects matches the functionality in mllib (SPARK-19535). The performance of ml and mllib (SPARK-11968 and SPARK-20587) has also been improved

Statistical correlation and ChiSquareTest (SPARK-19636 and SPARK-19635) with DataFrames

FPGrowth frequent pattern Mining algorithm (SPARK-14503)

GLM now supports the full Tweedie family (SPARK-18929)

Feature Converter (SPARK-13568) used by Imputer to fill missing values in data sets

LinearSVC linear support vector machine classification (SPARK-14709)

Logistic regression now supports constraints on coefficients during training (SPARK-20047)

Migration Guid

MLlib is under active development. Api labeled Experimental/DeveloperApi may change in future releases, and the following migration guidelines will explain all changes between versions.

Breakthrough change from 2.1 to 2.2

There is no breakthrough.

Opposition and change of behavior

Objection

No one objected.

Changes of behavior

SPARK-19787: Default value of regParam changed from 1.0 to 0.1 for ALS.train method (marked DeveloperApi). Note this does not affect the ALS Estimator or Model, nor MLlib's ALS class.

SPARK-14772: Fixed inconsistency between Python and Scala APIs for Param.copy method.

SPARK-11569: StringIndexer now handles NULL values in the same way as unseen values. Previously an exception would always be thrown regardless of the setting of the handleInvalid parameter.

The above is the principle analysis of the Spark2.2 machine learning library MLlib shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.