What are the spark MLlib data types 07/06 Update SLTechnology News&Howtos

What are the spark MLlib data types

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly talks about "what are the spark MLlib data types". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let the editor take you to learn "what are the spark MLlib data types"?

MLlib is the machine learning library of spark, which aims to make machine learning algorithms easier to use and expand. MLlib includes classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline API. MLlib is divided into two packages: spark.mllib contains the original API,spark.ml built on top of RDD, which provides a higher-level machine learning pipeline API built on top of DataFrames. Spark.ml is recommended because DataFrames makes API more generic and flexible.

MLlib data type

MLlib supports distributed vectors and matrices that are saved locally or represented by RDD. Algebraic operations are supported by Breeze and jblas libraries. In supervised learning, a training sample is called a labeled point

Local vector: the element is of type double, and the vector subscript index starts at 0 int integers; both dense and sparse types are supported. The dense vector holds all the values in the vector by an double array; the sparse vector is supported by two parallel arrays, holding the index and the value, respectively. Class inheritance relationship: Vector-> (DenseVector, SparseVector)

LabeledPoint: includes a local vector (which can be dense vector or sparse vector) and a label for that vector. Tags are saved as double, so LabeledPoint can be used in classification and regression problems. For binary classification problems, the label is either 0 or 1; for multi-class problems, the label is an integer starting from 0: 0, 0, 1, 2, and so on. Sparse training data is often encountered in practical problems. MLlib supports loading data from LIBSVM files and constructing LabeledPoint.

Local matrix: the element value is of type double and the row index is of type int; dense and sparse matrices are supported. The dense matrix uses a double array to save the array according to the column main order, and the sparse matrix uses the CSC (Compressed sparse column) format to save the non-zero elements in the matrix. Matrix-> (DenseMatrix, SparseMatrix)

Distributed matrix: use long type to save row and column index, element value is still double type, distributed in one or more RDD. Choosing the correct saved format is very important for distributed matrices, because transforming the format of distributed matrices is likely to involve a large number of shuffle IO operations, currently supporting three types of distributed matrices: RowMatrix, IndexedRowMatrix, and CoordinateMatrix. The most basic type, RowMatrix, is a row-based matrix, each row can be regarded as an eigenvector, stored locally in the form of a vector. IndexedRowMatrix is a special RowMatrix that also holds the index of each row for locating specific rows and performing join operations. CoordinateMatrix saves the elements in the array as a sequence of coordinates. BlockMatrix is a data structure designed for block matrix. The matrix is divided into matrix blocks and saved locally.

MLlib data statistics

On RDD [Vector], the Statistics class provides a column-based statistical function colStats,colStats that returns the maximum, minimum, mean, variance, number of non-zero elements, and the number of all elements for each column.

It is a common operation to calculate the correlation between two sequences. It is easy to calculate the correlation coefficient of two or more vectors by using corr function. Corr function supports Pearson and Spearman correlation coefficient.

In RDD [(KQuery V)], the stratified sampling function sampleByKey is used, and the sampling proportion of each key needs to be specified.

Hypothesis test, supporting Pearson chi-square test

Random number generation, supporting uniform distribution, standard normal distribution, Poisson distribution

Kernel density estimation: allows you to visualize the empirical probability distribution of the observed sample without knowing its probability distribution. Estimate the distribution of random variables by evaluating a given sample. When evaluating the distribution of random variables, it is assumed that the empirical probability distribution function of random variables can be expressed as the mean of normal distribution centered on all sampling points.

At this point, I believe you have a deeper understanding of "what are the spark MLlib data types?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.