Getting started with 14.spark mllib 07/06 Update SLTechnology News&Howtos

Getting started with 14.spark mllib

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Brief introduction

MLlib is a library provided by Spark that provides machine learning and is designed for situations where it runs in parallel on a cluster.

MLlib contains many machine learning algorithms that can be used in all programming languages supported by Spark.

MLlib is designed to represent data in the form of RDD, and then invoke various algorithms on distributed datasets. In fact, MLlib is a collection of functions that can be called on the RDD.

Data type

MLlib contains some unique data types, which are located in org.apache.spark.mllib packages (Java/Scala) or pyspark.mllib (Python). The main classes are:

Vector

A local vector (Local Vector) The index starts at 0 and is an integer. The value is of type Double and is stored in a single machine.

MLlib supports both dense and sparse vectors. The former means that every bit of the vector is stored, while the latter stores only non-zero bits to save space.

Vectors can be created through the mllib.linalg.Vectors class

Scala// creates a dense vector scala > val denseVec1 = Vectors.dense (1.0mem2.0) denseVec1: org.apache.spark.mllib.linalg.Vector = [1.0mem2.0) scala > val denseVec2 = Vectors.dense (Array (1.0mem2.0)) denseVec2: org.apache.spark.mllib.linalg.Vector = [1.0mem2.0) / / create a sparse vector scala > val sparseVec1 = Vectors.sparse (4MaArray (0Magne2)) Array sparseVec1: org.apache.spark.mllib.linalg.Vector = (4, [0meme2], [1.0pc2.0]) python > from pyspark.mllib.linalg import Vectors > > den = Vectors.dense ([1.0pc2.0]) > denDenseVector ([1.0pc2.0]) > spa = Vectors.sparse (4, [0c2], [1.0jue 2.0]) > > spaSparseVector (4, {0: 1.0mr2: 2.0})

LabeledPoint

Used in supervised learning (supervised learning) algorithms such as classification and regression.

LabeledPoint represents labeled data points, including a feature vector and a label (represented by a floating point number).

Located in the mllib.regression package

First of all, scala// needs to introduce the label point related class import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPoint// to create a label point with positive tags and dense feature vectors. Val pos = LabeledPoint (1.0,0.0,3.0) / / create a label point with negative tags and sparse feature vectors. Val neg = LabeledPoint (0.0, Vectors.sparse (3, Array (0jue 2), Array (1.0,3.0)) python > from pyspark.mllib.regression import LabeledPoint > from pyspark.mllib.linalg import Vectors > pos = LabeledPoint (1.0meme Vectors.coach ([1.0pr 2.0cr 3.0]) > neg = LabeledPoint (0.0meme Vectors.coach ([1.0core2.0])

Matrix

Matrix is divided into dense matrix and sparse matrix.

The entity values of the dense matrix are stored in a single Double array in the form of primary order. The non-zero entities of the coefficient matrix are stored in compressed sparse columns (Compressed Sparse Column, CSC) in the form of primary order. For example, the following dense matrix is stored in an one-dimensional array [1.0, 3.0, 5.0, 2.0, 4.0, 6.0] with a size of (3, 2).

The base class of the local matrix is the Matrix class, and there are two implementations in Spark, DenseMatrix and SparseMatrix. The official documentation recommends that the factory method that has been implemented in the Matrices class be used to create the local matrix. It should be noted that the local matrix in MLlib is column main order (column-major)

Dense matrix

Import org.apache.spark.mllib.linalg. {Matrix, Matrices} / / create dense matrices ((1.0,2.0), (3.0,4.0), (5.0,6.0) val dm: Matrix= Matrices.dense (3,2, Array (1.0,3.0,5.0,2.0,4.0,6.0) sparse matrices scala > val sparseMatrix= Matrices.sparse (3,3, Array (0,2,3,6), Array (0,2) Array (1.0,2.0,3.0) sparseMatrix: org.apache.spark.mllib.linalg.Matrix = 3 x 3 CSCMatrix (0) 1.0 (2) 2.0 (1) 3.0 (0) 4.0 (1) 5.0 (2) 6.0

Rating

For product recommendation

Indicates the user's rating of a product

Located in the mllib.recommendation package

Various Model classes (models)

Each Model is the result of a training algorithm.

Models generally have a predict () method, which is used to predict new data points or RDD composed of data points. Statistics

Whether in real-time exploration or in machine learning data understanding, basic statistics are an important part of data analysis in . MLlib provides several widely used statistical functions through the methods in the mllib.stat.Statistics class, which can be used directly on RDD. Some of the commonly used functions are listed below.

Statistics.colStats (rdd)

calculates the summary statistics of the RDD composed of vectors, keeping the minimum, maximum, average, and variance of each column in the vector set. This can be used to obtain rich statistical information in a single execution.

Statistics.corr (rdd, method)

& esmp; calculates the correlation matrix between columns in a RDD consisting of vectors, using one of Pearson correlation (Pearson correlation) or Spelman correlation (Spearman correlation) (method must be one of pearson or spearman).

Statistics.corr (rdd1, rdd2, method)

calculates the correlation matrix of two RDD consisting of floating-point values, using either Pearson correlation or Spelman correlation (method must be one of pearson or spearman).

Statistics.chiSqTest (rdd)

calculates the Pearson independence test for each feature and tag in a RDD consisting of LabeledPoint objects

(Pearson's independence test) results. Returns a ChiSqTestResult object with p value, (p-value), test statistics, and degrees of freedom for each feature. Labels and eigenvalues must be classified (that is, discrete values).

here is an example: use the Vector of three students' grades to build the required RDD Vector, and each Vector in this matrix represents a student's score in four courses:

Pythonfrom pyspark.mllib.stat import Statisticsfrom pyspark.mllib.linalg import Vectors// build RDDbasicTestRDD = sc.parallelize ([Vectors.dense ([60,70,80,0]), Vectors.dense ([80,50,0,90]), Vectors.dense ([60,70,80,0])]) / / View members in summary This object contains a large number of statistics > print summary.mean () [66.66666667 63.33333333 53.33333333 30. Print summary.variance () [133.33333333 133.33333333 2133.33333333 2700. > print summary.numNonzeros () [3.3.2.1.] scalaimport org.apache.spark.mllib.linalg. {Vector, Vectors} import org.apache.spark.rdd.RDDval array1: Array [Double] = Array [Double] (60,70,80,0) val array2: Array [Double] = Array [Double] (80,50,0,90) val array3: Array [Double] = Array [Double] (60,70,80) 0) val denseArray1 = Vectors.dense (array1) val denseArray2 = Vectors.dense (array2) val denseArray3 = Vectors.dense (array3) val seqDenseArray: Seq [Vector] = Seq (denseArray1, denseArray2, denseArray3) val basicTestRDD: RDD [Vector] = sc.parallelize [Vector] (seqDenseArray) val summary: MultivariateStatisticalSummary = Statistics.colStats (basicTestRDD) algorithm feature extraction TF-IDF (word frequency-inverse document frequency) is a simple method to generate specific vectors from text documents (such as web pages). Scaling, most of them consider the amplitude of each element in the feature vector, and perform best when the feature scaling is adjusted to equal treatment. Regularize, when preparing to input data, normalize the vector to length 1. This can be done using the Normalizer class. Word2Vec is a text feature algorithm based on neural network, which can be used to transmit data to many downstream algorithms. Dimension reduction

Principal component analysis (PCA)

PCA will map features to low-order space to maximize the variance of data representation in low-dimensional space, thus ignoring some useless dimensions. To calculate this mapping, we need to construct a normalized correlation matrix and use the singular vectors and singular values of the matrix. The singular vector corresponding to the largest part of the singular value can be used to reconstruct the main components of the original data. Singular value decomposition (MLlib) also provides the underlying singular value decomposition (SVD) primitive. Classification and regression are two forms of supervised learning. Supervised learning means that the algorithm attempts to use labeled training data to predict the results according to the characteristics of the object. In classification, the predicted variables are discrete. In regression, the predicted variables are continuous. MLlib contains many classification and regression algorithms, such as simple linear algorithm, decision tree and forest algorithm. Clustering algorithm is an unsupervised learning task, which is used to divide objects into clusters with high similarity. Clustering algorithms are mainly used for data exploration (to see what a new data set looks like) and anomaly detection (to identify points far away from any clustering). MLlib includes two popular K-means algorithms in clustering, as well as a variant called K-means | |, which provides a better initialization strategy for parallel environments. Collaborative filtering and recommendation collaborative filtering is a kind of recommendation system technology which recommends new products according to the interaction and rating of various products. Alternating least squares (ALS) sets a feature vector for each user and product, so that the dot product of the user vector and the product vector is close to their score. Example uses logical regression algorithm to classify spam def testLogisticRegressionWithSGD = {val spam = sc.textFile ("src/main/resources/mllib/spam.txt", 1) val normal = sc.textFile ("src/main/resources/mllib/normal.txt", 1) / / create a HashingTF instance to map email text to a vector val tf = new HashingTF (numFeatures = 10000) / / each email is cut into words Each word is mapped to a feature val spamFeatures = spam.map {email = > tf.transform (email.split ("))} val normalFeatures = normal.map {email = > tf.transform (email.split ("))} / / create an example of LabeledPoint dataset storing positive (spam) and negative (normal mail) respectively val positiveExamples = spamFeatures.map {features = > LabeledPoint (1, features)} val negativeExamples = normalFeatures.map {features = > LabeledPoint (0) Features)} val trainingData = positiveExamples.union (negativeExamples) trainingData.cache () println (trainingData.toDebugString) / / use the SGD algorithm to run logical regression val model = new LogisticRegressionWithSGD (). Run (trainingData) / / Test val posTest = tf.transform ("O M G get cheap stuff by sending money to." .split (") val negTest = tf.transform (" hello) with positive (spam) and negative (normal email) examples, respectively. I started studying Spark ".split (") println (s "prediction for positive tset example: ${model.predict (posTest)}") println (s "prediction for negitive tset example: ${model.predict (negTest)}") Thread.sleep (Int.MaxValue)} svm classification algorithm # loading module from pyspark.mllib.util import MLUtilsfrom pyspark.mllib.classification import SVMWithSGD# reads data dataFile ='/ opt/spark-1.6.1-bin-hadoop2 .6 / data/mllib/sample_libsvm_data.txt'data = MLUtils.loadLibSVMFile (sc DataFile) splits = data.randomSplit ([0.8,0.2], seed = 9L) training = splits [0] .cache () test = splits [1] # print split data amount print "TrainingCount: [% d]"% training.count () Print "TestingCount: [% d]"% test.count () Model = SVMWithSGD.train (training, 100) scoreAndLabels = test.map (lambda point: (model.predict (point.features), point.label)) # output results, including predicted numeric results and 0Take 1 results: for score, label in scoreAndLabels.collect (): print score, labelk-means clustering algorithm # read data files Create RDDdataFile = "/ opt/spark-1.6.1-bin-hadoop2.6/data/mllib/kmeans_data.txt" lines = sc.textFile (dataFile) # create Vector Arraydata = lines.map (lambda line: np.array ([float (x) for x in line.split ('')]) # where 2 is the number of clusters model = KMeans.train (data, 2) print ("Final centers:" + str (model.clusterCenters)) print ("Total Cost:" + str (model.computeCost (data)

Loyal to technology, love sharing. Welcome to the official account: java big data programming to learn more technical content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.