Spark machine learning 07/04 Update SLTechnology News&Howtos

Spark machine learning

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Main Concepts in Spark Machine Learning Pipelines

API provided by MLlib can combine multiple complex machine learning algorithms into a single pipeline or a single workflow through Pipelines. This concept is similar to the concept in scikit-learn, according to the official statement, this abstract concept is inspired by scikit-learn.

DataFrame: use DataFrame in the Spark SQL component as a dataset for machine learning. Multiple data types are supported. For example, DataFrame can divide text, database and other external data sources into different columns, including feature vectors, eigenvalues and so on.

Transformer: a Transformer can convert one DataFrame to another DataFrame. For example, a machine learning model can transform a DataFrame with eigenvalues into a DataFrame with model prediction data.

Estimator: an algorithm that trains DataFrame data sets to produce a machine learning model.

Pipeline: combine multiple Transformer and Estimator to form a machine learning workflow.

Parameter: a shared API for all Transformer and Estimator specified parameters.

DataFrame

The widely used data structures in DataFrame can include vectors, text, pictures, and structured data. DataFrame supports multiple data sources through Spark SQL.

The workflow is shown in the figure:

Pipleline flow chart in machine learning

As shown in the figure, Pipeline has three phases, each of which is either Transformer or Estimator, which is executed in a certain order, during which a new Words (DataFrame type) is generated from the Raw text of the DataFrame type represented by the cylinder, and finally a LogisticRegressionModel is established. The Tokenizer,HashingTF in the figure is all Transformer, and LogisticRegressionModel is Estimator.

In the Transformer phase, the transform () method is mainly called for calculation.

In the Estimator phase, the fit () method is mainly called for calculation.

DAG Pipelines: multiple stages form a pipeline. Similarly, DAG Pipelines is a directed acyclic graph composed of multiple pipeline.

Run-time checking: there can be a variety of data in the data structure DataFrame, but the data type is not checked at compile time, but is checked against DataFrame's Schema at run time.

Unique ID identity: each stage of Pipeline is uniquely identified by id. The same real column, such as HashingTF, will not be inserted into the same Pipeline twice, because each stage has its own unique ID for identification.

Save and read pipeline

Code case:

Comprehensive cases of Estimator, Transformer, and Param

Importorg.apache.spark.ml.classification.LogisticRegression

Importorg.apache.spark.ml.linalg. {Vector,Vectors}

Importorg.apache.spark.ml.param.ParamMap

Importorg.apache.spark.sql.Row

/ / Prepare training data from a list of (label, features) tuples.

Valtraining=spark.createDataFrame (Seq (

(1.0 meme Vectors.coach (0.0mem1.1 pr 0.1))

(0.0meme Vectors.coach (2.0meme 1.0meme 1.0))

(0. 0dVectors.recording (2. 0dVectors.1. 0))

(1.0 meme Vectors.coach (0.0mem1.2 mahry 0.5))

). ToDF ("label", "features")

/ / Create a LogisticRegression instance. This instance is anEstimator.

Vallr=newLogisticRegression ()

/ / Print out the parameters, documentation, and any defaultvalues.

Println ("LogisticRegressionparameters:\ n" + lr.explainParams () + "\ n")

/ / We may set parameters using setter methods.

Lr.setMaxIter (10)

.setRegParam (0.01)

/ / Learn a LogisticRegression model. This uses the parametersstored in lr.

Valmodel1=lr.fit (training)

/ / Since model1 is a Model (i.e., a Transformer produced byan Estimator)

/ / we can view the parameters it used during fit ().

/ / This prints the parameter (name: value) pairs, where namesare unique IDs for this

/ / LogisticRegression instance.

Println ("Model 1 was fit usingparameters:" + model1.parent.extractParamMap)

/ / We may alternatively specify parameters using a ParamMap

/ / which supports several methods for specifying parameters.

ValparamMap=ParamMap (lr.maxIter- > 20)

Put (lr.maxIter,30) / / Specify 1 Param. This overwrites the original maxIter.

.put (lr.regParam- > 0.1jre lr.rangold-> 0.55) / / Specify multiple Params.

/ / One can also combine ParamMaps.

ValparamMap2=ParamMap (lr.probabilityCol- > "myProbability") / / Change output column name.

ValparamMapCombined=paramMap++paramMap2

/ / Now learn a new model using the paramMapCombinedparameters.

/ / paramMapCombined overrides all parameters set earlier vialr.set* methods.

Valmodel2=lr.fit (training,paramMapCombined)

Println ("Model 2 was fit usingparameters:" + model2.parent.extractParamMap)

/ / Prepare test data.

Valtest=spark.createDataFrame (Seq (

(1.0 recorder Vectors.coach (- 1.0 recorder 1.5 minus 1.3))

(0.0recoveryVectors.recording (3.0meme2.0mai 0.1))

(1.0 meme Vectors.coach (0.0mem2.2 furo1.5))

). ToDF ("label", "features")

/ / Make predictions on test data using theTransformer.transform () method.

/ / LogisticRegression.transform will only use the 'features'column.

/ / Note that model2.transform () outputs a 'myProbability'column instead of the usual

/ / 'probability' column since we renamed thelr.probabilityCol parameter previously.

Model2.transform (test)

.select ("features", "label", "myProbability", "prediction")

.clients ()

.foreach {caseRow (features:Vector,label:Double,prob:Vector,prediction:Double) = >

Println (s "($features, $label)-> prob=$prob, prediction=$prediction")

}

Individual case code for Pipeline

Importorg.apache.spark.ml. {Pipeline,PipelineModel} importorg.apache.spark.ml.classification.LogisticRegressionimportorg.apache.spark.ml.feature. {HashingTF,Tokenizer} importorg.apache.spark.ml.linalg.Vectorimportorg.apache.spark.sql.Row

/ Prepare training documents from a list of (id, text, label) tuples.val training = spark.createDataFrame (Seq ((0L, "ab c d e spark", 1.0), (1L, "bd", 0.0), (2L, "spark f g h", 1.0), (3L, "hadoop mapreduce", 0.0)) .toDF ("id", "text", "label")

/ / Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.val tokenizer = newTokenizer () .setInputCol ("text") .setOutputCol ("words") val hashingTF = newHashingTF () .setNumFeatures (1000) .setInputCol (tokenizer.getOutputCol) .setOutputCol ("features") val lr = newLogisticRegression () .setMaxIter (10) .setRegParam (0.001) val pipeline = newPipeline () .setStages (Array (tokenizer, hashingTF, lr))

/ / Fit the pipeline to training documents.val model = pipeline.fit (training)

/ / Now we can optionally save the fitted pipeline to diskmodel.write.overwrite () .save ("/ tmp/spark-logistic-regression-model")

/ / We can also save this unfit pipeline to diskpipeline.write.overwrite () .save ("/ tmp/unfit-lr-model")

/ / And load it back in during productionval sameModel = PipelineModel.load ("/ tmp/spark-logistic-regression-model")

/ Prepare test documents, which are unlabeled (id, text) tuples.val test = spark.createDataFrame (Seq ((4L, "spark i j k"), (5L, "l m n"), (6L, "spark hadoop spark"), (7L, "apache hadoop")) .toDF ("id", "text")

/ / Make predictions on test documents.model.transform (test) .select ("id", "text", "probability", "prediction") .foreach {caseRow (id:Long, text:String, prob:Vector, prediction:Double) = > println (s "($id, $text)-- > prob=$prob, prediction=$prediction")}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.