How to use Spark machine learning data pipeline for advertising detection 07/01 Update SLTechnology News&Howtos

How to use Spark machine learning data pipeline for advertising detection

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how to use Spark machine learning data pipeline for advertising detection, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Our other Spark machine learning API, called Spark ML, is the recommended solution if you want to use data pipelining to develop big data applications.

The Spark ML (spark.ml) package provides machine learning API built on top of DataFrame, which has become a core part of the Spark SQL library. This package can be used to develop and manage machine learning pipelines. It can also provide feature extractors, converters, selectors, and support machine learning techniques such as classification, aggregation and clustering. All of these are critical to the development of machine learning solutions.

Here we look at how to use Apache Spark for exploratory data analysis (Exploratory Data Analysis), develop machine learning pipelines, and use the API and algorithms provided in the Spark ML package.

Because of its support for building machine learning data pipelines, the Apache Spark framework has now become a very good choice for building a comprehensive use case, including ETL, finger analysis, real-time flow analysis, machine learning, graph processing, and visualization.

Machine learning data pipeline

Machine learning pipeline can be used to create, adjust and verify machine learning workflow programs. The machine learning pipeline can help us focus more on big data requirements and machine learning tasks in the project, rather than spending time and energy on infrastructure and distributed computing. It can also help us in dealing with machine learning problems, in the exploration phase we need to develop iterative functions and composition models.

Machine learning workflow usually needs to include a series of processing and learning stages. The machine learning data pipeline is often described as a sequence of phases, each of which is either a converter module or an estimator module. These phases are performed sequentially, and input data is processed and converted as it flows through each phase in the pipeline.

The machine learning development framework should support distributed computing and serve as a tool for assembling pipelined modules. There are other requirements for building data pipelining, including fault tolerance, resource management, scalability, and maintainability.

In real projects, machine learning workflow solutions also include model import and export tools, cross-validation to select parameters, accumulation of data for multiple data sources, and so on. They also provide some data tools such as function extraction, selection and statistics. These frameworks support machine learning pipeline persistence to save and import machine learning models and pipelines for future use.

The concept of machine learning workflow and the combination of workflow processors have become more and more popular in many different systems. Big data processing frameworks such as scikit-learn and GraphLab also use the concept of pipelining to build systems.

A typical data value chain process includes the following steps:

find

Inject

Deal with

Save

Integration

Analysis.

Show

The methods used in machine learning data pipelining are similar. The following figure shows the different steps involved in machine learning pipeline processing.

Table 1: machine learning pipeline processing steps

These steps can also be shown in figure 1 below.

Figure 1: machine learning data pipeline processing flow chart

Next, let's look at the details of each step.

Data injection: the data we collect for machine learning pipeline applications can come from a variety of data sources, ranging from a few hundred GB to a few TB. Moreover, the big data application also has a feature, that is, the injection of data in different formats.

Data cleaning: data cleaning is a very important step in the entire data analysis pipeline, which can also be called data cleaning or data conversion. The main purpose of this step is to make the input data structured to facilitate subsequent data processing and predictive analysis. Depending on the quality of the data entered into the system, 60% of the total processing time will be spent on data cleaning to convert the data into an appropriate format, so that the machine learning model can be applied to the data.

There are always a variety of quality problems with data, such as incomplete data, or incorrect or illegal data items. The data cleaning process usually uses a variety of different methods, including custom converters, etc., using custom converters in the pipeline to perform data cleaning actions.

Sparse or coarse-grained data is another challenge in data analysis. There are always many extreme cases in this respect, so we have to use the data cleaning techniques mentioned above to ensure that the data input into the data pipeline must be of high quality.

With our in-depth understanding of the problem, every continuous attempt and continuous updating of the model, data cleaning is usually an iterative process. Data conversion tools such as Trifacta, OpenRefine, or ActiveClean can be used to accomplish data cleaning tasks.

Feature extraction: in the step of feature extraction (sometimes called feature engineering), we use techniques such as feature Hashing Term Frequency and Word2Vec to extract specific functions from the original data. The output of this step often includes an assembly module, which is passed into the next step for processing.

Model training: machine learning model training includes providing an algorithm and providing some training data for the model to learn. The learning algorithm will find the pattern from the training data and generate the output model.

Model verification: this step evaluates and adjusts the machine learning model to measure the effectiveness of using it to make predictions. As mentioned in this article, the receiver operating characteristic (Receiver Operating Characteristic,ROC) curve can be used to evaluate the binary classification model. The ROC curve can represent the performance of a binary classifier system. The method of creating it is to describe the corresponding relationship between true positive rate (True Positive Rate,TPR) and false positive rate (False Positive Rate,FPR) under different threshold settings.

Model selection: model selection allows converters and estimators to use data to select parameters. This is also a key step in the process of machine learning pipeline processing. Classes such as ParamGridBuilder and CrossValidator provide API to select machine learning models.

Model deployment: once the correct model is selected, we can start deployment, enter new data and get predictive analysis results. We can also deploy machine learning models as web services.

Spark machine learning

The machine learning pipeline API is introduced in version 1.2 of the Apache Spark framework. It provides developers with API to create and execute complex machine learning workflows. The goal of pipeline API is to enable users to quickly and easily build and configure feasible distributed machine learning pipelines by providing standardized API for different machine learning concepts. The pipelined API is included in the org.apache.spark.ml package.

Spark ML also helps to combine a variety of machine learning algorithms into one pipeline.

Spark machine learning API is divided into two packages, spark.mllib and spark.ml. The spark.ml package includes the original API built on RDD. The spark.ml package provides an advanced API built on top of DataFrame to build a machine learning pipeline.

The RDD-based MLlib library API is now in maintenance mode.

As shown in figure 2 below, Spark ML is a very important big data analysis library in the Apache Spark ecosystem.

Figure 2: Spark ecosystem including Spark ML

Machine learning pipeline module

The machine learning data pipeline includes a number of modules needed to complete the data analysis task. The key modules of the data pipeline are listed below:

Data set

Pipeline

The stage of the pipeline

Converter

Estimator

Evaluator

Parameters (and parameter maps)

Let's take a brief look at how these modules correspond to the overall steps.

Datasets: DataFrame is used to represent datasets in the machine learning pipeline. It also allows you to save structured data by field with a name. These fields can be used to save text, function vectors, real labels, and predictions.

Pipeline: the machine learning workflow is modeled as an pipeline, which includes a series of phases. Each stage processes the input data to generate output data for the next phase. A pipeline connects multiple converters and estimators to describe a machine learning workflow.

Pipeline phase: we define two phases, the converter and the estimator.

Converter: the algorithm can convert one DataFrame to another DataFrame. For example, a machine learning model is a converter that converts a characteristic DataFrame into a DataFrame with predictive information.

The converter converts one DataFrame to another DataFrame and adds new features to it. For example, in the Spark ML package, OneHotEncoder converts a field with a tag index into a field with vector characteristics. Each converter has a transform () function that converts one DataFrame into another when called.

Estimator: an estimator is a machine learning algorithm that learns from the data you provide. The input to the estimator is a DataFrame and the output is a converter. The estimator is used to train the model, which generates a converter. For example, a logical regression estimator produces a logical regression converter. Another example is to use K-Means as an estimator, which receives training data and generates a K-Means model, which is a converter.

Parameters: the machine learning module uses the general API to describe the parameters. One example of a parameter is the number of iterations to be used by the model.

The following figure shows the modules of a data pipeline used for text classification.

Figure 3: data pipeline using Spark ML

Use case

One of the use cases of machine learning pipelining is text classification. Such use cases usually include the following steps:

Clean text data

Convert the data into feature vectors, and

Training classification model

In text classification, data preprocessing such as n-gram abstraction and TF-IDF feature weight will be carried out before the training of classification model (similar to SVM).

Another machine learning pipeline use case is the image classification described in this article.

There are many other machine learning use cases, including fraud detection (using a classification model, which is also part of supervised learning), user partitioning (clustering model, which is also part of unsupervised learning).

TF-IDF

Word frequency-inverse document frequency (Term Frequency-Inverse Document Frequency,TF-IDF) is a static method to evaluate the importance of a word in a given sample set. This is an information acquisition algorithm that is used to rate the importance of a word in a collection of documents.

TF: if a word appears repeatedly in a document, it is more important. The specific calculation method is as follows:

TF = (# of times word X appears in a document) / (Total # of

Words in the document)

IDF: but if a word appears frequently in multiple documents (such as the,and,of, etc.), it means that the word has little practical meaning, so downgrade it.

Sample program

Let's look at an example program to see how the Spark ML package can be used in big data's processing system. We will develop a document classification program to distinguish the advertising content in the program input data. The input data set for the test includes documents, emails, or any other content that may contain advertisements received from an external system.

We will use the ad detection example of "Building Machine Learning applications with Spark" discussed at the Strata Hadoop World Conference seminar to build our sample program.

Use case

This use case analyzes the different messages sent to our system. Some messages contain advertising information, but some messages do not. Our goal is to use Spark ML API to find messages that contain ads.

Arithmetic

We will use the logical regression algorithm in machine learning. Logical regression is a regression analysis model, which can predict the possible results of yes or no based on one or more independent variables.

Detailed solution

Let's take a look at the details of the Spark ML sample program and the steps to run it.

Data injection: we will import both data containing advertisements (text files) and data that does not contain advertisements.

Data cleaning: in the sample program, we do not do any special data cleaning operations. We just aggregate all the data into one DataFrame object.

We randomly select some data from the training data and test data to create an array object. In this example, our choice is 70% of the training data and 30% of the test data.

In the subsequent pipeline operation, we use these two data objects to train the model and make predictions.

Our machine learning data pipeline consists of four steps:

Tokenizer

HashingTF

IDF

Create a pipeline object and set the above phases in the pipeline. Then we can create a logical regression model based on the training data according to the example.

Now, let's use the test data (the new dataset) to make predictions with the model.

The architecture diagram of the example program is shown in figure 4 below.

Figure 4: data classification program architecture diagram

Technical

The following techniques are used to implement the machine learning pipeline solution.

Table 2: techniques and tools used in machine learning examples

Spark ML program

The machine learning code written according to the examples at the seminar is written in the Scala programming language, and we can run this program directly using the Spark Shell console.

Ad detection Scala code snippet:

* step: create a custom class to store the details of the advertising content.

Case class SpamDocument (file: String, text: String, label: Double)

Step 2: initialize the SQLContext and convert the Scala object to DataFrame through implicit conversion. The dataset is then imported from the specified directory where the input file is stored, and the result is a RDD object. The DataFrame object is then created from the RDD objects of the two datasets.

Val sqlContext = new SQLContext (sc) import sqlContext.implicits._ / Load the data files with spam / / val rddSData = sc.wholeTextFiles ("SPAM_DATA_FILE_DIR", 1) val dfSData = rddSData.map (d = > SpamDocument (d) (d) dfSData.show () / Load the data files with no spam / / val rddNSData = sc.wholeTextFiles ("NO_SPAM_DATA_FILE_DIR", 1) val dfNSData = rddNSData.map (d = > SpamDocument (d) D. ToDF 2, 0). DfNSData.show ()

Step 3: now, gather the data set and split the whole data into training data and test data according to the ratio of 70% to 30%.

/ Aggregate both data frames / / val dfAllData = dfSData.unionAll (dfNSData) dfAllData.show () / Split the data into 70% training data and 30% test data / / val Array (trainingData, testData) = dfAllData.randomSplit (Array (0.7,0.3))

Step 4: you can now configure the machine learning data pipeline to create the parts we discussed earlier in the article: Tokenizer, HashingTF, and IDF. Then use the training data to create a regression model, in this case a logical regression.

/ Configure the ML data pipeline / Create the Tokenizer step / / val tokenizer = new Tokenizer () .setInputCol ("text") .setOutputCol ("words") / Create the TF and IDF steps / / val hashingTF = new HashingTF () .setInputCol (tokenizer.getOutputCol) .setOutputCol ("rawFeatures") val idf = new IDF (). SetInputCol ("rawFeatures"). SetOutputCol ("features") / Create the Logistic Regression / Step / / val lr = new LogisticRegression () .setMaxIter (5) lr.setLabelCol ("label") lr.setFeaturesCol ("features") / Create the pipeline / / val pipeline = new Pipeline () .setStages (Array (tokenizer) HashingTF, idf, lr) val lrModel = pipeline.fit (trainingData) println (lrModel.toString ())

Step 5: *, we call the transformation method in the logical regression model to make predictions with test data.

/ Make predictions. / / val predictions = lrModel.transform (testData) / Display prediction results / / predictions.select ("file", "text", "label", "features", "prediction"). Show

Spark machine learning library is one of the most important libraries in Apache Spark framework. It is used to implement data pipelining. In this article, we learned how to use the API of the Spark ML package and use it to implement a text classification use case.

The graph data model is about the connections and relationships between different entities in the data model. Graph data processing technology has received a lot of attention recently because it can be used to solve many problems, including fraud detection and the development of recommendation engines.

On how to use the Spark machine learning data pipeline for advertising detection is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.