How to do user Portrait based on Spark 07/12 Update SLTechnology News&Howtos

How to do user Portrait based on Spark

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to portray users based on Spark. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

Recently, comSysto shared the experience of its R & D team using Spark platform to solve Kaggle competition problems, which provides a reference for the application of Spark and other platforms in the field of data science.

The organizers provided a data set containing 50, 000 anonymous driver routes. the purpose of the competition is to develop an algorithm class signature of the driving type according to the route to characterize the driver's characteristics. For example, does the driver drive long distances? Short-distance driving? Driving at high speed? Going back? Is there a sharp acceleration from some sites? Do you turn at high speed? The answers to all these questions form a unique label that characterizes the driver's characteristics.

Faced with this challenge, comSysto's team came up with the Spark platform, which covers a variety of processing models, such as batch processing, streaming data, machine learning, graph processing, SQL query, and interactive custom analysis. They just took the challenge as an opportunity to enhance their experience in Spark. Next, this paper introduces the process of the comSysto team to solve the above problems from three aspects: data analysis, machine learning and results.

Data analysis

As the first step in solving the problem, data analysis plays a key role. However, to the surprise of the comSysto team, the raw data provided by the competition was very simple. The data set contains only some anonymous coordinate pairs of the circuit, such as (1.3), (2.1) 4.8 and (2.9), etc. As shown in the following figure, the driver will set out in each line and return to the origin (0P0), and then choose a random direction from the origin to form multiple return routes.

After getting the data, the comSysto team was a little discouraged: it is difficult to represent a driver just by looking at the coordinates.

The definition of Information fingerprint

Therefore, when the raw data is so simple, one of the problems faced by the team is how to convert the coordinate information into useful machine learning data. After authentication and consideration, it adopts the method of establishing information fingerprint database to collect the meaningful and special characteristics of each driver. In order to obtain information fingerprints, the team first defined a series of features:

Distance: the sum of the Euclidean distances of all two adjacent coordinates.

Absolute distance: the Euclidean distance between the beginning and the end.

Total pause time on the line: total pause time for the driver.

Total line time: the number of entries on a particular line (if the coordinate values of the line are assumed to be recorded per second, the number of entries in the route is the total number of seconds on the line).

Speed: the speed of a point is defined as the Euclidean distance between that point and the previous point. Assuming that the coordinate units are meters and the recording time interval between the coordinates is 1 second, the speed unit given by this definition is mUnix s. However, in this analysis, speed is mainly used to compare different points or different drivers. As long as the unit of speed is the same, and do not pursue its absolute value. This statement is also true for acceleration, deceleration and centripetal acceleration.

Acceleration: the difference between the speed of this point and the previous point during acceleration

Deceleration: the difference between the speed at this point and the previous point during deceleration.

Centripetal acceleration:

Where v is the velocity and r is the radius of the circle formed by the curve path. The radius calculation needs to use the coordinate information of the current point, before and after several points. However, the centripetal acceleration is the embodiment of the driver's high-speed driving style: the higher the value, the faster the turn.

The above characteristics of all the lines of a driver make up his resume (information fingerprint). According to experience, the average speed on urban roads is different from that on highways. Therefore, the average speed of a driver on all lines does not make much sense. EcoSysto chose the average speed and speed of different route types such as urban roads, long-distance expressways and rural roads as the object of study.

Data statistics: according to statistics, the data set of this competition contains information about 2700 drivers and 54000 lines. All the lines contain a total of 360 million Xamp Y coordinates-100000 hours of line data based on recording one coordinate per second.

Machine learning

After preliminary data preparation and feature extraction, the ecoSysto team began to select and test machine learning models used to predict driver behavior.

Clustering

The step of machine learning is to classify routes-the ecoSysto team chooses the k-means algorithm to automatically classify route types. These categories are derived from all routes of all drivers and are not specific to a single driver. After getting the clustering results, the ecoSysto team's feeling is that the extracted features and the calculated classification are related to the length of the route. This shows that they can be used as a pointer to the route type. In the end, based on the cross-validation results, they selected eight types-- one type of ID for each route-- for further analysis.

Forecast

For driver behavior prediction, the ecoSysto team chose a random forest (random forest) algorithm to train the prediction model. The model is used to calculate the probability that a particular driver will complete a given route. First, the team established a training set by selecting about 200 routes for one driver (marked "1"-matching), plus about 200 routes for other randomly selected drivers (marked "0"-mismatch). Then, these data sets are put into the random forest training algorithm to generate a random forest model for each driver. After that, the model is cross-validated and finally produces the submitted data of the Kaggle competition. Based on the results of cross-validation, the ecoSysto team selected 10 trees and * depth 12 as parameters of the random forest model. For a comparison of more integrated learning algorithms used for prediction in the Spark Machine Learning Library (MLib), please refer to Databrick's blog.

Pipeline

The workflow of the ecoSysto team is divided into several independent steps implemented in Java applications. These steps can be submitted to Spark for execution through the "spark-submit" command byte. The pipeline takes Hadoop SequenceFile as the input and CSV file as the output. The assembly line mainly consists of the following steps:

Convert the original input file: convert the original 550000 small CSV files into a separate Hadoop SequenceFile.

Extract features and calculate statistics: use the definition described above to calculate eigenvalues, and use Spark RDD transform API to calculate statistics such as average and variance, which are written into a CSV file.

Calculate the clustering results: use the above features and statistical values as well as the API of Spark MLlib to classify the routes.

Random forest training: select configuration parameters such as maxDepth and crossValidation, combined with the characteristics of each line, start the training of random forest model. For the actual data submitted by Kaggle, the ecoSysto team simply loaded the serialized model, predicted the probability that each line belonged to the driver, and saved it in a file in CSV format.

In the end, the ecoSysto team's prediction model ranked 670th in the Kaggle rankings with 74% accuracy. The team said that for models that took only two days to complete, the accuracy was still within acceptable limits. If it takes a certain amount of time, the accuracy of the model can certainly be improved. However, this process proves that the high-performance distributed computing platform can be used to solve practical machine learning problems.

The above is how to portray users based on Spark. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.