Machine Learning and Application scene under the trend of big data 07/04 Update SLTechnology News&Howtos

Machine Learning and Application scene under the trend of big data

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Machine learning is a science of artificial intelligence, the study of computer algorithms that can be improved automatically through experience.

Machine learning is an interdisciplinary field, which involves computer, informatics, mathematics, statistics, neuroscience and so on.

Machine learning is the core technology of big data, which is essentially based on experience algorithm processing. Machine learning emphasizes three keywords: algorithm, experience, and performance, as shown in the following figure.

On the basis of the data, the model is constructed and evaluated by the algorithm. If the performance of the evaluation meets the requirements, use the model to test other data; if not, adjust the algorithm to re-establish the model and evaluate it again. This cycle, and finally get a satisfactory experience to deal with other data.

Machine learning techniques and methods have been successfully applied to many fields, such as Jinri Toutiao's personality recommendation system, Ant Financial Services Group's financial anti-fraud, iFLYTEK speech recognition, natural language processing and google machine translation, pattern recognition, intelligent control, spam and so on.

Classified supervised learning of machine learning

Supervision is to learn a model from a given training data set, then use this model to predict, and then compare the prediction results with the actual results, and constantly adjust the prediction model until an expected accuracy is achieved.

Common algorithms include regression analysis and statistical classification. Supervised learning is often used to train neural networks and decision trees. They are highly dependent on a pre-determined classification system. Such as spam, news and information content classification.

Unsupervised learning

The training set of unsupervised learning has no artificially labeled results, and the learning model is to infer some internal structure of the data. Common application scenarios include learning association rules and clustering.

The goal of this kind of learning is not to maximize the utility function, but to find the approximate points in the training data. Clustering can often find intuitive classifications that match the hypothesis fairly well. For example, aggregate individuals based on demographics may form a rich aggregation and a poor aggregation in a group.

Semi-supervised learning

Between supervised learning and unsupervised learning, it is necessary to consider how to use a small number of labeled samples and a large number of unlabeled samples for training and classification. The learning algorithm attempts to model the unidentified data, and then predict the identified data, such as graph theory reasoning algorithm or Laplace support vector machine and so on.

Common algorithms for machine learning regression algorithm

Least square method, logical regression, stepwise regression, multivariate adaptive regression splines and smooth estimation of important scatter points.

Case-based algorithm

It is often referred to as winner-take-all learning. It is often used to build models for countermeasure problems. Such models often select a batch of sample data first, and then compare the new data with the sample data according to some approximations. Find the best match in this way.

Decision tree learning

According to the attributes of the data, a tree structure is used to establish a decision model, which is often used to solve the problems of classification and regression.

Bayesian learning

Mainly used to solve the problems of classification and regression. Naive Bayesian algorithm.

Clustering and classification algorithm

Clustering and classification are two commonly used algorithms in machine learning. Clustering divides the data into different sets, and classifies the new data for category prediction. The following two kinds of algorithms are introduced.

(1) what is clustering

Clustering refers to grouping data objects into multiple classes or clusters (Cluster). Its goal is to have high similarity between objects in the same cluster, while objects in different clusters are quite different.

In fact, clustering is a common behavior in people's daily life, that is, the so-called "birds of a feather flock together". Its core idea is grouping, and people constantly improve the clustering model to learn how to distinguish between things and people.

(2) what is classification?

There is a lot of knowledge in the data warehouse, database or other information base that can provide the necessary knowledge for the decision-making of business, scientific research and other activities. Classification and prediction are two forms of data analysis, which can be used to extract important data sets or predict future data trends.

The classification method (Classification) is used to predict the discrete category (Categorical Label) of the data object, and the prediction method (Prediction) is used to predict the continuous value of the data object.

Classification process: new sample → feature selection → classification → evaluation

Training flow: training set → feature selection → training → classifier

Initially, most of the classification applications of machine learning are based on these methods and algorithms based on memory. At present, data mining methods are required to have the ability to deal with large-scale data collection based on external memory, as well as scalable ability.

Machine learning library Spark MLLib

MLlib is the machine learning (Machine Learning) library of Spark, which aims to simplify the engineering practice of machine learning and facilitate expansion to a larger scale. Machine learning requires many iterations. If you use the Hadoop computing framework, disk read and write tasks will be carried out for each calculation, which will result in very large Imax O and CPU consumption, while Spark memory-based computing has inherent advantages. And its RDD can seamlessly share data and operations with other sub-frameworks and libraries such as SparkSQL, Spark Streaming, GraphX and so on. For example, MLlib can directly use the data provided by SparkSQL, or can directly carry out join operations with GraphX graph calculation.

The position of MLlib in spark ecosystem

Spark MLlib architecture

You can see from the architecture diagram that MLlib consists of three main parts:

Underlying foundation: including Spark runtime, matrix library and vector library

Algorithm library: algorithms containing generalized linear models, recommendation systems, clustering, decision trees, and evaluation

Utility: including the generation of test data, external data reading and other functions.

The following figure is the core of the MLlib algorithm library.

MLlib consists of some general learning algorithms and tools, including classification, regression, clustering, collaborative filtering, dimension reduction, etc., as well as low-level optimization primitives and high-level pipeline API.

Specifically, it mainly includes the following aspects:

1. Algorithm tools: commonly used learning algorithms, such as classification, regression, clustering and collaborative filtering

two。 Characterization tools: feature extraction, transformation, dimensionality reduction, and selection tools

3. Pipeline: a tool for building, evaluating, and tuning machine learning channels

4. Persistence: saving and loading algorithms, models, and pipes

5. Utility: linear algebra, statistics, data processing and other tools.

Spark divides the machine learning algorithm into two modules:

Training module: output model parameters through training samples

Prediction module: initialize the model parameters, predict the test samples, and output the predicted value.

Analytical Classification of Classical algorithms in MLLib

Classification is an important machine learning and data mining technology. The purpose of classification is to construct a classification function or classification model (often called classifier) according to the characteristics of data sets, which can map samples of unknown categories to a given category.

The specific rules of classification can be described as follows:

Given the set T (Training set) of a set of training data, each record of T contains several attributes (Features) to form a feature vector, which is represented by vector x = (x _ 1 ~ ~ x _ 2). Xi can have different ranges. When the range of an attribute is continuous, the attribute is continuous (Numerical Attribute), otherwise it is discrete (Discrete Attribute). Category attributes are represented by Category C1 and C2Magne.ck, that is, there are k different categories in the dataset. Then T implies a mapping function from vector X to category attribute C: F (X) ↦ C. The purpose of classification is to analyze the input data, through the characteristics of the data in the training set, to find an accurate description or model for each class, and use this method (model) to express the implicit function.

The process of constructing a classification model is generally divided into two stages: training and testing. Before constructing the model, the data set is randomly divided into training data set and test data set. First, the training data set is used to construct the classification model, and then the test data set is used to evaluate the classification accuracy of the model. If the accuracy of the model is considered acceptable, the model can be used to classify other data tuples. Generally speaking, the cost of the testing phase is much lower than that of the training phase.

MLlib classification algorithm is based on different ideas, and the classification algorithm is also different, such as support vector machine SVM, decision tree algorithm, Bayesian algorithm, KNN algorithm and so on. Spark.mllib package supports a variety of classification methods, including binary classification, multi-classification and regression analysis. The following table lists the algorithms supported for each type of problem.

The specific content of each algorithm is too much, so it will not be described in detail here.

Classification algorithm uses scenarios

1. Forecast of citizens' travel choice by public transportation

Based on the massive public transport data records, we hope to mine the behavior patterns of citizens in public transport. Taking the prediction of citizens' bus routes as the direction, it is expected that by analyzing the historical bus card transaction data of some bus lines in Guangdong Province, mining the behavior patterns of fixed groups in public transport, analyzing and speculating passengers' travel habits and preferences, thus establishing a model to predict which bus lines people will take in the coming week, so as to provide passengers with a symmetrical, safe and comfortable travel environment. Use data to lead the intelligent travel of cities in the future.

2. Personal credit evaluation based on operator data.

As network service providers, operators have accumulated a large number of user basic information and behavior characteristic data, such as terminal data, package consumption data, communication data and so on. The real-name policy ensures that the operator user data can match the real identity of the user, and truly and objectively reflect the user behavior. The extensive network infrastructure provides the conditions for accumulating a large amount of real-time data, which feedback the information and characteristics of all dimensions of the user in real time.

In China, personal credit information is evaluated mainly by quoting the personal credit report of the central bank, but for many users who do not establish personal credit records, the cost of financial institutions to understand their credit records is high. traditional credit assessment methods are difficult to meet the current various emerging needs. Financial business is different from other big data business, which requires high authenticity, credibility and timeliness of data, which is the value of operator data.

It is expected to provide a perfect personal credit assessment by using the user data of operators.

3. Classification of commodity pictures

JD.com contains millions of pictures of goods, and apps such as "photo shopping" and "finding the same money" must classify the pictures provided by users. At the same time, the extraction of commodity image features can be provided to recommendation, advertising and other systems to improve the effect of recommendation / advertising.

It is hoped that the purpose of image classification can be achieved by learning the image data.

4. prediction of advertising click behavior.

In the process of browsing the Internet, users may have advertising exposure or click behavior. The prediction of advertising clicks can guide advertisers to conduct targeted advertising and optimization, so as to maximize the return on advertising investment.

It is expected to predict whether each user will have click behavior at each monitoring point within 8 days based on the ad exposure and click log of 1 million random users over a six-month time range, including ad monitoring point data.

5. Spam message recognition based on text content

Spam messages have increasingly become a difficult problem for operators and mobile phone users, seriously affecting people's normal life, infringing on the social image of operators and endangering social stability. While lawbreakers use scientific and technological means to constantly update the form of spam messages and spread through a wide range of channels, the effect of traditional filtering based on strategies and keywords is limited, and many spam messages "escape" filtering and continue to reach the mobile terminal.

It is hoped that based on the text content of short messages, combined with machine learning algorithm, big data analysis and mining to intelligently identify spam messages and their variants.

6. Sogou user Portrait Mining in big data's Precision Marketing

The old saying "birds of a feather flock together, people are divided into groups" not only reveals the trend of self-organization between things and people, but also implies the internal relationship between "clustering" and "crowd". In the modern digital advertising delivery system, simulating people with things and seeing people with things is a bigger premise than any big data. In the modern advertising system, the multi-level systematic user profile construction algorithm is one of the basic technologies to achieve accurate advertising. Among them, the advertising targeting technology based on population attribute is the key technology which is generally suitable for brand display advertising and precision bidding advertising. In the search bidding advertising system, users obtain relevant information by entering specific query words in the search engine. Therefore, the historical query words of users are closely related to the basic attributes and potential needs of users.

It is hoped that based on the query words of one month's user history and the user's population attribute tags (including gender, age, education) as training data, a classification algorithm is constructed through machine learning and data mining technology to determine the population attributes of new users.

Clustering

Clustering is to divide similar objects into different groups or more subsets (subset) by static classification. Members of the same subset have similar attributes. Clustering analysis can be regarded as an unsupervised learning technology.

In the Spark2.0 version (not RDD API-based MLlib), there are four clustering methods:

(1) K-means

(2) Latent Dirichlet allocation (LDA)

(3) Bisecting k-means (binary k-means algorithm)

(4) Gaussian Mixture Model (GMM).

There are six clustering methods in RDD API-based MLLib:

(1) K-means

(2) Gaussian mixture

(3) Power iteration clustering (PIC)

(4) Latent Dirichlet allocation (LDA) *

(5) Bisecting k-means

(6) Streaming k-means

Power iteration clustering (PIC) and Streaming k-means are added.

The K-means algorithm is commonly used.

K-means algorithm (K-Means) is a partition clustering method. The idea of the algorithm is to find the clustering center iteratively to minimize the sum of the square error of each sample and the mean value of the class.

KMeans is an iterative clustering algorithm, which belongs to the Partitioning type clustering method, that is, first create K partitions, and then iteratively transfer samples from one partition to another to improve the quality of the final clustering.

K-Means clustering algorithm can easily model the clustering problem. K-Means clustering algorithm is easy to understand and can run in parallel in a distributed environment. Learning K-Means clustering algorithm makes it easier to understand the advantages and disadvantages of clustering algorithms and the efficiency of other algorithms for specific data.

K in K-Means clustering algorithm is the number of clusters, which will force user input in the algorithm. If the news is clustered into political, economic, cultural and other categories, we can choose the number of 10 to 20 as K. Because the number of this top-level category is very small. If you want to classify these news in detail, it is no problem to choose a number of 50 to 100. K-Means clustering algorithm can be divided into three steps.

The first step is to find the points to be clustered and randomly select K samples as the initial clustering center.

The second step is to calculate the distance of the cluster center of each point and cluster each point into the nearest cluster to that point.

The third step is to calculate the coordinate average of all the points in the cluster, and take this average as the new clustering center point.

Repeat the second step until the clustering center no longer moves on a large scale, or until the number of clustering meets the requirements.

Clustering algorithm usage scenario

1. Business location based on user location information

With the rapid development of information technology, mobile devices and mobile Internet have spread to thousands of households. When the user uses the mobile network, it will naturally leave the user's location information. With the continuous improvement and popularization of GIS geographic information technology in recent years, the combination of user location and GIS geographic information will bring innovative applications. For example, Baidu cooperates with Wanda to promote the efficiency of merchants by positioning the location of users and combining the merchant information of Wanda.

Hope to provide a new store location for a chain catering organization through a large number of location information of mobile device users.

2. Standardization of Chinese address

Address is a variable that covers rich information, but for a long time, due to the complexity of Chinese processing and the irregularity of domestic Chinese address naming, the rich information contained in the address can not be deeply analyzed and mined. Through the standardized processing of the address, it is possible to make the address-based multi-dimensional quantitative mining analysis possible, which provides more abundant methods and means for e-commerce application mining in different scene patterns, so it has important practical significance.

3. Identification of non-human malicious traffic

Facebook posted in the first quarter of 2016 that half a year's traffic quality tests on its Atlas DSP platform showed that 75 per cent of non-human malicious traffic was caused by means such as robot simulation and black IP. In the first half of 2016 alone, AdMaster's anti-cheating solution determined that there was an average of 28% of cheating traffic per day. The problem of low-quality false traffic has always existed, which is also the problem that the digital marketing industry has been playing games in the past decade. Based on the massive monitoring data of AdMaster, more than 50% of the projects are suspected of cheating; in different projects, cheating traffic accounts for 5% to 95% of advertising; among them, vertical and network alliance media account for the highest proportion of cheating traffic; the proportion of cheating traffic on PC is significantly higher than that on mobile and intelligent TV platforms. Advertising monitoring behavior data are increasingly used for modeling and decision-making, such as drawing user profiles, identifying corresponding users across devices, and so on. The behaviors caused by cheating, malicious exposure, web crawlers, misleading clicks, and even controlled access without user perception bring huge noise to the data and have a great impact on model training.

It is hoped that based on the given data, a model can be established to identify and mark cheating traffic and remove the noise of the data, so as to make better use of the data and maximize the interests of advertisers.

Collaborative filtering

Collaborative Filtering (CF,WIKI for short) is defined as: simply speaking, it makes use of the preferences of a group with similar interests and shared experience to recommend information of interest to users. Individuals give a considerable degree of response (such as scoring) to the information through the cooperative mechanism and record it to achieve the purpose of filtering, so as to help others filter the information, and the response is not necessarily limited to those of particular interest. Records of particularly uninteresting information are also important.

Collaborative filtering is often used in recommendation systems. These technologies are designed to supplement the missing parts of the user-commodity correlation matrix.

MLlib currently supports model-based collaborative filtering, in which users and goods are expressed by a small group of hidden factors, and these factors are also used to predict missing elements. MLLib uses alternating least squares (ALS) to learn these hidden factors.

Users' preferences for items or information, depending on the application itself, may include the user's score on the item, the user's record of viewing the item, the user's purchase record, and so on. In fact, the preference information of these users can be divided into two categories:

Explicit user feedback: this is when users browse or use the site naturally and explicitly provide feedback, such as user ratings or comments on items.

Implicit user feedback: this is the data generated by the user when using the site, which implicitly reflects the user's preference for an item, such as the user buys an item, the user views the information about an item, and so on.

Explicit user feedback can accurately reflect users' real preferences for items, but it requires users to pay an extra price, while implicit user behavior, through some analysis and processing, can also reflect users' preferences, but the data is not very accurate. There is a lot of noise in the analysis of some behaviors. However, as long as you choose the right behavior characteristics, implicit user feedback can also get good results, but the choice of behavior characteristics may be very different in different applications, for example, on e-commerce websites, purchase behavior is actually an implicit feedback that can well express users' preferences.

The recommendation engine may use part of the data source according to different recommendation mechanisms, and then analyze certain rules or directly predict and calculate users' preferences for other items according to these data. In this way, the recommendation engine can recommend items that may be of interest to the user when he enters.

MLlib currently supports a collaborative filtering-based model in which users and products are described by a set of potential factors that can be used to predict missing items. In particular, we implement the alternating least square (ALS) algorithm to learn these potential factors, and the implementation in MLlib has the following parameters:

NumBlocks is the number of chunks used for parallel calculation (set to-1 for automatic configuration)

Rank is the number of recessive factors in the model.

Iterations is the number of iterations

Lambda is the regularization parameter of ALS.

ImplicitPrefs decides whether to use the explicit feedback ALS version or the implicit feedback dataset version.

Alpha is a parameter for the implicit feedback ALS version, which determines the benchmark for the intensity of preference behavior.

Application scenario of Collaborative filtering algorithm

1, the e-commerce platform bought XX also bought XX, a combination of packages, take a look at the function.

2. Personalized recommendation of Jinri Toutiao.

3. Groups with the same interest as Douban.

4. Movie recommendation system.

5. Baidu map is based on nearby delicacies based on geographical location.

……

references

1. MLlib description of Spark official website

2. Spark enterprise level actual combat

3. Tianchi DataCastleCCF

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.