An example Analysis of naive Bayesian method in big data 04/18 Update SLTechnology News&Howtos

An example Analysis of naive Bayesian method in big data

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you the content of the example analysis of big data's naive Bayesian method. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

The two most extensive classification models are decision tree model (Decision Tree Model) and naive Bayesian model (Naive Bayesian Model,NBM). Naive Bayesian model is used in this case. Naive Bayesian method is a classification method based on Bayesian theorem and independent hypothesis of feature conditions. This section focuses on the analysis of this algorithm.

1. Spam recognition algorithm-naive Bayes

Compared with the decision tree model, naive Bayesian classifier (Naive Bayes Classifier, NBC) originates from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency. At the same time, the NBC model needs to estimate fewer parameters, is not sensitive to the missing data, and the algorithm is relatively simple.

Theoretically, the NBC model has the smallest error rate compared with other classification methods. But in fact, this is not always the case, because the NBC model assumes that the attributes are independent of each other, which is often not true in practical application, which has a certain impact on the correct classification of the NBC model.

This algorithm, invented more than 250 years ago, has an unparalleled position in the field of information. Bayesian classification is the general name of a series of classification algorithms, which are all based on Bayesian theorem, so they are collectively called Bayesian classification. Naive Bayesian algorithm (Naive Bayesian) is one of the most widely used classification algorithms.

1. Realizing the Core of Bayesian Classification in basic Machine Learning

Classification is the process of dividing an unknown sample into several pre-known classes. The solution to the problem of data classification is a two-step process: the first step is to establish a model to describe the pre-set of data or concepts. The model is constructed by analyzing the samples (or instances, objects, etc.) described by attributes. Assume that each sample has a predefined class, determined by an attribute called a class tag. The data tuples analyzed for modeling form a training data set, also known as supervised learning.

Among the many classification models, the two most widely used classification models are decision tree model and naive Bayesian model. The decision tree model solves the classification problem by constructing a tree. First of all, a decision tree is constructed by using the training data set. Once the tree is established, it can generate a classification for unknown samples.

Using the decision tree model in the classification problem has many advantages, the decision tree is easy to use and efficient, and the rules can be easily constructed according to the decision tree, and the rules are usually easy to explain and understand. the decision tree can be well extended to a large database, and its size is independent of the size of the database; another advantage of the decision tree model is that it can construct a decision tree for data sets with many attributes. The decision tree model also has some shortcomings, such as the difficulty in dealing with missing data, the emergence of over-fitting problems, and ignoring the correlation between attributes in the data set.

The solution to this problem is generally to establish an attribute model and deal with the attributes that are not independent of each other separately. For example, in Chinese text classification and recognition, we can set up a dictionary to deal with some phrases. If you find that there are special schema attributes in a particular problem, deal with it separately.

This is also in line with the Bayesian probability principle, because we treat a phrase as a separate pattern, for example, English texts deal with words of different lengths, and all are treated as separate patterns. This is the difference between natural language and other classification and recognition problems.

When a priori probability is actually calculated, the result is the same because these patterns are calculated by the program as probabilities, rather than understood by people in natural language.

When the number of attributes is large or the correlation between attributes is large, the classification efficiency of naive Bayesian model is not as good as that of decision tree model. But this point needs to be verified, because the specific problems are different, the results of the algorithm are different, the same algorithm for the same problem, as long as the pattern changes, there are different recognition performance. This point has been recognized in many foreign papers, and the recognition of attributes by the algorithm is determined by many factors, such as the ratio of training samples to test samples.

For text classification and recognition, the decision tree depends on the specific situation. When the attribute correlation is small, the performance of naive Bayesian model is relatively good. When the attribute correlation is large, the performance of the decision tree algorithm is better.

two。 Expression description of naive Bayesian classification

Bayesian reasoning was developed by the British mathematician Bayes (Thomas Bayes 1702-1761) to describe the relationship between two conditional probabilities, such as P (A | B) and P (B | A). According to the multiplication rule, it can be derived immediately: P (A ∩ B) = P (A) * P (B | A) = P (B) * P (A | B). The above formula can also be deformed as follows: P (B | A) = P (A | B) * P (B) / P (A).

Usually, the probability of event A under the condition of event B (occurrence) is different from that of event B under the condition of event A. however, there is a definite relationship between them, and Bayesian rule is the statement of this relationship. Bayesian method is about the conditional probability and marginal probability of random events An and B. Where P (A | B) is the possibility that An occurs when B occurs.

In Bayesian law, every noun has a conventional name:

Pr (A) is the prior probability or marginal probability of A. It is called "a priori" because it does not take into account any B factors.

Pr (A | B) is known as the conditional probability of An after the occurrence of B, and it is also called the posterior probability of A because of the value obtained from B.

Pr (B | A) is the conditional probability of B after the occurrence of A, and it is also called the posterior probability of B because of the value obtained from A.

Pr (B) is the prior probability or marginal probability of B, and it is also a normalized constant (normalized constant).

According to these terms, the Bayes rule can be expressed as:

A posteriori probability = (likelihood * a priori probability) / standardized constant, that is, a posteriori probability is proportional to the product of a priori probability and likelihood.

In addition, the ratio Pr (B | A) / Pr (B) is sometimes called standard likelihood (standardised likelihood), and the Bayes rule can be expressed as:

Posterior probability = standard likelihood * a priori probability.

Global counters for distributed Bayesian classification learning

When a machine learning case based on simple Bayesian classification algorithm is completed in a stand-alone environment, you only need to load the learning data completely and then apply Bayesian expression to calculate the statistical ratio information for each word, because all the required parameters can be collected and obtained directly in the same data file set, but when the business migrates to the MapReduce distributed environment, the situation has changed essentially. From figure 14.9, how Bayesian classification expressions are used in spam recognition:

It can be seen that several main proportional parameters need to be calculated when data learning statistics are carried out.

The percentage of all messages that contain a particular word

The percentage of messages that are spam

The percentage in which the message is spam and there are specific words in the spam message.

Therefore, it is necessary to summarize all the learning data, at least make clear the total number of messages in the learning data, the number of spam messages in the learning data, the number of valid messages in the learning data, and so on. Because the data input source of the MapReduce task comes from HDFS, and HDFS will automatically split the super-large data files into blocks of the same size and store them in different data nodes, at the same time, the MapRedece task will also meet the requirements of which node the data is located. The basic principle of "which node is where the computing task starts", so the whole learning data analysis and statistics task will be parallel in different Java virtual machines or even different task computing nodes. It is impossible to use the traditional shared variables to solve this summary statistics problem.

To complete the counter function using MapReduce, you can choose from the following:

(1) using the built-in Counter component of MapReduce, the Counter counter of MapReduce will automatically record some general statistical information, such as the total number of data fragments processed by MapReduce this time. Developers can also customize different types of Counter counters and set / accumulate / counter values in Map or Reduce tasks, but MapReduce's built-in Counter counter tool has an obvious defect. It does not support getting the value of the cumulative counter in the Map task directly in Reduce.

That is, the value of the relevant counter obtained for the first time in the Reduce task is always 0, although after the end of the entire task, MapReduce will finally add the values set by the corresponding counter during the Map and Reduce tasks, because in this case, you need to obtain the total number of valid / spam messages in the Reduce task to calculate the ratio information, and these numbers need to be counted in the Map task. Therefore, the built-in Counter counter of MapReduce is not suitable for this case.

(2) Zookeeper, a special component in the Hadoop ecology, provides API support for such unified counters across nodes, but it is obviously too expensive to deploy an additional set of Zookeeper clusters just because you need to set a few counters in digital form, so this solution is not applicable to the current case.

(3) realize a simple unified counter by ourselves. The implementation of the unified counter is relatively simple, only need to define digital variables in a separate node, and access these digital variables in this node through the network when you need to set, accumulate or get counters. In an ordinary environment, the implementation of such a counter service is relatively tedious, because a large number of network data exchange operations are required, but after the implementation of a custom RPC calling component, the network-based data setting and acquisition operation is extremely simple, which is as convenient as completing a common Java method on the local machine, so the implementation of the counter service can be completed according to the following structure:

Note: since multiple data processing nodes concurrently initiate setting requests to the counter service, it is necessary to pay attention to the security of counter variables. In the simplest design, keep the setting value, cumulative value, and get value method of the counter service synchronized.

Third, data cleaning and analysis result storage

MapReduce is a typical non-real-time data processing engine, which means it cannot be used as a scenario that requires real-time feedback. Therefore, the MapReduce task can only complete the processing of complex data in the background for the terminal to provide real-time computing support for the intermediate results, and because the Metadata retrieval service and data network transmission of the HDFS file system require a lot of IO overhead, if the magnitude of the intermediate result set does not need distributed file storage support and uses HDFS to store the intermediate results, it will have a negative impact on the final service efficiency. Therefore, after completing the unified cleaning and analysis of the data, the intermediate results generally choose the following preservation strategies:

If the result after cleaning is regular data of a small magnitude, it can be directly stored in a Key-Value cache system such as Redis.

If the result set after cleaning is large, it can be stored in the traditional RDBMS in the Reduce task for the business system to complete the real-time query using SQL statements.

If the result after cleaning is still a large amount of data, it can be stored in a distributed database such as HBase to provide efficient big data query.

This project uses Redis to cache data and uses Redis connection pool. Like RDBMS, Redis can improve data access efficiency and throughput through connection pooling, based on the following principles:

Thank you for reading! This is the end of the article on "sample Analysis of naive Bayesian method in big data". I hope the above content can be of some help to you, so that you can learn more knowledge. If you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.