In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail the example analysis of naive Bayesian algorithm and model selection and tuning in python machine learning. I think it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
First, the basis of probability knowledge 1. Probability
Probability is the possibility that something will happen.
two。 Joint probability
The probability that contains multiple conditions and all conditions hold at the same time is recorded as: P (A, B) = P (A) * P (B)
3. Conditional probability
The probability of occurrence of event A under the condition that another event B has already occurred, recorded as: P (A | B)
Characteristic of conditional probability: P (A1 | B) = P (A1 | B) P (A2 | B)
Note: the establishment of this conditional probability is due to the mutual independence of A1 and A2.
The principle of naive Bayes is that for each sample, the probability that belongs to each category is classified as the one with the highest probability.
Second, naive Bayes 1. Naive Bayesian calculation method
Replace the actual example directly, and each part is explained as follows:
P (C) = P (science and technology): probability of category of science and technology documents (number of science and technology documents / total number of documents)
P (W | C) = P ('intelligence', 'development' | Technology): in articles such as science and technology documents, the probability of the occurrence of the two characteristic words' intelligence 'and' development'. Note: 'intelligence' and 'development' belong to the words that appear in the predicted document, and there may be more feature words in the science and technology document, but not all of the given document. Therefore, use what is contained in a given document.
Calculation method:
P (F1 | C) = N (I) / N (training set calculation)
N (I) is the number of times the F1 word appears in all documents in category C
N is the number of occurrences of all words in the document under category C and
P ('intelligence' | Technology) = the number of times' intelligence 'appears in all technology documents / the number of words that appear in all technology documents and
Then P (F1) F2. | C) = P (F1 | C) * P (F2 | C)
P ('Intelligence', 'Development' | Technology) = P ('Intelligence' | Technology) * P ('Development' | Technology)
In this way, the probability that the predicted document belongs to science and technology based on the feature words in the predicted document can be calculated. The same method calculates the probabilities of other types. Whichever is taller.
two。 Laplace smoothing
3. Naive Bayes API
Sklearn.naive_bayes.MultinomialNB
Third, naive Bayesian algorithm case 1. Case Overview
The data of this case comes from the 20newsgroups data in sklearn. By extracting the feature words in the article, using naive Bayesian method, the predicted article is calculated, and the probability is used to determine which category the article belongs to.
The general steps are as follows: first, the article is divided into two categories, one as a training set and the other as a test set. Then we use tfidf to extract the features of the training set and the test set respectively, so as to generate the x of the training set test set. Then we can directly call the naive Bayesian algorithm to import the training set data x_train and y_train into the training model. Finally, you can use the trained model to test it.
two。 Data acquisition
Import database: import sklearn.datasets as dt
Import data: news = dt.fetch_20newsgroups (subset='all')
3. Data processing.
The method used for segmentation is the same as in knn. In addition, all the data imported from sklearn can directly call .data to get the dataset and .target to get the target value.
Split data: x_train, x_test, y_train, y_test = train_test_split (news.data, news.target, test_size=0.25)
Instantiation of eigenvalue extraction method: tf = TfIdfVectorizer ()
Eigenvalue extraction of training set data: x_train = tf.fit_transform (x_train)
Eigenvalue extraction of test set data: x_test = tf.transform (x_test)
For the feature extraction of the test set, you only need to call transform, because you want to use the standard of the training set, and you have already passed the standard of the training set in the previous step, so you can use the test set directly.
4. Algorithm flow
Algorithm instantiation: mlt = MultinomialNB (alpha=1.0)
Algorithm training: mlt.fit (x_train, y_train)
Forecast result: y_predict = mlt.predict (x_test)
5. Matters needing attention
The accuracy of naive Bayesian algorithm is determined by the training set and does not need to adjust parameters. If the error of the training set is large, the result must not be good. Because the way of calculation is fixed, and there is no super parameter to adjust.
The disadvantage of naive Bayes: it assumes that some words and other words in the document are independent and unrelated to each other. And the word statistics carried out in the training set will interfere with the results. the better the training set, the better the result, and the worse the training set, the worse the result.
Fourth, the evaluation of classification model 1. Confusion matrix
There are several evaluation criteria, one of which is accuracy, that is, the accuracy is calculated by comparing the predicted target value with the target value provided.
We also have other evaluation criteria that are more general and easier to use, that is, accuracy and recall. The accuracy and recall rate are calculated based on the confusion matrix.
In general, we only focus on the recall rate.
F1 classification standard:
According to the above formula, using the precision recall rate, the F1-score can be calculated, which can reflect the robustness of the model.
two。 Evaluation model API
Sklearn.metricx.classification_report
3. Model selection and tuning ① cross-validation
The purpose of cross-validation is to make the evaluated model more accurate and reliable, as follows:
Divide all data into n equal parts
> > the first one is used as a verification set and the other as a training set to get an accuracy, model 1
>
> >.
Until each copy is passed over, the accuracy of n models is obtained.
By averaging all the accuracy, we get a more credible result.
If it is divided into four equal parts, it is called "60% discount cross-validation".
② grid search
Grid search is mainly used in conjunction with cross-validation to adjust parameters. For example, there is a super-parameter k in the K-nearest neighbor algorithm, which needs to be specified manually, which is more complex, so it is necessary to preset several kinds of super-parameter combinations for the model, each group of super-parameters are evaluated by cross-validation, and finally the optimal parameter combination is selected to establish the model. (the K-nearest neighbor algorithm has one super parameter k, which is not a combination, but if the algorithm has two or more super parameters, it is equivalent to the exhaustive method.)
Grid search API:sklearn.model_selection.GridSearchCV
Fifth, the tuning method of the model taking knn as an example.
Suppose you have processed the data and features, got x_train, x_test, y_train, y_test, and instantiated the algorithm: knn = KNeighborsClassifier ()
1. Construct hyperparameters
Because the name of the hyperparameter that needs to be used in the algorithm is called 'superparameter', the selection range of the hyperparameter is specified directly by name. If there is a second superparameter, you can add a dictionary element after it.
Params = {'nails neighborns: [5, 10, 15, 15, 25]}
two。 Conduct a grid search
Input parameters: algorithm (estimator), grid parameters, specified number of fold cross validation
Gc = GridSearchCV (knn, param_grid=params, cv=5)
After the basic information is specified, you can fit the training set data.
Gc.fit (x_train, y_train)
3. Results View
In the grid search algorithm, there are several ways to view the accuracy, the model, the results of cross-validation, and the results after each cross-validation.
Gc.score (x_test, y_test) return accuracy
Gc.best_score_ returns the highest accuracy
Gc.best_estimator_ returns the best estimator (returns with the selected hyperparameter automatically)
This is the end of this article on "sample analysis of naive Bayesian algorithm and model selection and tuning in python machine learning". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.