Case Analysis of Scikit-learn text clustering 07/06 Update SLTechnology News&Howtos

Case Analysis of Scikit-learn text clustering

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Scikit-learn text clustering case analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

#-*-coding=utf-8-*-"text category" from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import MultinomialNBcategories = ['alt.atheism',' soc.religion.christian', 'comp.graphics',' sci.med'] twenty_train = fetch_20newsgroups (subset='train', categories=categories, shuffle=True Random_state=42) print len (twenty_train.data) len (twenty_train.filenames) count_vect = CountVectorizer () X_train_counts = count_vect.fit_transform (twenty_train.data) print X_train_counts.shapeprint count_vect.vocabulary_.get ('algorithm') tf_transformer = TfidfTransformer (use_idf=False) .fit (X_train_counts) X_train_tf = tf_transformer.transform (X_train_counts) print X_train_tf.shapetfidf_transformer = TfidfTransformer () X_train_tfidf = tf_transformer.fit_transform (X_train_counts) print X_train_tfidf.shapeclf = MultinomialNB () .fit (X_train_tfidf Twenty_train.target) docs_new = ['God is love',' OpenGl on the Gpu is fast'] X_new_counts = count_vect.transform (docs_new) X_new_tfidf = tfidf_transformer.fit_transform (X_new_counts) predicted = clf.predict (X_new_tfidf) for doc, category in zip (docs_new, predicted): print'% r = >% s'% (doc, twenty_train.target_ namescategory]

Classify 2257 documents in fetch_20newsgroups

Count the number of times each word appears

Use tf-idf to count the word frequency, tf is the number of times each word appears in a document divided by the total number of words in the document, idf is the total number of documents divided by the number of documents containing the word, and then take the logarithm; tf * idf is the value used here, the higher the value indicates that the word is more important, or more relevant.

Examples of specific practices:

The number of occurrences of each word is calculated first.

Then the tf-idf value is calculated.

Then bring it into the model for training.

Finally, the types of two new documents are predicted.

Results:

Is it helpful for you to read the above content after reading 'God is love'= > soc.religion.christian'OpenGL on the GPU is fast'= > comp.graphics? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.