In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Scikit-learn text clustering case analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
#-*-coding=utf-8-*-"text category" from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.naive_bayes import MultinomialNBcategories = ['alt.atheism',' soc.religion.christian', 'comp.graphics',' sci.med'] twenty_train = fetch_20newsgroups (subset='train', categories=categories, shuffle=True Random_state=42) print len (twenty_train.data) len (twenty_train.filenames) count_vect = CountVectorizer () X_train_counts = count_vect.fit_transform (twenty_train.data) print X_train_counts.shapeprint count_vect.vocabulary_.get ('algorithm') tf_transformer = TfidfTransformer (use_idf=False) .fit (X_train_counts) X_train_tf = tf_transformer.transform (X_train_counts) print X_train_tf.shapetfidf_transformer = TfidfTransformer () X_train_tfidf = tf_transformer.fit_transform (X_train_counts) print X_train_tfidf.shapeclf = MultinomialNB () .fit (X_train_tfidf Twenty_train.target) docs_new = ['God is love',' OpenGl on the Gpu is fast'] X_new_counts = count_vect.transform (docs_new) X_new_tfidf = tfidf_transformer.fit_transform (X_new_counts) predicted = clf.predict (X_new_tfidf) for doc, category in zip (docs_new, predicted): print'% r = >% s'% (doc, twenty_train.target_ namescategory]
Classify 2257 documents in fetch_20newsgroups
Count the number of times each word appears
Use tf-idf to count the word frequency, tf is the number of times each word appears in a document divided by the total number of words in the document, idf is the total number of documents divided by the number of documents containing the word, and then take the logarithm; tf * idf is the value used here, the higher the value indicates that the word is more important, or more relevant.
Examples of specific practices:
The number of occurrences of each word is calculated first.
Then the tf-idf value is calculated.
Then bring it into the model for training.
Finally, the types of two new documents are predicted.
Results:
Is it helpful for you to read the above content after reading 'God is love'= > soc.religion.christian'OpenGL on the GPU is fast'= > comp.graphics? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.