How to install and use NLTK 04/16 Update SLTechnology News&Howtos

How to install and use NLTK

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge of this article "how to install and use NLTK", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to install and use NLTK" article.

Installation

In the first way, you can install NLTK with Anaconda:

Conda install nltk

Second, you can use pip to run and install NLTK in the unit of Jupyter Notebook:

! pip install-- upgrade nltk

If the following Python code runs without errors, the installation is successful:

Import nltk

NLTK comes with a large amount of downloadable data (corpus, syntax, models, etc.), so simply run the following Python command and an interactive download window appears:

Ntlk.download ()

For this module, you also need to install the corpus of "stop words". After downloading, create an environment variable called NLTK_DATA that contains the download directory path (not needed if you are installing centrally; see the documentation for a complete guide to the installation data).

Text categorization

Categorizing text means assigning tags to it. We can classify text in a variety of ways, such as emotional analysis (positive / negative / neutral), spam classification (spam / non-spam), document subject, etc.

In this module, we will exercise the text classification example with a large movie review dataset that provides 25000 movie reviews (positive and negative) for training and the same number of tests.

NLTK provides a naive Bayesian classifier to handle machine learning. Our main work is to write a function to extract "features" from the text. The classifier uses these features to perform its classification.

Our function, called feature extractor, takes a string (text) as an argument and returns a dictionary called feature set that maps feature names to their values.

For film reviews, we will be characterized by the first N words (excluding stop words). Therefore, the feature extractor returns a feature set containing the N words as keys and a Boolean value indicating their existence or non-existence as values.

The first step is to browse the comments, store all the words (except the disabled words), and find the most commonly used words.

First, the helper function takes a text and outputs its non-stop words:

Import nltkimport nltk.sentiment.utilfrom nltk.corpus import stopwordsimport nltk.sentiment.utilstop = set (stopwords.words ("english")) def extract_words_from_text (text): tokens = nltk.word_tokenize (text) tokens_neg_marked = nltk.sentiment.util.mark_negation (tokens) return [t for t in tokens_neg_marked if t.replace ("_ NEG", ") .isalnum () and t.replace (" _ NEG ") "") not in stop]

Word_tokenize splits the text into a list of tags (still retaining punctuation).

Mark_negation uses the _ NEG mark to negate the tag. So, for example, "I don't like this." After tagging and tag negation, it becomes this:

["I", "did", "not", "enjoy_NEG", "this_NEG", "."]

The last line removes all stop words (including negative words) and punctuation. There are many useless words in the article, such as "I" or "this", but this filter is enough for us to demonstrate.

Next, we build a list of all the words read from the comment file. We keep a separate list of positive and negative words to ensure a balance when we choose the most important words. (I also tested the word list without separating it, and found that most positive comments were classified as negative comments. At the same time, we can also create a list of all positive comments and all negative comments.

Import ospositive_files = os.listdir ("aclImdb/train/pos") negative_files = os.listdir ("aclImdb/train/neg") positive_words = [] negative_words = [] positive_reviews = [] negative_reviews = [] for pos_file in positive_files: with open ("aclImdb/train/pos/" + pos_file, "r") as f: txt = f.read (). Replace ("" ") positive_reviews.append (txt) positive_words.extend (extract_words_from_text (txt)) for neg_file in negative_files: with open (" aclImdb/train/neg/ "+ neg_file," r ") as f: txt = f.read () .replace (", ") negative_reviews.append (txt) negative_words.extend (extract_words_from_text (txt))

It may take a while to run this code because there are many files.

Then, we keep only the first N words in the list of positive and negative words (2000 words in this case) and combine them.

N = 2000freq_pos = nltk.FreqDist (positive_words) top_word_counts_pos = sorted (freq_pos.items (), key=lambda kv: kv [1], reverse=True) [: n] top_words_pos = [twc [0] for twc in top_word_counts_pos] freq_neg = nltk.FreqDist (negative_words) top_word_counts_neg = sorted (freq_neg.items (), key=lambda kv: kv [1] Reverse=True) [: n] top_words_neg = [twc [0] for twc in top_word_counts_neg] top_words = list (set (top_words_pos + top_words_neg))

Now we can write a feature extractor. As mentioned earlier, it should return a dictionary with each topmost word as the key and True or False as the value, depending on whether the word exists in the text.

Def extract_features (text): text_words = extract_words_from_text (text) return {w: W in text_words for w in top_words}

Then we create a training set, which we provide to the naive Bayesian classifier. The training set should be a list of tuples, where the first element of each tuple is the feature set and the second element is the label.

Training = [(extract_features (review), "pos") for review in positive_reviews] + [(extract_features (review), "neg") for review in negative_reviews]

The above lines take up a lot of RAM and are slow, so you may want to use a subset of comments by getting part of the list of comments.

Training classifiers is simple:

Classifier = nltk.NaiveBayesClassifier.train (training)

To categorize comments immediately, ask classify to use this method on the new feature set:

Print (classifier.classify (extract_features ("Your review goes here.")

If you want to see the probability of each tag, you can use prob_classify instead:

Def get_prob_dist (text): prob_dist = classifier.prob_classify (extract_features (text)) return {"pos": prob_dist.prob ("pos"), "neg": prob_dist.prob ("neg")} print (get_prob_dist ("Your review goes here."))

The classifier has a built-in method to determine the accuracy of the model based on the test set. The shape of the test set is the same as the training set. The movie review dataset has a separate directory that contains comments that can be used for this purpose.

Test_positive = os.listdir ("aclImdb/test/pos") [: 2500] test_negative = os.listdir ("aclImdb/test/neg") [: 2500] test = [] for pos_file in test_positive: with open ("aclImdb/test/pos/" + pos_file, "r") as f: txt = f.read (). Replace (",") test.append ((extract_features (txt)) "pos")) for neg_file in test_negative: with open ("aclImdb/test/neg/" + neg_file, "r") as f: txt = f.read () .replace ("", ") test.append ((extract_features (txt)," neg ") print (nltk.classify.accuracy (classifier, test))

Using N = 2000, there were 5000 positive comments and 5000 negative comments in the training set. I used this code to get about 85% accuracy.

The above is about the content of this article "how to install and use NLTK". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.