In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article focuses on "how to use Python natural language processing NLP to create a summary", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use Python natural language to process NLP to create abstracts.
Which summary method should be used
I use extraction summaries because I can apply this method to many documents without having to perform a large number of (daunting) machine learning model training tasks.
In addition, the extraction summary method has a better summary effect than the abstract summary, because the abstract summary must generate new sentences from the original text, which is more difficult than the data-driven method to extract important sentences.
How to create your own text Abstract
We will use the word histogram to sort the importance of sentences, and then create a summary. The advantage of this is that you don't need to train your model to use it in documentation.
Text Summary Workflow
Here's the workflow we're going to follow.
Import text > Clean text and split into sentences > > Delete stop words > > build word histogram > > rank sentences > > Select the first N sentences to extract the abstract
(1) sample text
I used the text of a news article titled Apple's $50 million acquisition of AI startups to advance its apps. You can find the original news article here: https://analyticsindiamag.com/apple-acquires-ai-startup-for-50-million-to-advance-its-apps/
You can also download a text document from Github: https://github.com/louisteo9/personal-text-summarizer
(2) Import library
# Natural language Toolkit (NLTK) import nltk nltk.download ('stopwords') # regular expression for text preprocessing import re # queue algorithm to find the first sentence import heapq # numerically calculated NumPy import numpy as np # pandas import pandas as pd # matplotlib drawing from matplotlib import pyplot as plt% matplotlib inline for creating data frames
(3) Import text and perform preprocessing
There are many ways to do this. The goal here is to have a clean text that we can enter into our model.
# load the text file with open ('Apple_Acquires_AI_Startup.txt', 'r') as f: file_data = f.read ()
Here, we use regular expressions for text preprocessing. We will
(a) use spaces (if any...) Replace the reference number, namely [1], [10], [20]
(B) replace one or more spaces with a single space.
Text = file_data # if so, replace text = re.sub with spaces (r'\ [[0-9] *\]',', text) # replace one or more spaces text = re.sub with a single space (r'\ s spaces', text)
Then, we use lowercase (without special characters, numbers, and extra spaces) to form a clean text and divide it into a single word for phrase score calculation and word formation histogram.
The reason for forming a clean text is that the algorithm does not treat "understand" and "understand" as two different words.
# convert all uppercase characters to lowercase characters clean_text = text.lower () # replace characters other than [a-zA-Z0-9] with spaces clean_text = re.sub (r'\ Wanderline', clean_text) # replace the number clean_text = re.sub (r'\ dbadge'') with spaces Clean_text) # replace one or more spaces with a single space clean_text = re.sub (clean_text)
(4) divide the text into sentences
We use the NLTK sent_tokenize method to split the text into sentences. We will assess the importance of each sentence and then decide whether each sentence should be included in the summary.
Sentences = nltk.sent_tokenize (text)
(5) Delete stop words
A stop word is an English word that does not add too much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. We have downloaded a file containing English stop words
Here, we will get a list of deprecated words and store them in the stop_word variable.
# get the list of stop words stop_words = nltk.corpus.stopwords.words ('english')
(6) construct histogram.
Let's assess the importance of each word according to the number of times it appears in the entire text.
We will split the word into clean text by (1), (2) delete the deactivated word, and then (3) check the frequency of each word in the text.
# create an empty dictionary to hold the word count word_count = {} # Loop through the marked words Delete deactivated words and save the word count to the dictionary for word in nltk.word_tokenize (clean_text): # remove stop words if word not in stop_words: # save the word count to the dictionary if word not in word_count.keys (): word_ [word] = 1 else: word_ countword [word] + = 1
Let's draw the word histogram and see the result.
Plt.figure (figsize= (16) 10) plt.xticks (rotation = 90) plt.bar (word_count.keys (), word_count.values ()) plt.show ()
Let's convert it to a bar chart, showing only the first 20 words, with a helper function below.
The # helper function, which is used to draw the top word. Def plot_top_words (word_count_dict, show_top_n=20): word_count_table = pd.DataFrame.from_dict (word_count_dict, orient = 'index') .rename (columns= {0:' score'}) word_count_table.sort_values (by='score') .tail (show_top_n) .plot (kind='barh', figsize= (10)) plt.show ()
Let's show the first 20 words.
Plot_top_words (word_count, 20)
From the figure above, we can see that the words "ai" and "apple" appear at the top. This makes sense, because this article is about Apple's acquisition of an artificial intelligence start-up.
(7) arrange sentences according to scores
Now, we will rank the importance of each sentence according to its score. We will:
Delete sentences with more than 30 words and realize that long sentences are not always meaningful
Then, add a score to each word that makes up the sentence to form a sentence score.
Sentences with high marks will be at the top of the list. The previous sentences will form our summary.
Note: according to my experience, any 25 to 30 words can give you a good summary.
# create an empty dictionary to store sentence scores sentence_score = {} # cycle through marked sentences, taking only sentences of less than 30 words Then add the word score to form a sentence score for sentence in sentences: # check whether the words in the sentence are in the word dictionary for word in nltk.word_tokenize (sentence.lower ()): if word in word_count.keys (): # only accept sentences with less than 30 words if len (sentence.split ('))
< 30: # 把单词分数加到句子分数上 if sentence not in sentence_score.keys(): sentence_score[sentence] = word_count[word] else: sentence_score[sentence] += word_count[word] 我们将句子-分数字典转换成一个数据框,并显示sentence_score。 注意:字典不允许根据分数对句子进行排序,因此需要将字典中存储的数据转换为DataFrame。 df_sentence_score = pd.DataFrame.from_dict(sentence_score, orient = 'index').rename(columns={0: 'score'}) df_sentence_score.sort_values(by='score', ascending = False)(8) choose the previous sentence as the summary
We use the heap queue algorithm to select the first three sentences and store them in the best_quences variable.
Usually 3-5 sentences are enough. Depending on the length of the document, you can change the number of sentences at the top of the display at will.
In this case, I chose 3 because our text is relatively short.
# show the best three sentences as a summary best_sentences = heapq.nlargest (3, sentence_score, key=sentence_score.get)
Let's use the print and for loop functions to display the summary text.
Print ('SUMMARY') print (' -') # displays the top sentence for sentence in sentences: if sentence in best_sentences: print (sentence) according to the sentence order in the original text. So far, I believe you have a better understanding of "how to use Python natural language to process NLP to create a summary". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.