How to combine text abstraction and keyword extraction in Serverless? 04/16 Update SLTechnology News&Howtos

How to combine text abstraction and keyword extraction in Serverless?

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article is about how to combine text abstraction and keyword extraction in Serverless. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it with the editor.

Automatic abstract extraction and keyword extraction of text belong to the category of natural language processing. One of the advantages of extracting abstracts is that readers can judge whether the article is meaningful or valuable to them and whether they need to read it in more detail through the least amount of information. the advantage of extracting keywords is that it can make a connection between articles and articles, and at the same time, it also allows readers to quickly locate the content of articles related to that keyword.

Text abstracts and keyword extraction can be combined with traditional CMS. Through the transformation of articles / news and other publishing functions, keywords and abstracts are extracted synchronously and put into HTML pages as Description and Keyworks. This is beneficial to search engines to a certain extent, and belongs to the category of SEO optimization.

Keyword extraction

There are many ways to extract keywords, but the most common one should be tf-idf.

The method of keyword extraction based on tf-idf through jieba:

Jieba.analyse.extract_tags (text, topK=5, withWeight=False, allowPOS= ('nasty,' vn','v') text summary

There are also many methods of text summary, if divided in a broad sense, including extraction and generation. The extraction formula is to find out the key sentences and then assemble the summary through TextRank and other algorithms in the article, this method is relatively simple, but it is difficult to extract the real semantics; another method is the generation, through in-depth learning and other methods, extract the text semantics and generate the abstract.

If you simply understand the abstract generated by the extraction method, all the sentences come from the original text, while the generative method is generated independently.

In order to simplify the difficulty, this paper will use extraction to achieve the text summary function, through the SnowNLP third-party library to achieve the text summary function based on TextRank. We use part of the contents of "20,000 miles under the Sea" as the original text to generate a summary:

Original text:

These events occurred when I had just returned from a scientific research project in a barren area of Nebraska. I was a visiting professor at the Museum of Natural History in Paris, and the French government sent me to take part in this expedition. I spent half a year in Nebraska, collected a lot of valuable information, returned with a full load, and arrived in New York at the end of March. I decided to leave for France at the beginning of May. So I took advantage of this waiting time to sort out the minerals, animals and plants I had collected, but just then something happened to the Scotia. I knew the street talk at that time like the back of my hand, and besides, how could I turn a deaf ear and be indifferent? I read and read all kinds of newspapers and periodicals in the United States and Europe, but I couldn't get to the bottom of the truth. It is mysterious and puzzling. I thought about it, swinging between the two extremes, never forming an opinion. There must be something in it, there is no doubt about it, and if anyone doubts it, ask them to touch the wound of the Scorch. When I arrived in New York, the question was in full swing. Some ignorant people have put forward the idea of floating islands and unpredictable reefs, but all these hypotheses have been overturned. Obviously, unless there is a machine in the belly of the reef, how can it move so quickly? By the same token, it is not true to say that it is a floating hull or a pile of fragments of a large ship, because it is still moving too fast. In that case, there are only two explanations for the problem. People have their own opinions, and they are naturally divided into two factions with very different views: one says it is a powerful monster, and the other says it is a very powerful "diving ship." Oh, the last assumption is certainly acceptable, but after the survey in Europe and the United States, it is difficult to justify itself. What ordinary person would have such a powerful machine? It's impossible. Where and when did he ask who to create such a behemoth, and how could he keep the news of the construction under wraps? It seems that only the government can have such a destructive machine, and in this era of disaster, people do everything possible to increase the power of weapons of war. One country is trying to produce such appalling weapons without telling other countries. After the Shasper rifle, there was a mine, followed by an underwater hammer, and then the magic path climbed, and the situation became more and more serious. At least, that's what I think.

The algorithm provided by SnowNLP:

From snownlp import SnowNLP text = "the original text above, omitted here" s = SnowNLP (text) print ("." .join (s.summary (5)

Output result:

Naturally, it is divided into two groups with very different views: one says that this is a monster of great power. This assumption is also untenable. When I got to New York. It is said that it is a floating hull or a pile of ship fragments. The other side said it was a very powerful "diving ship".

Initially, the effect is not very good. Next, we calculate the sentence weight by ourselves to achieve a simple summary function, which requires jieba:

Import reimport jieba.analyseimport jieba.posseg class TextSummary: def _ _ init__ (self, text): self.text = text def splitSentence (self): sectionNum = 0 self.sentences = [] for eveSection in self.text.split ("\ n"): if eveSection: sentenceNum = 0 for eveSentence in re.split ("! |. |?" EveSection): if eveSentence: mark = [] if sectionNum = = 0: mark.append ("FIRSTSECTION") if sentenceNum = 0: mark.append ("FIRSTSENTENCE") Self.sentences.append ({"text": eveSentence "pos": {"x": sectionNum, "y": sentenceNum "mark": mark}}) sentenceNum = sentenceNum + 1 sectionNum = sectionNum + 1 self.sentences [- 1] ["pos"] ["mark"] .append ("LASTSENTENCE") for i in range (0 Len (self.sentences): if self.sentences [I] ["pos"] ["x"] = = self.sentences [- 1] ["pos"] ["x"]: self.sentences [I] ["pos"] ["mark"] .append ("LASTSECTION") def getKeywords (self): self.keywords = jieba.analyse.extract_tags (self.text, topK=20, withWeight=False, allowPOS= ('nasty,' vn') 'v') def sentenceWeight (self): # calculate the position weight of a sentence for sentence in self.sentences: mark = sentence ["pos"] ["mark"] weightPos = 0 if "FIRSTSECTION" in mark: weightPos = weightPos + 2 if "FIRSTSENTENCE" in mark: weightPos = weightPos + 2 If "LASTSENTENCE" in mark: weightPos = weightPos + 1 if "LASTSECTION" in mark: weightPos = weightPos + 1 sentence ["weightPos"] = weightPos # calculate the weight of the clue word index = ["anyway" "all in all"] for sentence in self.sentences: sentence ["weightCueWords"] = 0 sentence ["weightKeywords"] = 0 for i in index: for sentence in self.sentences: if sentence ["text"] .find (I) > = 0: sentence ["weightCueWords"] = 1 for keyword in self.keywords For sentence in self.sentences: if sentence ["text"] .find (keyword) > = 0: sentence ["weightKeywords"] = sentence ["weightKeywords"] + 1 for sentence in self.sentences: sentence ["weight"] = sentence ["weightPos"] + 2 * sentence ["weightCueWords"] + sentence ["weightKeywords"] def getSummary (self Ratio=0.1): self.keywords = list () self.sentences = list () self.summary = list () # call method Calculate keywords, clauses and weights self.getKeywords () self.splitSentence () self.sentenceWeight () # sort the weight values of sentences self.sentences = sorted (self.sentences, key=lambda k: K ['weight'], reverse=True) # according to the ranking results Take the sentences that rank at the top of the ratio% as the abstract for i in range (len (self.sentences)): if I

< ratio * len(self.sentences): sentence = self.sentences[i] self.summary.append(sentence["text"]) return self.summary 这段代码主要是通过tf-idf实现关键词提取，然后通过关键词提取对句子尽心权重赋予，最后获得到整体的结果，运行： testSummary = TextSummary(text)print("。".join(testSummary.getSummary())) 可以得到结果： Building prefix dict from the default dictionary ...Loading model from cache /var/folders/yb/wvy_7wm91mzd7cjg4444gvdjsglgs8/T/jieba.cacheLoading model cost 0.721 seconds.Prefix dict has been built successfully.看来，只有政府才有可能拥有这种破坏性的机器，在这个灾难深重的时代，人们千方百计要增强战争武器威力，那就有这种可能，一个国家瞒着其他国家在试制这类骇人听闻的武器。于是，我就抓紧这段候船逗留时间，把收集到的矿物和动植物标本进行分类整理，可就在这时，斯科舍号出事了。同样的道理，说它是一块浮动的船体或是一堆大船残片，这种假设也不能成立，理由仍然是移动速度太快我们可以看到，整体效果要比刚才的好一些。发布 API 通过 Serverless 架构，将上面代码进行整理，并发布。代码整理结果： import re, jsonimport jieba.analyseimport jieba.posseg class NLPAttr: def __init__(self, text): self.text = text def splitSentence(self): sectionNum = 0 self.sentences = [] for eveSection in self.text.split("\n"): if eveSection: sentenceNum = 0 for eveSentence in re.split("!|。|？", eveSection): if eveSentence: mark = [] if sectionNum == 0: mark.append("FIRSTSECTION") if sentenceNum == 0: mark.append("FIRSTSENTENCE") self.sentences.append({ "text": eveSentence, "pos": { "x": sectionNum, "y": sentenceNum, "mark": mark } }) sentenceNum = sentenceNum + 1 sectionNum = sectionNum + 1 self.sentences[-1]["pos"]["mark"].append("LASTSENTENCE") for i in range(0, len(self.sentences)): if self.sentences[i]["pos"]["x"] == self.sentences[-1]["pos"]["x"]: self.sentences[i]["pos"]["mark"].append("LASTSECTION") def getKeywords(self): self.keywords = jieba.analyse.extract_tags(self.text, topK=20, withWeight=False, allowPOS=('n', 'vn', 'v')) return self.keywords def sentenceWeight(self): # 计算句子的位置权重 for sentence in self.sentences: mark = sentence["pos"]["mark"] weightPos = 0 if "FIRSTSECTION" in mark: weightPos = weightPos + 2 if "FIRSTSENTENCE" in mark: weightPos = weightPos + 2 if "LASTSENTENCE" in mark: weightPos = weightPos + 1 if "LASTSECTION" in mark: weightPos = weightPos + 1 sentence["weightPos"] = weightPos # 计算句子的线索词权重 index = [" 总之 ", " 总而言之 "] for sentence in self.sentences: sentence["weightCueWords"] = 0 sentence["weightKeywords"] = 0 for i in index: for sentence in self.sentences: if sentence["text"].find(i) >

= 0: sentence ["weightCueWords"] = 1 for keyword in self.keywords: for sentence in self.sentences: if sentence ["text"] .find (keyword) > = 0: sentence ["weightKeywords"] = sentence ["weightKeywords"] + 1 for sentence in self.sentences: sentence ["weight"] = sentence [" WeightPos "] + 2 * sentence [" weightCueWords "] + sentence [" weightKeywords "] def getSummary (self Ratio=0.1): self.keywords = list () self.sentences = list () self.summary = list () # call method Calculate keywords, clauses and weights self.getKeywords () self.splitSentence () self.sentenceWeight () # sort the weight values of sentences self.sentences = sorted (self.sentences, key=lambda k: K ['weight'], reverse=True) # according to the ranking results Abstract for i in range (len (self.sentences)): if I < ratio * len (self.sentences): sentence = self.sentences [I] self.summary.append (sentence ["text"]) return self.summary def main_handler (event) Context): nlp = NLPAttr (json.loads (event ['body']) [' text']) return {"keywords": nlp.getKeywords (), "summary": " .join (nlp.getSummary ())}

Write the project serverless.yaml file:

NlpDemo: component: "@ serverless/tencent-scf" inputs: name: nlpDemo codeUri:. / handler: index.main_handler runtime: Python3.6 region: ap-guangzhou description: text summary / keyword function memorySize: 256 timeout: 10 events:-apigw: name: nlpDemo_apigw_service parameters: protocols:-http ServiceName: serverless description: text summary / keyword function environment: release endpoints:-path: / nlp method: ANY

Because jieba is used in the project, it is recommended to install it under the CentOS system and the corresponding Python version during installation, or you can use a dependent tool I made earlier for convenience:

Deploy via sls-debug:

Deployment is complete, and you can simply test it through PostMan:

As you can see from the picture above, we have output the target results as expected.

Relatively speaking, through the Serveless architecture to do API is very easy and convenient, can achieve API plug and pull lines, components.

We invite you to experience the most convenient way to develop and deploy Serverless. During the trial period, related products and services provide free resources and professional technical support to help your business achieve Serverless quickly and easily!

Above is how to combine text summary and keyword extraction in Serverless? The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.