Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to filter text-sensitive words with Serverless

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to filter text-sensitive words in Serverless. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Sensitive word filtering is a technical means developed with the development of the Internet community to prevent cyber crime and cyber violence, through targeted screening and shielding of possible keywords that may exist crime or cyber violence, in many cases, we can prevent it and nip serious crimes in the bud.

With the increasing popularity of various social platforms, sensitive word filtering has gradually become a very important and noteworthy function. So what are the new implementations of sensitive word filtering through the Python language under the Serverless architecture? Can we implement a sensitive word filtering API in the simplest way?

Understanding several methods of sensitive filtering Replace method

If we say that sensitive word filtering is actually more like text replacement, take Python as an example, when it comes to lexical substitution, we have to think of replace, we can prepare a sensitive thesaurus, and then replace sensitive words through replace:

Def worldFilter (keywords, text): for eve in keywords: text = text.replace (eve, "* *") return textkeywords = ("keyword 1", "keyword 2", "keyword 3") content = "this is an example of keyword substitution, which involves keyword 1 and keyword 2, and finally keyword 3." Print (worldFilter (keywords, content))

However, if you use your head, you will find that this approach will have serious performance problems under the premise of a very large text and sensitive vocabulary. For example, I modified the code to do basic performance tests:

Import timedef worldFilter (keywords, text): for eve in keywords: text = text.replace (eve, "* *") return textkeywords = ["keyword" + str (I) for i in range (0Lv 10000)] content = "this is an example of keyword substitution, which involves keyword 1 and keyword 2, and finally keyword 3." * 1000startTime = time.time () worldFilter (keywords, content) print (time.time ()-startTime)

The output at this time is 0.12426114082336426, and you can see that the performance is very poor.

Regular expression method

Instead of using replace, it is better to express re.sub more quickly through regular expression.

Import timeimport redef worldFilter (keywords, text): return re.sub ("|" .join (keywords), "*", text) keywords = ["keyword" + str (I) for i in range (01.10000)] content = "this is an example of keyword substitution, which involves keyword 1 and keyword 2, and finally keyword 3." * 1000startTime = time.time () worldFilter (keywords, content) print (time.time ()-startTime)

We also increase the performance test and do the modification test according to the above method, and the output result is 0.24773502349853516. Through such an example, we can find that the performance of Korean TV dramas is not great, but in fact, as the amount of text increases, the regular expression of this practice will become much higher at the performance level.

DFA filter sensitive words

This method will be relatively more efficient. For example, if we think that bad guys, bad kids, and bad guys are sensitive words, then their tree relationship can be expressed:

The DFA dictionary is used to express:

{'bad': {'x00eggs: 0},' people': {'\ x00eggs: 0}, 'child': {'son': {'\ x00eggs: 0}

The biggest advantage of using this tree to represent a problem is that it can reduce the number of searches, improve the efficiency of retrieval, and the basic code implementation:

Import timeclass DFAFilter (object): def _ init__ (self): self.keyword_chains = {} # keyword linked list self.delimit ='\ x00' # qualified def parse (self, path): with open (path Encoding='utf-8') as f: for keyword in f: chars = str (keyword). Strip (). Lower () # keyword becomes lowercase if not chars: # if the keyword is empty, return return level = self.keyword_chains for i in range (len ( Chars): if chars [I] in level: level = level [I]] else: if not isinstance (level Dict): break for j in range (I, len (chars)): level [chars] = {} last_level, last_char = level Chars [j] level = last_ level [last _ char] = {self.delimit: 0} break if i = = len (chars)-1: level [self.delimit] = 0 def filter (self, message) Repl= "*"): message = message.lower () ret = [] start = 0 while start < len (message): level = self.keyword_chains step_ins = 0 for char in message [start:]: if char in level: step_ins + = 1 if self. Delimit not in level [char]: level = level [char] else: ret.append (repl * step_ins) start + = step_ins-1 break else: ret.append (message [start]) Break else: ret.append (message [start]) start + = 1 return''.join (ret) gfw = DFAFilter () gfw.parse (". / sensitive_words") content = "this is an example of keyword substitution It involves keyword 1 and keyword 2, and finally keyword 3. " * 1000startTime = time.time () result = gfw.filter (content) print (time.time ()-startTime)

Here our dictionary library is:

With open (". / sensitive_words",'w') as f: f.write ("\ n" .join (["keyword" + str (I) for i in range (0Ml10000)])

Execution result:

0.06450581550598145

You can see further improvement in performance.

AC Automata filtering algorithm for sensitive words

Next, let's take a look at the AC automaton algorithm for filtering sensitive words:

AC automata: a common example is to give n words and an m-character article to let you find out how many words have appeared in the article.

To put it simply, AC automata is a dictionary tree + kmp algorithm + mismatch pointer.

Code implementation:

Import timeclass Node (object): def _ init__ (self): self.next = {} self.fail = None self.isWord = False self.word = "" class AcAutomation (object): def _ init__ (self): self.root = Node () # find sensitive word function def search (self Content): P = self.root result = [] currentposition = 0 while currentposition < len (content): word = content [currentposition] while word in p.next = = False and p! = self.root: P = p.fail if word in p.next: P = p.next [word] else: P = self.root if p.isWord: result.append (p.word) p = self.root currentposition + = 1 return result # load sensitive word library function def parse (self Path): with open (path Encoding='utf-8') as f: for keyword in f: temp_root = self.root for char in str (keyword). Strip (): if char not in temp_root.next: temp_ root.next[ char] = Node () temp_root = temp_ root.next[ char ] temp_root.isWord = True temp_root.word = str (keyword). Strip () # sensitive word substitution function def wordsFilter (self Text): "": param ah: AC automaton: param text: text: return: text after filtering sensitive words "" result = list (set (self.search (text) for x in result: M = text.replace (x '*' * len (x) text = m return textacAutomation = AcAutomation () acAutomation.parse ('. / sensitive_words') startTime = time.time () print (acAutomation.wordsFilter ("this is an example of keyword substitution It involves keyword 1 and keyword 2, and finally keyword 3. " * 1000)) print (time.time ()-startTime)

The thesaurus is also:

With open (". / sensitive_words",'w') as f: f.write ("\ n" .join (["keyword" + str (I) for i in range (0Ml10000)])

Using the above method, the test result is 0.017391204833984375.

Summary of filtering methods for sensitive words

We can see that among all the algorithms mentioned above, DFA has the highest performance of filtering sensitive words, but in fact, for the latter two algorithms, no one is necessarily better. It is possible that some times, AC automata filter sensitive words algorithm will get higher performance, so in production and life, recommend using both, you can do according to your specific business needs.

Implement sensitive word filtering API

To deploy the code to the Serverless architecture, you can choose the API gateway to combine with function calculation. Take the AC automaton algorithm for filtering sensitive words as an example: we only need to add a few lines of code. The complete code is as follows:

#-*-coding:utf-8-*-import json, uuidclass Node (object): def _ init__ (self): self.next = {} self.fail = None self.isWord = False self.word = "" class AcAutomation (object): def _ init__ (self): self.root = Node () # find sensitive word function def search (self Content): P = self.root result = [] currentposition = 0 while currentposition < len (content): word = content [currentposition] while word in p.next = = False and p! = self.root: P = p.fail if word in p.next: P = p.next [word] else: P = self.root if p.isWord: result.append (p.word) p = self.root currentposition + = 1 return result # load sensitive word library function def parse (self Path): with open (path Encoding='utf-8') as f: for keyword in f: temp_root = self.root for char in str (keyword). Strip (): if char not in temp_root.next: temp_ root.next[ char] = Node () temp_root = temp_ root.next[ char ] temp_root.isWord = True temp_root.word = str (keyword). Strip () # sensitive word substitution function def wordsFilter (self Text): "": param ah: AC automaton: param text: text: return: text after filtering sensitive words "" result = list (set (self.search (text) for x in result: M = text.replace (x,'*'* len (x)) text = m return textdef response (msg) Error=False): return_data = {"uuid": str (uuid.uuid1 ()), "error": error, "message": msg} print (return_data) return return_dataacAutomation = AcAutomation () path ='. / sensitive_words'acAutomation.parse (path) def main_handler (event) Context): try: sourceContent = json.loads (event ["body"]) ["content"] return response ({"sourceContent": sourceContent, "filtedContent": acAutomation.wordsFilter (sourceContent)}) except Exception as e: return response (str (e), True)

Finally, to facilitate local testing, we can add:

Def test (): event = {"requestContext": {"serviceId": "service-f94sy04v", "path": "/ test/ {path}", "httpMethod": "POST", "requestId": "c6af9ac6-7b61-11e6-9a41-93e8deadbeef", "identity": {"secretId": "abdcdxxxxxxxsdfs"} "sourceIp": "14.17.22.34", "stage": "release"}, "headers": {"Accept-Language": "en-US,en,cn", "Accept": "text/html,application/xml,application/json", "Host": "service-3ei3tii4-251000691.ap-guangzhou.apigateway.myqloud.com" "User-Agent": "User Agent String"}, "body": "{\" content\ ":\" this is the text of a test I just said, "pathParameters": {"path": "value"}, "queryStringParameters": {"foo": "bar"}, "headerParameters": {"Refer": "10.0.2.14"} "stageVariables": {"stage": "release"}, "path": "/ test/value", "queryString": {"foo": "bar", "bob": "alice"}, "httpMethod": "POST"} print (main_handler (event) None)) if _ _ name__ = = "_ _ main__": test ()

When it is finished, we can test and run it. For example, my dictionary is:

Hehe test

The result after execution:

{'uuid':' 9961ae2amur5cfcmur11eaMua7c2muracde48001122, 'error': False,' message': {'sourceContent':' this is a test text, I will hehe', 'filtedContent':' this is a * * text, I will * *'}}

Next, we deploy the code to the cloud and create a new serverless.yaml:

Sensitive_word_filtering: component: "@ serverless/tencent-scf" inputs: name: sensitive_word_filtering codeUri:. / exclude:-.gitignore-.git / * *-.serverless-.env handler: index.main_handler runtime: Python3.6 region: ap-beijing description: sensitive words filtering memorySize: 64 timeout: 2 events:-apigw: Name: serverless parameters: environment: release endpoints:-path: / sensitive_word_filtering description: sensitive words filter method: POST enableCORS: true param:-name: content position: BODY Required: 'FALSE' type: string desc: sentences to be filtered

Then deploy through sls-- debug, and the deployment result:

Finally, test through PostMan:

Sensitive word filtering is a very common requirement / technology at present. Through sensitive word filtering, we can reduce the occurrence of malicious speech or illegal speech to a certain extent. In the above practice, there are two points:

As for the problem of obtaining sensitive thesaurus: there are many on Github, you can search and download them on your own. Because there are many sensitive words in the thesaurus, I can't directly put them on this for everyone to use, so you still need to search and use them on Github.

The problem with this API usage scenario: it can be placed in our community post system / comment system / blog publishing system to prevent the emergence of sensitive words and reduce unnecessary troubles.

The above is how to achieve text-sensitive word filtering in Serverless shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report