How to filter content-sensitive words in Python based on DFA algorithm 04/18 Update SLTechnology News&Howtos

How to filter content-sensitive words in Python based on DFA algorithm

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "Python based on DFA algorithm to achieve content-sensitive word filtering", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in-depth, together to study and learn "Python based on DFA algorithm to achieve content-sensitive word filtering" bar!

DFA algorithm is to construct a tree search structure in advance, and then search very efficiently in the tree structure according to the input.

Let's say we have a sensitive thesaurus, and the words in Ciku are:

I love you

I love him

I love her

I love you

I love him.

I love her.

Then you can construct a tree structure like this:

Set the string entered by the player as: Baiju I love you, ha.

We iterate through the string str entered by the player and set the pointer I to point to the root node of the tree, that is, the leftmost blank node:

When str [0] = 'white', tree [I] does not point to a node with a value of 'white', so it does not meet the matching condition and continues to traverse.

Str [1] = 'chrysanthemum', also does not meet the matching condition, continue to traverse

Str [2] ='I'. At this time, tree [I] has a path connected to the node'I 'and meets the matching condition. I points to the node' I', and then continues to traverse

Str [3] = 'love'. At this time, tree [I] has a path connected to the node 'love', meets the matching condition, points to 'love', and continues to traverse

Str [4] = 'you', there is also a path, I point to 'you', continue to traverse

Str [5] ='ya', there is also a path, I points to'ya'

At this point, our pointer I has pointed to the end of the tree structure, that is, a sensitive word judgment has been completed at this time. We can use variables to record the subscript of the string entered by the player at the beginning of the sensitive word match and the subscript at the end of the match, and then traverse it again and replace the character with *.

After a match, we repoint the pointer I to the root node of the tree.

At this time, the string entered by our player has not yet been traversed, so continue to traverse:

Str [6] ='ha', does not meet the matching condition, continue to traverse

Str [7] ='ha '...

Str [8] ='ha '...

We can see that we iterated through the string entered by the player and found the sensitive words in it.

Python implementation of DFA algorithm

Class DFA: "DFA algorithm sensitive word" * "represents any character" def _ _ init__ (self, sensitive_words: list) Skip_words: list): # for sensitive words sensitive_words and meaningless words skip_words can be accessed through the database, Save files or other storage media self.state_event_dict = self._generate_state_event (sensitive_words) self.skip_words = skip_words def _ _ repr__ (self): return'{} '.format (self.state_event_dict) @ staticmethod def _ generate_state_event (sensitive_words)-> dict: state_event_dict = {} For word in sensitive_words: tmp_dict = state_event_dict length = len (word) for index Char in enumerate (word): if char not in tmp_dict: next_dict = {'is_end': False} tmp_ certificates [char] = next_dict tmp_dict = next_dict else: next_dict = tmp_ certificates [char] Tmp_dict = next_dict if index = = length-1: tmp_dict ['is_end'] = True return state_event_dict def match (self Content: str): match_list = [] state_list = [] temp_match_list = [] for char_pos Char in enumerate (content): if char in self.skip_words: continue if char in self.state_event_dict: state_list.append (self.state_event_dict) temp_match_list.append ({"start": char_pos "match": "}) for index, state in enumerate (state_list): is_match = False state_char = None if'*'in state: # for some sensitive words For example, Big silly X, it may be Big silly B, Big silly X, Big silly. Use the wildcard character * A * represents a character state_ [index] = state ['*'] state_char = state ['*'] is_match = True if char in state: state_ list [index] = state [char] state_char = state [char] Is_match = True if is_match: if state_char ["is_end"]: stop = char_pos + 1 temp_match_ list [index] ['match'] = content [ Temp_match_ list [start']: stop] match_list.append (copy.deepcopy (temp_match_ list [index]) if len (state_char.keys ()) = 1: state_list.pop (index) temp_match_list.pop ( Index) else: state_list.pop (index) temp_match_list.pop (index) for index Match_words in enumerate (match_list): print (match_words ['start']) return match_list

The _ generate_state_event method generates a tree structure of sensitive words (saved in a dictionary). For the above example, the generated tree structure is saved as follows:

If _ _ name__ = ='_ main__': dfa = DFA (['I love you','I love him','I love her','I love her','I love him','I love her','I love her'], skip_words= []) # do not configure skip_words print (dfa) for the time being

Results:

{'I': {'is_end': False,' love': {'is_end': False,' you': {'is_end': True,' ya': {'is_end': True}},' he': {'is_end': True,' ya': {'is_end': True}},' she': {'is_end': True,' ya': {'is_end': True} 'ah': {'is_end': True}

Then call the match method and enter the content to match the sensitive words:

If _ _ name__ = ='_ main__': dfa = DFA (['I love you','I love him','I love her','I love her'], ['\ n','\ r\ n','\ r']) # print (dfa) print (dfa.match ('Baiju I love you')

Results:

[{'start': 2,' match':'I love you'}, {'start': 2,' match':'I love you'}]

And for some sensitive words, such as Big silly X, which may be Big silly B, Big silly X, Big silly. Can it be solved through a wildcard *?

See code: lines 48-51

If'*'in state: # for some sensitive words, such as big silly X, it may be big silly B, big silly ×, big silly..., use the wildcard character *, a * represents a character state_ [index] = state ['*'] state_char = state ['*] is_match = True

Verify:

If _ _ name__ = ='_ main__': dfa = DFA (['big silly *'], []) print (dfa) print (dfa.match)

{'is_end': False': {'is_end': False': {'is_end': False,' *': {'is_end': True}

[{'start': 0,' match': 'big silly X'}, {'start': 6,' match': 'big silly B'}]

If the input above, "big silly X easy fly big silly B" is written as "big% silly X easy fly big & stupid B", see if you can recognize sensitive words? I can't recognize it!

If _ _ name__ = ='_ main__': dfa = DFA (['big silly *'], []) print (dfa) print (dfa.match)

Results:

{'is_end': False': {'is_end': False': {'is_end': False,' *': {'is_end': True}

[

Such as ", &,!,!, @, #, $, ¥, *, ^,%,?,", these special symbols have no practical meaning, but can be inserted in the middle of sensitive words and destroy the structure of sensitive words to avoid sensitive word checking.

Configure unintentional words, and then check sensitive words, as follows, it can be seen that the damaged sensitive words can also be identified.

If _ _ name__ ='_ main__': dfa = DFA (['silly *'], ['%','&']) print (dfa) print (dfa.match)

Results:

{'is_end': False': {'is_end': False': {'is_end': False,' *': {'is_end': True}

[{'start': 0,' match': 'big% silly X'}, {'start': 7,' match': 'big & stupid B'}]

Thank you for reading, the above is the content of "Python how to filter content-sensitive words based on DFA algorithm". After the study of this article, I believe you have a deeper understanding of how Python based on DFA algorithm to achieve content-sensitive word filtering, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.