Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize simple word Segmentation by using re Module in Python

2025-10-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "Python how to use the re module to achieve simple word segmentation", the content is detailed, the steps are clear, and the details are handled properly. I hope this article "Python how to use the re module to achieve simple word segmentation" can help you solve your doubts.

A simple tokenizer

The tokenization task is the most common task in Python string processing. Here we talk about building a simple expression splitter (tokenizer) out of regular expressions that parses an expression string from left to right into a tokens stream.

Given the following expression string:

Text = 'foo = 12 + 5 * 6'

We want to convert it to the following word segmentation results presented in sequence pairs:

Tokens = [('NAME',' foo'), ('EQ',' ='), ('NUM',' 12'), ('PLUS',' +'),\ ('NUM',' 5'), ('TIMES',' *'), ('NUM',' 6')]

To complete such a word segmentation, we first need to define all possible markup patterns (the so-called pattern, which is a string used to describe or match / a series of strings that match a syntactic rule, here we use regular expressions as patterns), note that the space whitespace must be included here, otherwise the scanning will stop after any characters in the string that are not in the pattern. Because we also need to give the tag names such as NAME, EQ, and so on, we use the named capture group in the regular expression.

Import reNAME = r'(? P [a murzAmurz Z _] [a-zA-Z_0-9] *)'# where? P represents the pattern name, () represents a regular expression capture group, and taken together, a named capture group EQ = r'(? P =) 'NUM = r' (? P\ d +)'#\ d represents matching numbers + for any number of PLUS = r'(? P\ +)'# need to escape TIMES = r'(? P\ *)'# need to escape WS = r'(? P\ s +)'#\ s for matching spaces, + for any number of master_pat = re.compile ("|" .join ([NAME, EQ, NUM, PLUS, TIMES, WS])) # | used to select multiple modes to indicate "or"

Next, we use the scanner () method in the schema object to complete the word segmentation, which creates a scan object:

Scanner = master_pat.scanner (text)

You can then use the match () method to get a single match, one pattern at a time:

Scanner = master_pat.scanner (text) m = scanner.match () print (m.lastgroup, m.group ()) # NAME foom = scanner.match () print (m.lastgroup, m.group ()) # WS

Of course, one call at a time is too troublesome, we can use iterators to call in batches and store the results of a single iteration in a named tuple.

Token = namedtuple ('Token', [' type', 'value']) def generate_tokens (pat, text): scanner = pat.scanner (text) for m in iter (scanner.match, None): # scanner.match is the method called each time by the iterator, and # None is the default value for the sentry Means to iterate to None to stop yield Token (m.lastgroup, m.group ()) for tok in generate_tokens (master_pat, "foo = 42"): print (tok)

Finally, the tokens stream that shows the expression string "foo = 12 + 5 * 6" is:

Token (type='NAME', value='foo') Token (type='WS', value='') Token (type='EQ', value='=') Token (type='WS', value='') Token (type='NUM', value='12') Token (type='WS', value='') Token (type='PLUS', value='+') Token (type='WS', value='') Token (type='NUM', value='5') Token (type='WS', value='') Token (type='TIMES' Value='*') Token (type='WS', value='') Token (type='NUM', value='6') filter tokens streams

Next, to filter out the space tags, use the generator expression:

Tokens = (tok for tok in generate_tokens (master_pat, "foo = 12 + 5 * 6") if tok.type! = 'WS') for tok in tokens: print (tok)

You can see that the spaces are successfully filtered:

Token (type='NAME', value='foo') Token (type='EQ', value='=') Token (type='NUM', value='12') Token (type='PLUS', value='+') Token (type='NUM', value='5') Token (type='TIMES', value='*') Token (type='NUM', value='6') Note substring matching traps

The order of tokens in regular expressions (that is, "|" .join ([NAME, EQ, NUM, PLUS, TIMES, WS]) is also important. Because when matching, the re module matches the patterns in the specified order. Therefore, if a pattern happens to be a substring of another longer pattern, it is necessary to ensure that the longer pattern matches first. The correct and wrong matching methods are shown below:

LT = r'(? P

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report