How to realize simple tokenizer with re Module in Python 07/08 Update SLTechnology News&Howtos

How to realize simple tokenizer with re Module in Python

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "Python how to use re module to achieve simple tokenizer". In daily operation, I believe many people have doubts about how Python uses re module to achieve simple tokenizer. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "Python how to use re module to achieve simple tokenizer". Next, please follow the editor to study!

A simple tokenizer

The tokenization task is the most common task in Python string processing. Here we talk about building a simple expression splitter (tokenizer) out of regular expressions that parses an expression string from left to right into a tokens stream.

Given the following expression string:

Text = 'foo = 12 + 5 * 6'

We want to convert it to the following word segmentation results presented in sequence pairs:

Tokens = [('NAME',' foo'), ('EQ',' ='), ('NUM',' 12'), ('PLUS',' +'),\ ('NUM',' 5'), ('TIMES',' *'), ('NUM',' 6')]

To complete such a word segmentation, we first need to define all possible markup patterns (the so-called pattern, which is a string used to describe or match / a series of strings that match a syntactic rule, here we use regular expressions as patterns), note that the space whitespace must be included here, otherwise the scanning will stop after any characters in the string that are not in the pattern. Because we also need to give the tag names such as NAME, EQ, and so on, we use the named capture group in the regular expression.

Import reNAME = r'(? P [a murzAmurz Z _] [a-zA-Z_0-9] *)'# where? P represents the pattern name, () represents a regular expression capture group, and taken together, a named capture group EQ = r'(? P =) 'NUM = r' (? P\ d +)'#\ d represents matching numbers + for any number of PLUS = r'(? P\ +)'# need to escape TIMES = r'(? P\ *)'# need to escape WS = r'(? P\ s +)'#\ s for matching spaces, + for any number of master_pat = re.compile ("|" .join ([NAME, EQ, NUM, PLUS, TIMES, WS])) # | used to select multiple modes to indicate "or"

Next, we use the scanner () method in the schema object to complete the word segmentation, which creates a scan object:

Scanner = master_pat.scanner (text)

You can then use the match () method to get a single match, one pattern at a time:

Scanner = master_pat.scanner (text) m = scanner.match () print (m.lastgroup, m.group ()) # NAME foom = scanner.match () print (m.lastgroup, m.group ()) # WS

Of course, one call at a time is too troublesome, we can use iterators to call in batches and store the results of a single iteration in a named tuple.

Token = namedtuple ('Token', [' type', 'value']) def generate_tokens (pat, text): scanner = pat.scanner (text) for m in iter (scanner.match, None): # scanner.match is the method called each time by the iterator, and # None is the default value for the sentry Means to iterate to None to stop yield Token (m.lastgroup, m.group ()) for tok in generate_tokens (master_pat, "foo = 42"): print (tok)

Finally, the tokens stream that shows the expression string "foo = 12 + 5 * 6" is:

Token (type='NAME', value='foo') Token (type='WS', value='') Token (type='EQ', value='=') Token (type='WS', value='') Token (type='NUM', value='12') Token (type='WS', value='') Token (type='PLUS', value='+') Token (type='WS', value='') Token (type='NUM', value='5') Token (type='WS', value='') Token (type='TIMES' Value='*') Token (type='WS', value='') Token (type='NUM', value='6') filter tokens streams

Next, to filter out the space tags, use the generator expression:

Tokens = (tok for tok in generate_tokens (master_pat, "foo = 12 + 5 * 6") if tok.type! = 'WS') for tok in tokens: print (tok)

You can see that the spaces are successfully filtered:

Token (type='NAME', value='foo') Token (type='EQ', value='=') Token (type='NUM', value='12') Token (type='PLUS', value='+') Token (type='NUM', value='5') Token (type='TIMES', value='*') Token (type='NUM', value='6') Note substring matching traps

The order of tokens in regular expressions (that is, "|" .join ([NAME, EQ, NUM, PLUS, TIMES, WS]) is also important. Because when matching, the re module matches the patterns in the specified order. Therefore, if a pattern happens to be a substring of another longer pattern, it is necessary to ensure that the longer pattern matches first. The correct and wrong matching methods are shown below:

LT = r'(? P

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.