Can python regular expressions extract all qualified fields? 04/20 Update SLTechnology News&Howtos

Can python regular expressions extract all qualified fields?

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains the "python regular expression can all extract all the qualified fields", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "python regular expression can all extract all the qualified fields" bar!

There are only three kinds of problems, such as the title, using regular expressions to match fields, which are:

Re.match () re.search () re.findall ()

Briefly, re.match () is very similar to re.search (), except that the former matches from the beginning of the target string, while the latter does not require this. Re.findall () returns all the matching results. But sometimes the result returned by re.findall () is not the same as the previous two, so let's look at the following example:

For sentences:

Since the onset of the disease, the patient has no low back pain, neck pain, no pharynx, oral ulcer, no light allergy, hair loss, dry mouth, dry eyes, no paroxysmal cyanosis, no limb weakness, no edema, foamy urine, spirit, appetite, poor sleep, nearly 1 month dry stool knot, 5-6 days, no abdominal pain, black stool, hematochezia, urination 1-2 hours once, no urine pain, hematuria. There was no significant change in body weight.

I want to use rules to match all clauses that contain urination and urine-related clauses, the purpose is to "no edema, foam urine" and "urinate 1-2 hours once, no urine pain, hematuria." Identify and return these clauses.

I wanted to use re.findall () to match:

The result is:

[('pee','), ('pee', 'pee')]

Here I explain the meaning of the pattern I use, because I want to match a clause, so there must be a corresponding symbol before and after a clause, so "[,;] +" is added before and after the pattern to indicate at least one match. The following "[,.,;.] *" indicates that 0 or more punctuation marks are matched. Here, commas, semicolons and periods are added in Chinese and English respectively. "*" means to match 0 or 1 or more. It is important to note that the reason why I use "[,;;]" Because the text may contain many other symbols, such as the "-" in the above example, there may be omissions if you want to use Chinese characters, numbers, and specific symbols to match, and my aim is to get matching clauses. So use "[^,.;]" It will be more general. The next step is "((pee) | pee)", which means to match a substring that contains "urinate" or contains "pee".

But the result I got using re.findall () was not what I wanted, so I slightly changed the matching rules and replaced "((urinate) | pee) +" with "[(urinate) | pee] +"; to verify the applicability of the match, I added two more samples. The overall picture is as follows:

Import relines = ["since the onset of the disease, the patient has no low back pain, neck pain, sore throat, oral ulcer, light allergy, alopecia, dry mouth, dry eyes, paroxysmal cyanosis, limb weakness, edema, foamy urine, mental, appetite, poor sleep, dry stool in nearly 1 month, once every 5-6 days, no abdominal pain, black stool, hematochezia, urination 1-2 hours, no urine pain, hematuria. There was no significant change in weight." Since the onset of the disease, sleep, stomach and urination have been normal. In the past 4 to 5 years, defecation has been done 3 or 4 times a day, mostly yellowish brown formed soft stool, occasional rotten stool, incomplete defecation, bloody stool, black stool, no weight loss. " "short in stature and lighter in weight than their peers."] For line in lines: pattern = "[, str = re.findall (pattern,line) print (str)] * [(urinate)] + [(pee) urine] + [^,.

The result is:

No edema, foam urine, dry stool in recent 1 month, no abdominal pain, black stool, hematochezia, no urine pain, hematuria.']

[', normal urination,',']

[]

On the other hand, there is a match between "urinating once in 1-2 hours, no pain in urination and hematuria". On the other hand, there is a match between "dry knot of stool in recent 1 month" and "no abdominal pain, black stool, hematochezia". It seems that the meaning of "[urination]" does not match the substring containing "urination" or "urination". Does "[(urinate) pee]" mean to match any substring containing "piss", "piss" or "piss"? But according to the third sample that contains "small" but does not contain "poop" and "urine", the above idea is still wrong.

Plus the start and end position of the substring that re.findall () did not match in the original text, so I want to "urinate 1-2 hours once, no pain in urine, no hematuria." It is also difficult to obtain the connection of two clauses.

So I switched to another commonly used re.search () method.

The result is:

, no edema, foam urine

As you can see, re.search () only matches the first substring that meets the criteria.

If you change the word "((urinate) | pee)" in pattern to "[(urinate) | pee]" (or "[(urinate) pee]", which means exactly the same, I have also tried)

The results are as follows:

, no edema, foam urine

It can be seen that there is no change before and after the modification, but if I delete the "urine" in "no edema, foamy urine" in the original text, the result before the modification is:

, urinate 1-2 hours once.

The modified result is:

For nearly 1 month, dry stool knot

That is to say, for

Pattern = "[,;,;] + [^,;;] * [(urinate)] + [^,.,;.] * [,.,;] +"

Both re.findall () and re.search () can match the stool-related substrings.

For:

Pattern = "[,;;.] + [^,.;.] * ((urinate) | urine) + [^,.;.] * [,.,;.] +"

The matching substrings of re.findall () and re.search () are different, and the result of the pre-matching is a list of tuples: [('urine','), ('pee', 'pee')], and the latter matches the substring I want: no edema, foam urine.

Later, after asking colleagues and learning more about the regular operation mechanism, I found that in addition to extracting matching strings, parentheses () is also used to capture packets, that is, the contents in parentheses will be stored after matching, and the corresponding value will be returned when called. When re.findall () is used, all the values of the stored packet are returned.

For another example, it would be clearer to change the word "((urinate) | pee)" in the above pattern to "((urinate) | (urinate))", such as:

Pattern = "[,;;.] + [^,.] * ((urinate) | (urine)) + [^,.;.] * [,.,.] +"

The output using re.findall () is as follows:

[('pee', 'pee'), ('pee', 'pee')]

It can be seen from the above that "(urinate) | (urinate)" uses three "()", so three groups are generated. The first group in the outermost is used to capture "urinate" or "urinate". In the original text, both "urinate" and "urinate" can be matched, so the first position has both; the second grouping is used to capture "(urinate)", so the second group only stores "urination". Similarly, the third group is used to capture "(urine)", so only "urine" is stored as a result.

I use re.search () to output the grouping result:

For line in lines:

Pattern = "[,;;.] + [^,.] * ((urinate) | (urine)) + [^,.;.] * [,.,.] +"

Str = re.search (pattern, line)

Print (str.group (0))

Print (str.group (1))

Print (str.group (2))

Print (str.group (3))

The result is:

, no edema, foam urine

Urine

None

Urine

The grouping results of group (1), group (2) and group (3) were the same as those in ('urine', 'urine'). But here group (0) (or group (), the two meanings are exactly the same) is not "(" urine "," urine "); here the author's level is limited, and it is not very clear why, that is, when group (0) is called, the meaning of () in pattern is no longer to capture packets, but to return to the original meaning of extracting matching strings.

In order to solve

Pattern = "[,;,;] + [^,;;] * [(urinate)] + [^,.,;.] * [,.,;] +"

It matches the unwanted problem with a "poop" string, which can be achieved by using a non-capture grouping (?:).

Pattern = "[,;.,;]? [^,.] * (?: urinate | pee) [^,.;.] * [,.,;.]"

At this point, it matches "pee" or "pee"; the result is:

[', no edema, foam urine,',', urinate 1-2 hours once,','no urine pain, hematuria.']

Note the above results, because ", urinate once in 1-2 hours," and "no pain in urine, hematuria." Is immediately followed, and the comma has been assigned to the former, so the latter has no comma, which looks a bit like a string slice, which is gone when it is cut away, so here is the first "[,;,;]" in pattern. Then replace "+" with "?" () the first character of the meaning appears 0 or 1 times; of course, it can be further optimized to:

Pattern = "[,;;]? [^,.;.] * (?: urinate | pee). * [,.;.]"

As you can see, put the second "[^,;;]" in the pattern Become ".?"

Although all the clauses mentioned above are matched and output, the two adjacent clauses are output separately, which still does not meet our desired expectations. So the above code is improved:

For line in lines: # pattern = "[,;.,;.] + [^,.;.] * [('urinate') urinate] + [^,.;.] * [,.,.] +" pattern = "[,.,;.]? [^,.;.]. * * (?: urinate | urinate). * [,.;.]. # pattern = "[,;.,;.]? [^,.;.] * (?: pee | urine) [^,.;.] * [,.;.]" Str = re.findall (pattern,line) ls = [',',';',';','.] For idx, text in enumerate (str): if text [0] not in ls: str[ IDX-1] + = text str.remove (text) print (str)

The result is:

[, no edema, foam urine,',', urinate 1-2 hours once, no urine pain, hematuria.']

If you use re.search (), you can also achieve your expectations, as shown in the following code:

For line in lines: result = [] num =-1 while line: # pattern = re.compile (r "[,.;.] + [^,;.] * ((urinate) | pee) + [^,;;.] * [,.;.] + ") # str = pattern.search (line) pattern = r" [,. + [^,;.] * ((urinate) | pee) + [^,;;.] * [,.;.] + "str = re.search (pattern,line) if str = = None: break tmp = str.group () if str.start () = 0: result [- 1] + = tmp [1:] else: result.append (TMP [1:]) # print (tmp) num = str.end ()-1 # print (num) line = line [num:] print (result)

The result is:

['no edema, foam urine,', 'urinate once in 1-2 hours, no pain in urine, no hematuria.']

Thank you for your reading, the above is the "python regular expression can all extract all the fields that meet the conditions" of the content, after the study of this article, I believe that you can all python regular expression can extract all the fields that meet the conditions of this question has a more profound understanding, the specific use of the situation also need to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.