In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about what regular expressions mean. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
1. What is a regular expression
Basically, a regular expression is a pattern used to describe a certain amount of text. Regex stands for Regular Express. This article will be used to represent a specific regular expression.
A piece of text is the most basic pattern, simply matching the same text.
two。 Different regular expression engines
The regular expression engine is a kind of software that can handle regular expressions. In general, the engine is part of a larger application. In the software world, different regular expressions are not compatible. This tutorial focuses on the Perl 5 type of engine because it is the most widely used engine. At the same time, we will also mention some differences from other engines. Many modern engines are similar, but not exactly the same. For example, the .NET regular library, the JDK regular package.
3. Text symbol
The most basic regular expression consists of a single literal symbol. For example, it will match the character "a" that appears * times in the string. Such as the string "Jack is a boy". The "a" after "J" will be matched. The second "a" will not be matched.
Regular expressions can also match the second "a", which must be when you tell the regular expression engine to start searching at * times. In the text editor, you can use "find next". In the programming language, there will be a function that allows you to continue to search backwards from the location of the previous match.
Similarly, it matches the "cat" in "About cats and dogs". This is tantamount to telling the regular expression engine to find one, follow one, and then another.
Note that the regular expression engine is case-sensitive by default. Unless you tell the engine to ignore case, it will not match "Cat".
(1) Special characters
For text characters, 11 characters are reserved for special purposes. They are:
[]\ ^ $. |? * + ()
These special characters are also called metacharacters.
If you want to use these characters as text characters in regular expressions, you need to escape them with a backslash "\". For example, if you want to match "1" 1 "2", the correct expression is.
It is important to note that there are also valid regular expressions. But it will not match "1 # 1 # 2", but will match "111 # 2" in "123 # 111 # 234". Because the "+" here means a special meaning (repeated one or more times).
In programming languages, it is important to note that some special characters are processed by the compiler and then passed to the regular engine. Therefore, the regular expression should be written as "1\ + 1 # 2" in C++. To match "C:\ temp", you use regular expressions. In C++, the regular expression becomes "C:\ temp".
(2) undisplayable characters
You can use a sequence of special characters to represent some undisplayable characters:
Stands for Tab (0x09)
Stands for carriage return (0x0D)
Stands for newline character (0x0A)
Note that the text file in Windows uses "\ r\ n" to end a line, while Unix uses "\ n".
4. The internal working mechanism of the regular expression engine
Knowing how the regular expression engine works helps you quickly understand why a regular expression doesn't work as you expect.
There are two types of engines: text oriented (text-directed) engines and regular oriented (regex-directed) engines. Jeffrey Friedl calls them DFA and NFA engines. This article is about regular-oriented engines. This is because some very useful features, such as "lazy" quantifiers (lazy quantifiers) and backreferences (backreferences), can only be implemented in regular-oriented engines. So it's not surprising that this kind of engine is the current engine.
You can easily tell whether the engine you are using is text-oriented or regular-oriented. If backreferences or "lazy" quantifiers are implemented, you can be sure that the engine you are using is regular-oriented. You can do the following test: apply regular expressions to the string "regex not". If the result of the match is regex, the engine is regular-oriented. If the result is regex not, it is text-oriented. Because the regular-oriented engine is "urgent", it will be eager to show its work and report the matches it finds.
Regular-oriented engines always return the leftmost match
This is an important point for you to understand: even if it is possible to find a "better" match later, the rule-oriented engine always returns the leftmost match.
When applied to "He captured a catfish for his cat", the engine first compared to "H" and failed. So the engine failed to compare it with "e". Until the fourth character matches "c". Matches the fifth character. To the sixth character failed to match "p", also failed. The engine continues to re-check the match from the fifth character. Until the fifteenth character starts and matches the "cat" in "catfish", the regular expression engine eagerly returns * matches instead of looking for other better matches.
5. Character set
A character set is a set of characters enclosed by a square bracket "[]". Using the character set, you can tell the regular expression engine to match only one of the multiple characters. If you want to match an "a" or an "e", use it. You can use matching gray or grey. This is especially useful when you are not sure whether the characters you are searching for are in American English or British English. Instead, graay or graey will not be matched. The order of the characters in the character set does not matter, and the result is the same.
You can use the hyphen "-" to define a character range as the character set. Matches a single number between 0 and 9. You can use more than one range. Matches a single hexadecimal number and is case-insensitive. You can also combine scope definitions with individual character definitions. Matches a hexadecimal number or letter X. Again, the order of character and range definitions has no effect on the result.
(1) some applications of character set
Look for a word that may be misspelled, such as or.
Find the identifier of the program language. (* indicates repeating 0 or more times)
Look for C-style hexadecimal numbers. (+ means repeat one or more times)
(2) reverse the character set
The left square bracket "[" followed by an angle bracket "^" will invert the character set. The result is that the character set matches any characters that are not in square brackets. Unlike ".", the inverse character set can match the carriage return newline character.
It is important to remember that the inverse character set must match one character. It doesn't mean that it matches a Q with no u followed by it. It means that it matches a Q, followed by a character that is not u. So it does not match the Q in "Iraq", but matches the Q in "Iraq is a country" and a space character. In fact, the space character is part of the match because it is a "character that is not u".
If you just want to match a Q, provided that Q is followed by a character that is not u, we can solve it by looking forward as we will talk about later.
(3) Metacharacters in character set
It is important to note that only 4 characters in the character set have a special meaning. They are: "]\ ^ -". "]" Represents the end of the character set definition; "\" represents escape; "^" represents inversion; and "-" represents range definition. Other common metacharacters are normal characters within the character set definition and do not need to be escaped. For example, to search for an asterisk * or plus +, you can use. Of course, if you escape the usual metacharacters, your regular expressions will also work well, but this will reduce readability.
In the character set definition, in order to use the backslash "\" as a literal character rather than a special meaning character, you need to escape it with another backslash. Will match a backslash and an X. "] ^ -" can be escaped with a backslash, or put them in a position where it is impossible to use their special meaning. We recommend the latter because it increases readability. For example, for the character "^", put it except after the left parenthesis "[", using the literal character meaning rather than the reverse meaning. If it matches an x or ^. Will match a "]" or "x". Or will match a "-" or "x".
(4) shorthand of character set
Because some character sets are very commonly used, there are some abbreviations.
Representative
Represents a word character. This varies depending on the implementation of the regular expression. Most regular expressions implement word character sets that contain.
Stands for "white character". This is also related to different implementations. In most implementations, the space and Tab characters are included, as well as the carriage return newline character.
The abbreviated form of a character set can be used inside or outside square brackets. Matches a white character followed by a number. Matches a single white character or number. Will match a hexadecimal number.
Take the abbreviation of the inverse character set
=
=
=
(5) repetition of character set
If you repeat a character set with the "? * +" operator, you will repeat the entire character set. Not just the character it matches. Regular expressions match 837 and 222.
If you just want to repeat the character that is matched, you can use a backward reference to achieve the goal. We'll talk about quoting back later.
6. Repeat with? * or +
?: tells the engine to match the leading character 0 or once. It actually means that the leading character is optional.
+: tell the engine to match the leading character one or more times
*: tell the engine to match the leading character 0 or more times
Matches HTML tags that have no attributes. "" is a literal symbol. * the character set matches a letter, and the second character set matches a letter or number.
We seem to be able to use it. But it will match. But this regular expression is valid enough when you know that the string you are searching for does not contain similar invalid tags.
(1) restricted repetition
Many modern regular expression implementations allow you to define how many times a character is repeated. The morphology is: {min,max}. Both min and max are nonnegative integers. If there is a comma and max is ignored, then max has no limit. If both the comma and max are ignored, repeat min times.
So {0,} is the same as *, {1,} and + have the same effect.
You can match the number between 1000 and 9999 ("\ b" represents the word boundary). Matches a number between 100 and 99999.
(2) pay attention to greed.
Suppose you want to match a HTML tag with a regular expression. You know that the input will be a valid HTML file, so regular expressions do not need to exclude invalid tags. So if it's between two angle brackets, it should be a HTML tag.
Many regular expression novices will first think of using regular expressions >, they will be surprised to find that for the test string, "This is a first test", you might expect to return, and then continue to match, return.
But the truth is, it won't. The regular expression will match "first". Obviously, this is not what we want. The reason is that "+" is greedy. That is, "+" causes the regular expression engine to try to repeat the leading characters as much as possible. The engine will backtrack only if this repetition causes the entire regular expression to fail to match. In other words, it will give up the "repeat" once and then process the rest of the regular expression.
Similar to "+", the repetition of "?" is also greedy.
(3) go deep into the regular expression engine
Let's look at how the regular engine matches the previous example. * the mark is "". So far, "" has matched the newline character and failed. So the engine does backtracking. The result is that now "" matches "t". Obviously, it will still fail. This process continues until "" matches. So the engine found a match "first". Remember, the regular-oriented engine is "urgent", so it is in a hurry to report the matches it finds. Instead of going back, even if there might be a better match, such as "". So we can see that because of the greed of "+", the regular expression engine returns a leftmost and longest match.
(4) replace greed with laziness.
One possible way to correct the above problem is to replace greed with "+" inertia. Can you follow the "+" with a question mark? " To achieve this. "*", "{}" and "?" The repetition of the representation can also be used in this scheme. So we can use "" in the above example. Let's take a look at the regular expression engine again.
Once again, the regular expression token "". The reason why this is a better solution is that when lazy repetition is used, the engine will backtrack each character before finding a successful match. Using the inverse character set does not require backtracking.
Keep in mind that this tutorial is only about regular-oriented engines. Text-oriented engines are not retroactive. But at the same time, they do not support lazy repetition.
7. Use "." Match almost any character
In regular expressions, "." Is one of the most commonly used symbols. Unfortunately, it is also one of the most misused symbols.
"." Matches a single character regardless of what the matched character is. The only exception is the new line character. The engines discussed in this tutorial do not match new line characters by default. So by default, "." Equals to an abbreviation for the character set [^\ n\ r] (Window) or [^\ n] (Unix).
This exception is due to historical reasons. Because early tools for using regular expressions were line-based. They are all read into a file line by line, applying regular expressions to each line. In these tools, strings do not contain new line characters. So "." It never matches new line characters.
Modern tools and languages can apply regular expressions to large strings or even entire files. All regular expression implementations discussed in this tutorial provide an option to make "." Matches all characters, including new line characters. In tools such as RegexBuddy, EditPad Pro or PowerGREP, you can simply select "period matches new line characters". In Perl, "." A pattern that can match a new line character is called a single-line pattern. Unfortunately, this is a very confusing term. Because there is also the so-called "multi-line mode". The multi-line mode only affects the anchor of the first and the end of the line, while the single-line mode only affects the "."
Perl terminology definitions are also used in other languages and regular expression libraries. When using regular expression classes in .NET Framework, you can activate single-line mode with statements similar to the following: Regex.Match ("string", "regex", RegexOptions.SingleLine)
Use the period conservatively.
The dot can be said to be a metacharacter with large metacharacters. It allows you to be lazy: with a period, you can match almost all characters. The problem, however, is that it often matches characters that should not.
I will give a simple example to illustrate. Let's see how to match a date with the format "mm/dd/yy", but we want to allow the user to select the delimiter. One solution that can come up with soon is. It looks like it can match the date "02prime 12Universe 03". The problem is that 02512703 will also be considered a valid date.
It seems to be a better solution. Remember that periods are not metacharacters in a character set. This plan is far from perfect, it will match "99-99-99". And go one step further. Although he will also be a match for 19-39-99. The extent to which you want your regular expression to be depends on what you want to achieve. If you want to verify user input, you need to do as much as possible. If you just want to analyze a known source and we know that there is no erroneous data, it is enough to use a better regular expression to match the characters you want to search for.
8. Anchoring of the beginning and end of a string
Unlike normal regular expression symbols, anchoring does not match any characters. Instead, they match the position before or after the character. "^" matches the position in front of a line of string * characters. Will match the an in the string "abc". Will not match any characters in "abc".
Similarly, $matches the position after a character in a string. So match c in "abc".
(1) Application of anchoring
It is important to use anchoring when validating user input in a programming language. If you want to verify that the user's input is an integer, use.
In user input, there are often extra leading or closing spaces. You can use and to match leading spaces or closing spaces.
(2) use "^" and "$" as the beginning and end anchors of lines
If you have a string that contains multiple lines. For example: "first line\ n\ rsecond line" (where\ n\ r represents a new line character). It is often necessary to process each line separately instead of the entire string. As a result, almost all regular expression engines provide an option to extend the meaning of these two anchors. "^" matches the start position of the string (before f) and the position after each new line character (between\ n\ r and s). Similarly, $matches the end of the string (after an e) and before each new line character (between e and\ n\ r).
In .NET, when you use the following code, you will define the position before and after the anchor matches each new line character: Regex.Match ("string", "regex", RegexOptions.Multiline)
Application: string str = Regex.Replace (Original, "^", ">", RegexOptions.Multiline)-">" will be inserted at the beginning of each line.
(3) absolute anchoring
Matches only the start position of the entire string and only the end position of the entire string. Even if you use "multiline mode", and never match new line characters.
Even if\ Z and $only match the end of the string, there is still an exception. If the string ends with a new line character,\ Z and $will match the position in front of the new line character instead of the * * face of the entire string. This "improvement" was introduced by Perl and followed by many regular expression implementations, including Java,.NET. If applied to "joe\ n", the matching result is "joe" instead of "joe\ n".
Thank you for reading! This is the end of this article on "what does regular expression mean?". I hope the above content can be of some help to you, so that you can learn more knowledge. If you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.