Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the knowledge points of C++ string matching?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge points of C++ string matching. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article, let's take a look at it.

String matching

We look for string B in string A, where string An is the main string and string B is the pattern string. We record the length of the main string as n and the length of the pattern string as m. Because we are looking for the pattern string in the main string, n > m.

Single pattern string matching algorithm: that is, a string is matched with a string.

BF in BF algorithm is the abbreviation of Brute Force, which is called brute force matching algorithm in Chinese, also called naive matching algorithm. In the main string, we check that the starting positions are 0, 1, 2, respectively. N-m+1 substrings with length m of m to see if there are any matches with the pattern strings. The time complexity of BF algorithm is very high, which is O (nimm).

In the actual development, it is a more commonly used string matching algorithm, in most cases, the simple string matching algorithm is enough.

In actual software development, in most cases, the length of the pattern string and the main string will not be too long.

The idea of naive string matching algorithm is simple, and the code implementation is also very simple. Meet KISS (Keep it Simple and Stupid) design principles

The full name of RK algorithm is Rabin-Karp algorithm, which is named after its two inventors, Rabin and Karp. The idea is as follows: we use the hash algorithm to calculate the hash value of the n-m+1 substring in the main string, and then compare the hash value with the hash value of the pattern string one by one. If the hash value of a substring is equal to the pattern string, the corresponding substring and the pattern string match (no hash conflicts). If there is a hash conflict, compare the substring with the pattern string itself.

The efficiency of RK algorithm is higher than that of BF algorithm, and the overall time complexity of RK algorithm is O (n). However, this efficiency depends on the design method of the hash algorithm, and if there is a conflict, the time complexity may degenerate. In extreme cases, the hash algorithm has a large number of conflicts, and the time complexity is reduced to O (nymph).

Requirements: design hash algorithm. Find out the relationship between two substrings s [I-1] and s [I] (actually abcde, bad represents the same substring, the relationship between an and e).

BM (Boyer-Moore) algorithm: its performance is 3 to 4 times that of the famous KMP algorithm. When you encounter mismatched characters, look for rules and slide the pattern string back a few more bits.

The BM algorithm consists of two parts: the bad character rule (bad character rule) and the good suffix rule (good suffix shift).

Bad character rule: match backwards from the end of the pattern string when we find that a character cannot be matched. We call this unmatched character a bad character (a character in the main string).

When a mismatch occurs, we mark the characters under the pattern string corresponding to the bad characters as si. If a bad character exists in the pattern string, we mark the subscript of the bad character in the pattern string (the last one) as xi. If it doesn't exist, we record xi as-1. The number of bits that the mode string moves back is equal to si-xi. Note that the subscript I'm talking about here is the subscript of the character in the pattern string.

An bc array is required to record all character positions of the pattern string.

Cons: it is not enough to simply use bad character rules. Because the number of moving bits calculated according to si-xi may be negative, for example, the main string is aaaaaaaaaaaaaaaa and the mode string is baaa.

Good suffix rules:

When the suffix {u} looks in the pattern string, if we find another substring {u*} that matches {u}, then we slide the pattern string to the position where the substring {u*} is aligned with {u} in the main string.

When the suffix {u} is found in the pattern string, there is no other substring {u*} that matches {u} and cannot be excessively slipped after {u} in the main string. Instead, from the suffix substring of a good suffix, find the longest one that matches the prefix substring of the pattern string, assuming {v}, and then slide the pattern string to the position shown in the figure.

The suffix substring of abc includes c, bc. The prefix substring of abc is amam ab.

Implementation of the algorithm:

Bad character rule implementation

Using hash table to store each character position subscript of the pattern string, the same character, the subscript appears at the end of the record.

The starting subscript I, the pattern string matches from back to front, and bad characters are encountered. Find the bad character corresponding to the subscript in the pattern string is si, find the last occurrence position of the bad character in the pattern string xi (using hash table, key= character, value= subscript, not counted as-1)

Calculate sliding I = I + (j-bc [(int) a [iSj]])

Implementation of good suffix rules (bad characters are not in the last position in the pattern string m)

Preprocess the pattern string and store it in an array. Each array element records the position of another matching suffix substring. The following figure suffix [1] represents a substring with a length of 1, and the final matching substring has an index of 2.

In addition to the suffix array, we need another prefix array of type boolean to record whether the suffix substring of the pattern string matches the prefix substring of the pattern string.

How to calculate and populate the values of these two arrays?

Take the substring from 0 to I (I can be 0 to mmur2) and the whole pattern string, and find the common suffix substring. If the length of the common suffix substring is k, then we record suffix [k] = j (j represents the starting subscript of the common suffix substring). If j equals 0, that is, the common suffix substring is also a prefix substring of the pattern string, we record prefix [k] = true.

Private void generateGS (char [] b, int m, int [] suffix, boolean [] prefix) {

/ / initialize

For (int I = 0; I

< m; ++i) { suffix[i] = -1; prefix[i] = false; } // abcde的好后缀是bcde,所以要减去一 // 前缀串b[0,m-2]与后缀串b[1,m-1]求公共后缀子串,以b[m-1] 为比较 // 比如cab 与b 求后缀子串 for (int i = 0; i < m - 1; i++) { int j = i; int k = 0; while (j >

= 0 & b [j] = b [m-1-k]) {

+ + k

Suffix [k] = j

-- j

}

If (j =-1) {

Prefix [k] = true

}

}

}

The suffix substring of abc includes c, bc. The suffix substring must have the last value. The prefix substring of abc is amam ab. The prefix substring must have the first value

Find the common suffix substring of the prefix substring and the suffix substring, and record whether the prefix substring can match completely.

Through a good suffix length k, it is determined whether there is another matching suffix. Suffix [k]! =-1, slide the j-x+1 bit.

Good suffix {u}, when it does not exist in the pattern string. You need to find out if there is a matching prefix substring in the suffix substring. J for bad character subscript, loop search (rfail (fail stands for failure pointer, is there much like the process of finding next in the KMP algorithm? Continue the search above until Q is root, and if you haven't found a child node with the same character, let the failed pointer of node pc point to root.

How to match the main string on AC automata?

The main string traverses, matching each branch in root. When a branch matches a part such as a-> b-> c, and the next d does not match, use the failure pointer c = c-> fail (the position where the failure pointer points to, the previous match). , and then match.

Is the sensitive word filtering system implemented by AC automaton more efficient than single pattern string matching method?

First, we need to build sensitive words into AC automata, including building Trie trees and building failure pointers. The time complexity of Trie tree construction is O (m*len), where len represents the average length of sensitive words and m represents the number of sensitive words. What is the time complexity of building failed pointers?

Assume that the total number of nodes in the Trie tree is k, and when each node fails to build the pointer, (you can see the code) the most time-consuming link is the qroomq-> fail in the while loop. Each time this statement is run, the depth of Q pointing to the node will be reduced by 1, and the height of the tree will not exceed len, so the time complexity of each node building failure pointer is O (len). The whole process of building failure pointers is O (k*len).

What is the time complexity of matching with AC automata?

Similar to the analysis that just built the failed pointer, the for loop iterates through each character in the main string in turn, and the most time-consuming part of the for loop is the while loop, and the time complexity of this part is also O (len), so the total matching time complexity is O (n*len). Because sensitive words are not very long, and the time complexity is only a very broad upper limit, in fact, it may be similar to O (n), so AC automata filter sensitive words with very high performance.

These are all the contents of this article entitled "what are the knowledge points of C++ string matching". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report