How to understand the Boyer-Moore algorithm of string matching 07/11 Update SLTechnology News&Howtos

How to understand the Boyer-Moore algorithm of string matching

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces you how to understand the string matching Boyer-Moore algorithm, the content is very detailed, interested friends can refer to, hope to be helpful to you.

I introduced the KMP algorithm earlier.

However, it is not an efficient algorithm, and it is not widely used in practice. The "find" function (Ctrl+F) of various text editors mostly uses the Boyer-Moore algorithm.

Boyer-Moore algorithm is not only efficient, but also ingenious and easy to understand. In 1977, Professor Robert S. Boyer and Professor J Strother Moore of the University of Texas invented this algorithm.

Next, I will explain this algorithm based on Professor Moore's own example.

Suppose the string is "HERE IS A SIMPLE EXAMPLE" and the search word is "EXAMPLE".

two。

First, the "string" is aligned with the head of the "search term", starting at the end.

This is a clever idea, because if the trailing characters do not match, you can know that the first seven characters (as a whole) are definitely not what you are looking for.

We see that "S" does not match "E". At this point, "S" is called a "bad character", that is, mismatched characters. We also found that "S" is not included in the search word "EXAMPLE", which means that the search term can be moved directly to the second place of "S".

Still starting from the tail, it is found that "P" does not match "E", so "P" is a "bad character". However, "P" is included in the search term "EXAMPLE". So, move the search term back two places, and align the two "P".

As a result, we sum up the "bad character rules":

Number of late shifts = position of bad character-position of last occurrence in search term

If the "bad character" is not included in the search term, the last occurrence position is-1.

Take "P" as an example, as a "bad character", it appears in the 6th position of the search term (numbered from 0), and the last occurrence position in the search term is 4, so move back 6-4 = 2. Take the "S" in the second step as an example, it appears in the sixth place, and the last position is-1 (that is, it does not appear), then the entire search term moves back 6-(- 1) = 7 places.

Still compare from the tail, "E" and "E" match.

Compare the previous bit, "LE" and "LE" match.

Compare the previous bit, "PLE" and "PLE" match.

Compare the previous bit, "MPLE" and "MPLE" match. We call this situation a "good suffix" (good suffix), that is, all trailing matching strings. Note that "MPLE", "PLE", "LE" and "E" are all good suffixes.

Compared with the previous bit, it is found that "I" and "A" do not match. So, "I" is a "bad character".

10.

According to the Bad character Rule, the search term should be moved back 2-(- 1) = 3 digits at this time. The question is, is there a better way to move at this time?

11.

We know that there is a "good suffix" at this time. Therefore, you can use the "good suffix rule":

The number of postshifts = the location of the good suffix-the last occurrence position in the search term

For example, if the last "AB" of the string "ABCDAB" is a "good suffix". Then its position is 5 (calculated from 0, take the "B" value of *), and the "last occurrence position in the search term" is 1 (* "B" position), so move back 5-1 = 4 bits, and the previous "AB" moves to the latter "AB" position.

To give another example, if the "EF" of the string "ABCDEF" is a good suffix, the position of "EF" is 5, and the position that last appeared was-1 (that is, it did not appear), so move back 5-(- 1) = 6 bits, that is, the entire string is moved to the next bit of "F".

There are three points to note in this rule:

(1) the position of "good suffix" shall be one character. Assuming that the "EF" of "ABCDEF" is a good suffix, its position is based on "F", that is, 5 (calculated from 0).

(2) if the "good suffix" appears only once in the search term, its last occurrence position is-1. For example, if "EF" appears only once in "ABCDEF", its last occurrence is-1 (that is, it does not appear).

(3) if there are multiple "good suffixes", the last occurrence of the other "good suffixes" must be in the head except the longest one. For example, suppose the "good suffix" of "BABCDAB" is "DAB", "AB" and "B". What is the last location of the "good suffix"? The answer is that the good suffix used at this time is "B", and its last occurrence position is the head, that is, the 0th bit. This rule can also be expressed like this: if the longest "good suffix" appears only once, the search term can be rewritten in the following form for location calculation "(DA) BABCDAB", that is, the virtual addition of the leading "DA".

Go back to the example above. At this point, of all the "good suffixes" (MPLE, PLE, LE, E), only "E" appears in the "EXAMPLE", so move back 6-0 = 6 digits.

twelve。

As you can see, the "bad character rule" can only move 3 bits, and the "good suffix rule" can move 6 bits. Therefore, the basic idea of the Boyer-Moore algorithm is to move the larger of the two rules each time.

More cleverly, the number of bits moved by these two rules is only related to the search term, not to the original string. Therefore, "bad character rule table" and "good suffix rule table" can be generated in advance. When in use, just look up the table and compare it.

13.

Continuing to compare from the tail, "P" does not match "E", so "P" is a "bad character". According to the Bad character Rule, move back 6-4 = 2 digits.

14.

Compare bit by bit from the tail and find all matches, so the search ends. If you want to continue to search (that is, find all matches), according to the "good suffix rule", move back 6-0 = 6 bits, that is, the "E" of the head is moved to the "E" position of the tail.

This is the end of the Boyer-Moore algorithm on how to understand string matching. I hope the above content can be of some help and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.