How to improve the efficiency of large-scale regular matching 07/19 Update SLTechnology News&Howtos

How to improve the efficiency of large-scale regular matching

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to improve the efficiency of large-scale regular matching". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "how to improve the efficiency of large-scale regular matching"!

Background

Regular expressions are widely used in daily work, using regular expressions to define rules, and then to match data. Let's first look at the regular application requirements in two security scenarios.

Scenario 1 data is stolen after the successfully brute force cracking of the FTP account

Data source: FTP server log

Correlation logic: violently crack for a specific account, then log in successfully using that specific account, and then use the specific account to download a large number of files

Alarm content: FTP account ${user_name} was successfully violently cracked to steal data.

Alarm level: high risk

In scenario 1, regular expressions are used to match multiple account logins in the log.

Scenario 2 Deep packet inspection (DPI), such as filtering network threats and traffic violating security policies, etc.

Data sources: network packet

Detection rule condition: data hits rule set

In scenario 2, regular expressions are used for security detection between multiple packets on a time series.

In fact, only one way to attack FTP is listed in scenario 1, and there are many other methods for FTP attacks, so another feature of the regular matching scenario for detecting FTP attacks is that the whole rule set may be very large. In scenario 2, a pattern set is constructed using known intrusions, and network packets are detected to find out whether there is any behavior that does not comply with the security policy or signs of being attacked. This requires detection of the payload part of the packet. The speed of matching is very fast, otherwise it will affect the user experience.

On the other hand, the regularization used here is quite different from the traditional usage. The traditional usage of regularization is to match the text with one or a few regular rules to find the matching data in the text. The problem we are facing now, first of all, is that there are a large number of rules, tens of thousands or more than 100,000 rule sets. If we still adopt the previous practice of using | segmentation, or outer layer to match with loops, then the processing time will be very long. The consumption of resources is also very large, which is basically unacceptable. Secondly, when matching, the data to be matched is not a complete whole, such as network packets, is received one by one, this is a streaming form, the traditional regular processing engine can not deal with streaming data very well, need to cache a batch of data to match, so the matching is not timely, and there is a big problem in regular processing at present, if the regular expression is not written well, then the matching will be very slow. Therefore, a solution is needed to address these challenges:

A large number of rules

The matching speed should be fast.

Support for streaming data

Resource consumption should not be too large.

Introduction to Hyperscan operator

In view of the challenges encountered in the above regular matching, after investigating and comparing the mainstream regular matching engines in the market, we finally chose Hyperscan.

Hyperscan is an Intel open source high-performance regular expression matching library, which provides C language API. It has been used in many commercial and open source projects.

Hyperscan has these features:

Most PCRE regular grammars are supported (if Chimera libraries are used, all grammars are supported)

Support for streaming matching

Support for multi-mode matching

Using a specific instruction set to accelerate matching

Easy to expand

Internal combination of multiple engines

Hyperscan at the beginning of the design is to better deal with streaming matching and multi-mode matching, the support of convection pattern is very convenient for regular users, there is no need for users to maintain the received data, there is no need to cache data; multi-mode matching allows multiple regular expressions to be introduced and matched at the same time.

Because a specific instruction set is required, Hyperscan requires CPU, as shown in the following figure:

CPU should at least support the SSSE3 instruction set, and the instruction set on the bottom line can accelerate matching.

Like most regular engines, Hyperscan also includes the compilation and matching phase, compilation is to parse the regular expression and build it into an internally needed database, which can be used many times later to match; in the case of multi-mode matching, each regular expression needs to have a unique identity id,id will be used in the match. The compilation process is shown in the following figure:

Hyperscan will return all hit results by default when matching, unlike some regular engines, which specify greedy matching results when greedy and lazy results when lazy. If there is a hit during the match, the user is notified in the form of a callback function where the regular expression id has been hit. The matching process is shown in the following figure:

The disadvantage of Hyperscan is that it can only be executed on a single machine without distributed capability. It can solve the problem of delay, but it can not solve the problem of throughput. To solve the problem of throughput, we can rely on the mainstream real-time computing framework Flink. Flink is a framework and distributed processing engine for state computing on unbounded and bounded data streams. Unbounded is the data with a beginning but no end, unbounded data flow computing is streaming computing, bounded data flow is data with beginning and end, and bounded data flow computing is batch processing.

Flink can be used in many computing scenarios. Here are three. Flink can handle event-driven programs. In addition to simple events, Flink also provides CEP libraries to handle complex events. Flink can also be used as a data pipeline to do some data cleaning, filtering, conversion and other operations to transfer data from one storage system to another. Flink can do streaming or batch data analysis, metric calculation, for large screen display, and so on. Flink has become the first choice of streaming processing recognized by the industry.

Integrate the regular matching engine into Flink, with the help of Flink's powerful distributed capabilities, the combination of strong and strong, then it will play a more powerful role. So such a solution is provided, as shown in the following figure:

The solution implements a custom UDF operator, which supports specifying that only a few fields in the input data are matched. The output of the operator is the field text to be matched, and the matching final state, including hit, miss, error and timeout. If it is a hit state, it will also return the id of the regular expression in the match, and the output also includes the input original data, if there is any subsequent processing. This will not be affected. In order to make it more convenient for users to use, a new datastream is extended, called Hyperscanstream, which encapsulates the operator. Users only need to convert datastream to Hyperscanstream, and then use regular operators by calling a method. The entire solution is provided to users in a separate jar package, which maintains the original habit of writing Flink jobs and is decoupled from Flink's core framework.

The process of data transfer is like this: the data source reads a record and gives it to the downstream Hyperscan operator, and the Hyperscan operator gives the data to the Hyperscan sub-process, and after the sub-process matches, the result is returned to the Hyperscan operator, and then the Hyperscan operator passes the original record and the matching result to the subsequent operator.

Operator usage instructions for privatization deployment

For the privatization deployment scenario, the usage is as follows: first, the user needs to edit the regular expression file, and then use the tool to compile the regular expression to database and serialize it into a local file. If there is a HDFS in the deployment environment, you can upload the serialized file to HDFS. If not, you do not need to upload it, and then develop a Flink job to reference the serialized file to match the data.

Why is there a tool to compile and serialize the regular expression and use it directly in the Flink job after editing it? As mentioned earlier, Hyperscan execution includes compilation and matching phases, if only regular expressions are referenced in the job, assuming that the job has set the parallelism to 5, then each task needs to be compiled once, a total of 5 times, which is a waste of resources; and compilation in hyperscan is a relatively slow action, so the compilation process is isolated in order to speed up the execution of flink jobs as soon as possible. Compiling ahead of time also helps to know in advance whether regular expressions have syntax errors or are not supported, rather than after the job starts.

When privatized deployment, the hyperscan dependent program is provided to the user, and the dependent program is compiled statically, so there is no need to add dependencies, as long as the machine supports the required instruction set.

Internal use of the company

Use the example

Suppose you want to match the Host field and Referer field in the HTTP message, as shown in the following figure:

An example of the code is as follows:

The whole logic is divided into four steps, the first step is to build the input stream from the data source, the second step is to convert the input stream to Hyperscanstream, the third step is to call the hyperscan method and then use the Hyperscan operator, the first parameter HyperscanFunction specifies that the fields to be matched are Host and Referer fields, and the fourth step is to use the matching return result, which is the Tuple2 object, where the first field Event is the original record, in this case the entire HTTP message The second field is that the List,HyperScanRecord class consisting of HyperScanRecord includes matching fields, such as Host or Referer in this case, the regular expression id (if the match is hit), and the final state of the match.

Using 10,000 rule sets and samples of different sizes to be matched, the scheme achieves the desired performance. The test results are as follows.

Some suggestions for using the Hyperscan operator, as shown in the following figure:

As mentioned earlier, when not using the himera library, Hyperscan has some cases where PCRE syntax is not supported. Note that when using it, the following figure lists the unsupported syntax (using the Chimera library will affect matching performance)

Thank you for your reading. the above is the content of "how to improve the efficiency of large-scale regular matching". After the study of this article, I believe you have a deeper understanding of how to improve the efficiency of large-scale regular matching. the specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.