Discussion on sample Collection of Machine Learning from Free WEB Application Firewall hihttps 07/03 Update SLTechnology News&Howtos

Discussion on sample Collection of Machine Learning from Free WEB Application Firewall hihttps

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Hihttps is a free web application firewall, which not only supports traditional OWASP feature engineering checks of WAF (such as SQL injection, XSS, malicious vulnerability scanning, password cracking, CC, DDOS, etc.), but also supports unsupervised learning of machine samples, autonomous confrontation, and redefining web security. Today, the author introduces the sample collection of machine learning from the perspective of web security.

What on earth is web?

Network security experts generally believe that a lot of malice in web has its special URL characteristics, such as

Maliciously scan GET / hihttps?cat.. / /.. / etc/passwd

SQL injection into GET / hihttps?user=123' or 1

XSS GET / hihttps?user=alert (1); a

. This class does have very typical characteristics, which can be defended by traditional waf and rules.

What about the URL below?

GET / hihttps?user=ls123

GET / hihttps?user=%0Als

…… To be exact, such cyber security experts also consider it a normal request, or it is impossible to tell whether it is malicious or not.

The question is, are requests like GET / hihttps?user=ls123 necessarily a normal request? Not necessarily.

For example, CVE-2019-11043 high vulnerability: when sending% 0a to Nginx + PHP-FPM server URL, arbitrary remote commands can be executed to control the entire server. In other words, GET / hihttps?user=%0Als is a successful * * in some environments and executes the ls command of linux; if the server does not have Nginx+PHP installed, it can be considered harmless.

Furthermore, if there is no hihttps interface on the site, it is malicious scanning, which must be detected and blocked. But the traditional method certainly does not work, then machine learning will be the only way.

Second, machine learning redefines network security

Compared with the machine learning of graphics and images, the cost of web secure sample collection is the lowest, because as long as the software is run on the server, or even read web log files, you can get a large number of samples, and the cost is almost zero.

But * * samples are too scarce to finish and change with each passing day. From this point of view, it can be said that unsupervised or semi-supervised learning is the development direction of web security in the future. The following examples are:

If you http:// www.hihttps.com/hihttps.html?id=123 from the web server and the sample parameters collected are all in the form of "? id= number", then the following URLs can be considered as:

Http://www.hihttps.com/hihttps.html?id=123' or 1 # 1

Http:// www.hihttps.com / hihttps.html?id=alert (1)

Http:// www.hihttps.com / hihttps.html?id=1234567890&t=123

Http:// www.hihttps.com / hihttps.html?id=abc

The latter two URLs are impossible to detect in traditional waf, and can only be accurately detected by machine learning. So the core of machine learning is: everything that is not on my server is considered illegal, so it is possible to prevent unknown vulnerabilities and unknowns. In this way, the concept of web security is completely different from the traditional feature engineering, and machine learning redefines network security.

III. Principles of sample collection

1. Enough randomization to collect data randomly between different IP addresses.

2. Enough samples to ensure 99.99% accuracy, and at least tens of thousands of samples need to be collected.

3. Enough time to collect samples for at least 3-7 days in different time periods.

4. normal flow as far as possible, and the sample is not contaminated by *.

5. Complete data, including all HTTP request headers and body.

Therefore, it is limited to read sample data from the web log, and it is best to use the actual deployment of WAF to collect. For SSL encrypted sample collection, reverse proxy is usually used. You can refer to hihttps source code https://github.com/qq4108863.

IV. Summary

1. The current network APT based on logic vulnerabilities, the traditional waf rules are difficult to deal with unknown vulnerabilities and unknown.

2. It may be the only effective way to let machines learn like human beings and have a certain intelligence to automatically fight against APT. But the technology itself is the contest of the top intelligence of human beings, and WEB security still has a long way to go

3. Fortunately, free application firewalls like hihttps make a good start in machine learning and autonomous confrontation. In the future, web security is likely to be completed by feature engineering and machine learning. The author will introduce how to extract features from samples and automatically generate confrontation rules in the next article. WEB security must be dominated by AI in the future.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.