What is the research on Webshell detection of HTTPS encrypted traffic 07/19 Update SLTechnology News&Howtos

What is the research on Webshell detection of HTTPS encrypted traffic

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about the research on Webshell detection of HTTPS encrypted traffic. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Webshell is a common form of Trojan horse in Web attacks. At present, the mainstream detection methods are based on the content characteristics of HTTP request and response traffic. However, under the HTTPS protocol, many webshell detection mechanisms are powerless. The emergence of encrypted webshell such as ice scorpion increases the difficulty of detection, especially the pre-shared key mechanism is adopted in version 3.0 of ice scorpion, which is difficult to detect even in HTTP scenarios. By extracting some characteristics of network traffic, this paper mainly explores the classification of normal encrypted traffic and webshell encrypted traffic visiting the same HTTPS website, and realizes the identification of webshell traffic in HTTPS encrypted traffic. No matter for ice scorpion 2.0 or 3.0 upgrade, it can achieve a good identification effect.

I. background

Due to the strong convenience of Web services, it has been increasingly used to provide information services. However, the data and user information in it have also become the profit targets of many lawbreakers. Webshell is a backdoor program based on Web service, which provides remote access to a variety of key functions, such as executing arbitrary commands, traversing file directories, viewing and modifying arbitrary files, increasing access rights, and so on. Therefore, how to effectively identify webshell files or communications is an urgent problem to be solved.

At present, there are generally three mature webshell detection methods: log detection, file detection and traffic detection. The mainstream detection methods must be manually constructed character features as input parameters. However, attackers can deform the webshell code to achieve the purpose of bypassing detection, and with the emergence of more and more popular encrypted webshell and the widespread use of HTTPS encrypted traffic, it undoubtedly brings greater challenges to webshell detection.

Using the way of traffic detection and based on statistical characteristics, this paper focuses on identifying HTTPS traffic including ice scorpion in webshell encrypted traffic, and expounds it from the aspects of data collection, feature extraction, model training and prediction, feature importance analysis and so on.

Ice Scorpion 3.0

Ice Scorpion is a popular encrypted webshell client, which can establish an encrypted tunnel in the process of communication to avoid the detection of security devices. Recently released version 3.0, the main impact on communication traffic is the change in the way of key exchange, and the rest are some functional improvements. In addition to some bug fixes, changes related to traffic changes are as follows:

1. Remove the dynamic key negotiation mechanism and use pre-shared key without plaintext interaction.

2. Random redundant parameters are added to the request body to prevent protective equipment from identifying the request through the size of the request body.

3. Webshell encrypted traffic detection 1. Dataset

In order to collect traffic data, we set up a website and installed a self-signed certificate to make the traffic to the site encrypted by TLS. So whether it is a common non-encrypted webshell or an encrypted webshell such as ice scorpion, visiting the site generates HTTPS traffic.

At present, we have collected six representative types of traffic data for visiting the site, namely, normal access traffic, two types of page webshell traffic and three types of client-type webshell traffic, among which the client type includes the most used Chinese kitchen knife and the ice scorpion that encrypts the traffic.

Because the purpose of this paper is to study the classification of normal encrypted traffic and webshell encrypted traffic visiting the same HTTPS website, all data is divided into normal and webshell types, marked as 0 and 1 respectively. The label and quantity of each type of data is shown in Table 3.1. the unit of data quantity here is the number of two-way network flows after the packets are parsed by 喜悦.

Table 3.1 Statistical list of labels and quantities for datasets

2. Feature extraction.

喜悦 is called to parse the data packet, and the parsing result in json format is obtained. After processing, five main data elements are extracted, including more than 600 dimensional features:

(1) data flow meta-characteristics

(2) packet length sequence

(3) packet time interval characteristics

(4) packet byte distribution characteristics

(5) TLS characteristics of packets

It has been observed that the comparison of webshell traffic and normal traffic visiting the same HTTPS website has several characteristics as listed in Table 3.2 below.

Table 3.2 comparison of webshell traffic and normal traffic visiting the same HTTPS website

With regard to entropy, if the data is encrypted or encoded, then its entropy will become larger, so the difference between webshell and normal traffic in entropy characteristics is reasonable.

3. Model training and testing

Using LightGBM as the webshell traffic identification classification model, some important parameters are designed as follows:

Learning_rate = 0.1

N_estimators = 200

Colsample_bytree = 0.9

Num_leaves = 7

Subsample = 0.9

(1) before the release of Ice Scorpion 3.0, we have collected all the data except Ice Scorpion 3.0 in Table 3.1, and done some research, and carried out the following three small experiments.

Experiment 1: using all the traffic, 20% is randomly selected as the test set, the remaining 20% is randomly selected as the verification set, and the other 80% is selected as the training set, with a high overall accuracy. The experimental results show that only one stream in the test set is misclassified and the accuracy is 96.9%.

Experiment 2: because the small amount of data is likely to lead to over-fitting, cross-validation is carried out, the data of page-type No.2 is reserved as the test set, and the rest of the data is used as the training set and verification set. The experimental results show that 15 of the 16 data streams are predicted correctly, with an accuracy of 93.7%.

Experiment 3: because Ice Scorpion 2.0 was the only encrypted webshell in the data set at that time, the data of Ice Scorpion 2.0 was reserved as the test set, and the rest of the data was used as the training set and verification set. The experimental results show that 18 of the 22 data streams are predicted correctly, with an accuracy of 81.8%.

From experiment 1 and experiment 2, we can see that after training, the model has a good ability to identify the normal traffic and webshell traffic of visiting the same HTTPS website. Experiment 3 proves that whether webshell itself is an encryption type has no effect on the recognition ability of the model, even if the encrypted webshell does not exist in the training set, the model can still identify the ice scorpion traffic.

(2) recently, Ice Scorpio released version 3.0, although the functional improvement has little impact on traffic identification, but we want to see the impact of the change of key exchange on traffic identification, so we continue to study on the basis of previous experiments. We have newly collected the traffic of ice scorpion visiting the self-built HTTPS website. As the TLS certificate of the website has changed, in order to avoid the impact of TLS-related features on the experimental results, the following experiment only uses the first four data elements, including more than 400 dimensional features.

Experiment 4: using all the previous data for training and verification, the ice scorpion 3.0 flow as a test set was inputted into the trained model for testing. the experimental results showed that 23 of the 24 data streams were predicted correctly, with an accuracy of 95.8%.

Experiment 5: because of the traffic generated by ice scorpion 2.0 in the previous data, in order to avoid the impact of the similarity of different versions of ice scorpion on the experimental results, remove the ice scorpion 2.0 data from the training set, so that the training set does not contain any encrypted webshell traffic, or use the ice scorpion 3.0 traffic as the test set. The experimental results show that 23 of the 24 data streams are predicted correctly, with an accuracy of 95.8%.

It can be seen from experiment 4 that when the model has the ability to identify the normal traffic and webshell traffic visiting the same HTTPS website, the model can still identify the new version of ice scorpion traffic, that is, the appearance of ice scorpion 3.0 traffic does not affect the recognition ability of the model. Experiment 5 proves the same conclusion as experiment 3, that is, whether webshell itself is an encryption type has no effect on the recognition ability of the model.

4. Analysis of feature importance.

As shown in figure 3.1, the 10 features with high importance of the five experiments are shown in sequence. The column column represents the dimension of the feature, and the importance column represents the importance of the feature. The higher the value, the higher the importance, in descending order.

Figure 3.1 feature importance of five experiments

The packet number and total entropy of the last two features observed in Table 3.1 are the 0th and 8th dimensional features, respectively. It can be seen from figure 3.1 that both of them play an important role, while the TLS feature does not play any role, which is consistent with the observed characteristics. In addition, although the importance of features used in each model training is different, it can also be found that some important features are basically unchanged, but the importance of features has changed.

IV. Summary

This paper mainly studies the classification of normal encrypted traffic and webshell encrypted traffic visiting the same HTTPS website. the experimental results show that the trained model not only has a good ability to identify normal encrypted traffic and webshell encrypted traffic including ice scorpion, but also can identify encrypted webshell such as ice scorpion even if only non-encrypted webshell is included in the training set.

Of course, this is only a preliminary exploration, and there is still a lot of work to be done in the future. First of all, data is very important to the training of machine learning model, the amount of data used in this experiment is less, although it can explain the problem to a certain extent, but more data is needed to verify; secondly, webshell detection needs to be combined with security expert knowledge to extract more differentiated features, and good feature engineering directly determines the upper limit of the model effect. Finally, data balance and model tuning are indispensable parts of machine learning methods, which play a positive role in the model prediction results.

The above is the Webshell detection research for HTTPS encrypted traffic shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.