In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
Many novices are not very clear about the method and implementation of WebShell detection based on machine learning. in order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.
I. Overview
Webshell is a malicious script used by an attacker to upgrade and maintain persistent access to WEB applications that have been compromised. Webshell itself cannot exploit or exploit remote vulnerabilities, so it is always the second step in an attack.
Attackers can take advantage of common vulnerabilities such as SQL injection, remote file inclusion (RFI), FTP, and even use cross-site scripting (XSS) as part of the attack to upload malicious scripts. Common functions include, but are not limited to, shell command execution, code execution, database enumeration, and file management.
Why use webshell?
1. Continuous remote acc
Webshell usually contains a backdoor that allows attackers to access remotely and control the server at any time. In this way, the attacker saves the time needed to exploit the vulnerability every time he accesses the attack server. Attackers may also choose to fix the vulnerability themselves to ensure that no one else will exploit the vulnerability. In this way, the attacker can keep a low profile and avoid any interaction with the administrator. It is worth mentioning that some popular Webshell use password authentication and other techniques to ensure that only attackers uploading Webshell can access it. These techniques include locking scripts to specific custom HTTP headers, specific cookie values, specific IP addresses, or a combination of these technologies.
two。 Privilege upgrade
Unless the server is misconfigured, webshell will run under the user privileges of the web server, which is limited. By using webshell, an attacker can attempt to perform a privilege escalation attack by exploiting a local vulnerability on the system to assume the root privilege, which is "superuser" in Linux and other Unix-based operating systems. By accessing the root account, an attacker can do almost anything on the system, including installing software, changing permissions, adding and removing users, stealing passwords, reading email, and so on.
After invading a website, hackers usually mix the asp or php backdoor files with the normal web page files in the web directory of the website server, and then they can use the browser to access the asp or php backdoor to get a command execution environment to achieve the purpose of controlling the website server. There are generally three detection methods for Webshell: traffic-based mode, Agent-based mode, and log-based analysis mode. This paper focuses on webshell detection based on traffic.
In the whole process, we first collect webshell data, then observe and analyze the traffic generated by webshell and normal traffic combined with the knowledge of network security experts, and dig out better characteristics that can distinguish the two types of traffic, then select the two-classification algorithm to train the model, and then evaluate the performance of the algorithm to adjust the parameters, and finally make the model engineering landing. It will be explained in detail in turn.
II. Data acquisition
In the real environment, the lack of webshell samples, basically in tens of thousands of http traffic, it is difficult to have a webshell generated traffic. Therefore, for machine learning, high-quality, large number of samples will be a challenge. In order to solve the difficult problem of this sample, we specially simulate and build the environment of webshell intrusion, write an automatic script according to the type of webshell and the behavior of the attack, generate a large amount of webshell traffic at runtime, and collect Webshell traffic using network sniffing tools (such as Wireshark,Tcpdump, etc.).
2.1 build data types
Webshell can be roughly divided into the following three categories:
In a word, ●
The webshell is connected by a tool kitchen knife, and the functions that can be realized include the addition and deletion of files, the CRUD of the database, and the execution of commands.
● Malaysia
The webshell file is large and contains a lot of server code. Powerful, in addition to the addition and deletion of files, database CRUD, command execution. It also includes functions such as power raising, intranet scanning, rebound shell and so on.
● pony
The webshell file is small and has a small amount of code. Contains a relatively single function, the implementation of one or two functions. Mainly for file upload or server to perform file download, command execution and other functions.
When building a data collection type, it is divided according to the common webshell types and combined with the execution command, as follows:
Do you need to log in:
● direct: no need to log in, you can visit it directly. There are many functions that can be realized.
● login: requires a password or account to log in, which is divided into pre-login and post-login (before/latter)
After the division of whether it needs to log in or not, it begins to be divided by user behavior, that is, the type of operation.
Type of operation:
● cmd: command execution
● file: file operation
● sql: database operation.
2.2 Building a data collection environment
The overall environment built by traffic collection:
The system environment is a Linux virtual machine, which completes the network data exchange with the host server by establishing a bridge mode. The environments in which various types of webshell run are as follows:
● PHP:phpstudy
● JSP:jspstudy
● ASP:Ajiu AspWebServer+Mysql
The local host uses Wireshark to collect and save the network traffic of webshell.
2.3 Traffic data generation
● direct:
The webshell that is accessed directly is accessed in batches by script to generate the traffic data accessed by webshell. The flow of file operation, command execution and database operation is inputted manually.
● login:
Login webshell, use script to log in, run selenium module to simulate browser login webshell. Solve the problem that you can not successfully log in to webshell because of the cookie authentication mechanism. The flow of file operation, command execution and database operation is inputted manually.
● cmd:
The use of the webshell is realized by script. According to the script, the execution of different system commands is realized and the corresponding results are obtained.
● caidao:
The interface access and login of kitchen knife type webshell are realized in batch by script.
2.4 Traffic data classification
The naming rule of traffic packets is (webshell type) _ (operation), which is divided according to the specific type and operation, and collected separately. Ensure that the type of traffic data is uniform.
2.5 Traffic collection
The local host uses WireShark for traffic collection.
3. Traffic-based webshell detection 3.1 feature engineering
Using machine learning to build a traffic detection model, an important step is the feature mining analysis of webshell traffic. Feature engineering should be combined with the characteristics of webshell and relevant expert knowledge to mine. First, according to the behavioral characteristics of webshell, the following characteristics of webshell itself are summarized.
(1) there are command execution functions called by the system, such as eval, system, cmd_shell, assert, etc.
(2) there are file manipulation functions called by the system, such as fopen, fwrite, readdir, etc.
(3) there is a database operation function that calls the system's own stored procedure to connect the database operation.
(4) it has deep self-hiding and camouflage, and can be lurked into the web source code for a long time.
(5) there are many derivative variants, which can be bypassed by customizing encryption and decryption functions, using xor, string inversion, compression, truncation and recombination, etc.
(6) few visits to IP, few visits, isolated pages, no blocking by traditional firewalls, and no system operation log.
(7) generate payload traffic, which is recorded in the web log.
Traffic detection is to distinguish between normal visits and webshell, so it simply shows the difference between webshell and normal business web pages, as shown in figure 3:
3.1.1 feature mining
According to the expert experience and knowledge collected in the feature engineering, and the statistical analysis of the actual historical data, let's start our feature analysis. (note: the disclosure of too many technical details is sensitive, so only some features are enumerated.)
1. Keyword-based features
For the behavior analysis of webshell itself, it has the operation actions for system calls, system configuration, databases and files, and its behavior determines that the multi-band parameters in its data flow have some obvious characteristics. In addition, decode is performed on the traffic before keyword matching. After consulting all kinds of webshell operation modes and observing the generated data flow for statistical analysis, some keywords are collected and listed as shown in figure 4. According to statistics, it is found that the proportion of these keywords in positive and negative samples is very different, so it is very appropriate as a feature. The following is a comparative bar chart of the number of keyword occurrences in positive and negative samples (figure 4), which can show the difference in distribution.
two。 Number of get/post parameters in traffic
After observation, it is found that generally speaking, the number of parameters of webshell get/post is relatively small, which can be used as a feature.
3. Information Entropy of get/post in Traffic
General requests submit data to the server, and webshell is no exception. However, if the submitted data is encrypted or encoded, its entropy will increase. For a normal web business system, if the entropy of the data submitted to a URI is significantly larger than that of other pages, then the corresponding source file of the URI is more suspicious. In general, the entropy value of the webshell submitted data that has done encrypted communication will be too large, so it can be detected. For example, the comparison is as follows:
Normal page: "pid=12673&aut=false&type=low"
Webshell: "ac=ferf234cDV3T234jyrFR3yu4F3rtDW2R354"
4. Feature extraction based on cookie
In normal http access, because http access is a stateless protocol, the server does not automatically maintain the customer's context information, so session is used to save the context information. Session is stored on the server side, in order to reduce the cost of server storage, so when there is a http request, the server will return a cookie to record the sessionID and save it locally in the browser, and the next visit will carry cookie in the request. The content of cookie mainly includes: name, value, expiration time, path and domain. Together with the domain, the path forms the scope of cookie. According to observation and analysis, some of the cookie generated by webShell are empty, some have the structure of key-value pairs, but the basic number is very small, and naming has no actual meaning. So extract this feature to distinguish between webShell and normal website visits.
In addition, from the perspective of cookie, we can find that the key-value pairs of webshell will be confused, not as regular as in normal traffic or the parameters have actual readable meaning. If you select a Cookie of webshell below, you can find that the values of the key-value pairs are confused. Therefore, the entropy of the key-value pair is selected as the feature.
Cookie:KCNLMSXUMLVECYYYBRTQ=DFCXBTJMTFLRLRAJHTQLDNOXSKXPZEIXJUFVNNTA
5. Returns the similarity value of web page structure
When hackers carry out webshell rights enhancement attacks, they usually use existing webshell tools, such as taking Malaysia for direct use or minor modifications. Therefore, many returned pages have structural similarity, and the feature of web page structural similarity can be extracted for comparison. The design idea is to compare the web page structure similarity with the web page structure similarity generated by the collected webshell and use the returned web page structure similarity as a feature.
6. Number of layers of web page path
When hackers successfully invade a website and insert webshell web pages, they usually need these backdoor software to be hidden, so the web page path is relatively deep, and the web page is hidden deeply, which is not easy to be found by normal visitors.
7. Access time period
Compared with normal business, the browsing time of webshell is different, and hackers usually choose to visit it at a time when normal traffic is scarce. Therefore, time characteristics are extracted as a dimension. According to the time category characteristics, you can expand several small categories of features, which time of day (hour_0-23), what day of the week (week_monday …) Which week of the year, which quarter of the year, weekdays, weekends.
8. Do you have referer?
In traffic, if the page does not jump to the previous page, then the referer parameter will be empty. In general, there is little jump relationship between pony and one-sentence webshell, and the home page of Malaysia landing has no jump relationship with the previous page, so choose this feature as an auxiliary judgment.
3.1.2 feature extraction
To sum up, many features are extracted, such as keywords, levels of web page path structure, cookie key value logarithm, returned web page structure similarity, POST/GET entropy value, cookie key value to entropy value and so on (note: too much technical detail disclosure is sensitive, so only enumerate some features to elaborate). The data generated in the data acquisition phase is used as the data source to generate the machine learning model features, and then the features are normalized. Among them, there are 60349 normal flows and 51070 webshell flows.
3.2 Model construction and evaluation
Four algorithms, adboost, SVM, random forest and logical regression, are selected to train in the model, including 60349 normal flows and 51070 webshell flows. The training effects of each model are compared as shown in figure 5 below. Considering the running time of minimizing algorithm and maximizing interpretability, random forest is selected as the model algorithm for actual production when the detection effects of multiple algorithms are similar.
IV. Specific implementation of 4.1 testing process
The overall business logic of webshell detection based on machine learning is shown in figure 6, which is summarized as follows: first, import data from various terminal devices and third-party libraries for feature extraction model training; then, deploy the trained model to the production environment to detect the real data and generate alarm information; finally, manually confirm the detected results, and re-import the false positives into the training database to retrain the model regularly.
4.2 Technical selection
When the Webshell detection based on machine learning is deployed to the production environment, the influence of big data's scale on the timeliness and throughput of the model needs to be considered. After multi-consideration, the component collocation shown in figure 7 below is finally selected as the product technology selection. This scheme combines the high efficiency of spark big data processing, the medium-high performance and low latency of Kafka to data flow, and the real-time readability of Hbase to large data sets. Using the technical architecture in the following figure, we can ensure the detection of Webshell in high-traffic environment and the automatic optimization of machine learning model.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 273
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.