How to use Deep Learning to detect malicious PowerShell 07/01 Update SLTechnology News&Howtos

How to use Deep Learning to detect malicious PowerShell

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces how to use deep learning to detect malicious PowerShell, the article is very detailed, has a certain reference value, interested friends must read it!

Deep learning (deep learning) is a kind of algorithm under the framework of machine learning. Deep learning methods are obviously better than traditional methods in image and text classification tasks. With the development, it has great potential to use deep learning to establish new threat detection methods.

Machine learning algorithms use digital models, so objects such as images, documents or e-mails are converted into digital form through the steps of feature engineering, which requires a lot of manpower in traditional machine learning methods. Through deep learning, the algorithm can operate on relatively raw data and extract features without human intervention.

In this article, we provide an example of a deep learning technique originally developed for natural language processing (NLP) and now adopted and applied to detect malicious powershell scripts.

Word embedding Model in Natural language processing

Our goal is to classify powershell scripts, and we briefly introduce how to deal with text classification in the field of natural language processing.

An important step is to convert words into vectors (number tuples) that can be used by machine learning algorithms. First specify a unique integer for each word in the vocabulary, and then represent each word as a vector of 0, where 1 is at the integer index corresponding to the word. Although useful in many cases, there are obvious drawbacks. All words are equidistant, and the semantic relationship between words is not reflected in the geometric relationship between the corresponding vectors.

The context embedding model is a relatively new method, which overcomes these limitations by learning the context of words from the data. The context embedding model is trained on large text datasets such as Wikipedia. Word2vec algorithm is an implementation of this technology, which can not only transform the semantic similarity of words into the geometric similarity of vectors, but also maintain the polar relationship between words. For example, in the word2vec representation:

Embed powershell script

Because it takes a lot of data to train a good model, we use a large and diverse corpus of 386k different untagged powershell scripts. The word2vec algorithm is usually used with human languages, and when applied to the powershell language, it provides similar results. We split the powershell script into tokens and then use the word2vec algorithm to assign a vector representation to each token.

Figure 1 shows a two-dimensional visualization of the vector representation of 5000 randomly selected tokens, with some important tokens highlighted. Note that semantically similar tags are placed near each other. For example, vectors that represent-eq,-ne, and-gt (aliases for "equal", "not equal", and "greater than" in powershell, respectively) are grouped together. Similarly, vectors representing allsigned, remotesigned, bypass, and unrestricted tokens (all valid values for enforcement policy settings in powershell) are grouped together.

By examining the vector of the tag, we found some other relationships.

Token similarity: using the word2vec representation of the token, we can identify commands with aliases in the powershell. In many cases, the tag closest to a given command is its alias. For example, the token invocation expression Invoke-Expression and its alias IEX have the closest representation to each other. Two other examples of this phenomenon are invoke webrequest and its alias iwr, and the get childitem command and its alias gci.

We also measured the distance between several sets of markers. For example, consider four tags $I, $j, $k, and $true (see the right side of figure 2). The first three are usually used to represent numeric variables, and the last one represents Boolean constants. As expected, the $true token does not match the other tokens and is farthest from the center of the group (using Euclidean distance).

More specifically, for the semantics of powershell in network security, we examined the token representation: bypass, normal, minimize, maximize, and hide (see the left side of figure 2). Although the first tag is the legal value of the executionpolicy flag in powershell, the rest is the legal value of the windowstyle flag. As expected, the bypass vector represents a greater distance than the other four marked vectors.

Linear relationship: because word2vec retains a linear relationship, calculating the linear combination of vector representations will result in semantically meaningful results. Here are some of the relationships we found:

In each of the above expressions, the symbol ≈ indicates that the vector on the right is closest to the vector that is calculated on the left (of all the vectors that represent the glossary token).

Using Deep Learning to detect malicious powershell script

We use the word2vec embedding model of the powershell language introduced in the previous section to train an in-depth learning model that can detect malicious powershell scripts.

The classification model is trained and validated with powershell script datasets labeled "clean" or "malicious", while the embedded model is trained with untagged data. The process is shown in figure 3.

Using GPU computing in Microsoft Azure, we tried a variety of deep learning and traditional ML models. Compared with the traditional ML model, the deep learning model with the best performance increases the coverage by 22%. As shown in figure 4, the model combines several deep learning construction modules, such as convolution neural network (CNN) and long-term and short-term memory recurrent neural network (LSTM-RNN).

Apply deep learning to detect malicious PowerShell

Since the first deployment, the deep learning model has detected many malicious and red team PowerShell activities with high precision. The signals obtained by PowerShell are combined with various ML models and Microsoft Defender ATP signals to detect network attacks.

The following is an example of a malicious PowerShell script that deep learning can detect but has some difficulty with other detection methods:

The above is all the contents of the article "how to use Deep Learning to detect malicious PowerShell". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.