How to construct CodeDB to explore a new white-box static scanning scheme 07/06 Update SLTechnology News&Howtos

How to construct CodeDB to explore a new white-box static scanning scheme

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)05/31 Report--

How to construct CodeDB to explore a new white box static scanning scheme, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

Preface

I used a simple example to describe the scanning idea based on .QL, but in fact I have only seen a living SemmleQL (that is, a prototype of CodeQL) in this field. Let me talk about this and share some of the new static scanning schemes I'm trying to explore.

What is .QL?

QL, whose full name is Query Language, is a language for querying data from databases. Our common SQL is a kind of QL, which is a very common concept.

And what is .QL? The explanation on Wiki is an object-oriented query language for retrieving data from a relational database.

And what does .QL have to do with static analysis? We need to understand a concept called SCID.

SCID: Source Code in Database is an operation that parses the syntax of the code and stores it in the code. This kind of database can be simply called CodeDB.

When we generate CodeDB from a scheme, we need to construct a QL language to deal with it. Of course, CodeQL is a platform that implements CodeDB and designs the corresponding QL language. The query language designed by Semmle QL is a. QL, which conforms to several characteristics, including SQL, Datalog, Eindhoven Quantifier Notation and Classes are Predicates, which covers a variety of solutions for different logic of the code. Of course, this article is not intended to discuss CodeQL, so we will not explain the solutions in Semmle QL in depth here.

The concept of .QL was first proposed in 2007. For details, please refer to:

Https://help.semmle.com/home/Resources/pdfs/scam07.pdf

Why use .QL?

In "talking about Automated static Code Audit tools from 0 to 0", I once thought that based on .QL is the main trend of white box development in the future. the main reason is that there are many unsolved problems in modern white box core technology. in the last article, I mainly explained several modern scanning schemes based on technical principles. today I will talk about the difference from the technology itself.

In fact, the two analysis methods I mentioned in the previous article, whether based on AST analysis or based on IR/CFG analysis, are only different in technical basis, but there is little difference in the theory of analysis. We can roughly call them Data-flow analysis, that is, data flow analysis (stain analysis can be regarded as a variation of data flow analysis).

There are many kinds of data flow analysis, which are flow-sensitive in nature and path-insensitive in general. Of course, this is not absolute, and we can classify them by sensitive types:

Flow sensitivity analysis: flow-sensitive, which considers the order in which statements are executed, which usually relies on CFG control flow diagrams.

Path sensitivity analysis: path-sensitive, which not only considers the execution order of statements, but also analyzes the execution conditions of the path (such as if conditions, etc.) to determine whether there is an execution path that can actually run.

Context-sensitive analysis: context-sensitive, a kind of interprocedural analysis, takes into account the calling context when analyzing the target of a function call. The main scenario is that the context of the same function / method is different when it is called in different calls / locations.

Of course, it should be noted that this only refers to the classification of data flow analysis, regardless of the technical principles based on, if you like, you can also complete flow-sensitive analysis tools based on AST.

In the data flow-based scanning scheme, if we can fully support a variety of syntax-sufficient analysis logic, we can analyze the corresponding data flow mining vulnerabilities for each kind of vulnerability. Unfortunately, the truth is that there are more problems than we thought. Here I would like to cite a few problems that may be solved, temporarily solved, or no one can solve as examples.

1. How to judge the global filtering scheme?

2. How to deal with the situation where the special filter function is not completely filtered?

3. How to audit the framework of deep reconstruction?

4. How to scan storage xss?

5. How to scan the secondary injection?

6. How to scan the pseudo-code logic in eval?

With the continuous progress of modern scanning schemes, many problems may have been solved to a certain extent, but unfortunately, this is like a game between scanning solutions and developers. We are always committed to reducing the false alarm rate, but the false alarm rate can not be really solved, so it seems that the problem becomes unsolved again.

Of course, the scanning scheme of the .QL concept was not born to solve these problems, but fortunately, from my point of view, the scanning scheme based on the .QL concept takes static scanning to a new path. let's no longer focus on how to deal with flow sensitivity, constraint schemes, and so on. Last time I briefly explained the principle based on .QL scanning.

The core principle is to template each operation and store it in the database. such as

A ($b)

This statement is represented as

Function-a FunctionCall ($b)

Then such a triple can be used as a piece of data in the database.

And when we want to find a statement in the code that executes the a function, we can go directly through the

Select * from code_db from where type = 'FunctionCall' and node_name =' Function-a'

Such a statement can find all the nodes in the code that execute the a function.

Of course, static analysis can not only rely on such a simple statement to find loopholes, but the fact is, when we analyze CodeDB, we can not only ensure the order of strong code execution, but also cross multiple barriers to do analysis directly from the sink point. When the corresponding QL supports more and more high-level queries or custom advanced rules, it may be implemented directly.

Select * where {Source: $_ GET, Sink: echo, is_filterxss: False,}

It is precisely because of this that the emergence of CodeQL is considered by many people to be a cross-era emergence, static analysis from the bottom of the code analysis, need to go deep into the compilation process of the way, into the platform cleverly conceived rule statements, perhaps from now on, CodeQL this way to lay the bottom can not directly see the effect, but fortunately, as the technology itself, we have a new direction.

In the following article, we will follow some of my short-term research results some time ago to explore how to achieve a reasonable CodeDB.

How to achieve a reasonable CodeDB?

In the early days when there was only Semmle QL, I looked through some paper, and then I knew something about LGTM and then CodeQL. Later, when CodeQL came out, I looked through the rules written by some people were far from what CodeQL wanted to achieve, and then I always wanted to try to write a similar toy myself. This time, in the process of updating KunLun-M, I was subjected to many difficulties in data flow analysis based on AST, so this plan was born.

In order to practice my idea, I spent several weeks designing a simple version of CodeDB and wrote a simple tool to find the php deserialization chain based on CodeDB. The source code of the tool is as follows:

Https://github.com/LoRexxar/Kunlun-M/tree/master/core/plugins/phpunserializechain

Before we talk about the specific implementation, we need to figure out what exactly CodeDB needs to record.

First of all, the execution order of each line of code and the file in which it is located are the basic information. Secondly, the current code in which the domain environment, code type, code-related information is also a necessary condition.

On this basis, I try to use the five dimensions of domain location, execution order, source node, node type and node information to store data as five tuples. Take a simple example:

Test.php

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.