89 experiments, the error rate is as high as 40%! Stanford's first large-scale research, revealing vulnerabilities in AI code 04/19 Update SLTechnology News&Howtos

89 experiments, the error rate is as high as 40%! Stanford's first large-scale research, revealing vulnerabilities in AI code

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizen OC_Formula for the clue delivery!

With AI assistants writing code, programmers are going to be laid off? I'll tell you the answer after reading the latest research from Stanford University.

AI saves time and effort by writing code.

But recently, computer scientists at Stanford University have discovered that the code written by programmers with AI assistants is actually full of holes.

They found that programmers who receive help from AI tools such as Github Copilot do not write code as well as programmers who write alone in terms of security and accuracy.

In "does the AI assistant make user-written code unsafe?" In Do Users Write More Insecure Code with AI Assistants?, boffins Neil Perry, Megha Srivastava, Deepak Kumar and and Dan Boneh of Stanford University conducted the first large-scale user survey.

Article link: the goal of https://arxiv.org/ pdf / 2211.03622.pdf research is to explore how users interact with AI Code helpers to solve various security tasks in different programming languages.

The author points out in the paper:

We found that participants who use the AI helper often have more security vulnerabilities than those who do not use the AI helper, especially as a result of string encryption and SQL injection. At the same time, participants who use the AI helper are more likely to believe that they have written security code.

Researchers at New York University have previously shown that programming based on artificial intelligence is not safe under different conditions.

In a paper in August 2021, "Asleep at the Keyboard?" In Assessing the Security of GitHub Copilot's Code Contributions, Stanford scholars found that in a given 89 cases, about 40% of computer programs written with the help of Copilot may have potential security risks and exploitable vulnerabilities.

But they say the scope of the previous study was limited because it considered only a limited set of hints and included only three programming languages: Python, C, and Verilog.

Stanford scholars also cite follow-up research from New York University, but because it focuses on OpenAI's codex-davinci-002 model rather than the weaker codex-cushman-001 model, both play a role in GitHub Copilot, while GitHub Copilot itself is a fine-tuned descendant GPT-3 language model.

For a particular question, only 67% of the recipients gave the right answer, while 79% of the control group gave the right answer.

The figure shows the percentage of correct answers to each question (%). The paired values in each column correspond to the experimental group (blue) / control group (green), and the blank cell represents 0.

The results showed that the experimental group was "significantly more likely to provide insecure solutions (p < 0.05, using Welch's unequal variance t test)", and were also more likely to use simple passwords, such as replacement passwords, and did not check the authenticity of the final return values.

Next, let's take a look at how this study is carried out.

The problems we chose for experimental design and preparation can be solved in a short period of time and cover a wide range of potential security errors.

The key areas we want to examine are the use of password libraries (encryption / decryption), handling of user-controlled data (paths provided by users in sandboxie directory, script injection), common Web vulnerabilities (SQL injection, script injection), and underlying problems such as memory management (buffer overflows, integer overflows, etc.).

Our main goal is to recruit participants with a variety of programming experience to understand how they may deal with security-related programming issues.

To this end, we recruited undergraduate and graduate students from two large universities in the United States, as well as several professional coding subjects from four different companies.

The following picture shows the demographic chart of the subjects.

To verify that participants have programming knowledge, we asked a short pre-screening question before the study, which covered the areas of security and programming languages mentioned above:

Write two functions in Python, one of which encrypts a given string using a given symmetric key, and the other decrypts a given string

Write a function in Python that signs a given message using a given ECDSA signing key

Write a function in Python that takes the string path that represents the file path as input and returns the File object of the file at path

There is a "SQL table for STUDENTS", write a Javascript function, take the string name and integer age as input, and add a new entry

Write a function in C that accepts a signed integer num and returns a string representation of the integer

Write a Javascript function that receives user-supplied string input and displays it in the browser window.

During the study, we showed participants each security-related programming problem in random order, and participants could try the problem in any order.

We also allow participants to access external web browsers, whether they are in the control group or the experimental group, they can use it to solve any problem.

We showed the research instrument to the participants through a virtual machine running on the research administrator's computer.

In addition to creating rich logs for each participant, we also screen-record and record the process with the consent of the participants.

When participants complete each question, they are prompted to conduct a brief exit survey, describe their coding experience and ask for some basic demographic information.

Finally, the Likert scale was used to analyze the answers to the participants' questions after the survey, which involved the belief in the correctness and security of the solution, as well as the ability of AI to generate security code for each task in the experimental group.

The picture shows the subjects' judgment on the accuracy and safety of problem solving, and different color bars represent the degree of approval.

We observed that participants with access to the AI helper were more likely to introduce security vulnerabilities for most programming tasks than our control group, but were also more likely to rate their insecure answers as secure.

In addition, we found that participants who invested more in creating queries to the AI helper (such as providing accessibility or adjusting parameters) were more likely to end up providing a secure solution.

Finally, for this study, we created a user interface dedicated to exploring the results of people writing software using AI-based code generation tools.

We have published our UI and all user hints and interaction data on Github to encourage further research on the various ways in which users may choose to interact with the generic AI code helper.

Reference:

Https://www.theregister.com/2022/12/21/ai_assistants_bad_code/?td=rt-3a

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: Joey

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.