GitHub updated guidance on using its Copilot AI-powered code bot after researchers demonstrated at Black Hat that it often generates vulnerable code.
The Washington Post recently reported on the heroism of 16-year-old Corion Evans, a young man from southern Mississippi, who dove into the water to rescue drivers from a sinking car after witnessing the driver direct the vehicle down a boat ramp and into the Pascagoula River.
The teen who was driving the car later told authorities that the GPS had malfunctioned, and that she did not realize it was leading her and her passengers into the water. While a shocking revelation, in reality, drivers blindly following the lead of algorithms into the ditch (literally and proverbially) is a pretty common occurrence these days.
Researchers presenting at the Black Hat security conference on Wednesday provided a similar lesson for software developers. Hammond Pearce of NYU and Benjamin Tan of the University of Calgary presented the findings of research on Copilot, an AI-based development bot that GitHub introduced in 2021 and made generally available to developers in June 2022.
Here are highlights of what the researchers shared at the Black Hat Briefings.
[ Related: Copilot's rocky takeoff: GitHub ‘steals code’ ]
Don’t let AI drive (software development)
Like the algorithms driving WAZE or other navigation apps, Pearce and Tan said that GitHub’s Copilot was a useful assistive technology that, all the same, warrants continued and close attention from the humans who use it — at least if development projects don’t want to find themselves submerged in a river of exploitable vulnerabilities like SQL injection and buffer overflows.
The researchers found that coding suggestions by Copilot contained exploitable vulnerabilities about 40% of the time. About an equal percentage of the time, the suggested code with exploitable flaws was a “top ranked” choice — making it more likely to be adopted by developers, Pearce and Tan told the audience at Black Hat.
In all, the team of researchers generated 1,689 code samples using Copilot, responding to 89 different “scenarios,” or proposed coding tasks. For each scenario, the team requested Copilot generate 25 different solutions, then noted which of those was ranked the most highly by Copilot. They then analyzed the suggested code for the presence of 18 common software weaknesses, as documented by MITRE on its Common Weakness Enumeration (CWE) list.
Garbage (code) in, garbage (code) out
While Copilot proved good at certain types of tasks, such as addressing issues around permissions, authorization and authentication, it performed less well when presented with other tasks.
For example, a prompt to create “three random floats,” or non-integer number, resulted in three suggestions that would have led to “out of bounds” errors that could have been used by malicious actors to plant and run code on vulnerable systems.
Another researcher prompt for Copilot to create a password hash resulted in a code recommendation by Copilot to use the MD5 hashing algorithm, which is deemed insecure and no longer recommended for use.
Modeling bad behavior
The problem may lie in how Copilot was trained, rather than in how the AI was designed. According to GitHub, Copilot was designed to work as an “editor extension,” to help accelerate the work of developers. However, to do that the AI was trained on the massive trove of code that resides on GitHub’s cloud-based repository. The company says it “distills the collective knowledge of the world’s developers.”
The problem: a lot of that "collective knowledge" amounts to poorly executed code that doesn’t provide much in the way of a model for code creation.
“Copilot doesn’t know what’s good or bad. It just knows what it has seen before."
—Hammond Pearce
The recommendation to use MD5 in code for creating a password hash is a classic example of that. If Copilot’s study of GitHub code concluded that MD5 was the most commonly used hashing algorithm for passwords, it makes sense that it would recommend that for a new password hashing function — not understanding that the algorithm, though common, is outdated and has been deprecated.
The kind of probabilistic modeling that Copilot relies on, including the use of Large Language Modeling, is good at interpreting code, but not at grasping context. That results in AI that simply reproduces patterns that, although common, are flawed based on what it thinks “looks right,” the researchers said.
Experiments the research team conducted tended to reinforce that idea. Code generated by a study of reputable developers and well-vetted modules tended to be of higher quality than suggestions modeled on code from little known developers.
AI bias amplified by ranking
Copilot’s tendency to rank flawed code suggestions highly when presenting its suggestions is an equally worrying problem, the researchers said. In about 4 out of 10 recommendations, the top suggested code contained one of the common, exploitable weaknesses the researchers were searching for.
That top ranking makes it more likely that developers will use the suggested code, just like many of us jump at the top search result. That kind of “automation bias” — in which humans tend to blindly accept the things that algorithms recommend — could be a real problem as development organizations start to lean more heavily on AI bots to help accelerate development efforts.
In the wake of this new research presented at Black Hat, GitHub has updated its disclaimer for the AI, urging developers to audit Copilot code with tools like its CodeQL utility to discover vulnerabilities prior to implementing the suggestions.
The researchers summarized:
“Copilot should remain a co-pilot."