Copying code from Stack Overflow? You might paste security vulnerabilities, too
We know that Stack Overflow is a daily part of a lot of developers’ lives. I’ve heard from multiple people that they come here daily (if not more often) to get answers to their questions. Sometimes the answer to a question about code comes as a chunk of code. And sometimes that code makes it into production applications because it answered the question perfectly.
A group of researchers investigated these code snippets to see how secure they were, and if the security flaws that they introduced remained vulnerable in the project. Ashkan Sami, Associate Professor at Shiraz University, Foutse Khomh, Associate Professor at Polytechnique Montréal, and Gias Uddin, now Senior Data Scientist at the Bank of Canada, researched C++ code snippets on Stack Overflow to answer this exact question. (Ed note: We spoke to Khomh and Uddin previously about their work pulling opinions from Stack Overflow questions and comments.)
These researchers had been researching how developers use Stack Overflow in parallel when they met at a conference in Sweden in 2018. Khomh had been examining Stack Overflow code for licensing issues, which led the security expert Sami to wonder if the code had flaws that could expose copiers to more than just copyright violations.
Copying code itself isn’t always a bad thing. Code reuse can promote efficiency in software development; why solve a problem that has already been solved well? But when the developers use example code without trying to understand the implications of it, that’s when problems can arise. “Do they really care about scrutinizing it for vulnerabilities, or do they all just use the code off the shelf,” asked Khomh. “And if they do, does this issue spread around?”
Research process
Sami and company weren’t the first researchers to examine vulnerabilities in code posted to Stack Overflow. In reviewing the existing literature, they found that there were no papers addressing Stack Overflow code for the fourth most popular language, C++. “We wanted to focus on C++ to get better knowledge of how vulnerabilities evolve and if the vulnerability migration actually happened from Stack Overflow to GitHub.” said Sami.
They downloaded the SOTorrent data set, which contains ten years worth of Stack Overflow history. A first automated pass found 120,000 pieces of text tagged as code snippets. Through dedupe processes and manual examination, they boiled the set down to 2,560 unique snippets of code. Now the hard work began.
Three of the researchers reviewed every single of those snippets looking for vulnerabilities over multiple rounds of review. After each round, they had to defend why each vulnerable snippet was vulnerable with the entire research group. “It was exhaustive work,” said Sami. “But the vulnerabilities they found were actually vulnerabilities. After several rounds of review, they boiled down to 69 vulnerabilities that we could with some certainty state they are vulnerable.”
Those 69 vulnerable code snippets fell into one of 29 common weakness enumeration (CWE) categories. While 69 doesn’t seem like a lot, they found that those vulnerable snippets had migrated into over 2,800 projects. We’re not talking about school projects; these are actual live projects. Production code in publicly visible Github repos.
Before publishing their results, they made sure to contact all of the repo owners and let them know about the flaws in their code. A few responded and fixed the issues, but there were a lot of non-responses or quietly closed issues. And there could be more out there. “The vulnerabilities that we actually flag, I think it’s a subset of what is actually being exchanged around,” said Khomh.
Copying without understanding
Many of the issues that the team found were basic security errors. But that doesn’t mean they weren’t prevalent. One of the more common flaws came from not checking return values. When you don’t check a return value in C++, you run the risk of the dreaded null pointer dereference. This error immediately causes a segmentation fault and crashes the process.
This error is pretty easy to guard against; verify that the return value is !NULL. But the fact that it was so common points out the larger problem. “If you borrow things and you don’t understand the content of what you’re borrowing,” said Khomh, “then you fall in this trap of reusing code that has potential vulnerabilities. Then you are just spreading those things around.” If you’re going to reuse code, you need to understand that code.
These vulnerabilities can leave software open to malicious actors. Another popular flaw was input validation. Like the return validation error, this flaw occurs when functions process input without making sure it’s something expected. In some cases, this can cause a stack overflow error–our namesake–and possibly cause a program to execute input as arbitrary code. “Even the last year,” said Uddin, “there were some hacking activities that specifically targeted the stack overflow vulnerability in code bases. And that got unauthorized access to millions of user information.”
Both Sami and Khomh are professors, so they run into student work that reuses code all the time. With proper attribution and an understanding of what the code actually does, the copied code can actually help the student learn. More often than not, though, code is copied without any understanding of how it works. The best method is still doing it yourself. “Ideally they should create the solution and get the full mark,” said Khomh. “Then they learn the concept, and they could actually build something out of it.”
But if copied code must be used, attribution and due diligence are a must. “They should credit where they got it,” said Sami. “Again they should check if this component is properly okay to integrate with another component. Just an example, but the problem that happened in Ariane five and blew up the whole spacecraft was because they had an integer problem. The 64 bit number was written into a 16 bit place.” They reused code from the previous mission without checking that it still worked on the new systems.
What Stack can do about it
The researchers made it clear that this isn’t a problem unique to Stack Overflow; any site sharing code snippets as examples would face this. As part of their research, they created a browser extension that will identify the vulnerable code.
But there are other things that the community can do to help out. First and foremost is to understand that the code snippets posted as examples are not production-ready code. Don’t copy this into a project without understanding the code and testing it. With the extension, you have an awareness of the snippets that the researchers have flagged as risky, but new answers are being added all the time.
They also suggest leveraging the security experts among the community. “Now the Stack Overflow community as a whole, the developers, they’re pretty impressive,” said Uddin. “They’re very interactive with each other. If we can try to motivate the security experts to both raise awareness and educate the user community, we will not only serve to make the code more secure, but also provide more information to the users who will be using the code.”
Because, in the end, these researchers are also educators. They think that Stack Overflow can help educate curious questioners on security as well as programming technique. As Professor Sami said: “Not just providing answers, but providing insights whether the code is secure or is not secured.”
Tags: bulletin, research, stackoverflow
39 Comments
Reminds me of the Meta.SE question “Stack Overflow made the BBC news – Copycat coders create ‘vulnerable’ apps”: https://meta.stackexchange.com/q/334811/295232)
It should! We talked to the folks who did the research that the BBC story is based on.
But SO was never a site for providing complete, working production-quality examples. The code provided on the site is just localized example snippets for the sake of teaching one specific thing. If someone is asking a question about for example which parameters to pass to the strstr() function in C, an answer will not necessarily contain code describing to check the result of the function against NULL. Because that’s not what the question was about. But also because pedagogical snippets are not production-quality code. Anyone running static analysis on such snippets is just confused. I haven’t read this study but I would imagine that the researchers would get similar results if they ran static analysis on code examples from well-known programming books. A far greater concern for software safety is “supposedly production quality” code from open source projects, where random Github hobbyists publish complete but broken code all over the Internet.
It’s true; I don’t believe your examples _should_ be production-ready, because it makes for more verbosity, often masking what you’re really trying to demonstrate. So, I’ve always been a little frustrated that responders often say “you haven’t validated that input”, because I’ve always thought that it’s the developers’ responsibility to do that outside the provided code-snippet.
However, if SO readers really _are_ using the code without really understanding it, it’s a real problem. I just don’t think anyone is ever going to prove that one way or the other. Certainly _this_ research hasn’t hunted down actual examples of SO snippets being misused this way.
Unfortunately the people who will use this clever browser extension are mostly not the ones who need it.
There are large numbers of users asking questions who clearly do not understand anything about what they are doing, and it is obvious they are just going to copy/paste whatever working snippet they get from an answer and wire it into their program. I have seen a lot of those same users (frighteningly) claim to be working on writing a script or program or some other aspect of a project they have been given at their job. And those are just the ones who end up posting questions because they didn’t find an already existing example to copy.
I am willing to be the number of people just copy/pasting code they don’t understand from SO is extremely high.
Is this a real problem with the code on StackOverflow? Or are there flawed assumptions on the side of the researchers? For instance, there’s the explicit statement that you *must* check for NULL pointers from functions, as that indicates failure. But just read the documentation for a Standard C function like strcat,and you’ll find that it will never return NULL. Is that unusual? No; strcpy doesn’t return NULL either. In C++, new doesn’t return NULL. From the STL std::find can return a pointer if it’s passed a pointer range, again it can’t return NULL. So why check for a NULL return value ? If the assumptions are that broken, we have to worry about any conclusions drawn from them.
malloc, strstr, strchr, memchr, strtok, bsearch…
Must be honest, when I saw that popup or extension that warning whoever want to use that code about its risks, my first thought was “this is a virus / blocked-spam advertisement”. The alternative could add an clear comment that warning the use os such code.
Example:
This code covers only the subject asked here; it doesn’t considered security or other high-important matters – or some in these lines.
I would love to see the ability to flag a question and/or answer as unsafe. I come across this regularly in the [php] tag. Repeat offenders are unsafe database queries and disabling SSL certificate validation.
If you want people to lift their game you need another score, for security. For once the armies of people who don’t answer questions but love nitpicking other peoples efforts can make themselves useful.
I’ve left so many comments on SO pointing out that $sql = “select * from a where b = {$_POST[‘c’]}”; is horrendously vulnerable. I think I’ve seen the poster come back and fix their answer maybe twice.
Thank you, thank you, thank you for writing about this! A lot of insecure, improperly-error-checked, or just plain wrong snippets are present on Stack Overflow that people unwittingly copy into their projects. Now at least there’s some way to prevent this for people that care.
Is this going to be uploaded to the Chrome Web Store? If so, when?
I also think this is a great idea and should be implemented asap. I’ve seen accepted/highly upvoted answers for using Java security libraries that use default (unsafe) values.
“A first automated pass found 120,000 pieces of text tagged as code snippets.”
In questions or in answers? Makes a huge difference for me.
And: quite some insecure answers I have seen on SO include a statement that the code is insecure.
Why do we need a browser extension for this? Why not simply fix the code on SO. There should be an [edit] link below a question and an answer.
+1
If you dont check code or handle errors you deserve everything that happens to you … the best solution is get burnt once you dont put your hand back into the fire again for a very long time … you need to fail to learn not to be stupid… stop wasting time worrying about copied code and focus on Tdd or bdd methods.
The problem is that it’s the end user who will suffer most of the consequences of theose vulnerabilities.
I do not think it is realistic to expect that everyone writing code in C++ or other system-level language is ever going to remember all the security rules and apply them properly. It seems to me that systems are going to have to be rewritten from the ground up using compilers and linkers that have built-in checks for code that would have security holes. Any system that communicates with the outside world will have to be constructed using a modular approach in which software components are thoroughly protected against interacting with each other. Also, with our present ability to have many processors available, some of them should be used to monitor what is happening in the system. These processors will not be part of any symmetric multiprocessor approach; rather they will be comparable to. “random logic”, working alongside the main architecture and devoted to security.
Another risk is the use of sample code that contains “secrets” which are no longer secret. I found a snippet of code in a project that was used to encrypt, decrypt, and hash strings with standard providers. The actual keys and salt/pepper values in production had been copied from an SE post. A Google search for the key values found the exact same ones used in various projects. They even copied the comment that said “be sure to change these values” but didn’t change the values.
@Michiel Salters, you seem to have missed the part where it says “Three of the researchers reviewed every single of those snippets looking for vulnerabilities over multiple rounds of review. After each round, they had to defend why each vulnerable snippet was vulnerable with the entire research group.”
I’m pretty sure they did a lot more than just look for null references.
@Thomas W First, the browser extension isn’t specific to SE. Second, it’s intended to identify code samples that have been copied and pasted into MULTIPLE sites. Fixing it on SE does not fix it everywhere it’s been copied to.
To quote Google’s engineering practices documentation (which they’ve open sourced) (https://google.github.io/eng-practices/review/reviewer/looking-for.html#every_line):
> Look at *every* line of code that you have been assigned to review. Some things like data files, generated code, or large data structures you can scan over sometimes, but don’t scan over a human-written class, function, or block of code and assume that what’s inside of it is okay. Obviously some code deserves more careful scrutiny than other code—that’s a judgment call that you have to make—but you should at least be sure that you *understand* what all the code is doing.
If you are in the business of programming and/or code applications for a profession, security should always be your greatest concern before releasing your work as “certified” as secure.
As to what is released on answer boards and various “self-help” sites, that is an exercise of the soon-to-be user of that code.
I do not feel the need to add an additional burden of an additional browser plugin for this task when simply a conspicuously-posted disclaimer on the website would more than suffice. This should be a commonsense exercise anywhere there is publicly available code posted. WHY get into the “Chicken Little” attitude over something that should be more a learning experience (and a learning FROM experience) about just plain commonsense – like accepting the fact that some posted code snippets are bound to have one or more security flaws. That, and it is up to the user of stated code to exercise due caution and commonsense.
Those who are not so careful with porting publicly-sourced code into their own professional projects are the ones clearly at fault, if in the least – in their own negligence.
The snippets published on SO — in questions and answers alike — are bare-bones illustrations of specific issues. Unless the respective issue is safe coding, these snippets *by design* will not follow safe coding guidelines. That is a good thing because the safety checks would obscure the demonstrated issues. (Note that this is often an unfortunate but unavoidable side-effect of real world safe code.) I would think that almost *no code* on SO is safe for consumption as-is.
Anybody who uses *any* code snippets without understanding them and adapting them to their needs, including safety requirements, is unprofessional and doomed.
“120,000 pieces of text tagged as code snippets. Through dedupe processes and manual examination, they boiled the set down to 2,560 unique snippets of code”
So removing duplicates reduces the number of snippets to 2% of the original. Can they do the same for questions? :o)
“Copying code itself isn’t always a bad thing. Code reuse can promote efficiency in software development”
‘Code reuse’ doesn’t mean what you think it does!
Its an older type of code reuse. More the term is ambigious.
“In some cases, this can cause a stack overflow error–our namesake–and possibly cause a program to execute input as arbitrary code.”
Are you sure you don’t mean a “buffer overflow” there? A stack buffer overflow (which this article probably meant to write there) is a type of buffer overflow, not a type of stack overflow, which is a different error and is not typically related to lack of input validation nor is it typically exploitable. On the other hand, a buffer overflow is commonly related to both.
So what are those 69 vulnerabilities, then? At least, can we get links to some SO questions and the security vulnerabilities answers to to them are “guilty” of?
I want to judge for myself!
StackOverflow is about getting answers to questions, not about getting someone to write code for you. If I include code in an answer, it’s because it helps to answer the question, not because it’s intended to be fit for production use. I agree with the concern: an awful lot of people are shipping code with no idea of what it does or how it works. If they were plumbers or electricians, they’d never get past their qualification exams. But what should we do as answerers? One thing I do is to deliberately make the code incomplete, so if the poster doesn’t understand it, they can’t use it.
Nice to see a real meaty blog from SOF. The last one I saw was about that “We’re updating the Ask Question Wizard” that added nursery school like robot and answer balloons. Three months later they still have not reverted that change.
Interesting — StackOverFlow should be having a kind of Vulnerable-Code-Detection, meaning, when a user is posting an Answer, it will prevent them from posting it until they correct it, something like this:
The code you are about to publish contains {amount of vulnerabilities} Vulnerabilities! Click Here to View them.
You may not publish the answer until you patch the vulnerabilities! Vulnerability-Bypassing is against the rules, and can cause you to be unable to answer questions!
– When the user clicks the link, it will go to a blank page with text, stating the errors.(Line, Column, etc.)
That’s just my opinion, though!
That’s a very good concept, However it will be awesome if SO keeps it in built because not every dev will install the plugin.
How about adding a flag for “this code might be exploitable” and a review queue to crowdsource inspections? And, it might make sense to work with one of the static inspection automation packages to help auto-flag this stuff.
I don’t suggest deleting or concealing flagged answers, but rather marking them “possibly vulnerable.”
Donnowan; for the sake of big G. Dont even bring it up. Or just condemn it.
People should be responsible for their code. Period. Period.
Yes copy, do what you want, but when you are a cheat you are one, no one should try to help out cheats. “productivity”, of what BS BS. An you have no chance of protecting anyways. Any library could be potentially harmful, look at that Wirus-10, that MS is shipping. It has killed my machine several times by now, going in when I dont want it, doing things I dont want it to do, and making my screen black or blue.
With all due respect, that was some really wasted researcher time
Stack Overflow code samples ares not production quality code. Anyone who expects them to be production quality code has serious problems.
This is equally true of, for example, code you will see in programming books. The code is provided to show flow. All the null checks, return code checks, validation of input, etc. are valuable in production code. But, unless the questioner is asking for explanations for an error, rather than the far more common “how do i do this?”, adding that code serves little to no point, and just makes it harder to understand what is going on.
When I have a question about a code snippet of mine, the first thing I do is strip out all that extra checking, unless it’s directly relevant to the question. I expect most other professional programmers do the same.
It seems like it would be prudent in the blog post to mention that have if you copied code from Stack Overflow you are apparently required to open source your project since according the Stack Overflow Terms of Service code on stack Overflow is CC-BY-SA. That SA part = Share Alike = Virally Open Source.
At the risk of going against the tide, I disagree with the whole premise of the research and this blog post.
If someone copies and pastes a code sample from SO (or any other forum or Q&A site) into a significant code base, then they are responsible if they introduce insecurities, or any flaws, into their code base. My premise is that developers of a system are responsible if they introduce flawed code into that system. So what if they copied their code from somewhere else? They elected to copy the code, so they are responsible if it causes malfunction.
For SO in particular, the basic premise of the Q&A format is to provide workable answers to questions as asked. It might be nice, if answers provided bullet-proof code that can be safely copied and used in some arbitrary, unknown, code base. But writing such code takes a lot of effort, cannot anticipate every possible way in which that code might be reused. It requires a specialised skill set and, more importantly, a highly disciplined development approach to produce such samples. And the authors of such software samples can’t anticipate every possible use case. They rely on people using their code being as disciplined as they are.
In the C and C++ forums, yes, a lot of code samples have formal security flaws. There are many ways that comes about. But, most importantly, a lot of questions seek answers that address some specific problem. They don’t seek a highly engineered, bullet-proof, code sample. They often seek an explanation of their problem (why their program prints a value they don’t expect, or why it crashes) and a general approach to solving it. Demanding that answers only include demonstrably secure code demands a lot of effort to address concerns that were irrelevant to the question as asked.
And, despite all that effort, there is no way to prevent someone from copying that secure code sample, and misusing it in some unanticipated way. The responsibility still needs to rest with the people who copy and misuse the code sample, not with the people who write it.
The assumption of this blog post and in the research – that code samples can be written in a way that prevents it being copied and misused – is completely bogus. Yes, it is appropriate that code samples be improved where possible – and there are plenty of ways to achieve that. But setting an ideal yardstick – that all code posted in answers will be secure, bullet-proof, and immune to misuse – is unworkable. It will actively discourage people from providing help to questions as asked.
Not even high-integrity development standards and engineering guidelines claim to be such a panacea to problems of code copy-and-misuse
Are you seriously telling me that, suddenly, now I’m responsible, on behalf of the final users, for **all** the possible contexts my answers may be used in?
Should we also check students’ answers to exams to see if they are secure? 🙂
The infosec world is all about advertisement, you should have the balls to reply with the obvious flaws of the researchers’ methodology.
Really I never thought that code on stack overflow can also be malicious.