Every programmer learns to code in a unique way which results in distinguishing “fingerprints” in coding style. These fingerprints can be used to compare the source code of known programmers with an anonymous piece of source code to find out which one of the known programmers authored the anonymous code. This method can aid in finding malware programmers or detecting cases of plagiarism. In a recent paper, we studied this question, which we call source-code authorship attribution. We introduced a principled method with a robust feature set and achieved a breakthrough in accuracy.
Our results. We used a dataset with 250 programmers that had an average of 630 lines of code per programmer. We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code), resulting in a 95% accuracy at attributing an anonymous piece of code to one of 250 programmers. This is significantly better than prior work because of the larger number of candidate programmers and greater accuracy. The largest dataset used in previous work, in terms of number of programmers, had 46 programmers (they don’t state the number of lines of code). The accuracy was 55%. In another study, with a smaller dataset of 30 programmers and an average of 1,910 lines of code per programmer, 97% accuracy was reached.
Dataset. Google Code Jam is an annual international programming competition. It has thousands of participants from different backgrounds such as professional programmers, students, and hobbyists. The solution files of the programming tasks submitted by the contestants have been published on the website since 2008. We collected the C++ source code of more than 100,000 contestants along with their usernames from 2008 to 2014. We wanted to avoid risk of identifying the specific properties of problems’ possible solutions instead of a programmer’s coding style. Fortunately, in Google Code Jam, contestants try to solve the same sequence of problems to advance to more difficult rounds. This allowed us to construct experimental datasets in such a way that the training sets for each of 250 programmers were solutions to the same task. The test set was a source code file not seen in any of the training sets.
Abstract syntax trees. Our work is an application of machine learning. Broadly, there are two steps: turning each input file into a vector of numerical features, followed by using a classifier that learns the patterns in each programmer’s feature vectors to classify a new, previously unseen vector. The key advance in our work is the use of a deeper set of structural features to represent coding style. In particular, we used syntactic features extracted from “abstract syntax trees” along with lexical and layout features directly extracted from source code. Abstract syntax trees in source code are analogous “parse trees” of prose sentences. Prose authorship attribution that utilizes parse trees have been able to identify an anonymous text from 100,000 candidate authors 20% of the time.
The figures below show a code snippet and the corresponding abstract syntax tree.
![](https://www.cs.drexel.edu/~ac993/files/sourceCode.png)
![](https://www.cs.drexel.edu/~ac993/files/AST.png)
What’s next. Despite the leap in source code authorship attribution accuracy, we believe that this is only a first step in code stylometry and this line of attack will yield many improvements. Just as linguistic stylometry has seen huge leaps in the last few years, a rigorous machine learning based approach can transform code stylometry. For example, adding control flow graph features could further boost accuracy.
Code stylometry has applications in security, privacy, software forensics, and software engineering. In a follow-up blog post, I’ll discuss how it can be used for various problems in different areas. The results I presented above pertain to the general case of a “closed world setting” with multiple programmers. I will conclude with one practical example of where this can be useful. If we have a set of programmers who we think might be Satoshi, and samples of source code from each of these programmers, we could use the initial versions of Bitcoin’s source code to try to determine Satoshi’s identity. Of course, this assumes that Satoshi didn’t make any attempt to obfuscate his or her coding style.
Leave a Reply