Program authorship attribution consists of classifying programmers based on the consistent programmer hypothesis. The consistent programmer hypothesis states that programmers are unique and that this uniqueness can be observed in the code they write. Program authorship attribution finds a set of characteristics of a programmer’s code in terms of what remains constant for a significant portion of the programs that the programmers write. Furthermore, program authorship attribution can be used in many applications such as plagiarism detection, malicious code analysis, author tracking, quality control, ownership disputes, etc.

Although authorship attribution is highly desirable, it is also hard to achieve effectively. The main problem is that training data issues can cause program authorship attribution techniques to miss the possibility of identifying true program authors. One issue is the amount of code for training is small. A small amount of code might not be sufficient for building a discriminating model for a programmer’s coding characteristics. Another issue stems from redundant and irrelevant training data. For example, if a programmer has reused code written by others, the model we build cannot capture the programmer’s coding characteristics accurately. For another example, it is simple to see that with programs altered by code formatters and pretty printers, the available features to characterize an author decrease. In our implementation we address these issues in two ways:

  • We combine stylistic and textual features. Specifically, we employ our style analysis system that scans programs, detects stylistic features, such as blank lines or keywords with whitespace only on the left hand side, and increments the variables for each detected stylistic features. The stylistic statistics are then computed based on the values of these variables. Furthermore, we extract all the comments of a program file’s code and use R’s n-gram analysis to tokenize, or split the text of each comment into a series of n-grams. For example, the comment “drawing the cursor” can be subdivided into three 1-grams (“drawing”, “the”, and “cursor”), two 2-grams (“drawing the”, and “the cursor”), and one 3-gram (“drawing the cursor”). We then compute the frequency of every n-gram in the program corpus appeared for each program file of the analysis. The frequency values resulting from this process indicate how often a particular n-gram appeared in a program file written by a particular programmer. Additionally, when we go through comments, we record the common grammar and spelling issues of a particular programmer. For example, if we find that a programmer has a lot of subject-verb inconsistency issues in their comments, we set the subject-verb inconsistency issues feature variable to be 1 for the programmer. By combining stylistic and textual features, we get more information available to build a discriminating model for a programmer’s coding characteristics: even in small examples we can produce many features, via which we can increase the accuracy of the model we build.
  • We do not fully use all features we collect, but instead keep a subset of features that minimize the error rate by removing many redundant or irrelevant features. Redundant features are ones that provide no more information than the currently selected features, and irrelevant features are ones that provide no useful information in any context. We perform feature selection to help select the most relevant features in model construction, which involves computing information gain (which measures the expected reduction in entropy or the uncertainty associated with a random feature) and ranking all features via the information gain of each feature.By removing redundant and irrelevant features that do not increase information gain, we ensure that the errors introduced by reusing code and code formatting are minimized.