The other day I stumbled across a research paper entitled Digging the Development Dust for Refactorings , which addresses software repository data mining. Specifically, the paper identifies four types of data which can be examined — source code metrics, identifiers, ROI estimates, and design differencing — and examines their use in building a refactoring history for a software project. Which project, you ask? HtmlUnit!
From the abstract:
Software repositories are rich sources of information about the software development process. Mining the information stored in them has been shown to provide interesting insights into the history of the software development and evolution. Several different types of information have been extracted and analyzed from different points of view. However, these types of information have not been sufficiently cross-examined to understand how they might complement each other. In this paper, we present a systematic analysis of four aspects of the software repository of an open source project — source-code metrics, identifiers, return-on-investment estimates, and design differencing — to collect evidence about refactorings that may have happened during the project development. In the context of this case study, we comparatively examine how informative each piece of information is towards understanding the refactoring history of the project and how costly it is to obtain.
The authors evaluate their proposed refactoring detection methodology by trying it out on the HtmlUnit repository:
To evaluate the effectiveness of our lightweight refactoring method, we examined an open-source system HTMLUnit. HTMLUnit is a realistic representative example of open-source development. There are nine releases in its history from May 22, 2002 to March 17, 2005. It is quite well documented; in fact, examining the log comments in its CVS-repository history, we found many references to refactorings and their rationale, which is critical for our understanding of the system lifecycle.
So we get kudos on our commit logs. Continuing on into the conclusion:
Based on our HTMLUnit case study, we have found that a heuristic combination of source-code metrics and identifiers-movement analysis — using information easily available on any repository platform — can be quite effective in recovering specific refactorings in the software evolutionary lifecycle, albeit not as accurate as structural analysis of the logical system design and less computationally intensive. An even more interesting finding was that the refactorings omitted by the developers in the system’s documentation were found to be “bad investments of development time” according to our ROI estimate, which implies that developers’ documentation is a good description of the developers’ intention if not of their actual work.
Apparently the authors’ analysis identified 11 refactorings, three of which were not documented in the commit logs. These same three undocumented refactorings were also found to have negative ROIs and less than 50% relevance. The assumption made by the authors is that these three refactorings were accidental: code cleanups or bug fixes that got a little too bloated. So basically we’re pretty good about documenting our refactorings, except when we “accidentally refactor”. Interesting stuff!
 C.Schofield, B.Tansey, Z.Xing and E.Stroulia, Digging the Development Dust for Refactorings, Proc. of the 14th International Conference on Program Comprehension, Athens, Greece, June 14-16, 2006.