Abstract
Studies on fault prediction often pay little attention to empirical rigor and presentation. Researchers might not have full command over the statistical method they use, full understanding of the data they have, or tend not to report key details about their work. What does it happen when we want to compare such studies for b.ilding a theory on fault prediction? There are two issues that if not addressed, we b.lieve, prevent b.ilding such theory. The first concerns how to compare and report prediction performance across studies on different data sets. The second regards fitting performance of prediction models. Studies tend not to control and report the performance of predictors on historical data underestimating the risk that good predictors may poorly perform on past data. The degree of b.th fitting and prediction performance determines the risk managers are requested to take when they use such predictors. In this work, we propose a framework to compare studies on categorical fault prediction that aims at addressing the two issues. We propose three algorithms that automate our framework. We finally review baseline studies on fault prediction to discuss the application of the framework.