Performance bugs are program source code that is unnecessarily inefficient and that affects the perceived software quality similarly to functional bugs. However, in comparison to functional bugs there are (as of 2019) fewer empirical studies on performance bugs and they cover significantly fewer subjects. As a consequnce, while many approaches for detecting and localizing a variety of performance bugs have been developed in recent years, their efficacy has usually been evaluated on a relatively small set of bug instances. Therefore, we investigated more than 700 commits across 13 C/C++ projects to provide a dataset of real-world performance bugs, grouped by projects here and by patterns here. The patterns provide an abstract semantic classification how performance bugs are fixed. A detailed discussion of this classification can be found in our paper that will shortly appear in the conference proceedings of ISSRE 2019.
The dataset on these pages can be used 1) to assess the alignment of the current state of the art in performance bug detection and localization with performance bugs that get fixed in practice, 2) as a larger corpus to evaluate performance bug detection and localization approaches against, and 3) as the basis for further research, such as the simulation of performance bugs via code mutation.
More details can be found in our paper, which we will link from here as soon as it gets published.
The 13 projects investigated for our study are:
1: The investigation was started when llvm community has not been migrated from svn to github and uses an unofficial mirror on github. The up-to-date official repository is at https://github.com/llvm/llvm-project.
The total number of commits matched by each keyword (as discussed in our paper) is:
Threats to validity
The categorization of performance bugs according to semantic pattern has been derived during a continuous work period spanning 3 months. The concept of each category has also constantly undergone changes during these 3 months. While we hope that the stability of our categorization towards the end of that period remains far beyond that, we cannot rule out risks of error, both in our system of patterns and the classification of bugs according to these patterns. If you encounter any issues with the provided data set, such as categorization or other errors, please feel free to contact us. Similarly, if you use our data set please let us know so that we can refer back to your work. We also welcome community contributions to our data set of any kind. If you would like to add more performance bugs, report on or investigate the reproducibility of our results, or add information on the effects the bugs in our data set impose, please get in touch with us. Our contact data can be found here.