Performance Bugs

Performance bugs are program source code that is unnecessarily inefficient and that affects the perceived software quality similarly to functional bugs. However, in comparison to functional bugs there are (as of 2019) fewer empirical studies on performance bugs and they cover significantly fewer subjects. As a consequnce, while many approaches for detecting and localizing a variety of performance bugs have been developed in recent years, their efficacy has usually been evaluated on a relatively small set of bug instances. Therefore, we investigated more than 700 commits across 13 C/C++ projects to provide a dataset of real-world performance bugs, grouped by projects here and by patterns here. The patterns provide an abstract semantic classification how performance bugs are fixed. A detailed discussion of this classification can be found in our paper that will shortly appear in the conference proceedings of ISSRE 2019.

The dataset on these pages can be used 1) to assess the alignment of the current state of the art in performance bug detection and localization with performance bugs that get fixed in practice, 2) as a larger corpus to evaluate performance bug detection and localization approaches against, and 3) as the basis for further research, such as the simulation of performance bugs via code mutation.

More details can be found in our paper, which we will link from here as soon as it gets published.

Projects

The 13 projects investigated for our study are:

Project Repository
NetworkManager https://github.com/NetworkManager/NetworkManager
pulseaudio https://github.com/pulseaudio/pulseaudio
grep http://git.savannah.gnu.org/cgit/grep.git/
rsyslog https://github.com/rsyslog/rsyslog
lvm2 https://github.com/lvmteam/lvm2
llvm https://github.com/llvm-mirror/llvm1
git https://github.com/git/git
clang https://github.com/llvm-mirror/clang1
gecko-dev2 https://github.com/mozilla/gecko-dev
openssl https://github.com/openssl/openssl
systemd https://github.com/systemd/systemd
libgcrypt https://github.com/gpg/libgcrypt
linux https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

1: The investigation was started when llvm community has not been migrated from svn to github and uses an unofficial mirror on github. The up-to-date official repository is at https://github.com/llvm/llvm-project.

2: Firefox

The total number of commits matched by each keyword (as discussed in our paper) is:

Project fast latenc optimiz accelerate efficient contention performance speed up slow Total
NetworkManager 38 0 121 0 20 0 15 3 21 209
pulseaudio 6 75 16 0 3 0 6 1 6 106
grep 38 1 31 0 5 0 59 16 20 123
rsyslog 37 0 51 0 2 3 25 3 21 136
lvm2 35 2 19 1 12 3 23 0 39 123
llvm 1017 222 2389 3 399 3 625 39 328 4567
git 433 14 284 2 127 9 287 36 126 1107
clang 195 1 457 2 72 3 148 5 48 860
gecko-dev 946 109 1230 87 261 11 1231 81 633 4329
openssl 24 2 68 0 15 1 68 3 13 169
systemd 69 12 132 0 16 0 31 7 89 327
libgcrypt 70 0 50 17 4 0 24 5 11 145
linux 5936 1782 3746 249 1411 565 4844 351 3392 18975

Threats to validity

The categorization of performance bugs according to semantic pattern has been derived during a continuous work period spanning 3 months. The concept of each category has also constantly undergone changes during these 3 months. While we hope that the stability of our categorization towards the end of that period remains far beyond that, we cannot rule out risks of error, both in our system of patterns and the classification of bugs according to these patterns. If you encounter any issues with the provided data set, such as categorization or other errors, please feel free to contact us. Similarly, if you use our data set please let us know so that we can refer back to your work. We also welcome community contributions to our data set of any kind. If you would like to add more performance bugs, report on or investigate the reproducibility of our results, or add information on the effects the bugs in our data set impose, please get in touch with us. Our contact data can be found here.