article

Where the bugs are

Authors:
Thomas J. Ostrand

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

,
Elaine J. Weyuker

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

,
Robert M. Bell

AT&T Labs - Research, Florham Park, NJ

AT&T Labs - Research, Florham Park, NJ
View Profile

Authors Info & Claims

ACM SIGSOFT Software Engineering Notes Volume 29 Issue 4July 2004pp 86–96https://doi.org/10.1145/1013886.1007524

Published:01 July 2004Publication History

ACM SIGSOFT Software Engineering Notes

Abstract

The ability to predict which files in a large software system are most likely to contain the largest numbers of faults in the next release can be a very valuable asset. To accomplish this, a negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system. The files of each release were sorted in descending order based on the predicted number of faults and then the first 20% of the files were selected. This was done for each of fifteen consecutive releases, representing more than four years of field usage. The predictions were extremely accurate, correctly selecting files that contained between 71% and 92% of the faults, with the overall average being 83%. In addition, the same model was used on data for the same system's releases, but with all fault data prior to integration testing removed. The prediction was again very accurate, ranging from 71% to 93%, with the average being 84%. Predictions were made for a second system, and again the first 20% of files accounted for 83% of the identified faults. Finally, a highly simplified predictor was considered which correctly predicted 73% and 74% of the faults for the two systems.

References

E.N. Adams. Optimizing Preventive Service of Software Products. IBM J. Res. Develop., Vol 28, No 1, Jan 1984, pp.2--14.Google ScholarDigital Library
V.R. Basili and B.T. Perricone. Software Errors and Complexity: An Empirical Investigation. Communications of the ACM, Vol 27, No 1, Jan 1984, pp.42--52. Google ScholarDigital Library
N.E. Fenton and N. Ohlsson. Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Trans. on Software Engineering, Vol 26, No 8, Aug 2000, pp.797--814. Google ScholarDigital Library
T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy. Predicting Fault Incidence Using Software Change History. IEEE Trans. on Software Engineering, Vol 26, No. 7, July 2000, pp.653--661. Google ScholarDigital Library
L. Hatton. Reexamining the Fault Density - Component Size Connection. IEEE Software, March/April 1997, pp.89--97. Google ScholarDigital Library
T.M. Khoshgoftaar, E.B. Allen, K.S. Kalaichelvan, N. Goel. Early Quality Prediction: A Case Study in Telecommunications. IEEE Software, Jan 1996, pp.65--71. Google ScholarDigital Library
T.J. McCabe. A Complexity Measure. IEEE Trans. on Software Engineering, Vol 2, 1976, pp.308--320.Google ScholarDigital Library
P. McCullagh and J.A. Nelder. Generalized Linear Models, Second Edition, Chapman and Hall, London, 1989.Google ScholarCross Ref
K-H. Moller and D.J. Paulish. An Empirical Investigation of Software Fault Distribution. Proc. IEEE First International Software Metrics Symposium, Baltimore, Md., May 21-22, 1993, pp.82--90.Google ScholarCross Ref
J.C. Munson and T.M. Khoshgoftaar. The Detection of Fault-Prone Programs. IEEE Trans. on Software Engineering, Vol 18, No 5, May 1992, pp.423--433. Google ScholarDigital Library
T. Ostrand and E.J. Weyuker. The Distribution of Faults in a Large Industrial Software System. Proc. ACM/International Symposium on Software Testing and Analysis (ISSTA2002), Rome, Italy, July 2002, pp.55--64. Google ScholarDigital Library
T. Ostrand, E.J. Weyuker, and R. Bell. Using Static Analysis to Determine Where to Focus Dynamic Testing Effort. Proc. IEE/Workshop on Dynamic Analysis (WODA04), Edinburgh, May 2004.Google ScholarCross Ref
M. Pighin and A. Marzona. An Empirical Analysis of Fault Persistence Through Software Releases. Proc. IEEE/ACM ISESE 2003, pp.206--212. Google ScholarDigital Library
SAS Institute Inc. SAS/STAT User's Guide, Version 8, SAS Institute, Cary, NC, 1999.Google Scholar

Index Terms

Where the bugs are
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Looking for bugs in all the right places
ISSTA '06: Proceedings of the 2006 international symposium on Software testing and analysis

We continue investigating the use of a negative binomial regression model to predict which files in a large industrial software system are most likely to contain many faults in the next release. A new empirical study is described whose subject is an ...
Read More
Where the bugs are
ISSTA '04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis

The ability to predict which files in a large software system are most likely to contain the largest numbers of faults in the next release can be a very valuable asset. To accomplish this, a negative binomial regression model using information from ...
Read More
Predicting the Location and Number of Faults in Large Software Systems

Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used ...
Read More

Reviews

Reviewer: Timothy R. Hopkins

This interesting paper reports on the use of a negative binomial regression model to predict which files in a large software system are most likely to contain faults in the next release. This model was developed by studying a large and evolving inventory system over a four-year period, during which there were 17 releases. Fault data was available for all stages of the development, from requirements to general release. During this time, the number of files constituting the system grew from 584 to 1,950, and the total number of lines of code (LOC) grew from 145,000 to 538,000. The model characterizes each file by using a combination of continuous and categorical predictors, including LOC, number of faults reported in previous releases, whether the file is new for the current release or has been changed, and programming language. First, data from release 1 to n-1 was used to predict the number of faults likely to occur in each file at release n, for n=1, ..., 12, and the files were then ordered by predicted fault count. It was found that, on average, 80 percent of the faults identified in the release occurred in the top 20 percent of the ordered predicted files. In the later releases, all of the faults actually occurred in less than ten percent of the files. The estimated coefficients used in releases 1 to 12 were then used, as an independent validation of the model's predictive accuracy, on releases 13 to 17. The average percentage of faults contained in the first 20 percent of the files selected by the model rose to 89 percent, on average. It should be noted that, in general, the predicted fault counts did not match the actual fault counts on an individual file basis, but the majority of faults did appear in the set of files at the top of the predicted list. Further reported results showed that the predictive qualities of the model were hardly affected when only using faults detected during the integration testing phase or later. Results obtained from a simplified model, using only LOC as the predictor, were also given. This model proved to be less accurate, due to the ability of the full model to successfully identify small and medium fault-prone files. This is a well-written paper, which is easy to follow and understand. It should be read by anyone who wants to obtain pointers on where to focus their testing efforts; all managers of large software projects should read it. The only practical problem is the level of statistical know-how required to implement the full model, but the payback appears to make the effort worthwhile.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGSOFT Software Engineering Notes Volume 29, Issue 4
July 2004
284 pages
ISSN:0163-5948
DOI:10.1145/1013886
Issue’s Table of Contents
ISSTA '04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
July 2004
294 pages
ISBN:1581138202
DOI:10.1145/1007512
General Chair:
George Avrunin
University of Massachusetts, USA
,
Program Chair:
Gregg Rothermel
University of Nebraska -- Lincoln, USA
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2004
Check for updates
Author Tags
empirical study
fault-prone
prediction
regression model
software faults
software testing
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 196
  Total Citations
  View Citations
- 1,941
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Where the bugs are

ACM SIGSOFT Software Engineering Notes

Abstract

References

Cited By

Index Terms

Recommendations

Looking for bugs in all the right places

Where the bugs are

Predicting the Location and Number of Faults in Large Software Systems

Reviews

Access critical reviews of Computing literature here