skip to main content
article

Where the bugs are

Published:01 July 2004Publication History
Skip Abstract Section

Abstract

The ability to predict which files in a large software system are most likely to contain the largest numbers of faults in the next release can be a very valuable asset. To accomplish this, a negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system. The files of each release were sorted in descending order based on the predicted number of faults and then the first 20% of the files were selected. This was done for each of fifteen consecutive releases, representing more than four years of field usage. The predictions were extremely accurate, correctly selecting files that contained between 71% and 92% of the faults, with the overall average being 83%. In addition, the same model was used on data for the same system's releases, but with all fault data prior to integration testing removed. The prediction was again very accurate, ranging from 71% to 93%, with the average being 84%. Predictions were made for a second system, and again the first 20% of files accounted for 83% of the identified faults. Finally, a highly simplified predictor was considered which correctly predicted 73% and 74% of the faults for the two systems.

References

  1. E.N. Adams. Optimizing Preventive Service of Software Products. IBM J. Res. Develop., Vol 28, No 1, Jan 1984, pp.2--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V.R. Basili and B.T. Perricone. Software Errors and Complexity: An Empirical Investigation. Communications of the ACM, Vol 27, No 1, Jan 1984, pp.42--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N.E. Fenton and N. Ohlsson. Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Trans. on Software Engineering, Vol 26, No 8, Aug 2000, pp.797--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy. Predicting Fault Incidence Using Software Change History. IEEE Trans. on Software Engineering, Vol 26, No. 7, July 2000, pp.653--661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Hatton. Reexamining the Fault Density - Component Size Connection. IEEE Software, March/April 1997, pp.89--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T.M. Khoshgoftaar, E.B. Allen, K.S. Kalaichelvan, N. Goel. Early Quality Prediction: A Case Study in Telecommunications. IEEE Software, Jan 1996, pp.65--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T.J. McCabe. A Complexity Measure. IEEE Trans. on Software Engineering, Vol 2, 1976, pp.308--320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. McCullagh and J.A. Nelder. Generalized Linear Models, Second Edition, Chapman and Hall, London, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  9. K-H. Moller and D.J. Paulish. An Empirical Investigation of Software Fault Distribution. Proc. IEEE First International Software Metrics Symposium, Baltimore, Md., May 21-22, 1993, pp.82--90.Google ScholarGoogle ScholarCross RefCross Ref
  10. J.C. Munson and T.M. Khoshgoftaar. The Detection of Fault-Prone Programs. IEEE Trans. on Software Engineering, Vol 18, No 5, May 1992, pp.423--433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Ostrand and E.J. Weyuker. The Distribution of Faults in a Large Industrial Software System. Proc. ACM/International Symposium on Software Testing and Analysis (ISSTA2002), Rome, Italy, July 2002, pp.55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Ostrand, E.J. Weyuker, and R. Bell. Using Static Analysis to Determine Where to Focus Dynamic Testing Effort. Proc. IEE/Workshop on Dynamic Analysis (WODA04), Edinburgh, May 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Pighin and A. Marzona. An Empirical Analysis of Fault Persistence Through Software Releases. Proc. IEEE/ACM ISESE 2003, pp.206--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. SAS Institute Inc. SAS/STAT User's Guide, Version 8, SAS Institute, Cary, NC, 1999.Google ScholarGoogle Scholar

Index Terms

  1. Where the bugs are

    Recommendations

    Reviews

    Timothy R. Hopkins

    This interesting paper reports on the use of a negative binomial regression model to predict which files in a large software system are most likely to contain faults in the next release. This model was developed by studying a large and evolving inventory system over a four-year period, during which there were 17 releases. Fault data was available for all stages of the development, from requirements to general release. During this time, the number of files constituting the system grew from 584 to 1,950, and the total number of lines of code (LOC) grew from 145,000 to 538,000. The model characterizes each file by using a combination of continuous and categorical predictors, including LOC, number of faults reported in previous releases, whether the file is new for the current release or has been changed, and programming language. First, data from release 1 to n-1 was used to predict the number of faults likely to occur in each file at release n, for n=1, ..., 12, and the files were then ordered by predicted fault count. It was found that, on average, 80 percent of the faults identified in the release occurred in the top 20 percent of the ordered predicted files. In the later releases, all of the faults actually occurred in less than ten percent of the files. The estimated coefficients used in releases 1 to 12 were then used, as an independent validation of the model's predictive accuracy, on releases 13 to 17. The average percentage of faults contained in the first 20 percent of the files selected by the model rose to 89 percent, on average. It should be noted that, in general, the predicted fault counts did not match the actual fault counts on an individual file basis, but the majority of faults did appear in the set of files at the top of the predicted list. Further reported results showed that the predictive qualities of the model were hardly affected when only using faults detected during the integration testing phase or later. Results obtained from a simplified model, using only LOC as the predictor, were also given. This model proved to be less accurate, due to the ability of the full model to successfully identify small and medium fault-prone files. This is a well-written paper, which is easy to follow and understand. It should be read by anyone who wants to obtain pointers on where to focus their testing efforts; all managers of large software projects should read it. The only practical problem is the level of statistical know-how required to implement the full model, but the payback appears to make the effort worthwhile.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGSOFT Software Engineering Notes
      ACM SIGSOFT Software Engineering Notes  Volume 29, Issue 4
      July 2004
      284 pages
      ISSN:0163-5948
      DOI:10.1145/1013886
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISSTA '04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
        July 2004
        294 pages
        ISBN:1581138202
        DOI:10.1145/1007512

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 July 2004

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader