skip to main content
10.1145/2889160.2889238acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

VEnron: a versioned spreadsheet corpus and related evolution analysis

Authors Info & Claims
Published:14 May 2016Publication History

ABSTRACT

Like most conventional software, spreadsheets are subject to software evolution. However, spreadsheet evolution is rarely assisted by version management tools. As a result, the version information across evolved spreadsheets is often missing or highly fragmented. This makes it difficult for users to notice the evolution issues arising from their spreadsheets.

In this paper, we propose a semi-automated approach that leverages spreadsheets' contexts (e.g., attached emails) and contents to identify evolved spreadsheets and recover the embedded version information. We apply it to the released email archive of the Enron Corporation and build an industrial-scale, versioned spreadsheet corpus VEnron. Our approach first clusters spreadsheets that likely evolved from one to another into evolution groups based on various fragmented information, such as spreadsheet filenames, spreadsheet contents, and spreadsheet-attached emails. Then, it recovers the version information of the spreadsheets in each evolution group. VEnron enables us to identify interesting issues that can arise from spreadsheet evolution. For example, the versioned spreadsheets popularly exist in the Enron email archive; changes in formulas are common; and some groups (16.9%) can introduce new errors during evolution.

According to our knowledge, VEnron is the first spreadsheet corpus with version information. It provides a valuable resource to understand issues arising from spreadsheet evolution.

References

  1. R. Abraham and M. Erwig. AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 43--50. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Abraham and M. Erwig. GoalDebug: A Spreadsheet Debugger for End Users. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 251--260. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Abraham and M. Erwig. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages & Computing, 18(1):71--95, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill. FUSE: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR). 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering (TSE), 33(9):577--591, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Cassidy and M. Westwood-Hill. Removing pii from the edrm enron data set: Investigating the prevalence of unsecured financial, health and personally identifiable information in corporate data. {Online}. Available: http://www.nuix.com/images/resources/case_study_nuix_edrm_enron_data_set.pdf.Google ScholarGoogle Scholar
  7. C. Chambers and M. Erwig. Reasoning About Spreadsheets with Labels and Dimensions. Journal of Visual Languages & Computing, 21(5):249--262, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S.-C. Cheung, W. Chen, Y. Liu, and C. Xu. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In Proceedings of the 38th International Conference on Software Engineering (ICSE). 2016. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Dou, S.-C. Cheung, and J. Wei. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells Due to Ambiguous Computation. In Proceedings of the 36th International Conference on Software Engineering (ICSE), pages 848--858. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Durusau and S. Hunting. Spreadsheets - 90+ million End User Programmers With No Comment Tracking or Version Control. In Proceedings of Balisage: The Markup Conference. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. ACM SIGSOFT Software Engineering Notes, 30(4):1--5, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Rothermel, L. Li, C. Dupuis, and M. Burnett. What You See Is What You Test: A Methodology for Testing Form-based Visual Programs. In Proceedings of the 20th International Conference on Software Engineering (ICSE), pages 198--207. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering (TSE), 38(6):1276--1304, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Hermans and E. Murphy-Hill. Enron's Spreadsheets and Related Emails: A Dataset and Analysis. In Proceedings of the 37th International Conference on Software Engineering (ICSE), pages 7--16. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Hermans, M. Pinzger, and A. van Deursen. Supporting Professional Spreadsheet Users by Generating Leveled Dataflow Diagrams. In Proceedings of the 33th International Conference on Software Engineering (ICSE), pages 451--460. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Hermans, M. Pinzger, and A. van Deursen. Detecting and Visualizing Inter-worksheet Smells in Spreadsheets. In Proceedings of the 34th International Conference on Software Engineering (ICSE), pages 441--451. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Hermans, B. Sedee, M. Pinzger, and A. van Deursen. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 292--301. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Jansen and F. Hermans. Code Smells in Spreadsheet Formulas Revisited on an Industrial Dataset. In 31st International Conference on Software Maintenance and Evolution (ICSME). 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Kim, J. Nam, J. Song, and S. Kim. Automatic Patch Generation Learned from Human-written Patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 802--811. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An Empirical Study of Code Clone Genealogies. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 187--196. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Predicting Faults from Cached History. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 489--498. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Klimt and Y. Yang. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS) in Cooperation with AAAI and The International Association for Cryptologic Research and The IEEE Technical Committee on Security and Privacy. 2004.Google ScholarGoogle Scholar
  23. H. A. Nguyen, T. T. Nguyen, N. H. Pham, J. Al-Kofahi, and T. N. Nguyen. Clone Management for Evolving Software. IEEE Transactions on Software Engineering (TSE), 38(5):1008--1026, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Panko. Facing the Problem of Spreadsheet Errors. Decision Line, 37(5):8--10, 2006.Google ScholarGoogle Scholar
  25. R. R. Panko and S. Aurigemma. Revising the Panko--Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2):235--244, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. G. Powell, K. R. Baker, and B. Lawson. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems, 46(1):128--138, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Rajalingham, D. R. Chadwick, and B. Knight. Classification of Spreadsheet Errors. arXiv:0805.4224 {cs}, 2008.Google ScholarGoogle Scholar
  28. J. Reichwein, G. Rothermel, and M. Burnett. Slicing Spreadsheets: An Integrated Methodology for Spreadsheet Testing and Debugging. ACM SIGPLAN Notices, 35(1):25--38, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 207--214. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta. An Empirical Study on the Maintenance of Source Code Clones. Empirical Software Engineering, 15(1):1--34, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Spreadsheet Compare. https://technet.microsoft.com/enus/library/dn205148.aspx. {Accessed: 7-Feb-2016}.Google ScholarGoogle Scholar
  32. Apache POI - the Java API for Microsoft Documents. http://poi.apache.org/. {Accessed: 7-Feb-2016}.Google ScholarGoogle Scholar

Index Terms

  1. VEnron: a versioned spreadsheet corpus and related evolution analysis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICSE '16: Proceedings of the 38th International Conference on Software Engineering Companion
          May 2016
          946 pages
          ISBN:9781450342056
          DOI:10.1145/2889160

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 May 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate276of1,856submissions,15%

          Upcoming Conference

          ICSE 2025

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader