ABSTRACT
Like most conventional software, spreadsheets are subject to software evolution. However, spreadsheet evolution is rarely assisted by version management tools. As a result, the version information across evolved spreadsheets is often missing or highly fragmented. This makes it difficult for users to notice the evolution issues arising from their spreadsheets.
In this paper, we propose a semi-automated approach that leverages spreadsheets' contexts (e.g., attached emails) and contents to identify evolved spreadsheets and recover the embedded version information. We apply it to the released email archive of the Enron Corporation and build an industrial-scale, versioned spreadsheet corpus VEnron. Our approach first clusters spreadsheets that likely evolved from one to another into evolution groups based on various fragmented information, such as spreadsheet filenames, spreadsheet contents, and spreadsheet-attached emails. Then, it recovers the version information of the spreadsheets in each evolution group. VEnron enables us to identify interesting issues that can arise from spreadsheet evolution. For example, the versioned spreadsheets popularly exist in the Enron email archive; changes in formulas are common; and some groups (16.9%) can introduce new errors during evolution.
According to our knowledge, VEnron is the first spreadsheet corpus with version information. It provides a valuable resource to understand issues arising from spreadsheet evolution.
- R. Abraham and M. Erwig. AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 43--50. 2006. Google ScholarDigital Library
- R. Abraham and M. Erwig. GoalDebug: A Spreadsheet Debugger for End Users. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 251--260. 2007. Google ScholarDigital Library
- R. Abraham and M. Erwig. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages & Computing, 18(1):71--95, 2007. Google ScholarDigital Library
- T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill. FUSE: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR). 2015. Google ScholarDigital Library
- S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering (TSE), 33(9):577--591, 2007. Google ScholarDigital Library
- A. Cassidy and M. Westwood-Hill. Removing pii from the edrm enron data set: Investigating the prevalence of unsecured financial, health and personally identifiable information in corporate data. {Online}. Available: http://www.nuix.com/images/resources/case_study_nuix_edrm_enron_data_set.pdf.Google Scholar
- C. Chambers and M. Erwig. Reasoning About Spreadsheets with Labels and Dimensions. Journal of Visual Languages & Computing, 21(5):249--262, 2010. Google ScholarDigital Library
- S.-C. Cheung, W. Chen, Y. Liu, and C. Xu. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In Proceedings of the 38th International Conference on Software Engineering (ICSE). 2016. To appear. Google ScholarDigital Library
- W. Dou, S.-C. Cheung, and J. Wei. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells Due to Ambiguous Computation. In Proceedings of the 36th International Conference on Software Engineering (ICSE), pages 848--858. 2014. Google ScholarDigital Library
- P. Durusau and S. Hunting. Spreadsheets - 90+ million End User Programmers With No Comment Tracking or Version Control. In Proceedings of Balisage: The Markup Conference. 2015.Google ScholarCross Ref
- M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. ACM SIGSOFT Software Engineering Notes, 30(4):1--5, 2005. Google ScholarDigital Library
- G. Rothermel, L. Li, C. Dupuis, and M. Burnett. What You See Is What You Test: A Methodology for Testing Form-based Visual Programs. In Proceedings of the 20th International Conference on Software Engineering (ICSE), pages 198--207. 1998. Google ScholarDigital Library
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering (TSE), 38(6):1276--1304, 2012. Google ScholarDigital Library
- F. Hermans and E. Murphy-Hill. Enron's Spreadsheets and Related Emails: A Dataset and Analysis. In Proceedings of the 37th International Conference on Software Engineering (ICSE), pages 7--16. 2015. Google ScholarDigital Library
- F. Hermans, M. Pinzger, and A. van Deursen. Supporting Professional Spreadsheet Users by Generating Leveled Dataflow Diagrams. In Proceedings of the 33th International Conference on Software Engineering (ICSE), pages 451--460. 2011. Google ScholarDigital Library
- F. Hermans, M. Pinzger, and A. van Deursen. Detecting and Visualizing Inter-worksheet Smells in Spreadsheets. In Proceedings of the 34th International Conference on Software Engineering (ICSE), pages 441--451. 2012. Google ScholarDigital Library
- F. Hermans, B. Sedee, M. Pinzger, and A. van Deursen. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 292--301. 2013. Google ScholarDigital Library
- B. Jansen and F. Hermans. Code Smells in Spreadsheet Formulas Revisited on an Industrial Dataset. In 31st International Conference on Software Maintenance and Evolution (ICSME). 2015. Google ScholarDigital Library
- D. Kim, J. Nam, J. Song, and S. Kim. Automatic Patch Generation Learned from Human-written Patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 802--811. 2013. Google ScholarDigital Library
- M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An Empirical Study of Code Clone Genealogies. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 187--196. 2005. Google ScholarDigital Library
- S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Predicting Faults from Cached History. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 489--498. 2007. Google ScholarDigital Library
- B. Klimt and Y. Yang. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS) in Cooperation with AAAI and The International Association for Cryptologic Research and The IEEE Technical Committee on Security and Privacy. 2004.Google Scholar
- H. A. Nguyen, T. T. Nguyen, N. H. Pham, J. Al-Kofahi, and T. N. Nguyen. Clone Management for Evolving Software. IEEE Transactions on Software Engineering (TSE), 38(5):1008--1026, 2012. Google ScholarDigital Library
- R. Panko. Facing the Problem of Spreadsheet Errors. Decision Line, 37(5):8--10, 2006.Google Scholar
- R. R. Panko and S. Aurigemma. Revising the Panko--Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2):235--244, 2010. Google ScholarDigital Library
- S. G. Powell, K. R. Baker, and B. Lawson. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems, 46(1):128--138, 2008. Google ScholarDigital Library
- K. Rajalingham, D. R. Chadwick, and B. Knight. Classification of Spreadsheet Errors. arXiv:0805.4224 {cs}, 2008.Google Scholar
- J. Reichwein, G. Rothermel, and M. Burnett. Slicing Spreadsheets: An Integrated Methodology for Spreadsheet Testing and Debugging. ACM SIGPLAN Notices, 35(1):25--38, 1999. Google ScholarDigital Library
- C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 207--214. 2005. Google ScholarDigital Library
- S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta. An Empirical Study on the Maintenance of Source Code Clones. Empirical Software Engineering, 15(1):1--34, 2010. Google ScholarDigital Library
- Spreadsheet Compare. https://technet.microsoft.com/enus/library/dn205148.aspx. {Accessed: 7-Feb-2016}.Google Scholar
- Apache POI - the Java API for Microsoft Documents. http://poi.apache.org/. {Accessed: 7-Feb-2016}.Google Scholar
Index Terms
- VEnron: a versioned spreadsheet corpus and related evolution analysis
Recommendations
SpreadCluster: recovering versioned spreadsheets through similarity-based clustering
MSR '17: Proceedings of the 14th International Conference on Mining Software RepositoriesVersion information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheets' version information. Thus, the spreadsheets' version information ...
Systematic evolution of model-based spreadsheet applications
Using spreadsheets is the preferred method to calculate, display or store anything that fits into a table-like structure. They are often used by end users to create applications, although they have one critical drawback-spreadsheets are very error-...
Change-centric Model for Web Service Evolution
ICWS '14: Proceedings of the 2014 IEEE International Conference on Web ServicesWeb service is subject to frequent changes during its lifecycle. Web service evolution is a widely discussed topic. Many related problems have also been generated from Web service evolution such as Web service adaptation, Web service versioning and Web ...
Comments