research-article

VEnron: a versioned spreadsheet corpus and related evolution analysis

Authors:
Wensheng Dou

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Liang Xu

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Shing-Chi Cheung

The Hong Kong University of Science and Technology, Hong Kong, China

The Hong Kong University of Science and Technology, Hong Kong, China
View Profile

,
Chushu Gao

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Jun Wei

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Tao Huang

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

ICSE '16: Proceedings of the 38th International Conference on Software Engineering CompanionMay 2016Pages 162–171https://doi.org/10.1145/2889160.2889238

Published:14 May 2016Publication History

ICSE '16: Proceedings of the 38th International Conference on Software Engineering Companion

Pages 162–171

ABSTRACT

Like most conventional software, spreadsheets are subject to software evolution. However, spreadsheet evolution is rarely assisted by version management tools. As a result, the version information across evolved spreadsheets is often missing or highly fragmented. This makes it difficult for users to notice the evolution issues arising from their spreadsheets.

In this paper, we propose a semi-automated approach that leverages spreadsheets' contexts (e.g., attached emails) and contents to identify evolved spreadsheets and recover the embedded version information. We apply it to the released email archive of the Enron Corporation and build an industrial-scale, versioned spreadsheet corpus VEnron. Our approach first clusters spreadsheets that likely evolved from one to another into evolution groups based on various fragmented information, such as spreadsheet filenames, spreadsheet contents, and spreadsheet-attached emails. Then, it recovers the version information of the spreadsheets in each evolution group. VEnron enables us to identify interesting issues that can arise from spreadsheet evolution. For example, the versioned spreadsheets popularly exist in the Enron email archive; changes in formulas are common; and some groups (16.9%) can introduce new errors during evolution.

According to our knowledge, VEnron is the first spreadsheet corpus with version information. It provides a valuable resource to understand issues arising from spreadsheet evolution.

References

R. Abraham and M. Erwig. AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 43--50. 2006. Google ScholarDigital Library
R. Abraham and M. Erwig. GoalDebug: A Spreadsheet Debugger for End Users. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 251--260. 2007. Google ScholarDigital Library
R. Abraham and M. Erwig. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages & Computing, 18(1):71--95, 2007. Google ScholarDigital Library
T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill. FUSE: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR). 2015. Google ScholarDigital Library
S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering (TSE), 33(9):577--591, 2007. Google ScholarDigital Library
A. Cassidy and M. Westwood-Hill. Removing pii from the edrm enron data set: Investigating the prevalence of unsecured financial, health and personally identifiable information in corporate data. {Online}. Available: http://www.nuix.com/images/resources/case_study_nuix_edrm_enron_data_set.pdf.Google Scholar
C. Chambers and M. Erwig. Reasoning About Spreadsheets with Labels and Dimensions. Journal of Visual Languages & Computing, 21(5):249--262, 2010. Google ScholarDigital Library
S.-C. Cheung, W. Chen, Y. Liu, and C. Xu. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In Proceedings of the 38th International Conference on Software Engineering (ICSE). 2016. To appear. Google ScholarDigital Library
W. Dou, S.-C. Cheung, and J. Wei. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells Due to Ambiguous Computation. In Proceedings of the 36th International Conference on Software Engineering (ICSE), pages 848--858. 2014. Google ScholarDigital Library
P. Durusau and S. Hunting. Spreadsheets - 90+ million End User Programmers With No Comment Tracking or Version Control. In Proceedings of Balisage: The Markup Conference. 2015.Google ScholarCross Ref
M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. ACM SIGSOFT Software Engineering Notes, 30(4):1--5, 2005. Google ScholarDigital Library
G. Rothermel, L. Li, C. Dupuis, and M. Burnett. What You See Is What You Test: A Methodology for Testing Form-based Visual Programs. In Proceedings of the 20th International Conference on Software Engineering (ICSE), pages 198--207. 1998. Google ScholarDigital Library
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering (TSE), 38(6):1276--1304, 2012. Google ScholarDigital Library
F. Hermans and E. Murphy-Hill. Enron's Spreadsheets and Related Emails: A Dataset and Analysis. In Proceedings of the 37th International Conference on Software Engineering (ICSE), pages 7--16. 2015. Google ScholarDigital Library
F. Hermans, M. Pinzger, and A. van Deursen. Supporting Professional Spreadsheet Users by Generating Leveled Dataflow Diagrams. In Proceedings of the 33th International Conference on Software Engineering (ICSE), pages 451--460. 2011. Google ScholarDigital Library
F. Hermans, M. Pinzger, and A. van Deursen. Detecting and Visualizing Inter-worksheet Smells in Spreadsheets. In Proceedings of the 34th International Conference on Software Engineering (ICSE), pages 441--451. 2012. Google ScholarDigital Library
F. Hermans, B. Sedee, M. Pinzger, and A. van Deursen. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 292--301. 2013. Google ScholarDigital Library
B. Jansen and F. Hermans. Code Smells in Spreadsheet Formulas Revisited on an Industrial Dataset. In 31st International Conference on Software Maintenance and Evolution (ICSME). 2015. Google ScholarDigital Library
D. Kim, J. Nam, J. Song, and S. Kim. Automatic Patch Generation Learned from Human-written Patches. In Proceedings of the 35th International Conference on Software Engineering (ICSE), pages 802--811. 2013. Google ScholarDigital Library
M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An Empirical Study of Code Clone Genealogies. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 187--196. 2005. Google ScholarDigital Library
S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. Predicting Faults from Cached History. In Proceedings of the 29th International Conference on Software Engineering (ICSE), pages 489--498. 2007. Google ScholarDigital Library
B. Klimt and Y. Yang. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS) in Cooperation with AAAI and The International Association for Cryptologic Research and The IEEE Technical Committee on Security and Privacy. 2004.Google Scholar
H. A. Nguyen, T. T. Nguyen, N. H. Pham, J. Al-Kofahi, and T. N. Nguyen. Clone Management for Evolving Software. IEEE Transactions on Software Engineering (TSE), 38(5):1008--1026, 2012. Google ScholarDigital Library
R. Panko. Facing the Problem of Spreadsheet Errors. Decision Line, 37(5):8--10, 2006.Google Scholar
R. R. Panko and S. Aurigemma. Revising the Panko--Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2):235--244, 2010. Google ScholarDigital Library
S. G. Powell, K. R. Baker, and B. Lawson. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems, 46(1):128--138, 2008. Google ScholarDigital Library
K. Rajalingham, D. R. Chadwick, and B. Knight. Classification of Spreadsheet Errors. arXiv:0805.4224 {cs}, 2008.Google Scholar
J. Reichwein, G. Rothermel, and M. Burnett. Slicing Spreadsheets: An Integrated Methodology for Spreadsheet Testing and Debugging. ACM SIGPLAN Notices, 35(1):25--38, 1999. Google ScholarDigital Library
C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pages 207--214. 2005. Google ScholarDigital Library
S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta. An Empirical Study on the Maintenance of Source Code Clones. Empirical Software Engineering, 15(1):1--34, 2010. Google ScholarDigital Library
Spreadsheet Compare. https://technet.microsoft.com/enus/library/dn205148.aspx. {Accessed: 7-Feb-2016}.Google Scholar
Apache POI - the Java API for Microsoft Documents. http://poi.apache.org/. {Accessed: 7-Feb-2016}.Google Scholar

Index Terms

VEnron: a versioned spreadsheet corpus and related evolution analysis

Recommendations

SpreadCluster: recovering versioned spreadsheets through similarity-based clustering
MSR '17: Proceedings of the 14th International Conference on Mining Software Repositories

Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheets' version information. Thus, the spreadsheets' version information ...
Read More
Systematic evolution of model-based spreadsheet applications

Using spreadsheets is the preferred method to calculate, display or store anything that fits into a table-like structure. They are often used by end users to create applications, although they have one critical drawback-spreadsheets are very error-...
Read More
Change-centric Model for Web Service Evolution
ICWS '14: Proceedings of the 2014 IEEE International Conference on Web Services

Web service is subject to frequent changes during its lifecycle. Web service evolution is a widely discussed topic. Many related problems have also been generated from Web service evolution such as Web service adaptation, Web service versioning and Web ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '16: Proceedings of the 38th International Conference on Software Engineering Companion
May 2016
946 pages
ISBN:9781450342056
DOI:10.1145/2889160
General Chair:
Laura Dillon
Michigan State University
,
Program Chairs:
Willem Visser
Stellenbosch University, South Africa
,
Laurie Williams
North Carolina State University
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evolution
spreadsheet
version
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 216
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

VEnron: a versioned spreadsheet corpus and related evolution analysis

ICSE '16: Proceedings of the 38th International Conference on Software Engineering Companion

ABSTRACT

References

Cited By

Index Terms

Recommendations

SpreadCluster: recovering versioned spreadsheets through similarity-based clustering

Systematic evolution of model-based spreadsheet applications

Change-centric Model for Web Service Evolution