ABSTRACT
Target audience: Software practitioners and researchers wanting to understand the state of the art in using data science for software engineering (SE). Content: In the age of big data, data science (the knowledge of deriving meaningful outcomes from data) is an essential skill that should be equipped by software engineers. It can be used to predict useful information on new projects based on completed projects. This tutorial offers core insights about the state-of-the-art in this important field. What participants will learn: Before data science: this tutorial discusses the tasks needed to deploy machine-learning algorithms to organizations (Part1: Organization Issues). During data science: from discretization to clustering to dichotomization and statistical analysis. And the rest: When local data is scarce, we show how to adapt data from other organizations to local problems. When privacy concerns block access, we show how to privatize data while still being able to mine it. When working with data of dubious quality, we show how to prune spurious information. When data or models seem too complex, we show how to simplify data mining results. When data is too scarce to support intricate models, we show methods for generating predictions. When the world changes, and old models need to be updated, we show how to handle those updates. When the effect is too complex for one model, we show how to reason across ensembles of models. Pre-requisites: This tutorial makes minimal use of maths of advanced algorithms and would be understandable by developers and technical managers.
- Z. Chen, T. Menzies, D. Port, and B. Boehm. Finding the right data for software cost modeling. IEEE Software, 22(6):38–46, 2005. Google ScholarDigital Library
- P. Domingos. A few useful things to know about machine learning. Communications of ACM, 55(10):78–87, Oct. 2012. Google ScholarDigital Library
- B. Dominic and C. D. Making advanced analytics work for you. Harvard Business Review, 90(10):78–83, 2012.Google Scholar
- L. GMINKU and X. YAO. Can cross-company data improve performance in software effort estimation? In PROMISE’12: Proceedings of the 8th International Conference on Predictive Models in Software Engineering, pages 69–78, 2012. Google ScholarDigital Library
- M. Grechanik, C. Csallner, C. Fu, and Q. Xie. Is data privacy always good for software testing? In ISSRE’10: IEEE 21st International Symposium on Software Reliability Engineering, pages 368–377, 2010. Google ScholarDigital Library
- E. Kocaguneli and T. Menzies. How to find relevant data for effort estimation? In Empirical Software Engineering and Measurement (ESEM), 2011 International Symposium on, pages 255–264. IEEE, 2011. Google ScholarDigital Library
- E. Kocaguneli, T. Menzies, A. Bener, and J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Transactions on Software Engineering, 38(2):425–438, 2012. Google ScholarDigital Library
- E. Kocaguneli, T. Menzies, and J. Keung. On the value of ensemble effort estimation. IEEE Transactions on Software Engineering, 38(6):1403–1416, 2012. Google ScholarDigital Library
- T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaguneli. The inductive software engineering manifesto: principles for industrial data mining. In Proceedings of the International Workshop on Machine Learning Technologies in Software Engineering, MALETS ’11, pages 19–26, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, and T. Zimmermann. Local vs. global lessons for defect prediction and effort estimation. IEEE Transactions on Software Engineering, pages 1–1, 2012.Google Scholar
- T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. Local vs. global models for effort estimation and defect prediction. In ASE’11: 26th IEEE/ACM International Conference on Automated Software Engineering, pages 343–351, 2011. Google ScholarDigital Library
- T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1):2–13, 2007. Google ScholarDigital Library
- L. L. Minku and X. Yao. Ensembles and locality: Insight on improving software effort estimation. Information and Software Technology, 2012.Google Scholar
- F. Peters and T. Menzies. Privacy and utility for defect prediction: Experiments with morph. In ICSE’12: 34th International Conference on Software Engineering, pages 189–199, 2012. Google ScholarDigital Library
- M. Shepperd. It doesn‘t matter what you do, but it does matter who does it! In CREST Open Workshop, 2011.Google Scholar
- B. Turhan. On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 17:62–74, 2012. Google ScholarDigital Library
- B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. Google ScholarDigital Library
- B. Turhan, A. T. Misirli, and A. Bener. Empirical evaluation of the effects of mixed project data on learning defect predictors. Information and Software Technology, 2012. Google ScholarDigital Library
Index Terms
- Data science for software engineering
Recommendations
Mining software engineering data
ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project's status, progress, and evolution. Using well-established data mining techniques,...
First International Workshop on Software Engineering for Computational Science & Engineering
In recognition of the general lack of exposure scientists have to software engineering and vice versa, a workshop was held during the 2008 International Conference on Software Engineering in Leipzig, Germany. The workshop's goal was to bring together ...
Comments