ABSTRACT
The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.Google Scholar
- R. Abousleiman, G. Qu, and O. A. Rawashdeh. North atlantic right whale contact call detection. CoRR, abs/1304.7851, 2013.Google Scholar
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 14-17, 2013, pages 29--42, 2013. Google ScholarDigital Library
- L. Bertossi. Consistent query answering in databases. SIGMOD Rec., 35(2):68--76, June 2006. Google ScholarDigital Library
- H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endow., 1(2):1542--1552, Aug. 2008. Google ScholarDigital Library
- S. Duan, V. Thummala, and S. Babu. Tuning database configuration parameters with ituned. PVLDB, 2(1):1246--1257, 2009. Google ScholarDigital Library
- C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, volume 4978, pages 1--19. Springer Verlag, April 2008. Google ScholarDigital Library
- M. N. Garofalakis and P. B. Gibbon. Approximate query processing: Taming the terabytes. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 725--, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarDigital Library
- A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 795--806, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In Proceedings of CIDR 2017, 2017.Google Scholar
- S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016. Google ScholarDigital Library
- A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26-July 01, 2016, pages 19--34. ACM, 2016. Google ScholarDigital Library
- S. Sarawagi. User-adaptive exploration of multidimensional data. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 307--316. Morgan Kaufmann, 2000.Google Scholar
- G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAP data. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 531--540. Morgan Kaufmann, 2001. Google ScholarDigital Library
- D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.Google Scholar
- M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015. Google ScholarDigital Library
Index Terms
- Data Management Challenges in Production Machine Learning
Recommendations
Data Lifecycle Challenges in Production Machine Learning: A Survey
Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and ...
Data Integration and Machine Learning: A Natural Synergy
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataThere is now more data to analyze than ever before. As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest ...
Data Management in Machine Learning: Challenges, Techniques, and Systems
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataLarge-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ...
Comments