skip to main content
10.1145/3035918.3054782acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Data Management Challenges in Production Machine Learning

Published:09 May 2017Publication History

ABSTRACT

The tutorial discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Informed by our own experience with such largescale pipelines, we focus on issues related to understanding, validating, cleaning, and enriching training data. The goal of the tutorial is to bring forth these issues, draw connections to prior work in the database literature, and outline the open research questions that are not addressed by prior art.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.Google ScholarGoogle Scholar
  2. R. Abousleiman, G. Qu, and O. A. Rawashdeh. North atlantic right whale contact call detection. CoRR, abs/1304.7851, 2013.Google ScholarGoogle Scholar
  3. S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eighth Eurosys Conference 2013, EuroSys '13, Prague, Czech Republic, April 14-17, 2013, pages 29--42, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bertossi. Consistent query answering in databases. SIGMOD Rec., 35(2):68--76, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: Experimental comparison of representations and distance measures. Proc. VLDB Endow., 1(2):1542--1552, Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Duan, V. Thummala, and S. Babu. Tuning database configuration parameters with ituned. PVLDB, 2(1):1246--1257, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, volume 4978, pages 1--19. Springer Verlag, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. N. Garofalakis and P. B. Gibbon. Approximate query processing: Taming the terabytes. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 725--, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 795--806, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. M. Hellerstein, V. Sreekanti, J. E. Gonzales, Sudhansku, Arora, A. Bhattacharyya, S. Das, A. Dey, M. Donsky, G. Fierro, S. Nag, K. Ramachandran, C. She, E. Sun, C. Steinbach, and V. Subramanian. Establishing common ground with data context. In Proceedings of CIDR 2017, 2017.Google ScholarGoogle Scholar
  12. S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948--959, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice about joins before feature selection. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26-July 01, 2016, pages 19--34. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Sarawagi. User-adaptive exploration of multidimensional data. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 307--316. Morgan Kaufmann, 2000.Google ScholarGoogle Scholar
  15. G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAP data. In P. M. G. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, editors, VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 531--540. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.Google ScholarGoogle Scholar
  17. M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Management Challenges in Production Machine Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
        May 2017
        1810 pages
        ISBN:9781450341974
        DOI:10.1145/3035918

        Copyright © 2017 Owner/Author

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 May 2017

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader