skip to main content
10.1145/3299869.3314050acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Data Platform for Machine Learning

Published:25 June 2019Publication History

ABSTRACT

In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.

References

  1. Apple. Turi create. https://github.com/apple/turicreate/, 2018; accessed November 28, 2018.Google ScholarGoogle Scholar
  2. A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.Google ScholarGoogle Scholar
  3. M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. In PVLDB, volume 9, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald. Declarative machine learning - a classification of basic properties and types. In CoRR, abs/1605.05826, 2016.Google ScholarGoogle Scholar
  5. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings on Computer Vision and Pattern Recognition. IEEE Computer Society, June 2009.Google ScholarGoogle ScholarCross RefCross Ref
  6. Facebook. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/ introducing-fblearner-flow-facebook-s-ai-backbone/, 2018; accessed November 28, 2018.Google ScholarGoogle Scholar
  7. R. Gruener, O. Cheng, and Y. Litvin. Introducing Petastorm: Uber ATG's data access library for deep learning. https://eng.uber.com/petastorm/, 2018; accessed November 28, 2018.Google ScholarGoogle Scholar
  8. A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing Google's datasets. SIGMOD, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620--629, Feb 2018.Google ScholarGoogle ScholarCross RefCross Ref
  10. T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009.Google ScholarGoogle Scholar
  11. T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google ScholarGoogle Scholar
  12. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.Google ScholarGoogle Scholar
  13. A. Maccioni and R. Torlone. Crossing the finish line faster when paddling the data lake with kayak. Proceedings of the VLDB Endowment, 10(12):1853--1856, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. Proceedings of the VLDB Endowment, 9(9):624--635, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Apache MXNet. Mxnet data api. https://mxnet.incubator.apache.org/ versions/master/api/python/io/io.html, 2018; accessed November 28, 2018.Google ScholarGoogle Scholar
  16. H. Miao, A. Chavan, and A. Deshpande. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2ndWorkshop on Human-In-the-Loop Data Analytics, page 7. ACM, 2017.Google ScholarGoogle Scholar
  17. H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 571--582. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  18. V. Sridhar, S. Subramanian,D. Arteaga, S. Sundararaman,D. Roselli, and N. Talagala. Model governance: Reducing the anarchy of production {ML}. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 351--358, Boston, MA, 2018. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, Scientific and Statistical Database Management, pages 1--16, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. TensorFlow. An open source machine learning framework for everyone. https://www.tensorflow.org/, 2018; accessed November 28, 2018.Google ScholarGoogle Scholar
  21. Uber. Meet Michelangelo: Uber's machine learning platform. https: //eng.uber.com/michelangelo/, 2017; accessed November 28, 2018.Google ScholarGoogle Scholar
  22. L. Xu, S. Huang, S. Hui, A. J. Elmore, and A. Parameswaran. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1655--1658. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Zhang, F. Xu, E. Frise, S. Wu, B. Yu, and W. Xu. Datalab: A version data management and analytics system. In Proceedings of the 2Nd International Workshop on BIG Data Software Engineering, BIGDSE '16, pages 12--18, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Platform for Machine Learning

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader