Data Platform for Machine Learning

Authors:
Pulkit Agrawal

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Rajat Arya

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Aanchal Bindal

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Sandeep Bhatia

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Anupriya Gagneja

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Joseph Godlewski

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Yucheng Low

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Timothy Muss

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Mudit Manu Paliwal

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Sethu Raman

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Vishrut Shah

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Bochao Shen

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Laura Sugden

Apple Inc., Cupertino, CO, USA

Apple Inc., Cupertino, CO, USA
View Profile

,
Kaiyu Zhao

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

,
Ming-Chuan Wu

Apple Inc., Cupertino, CA, USA

Apple Inc., Cupertino, CA, USA
View Profile

SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataJune 2019Pages 1803–1816https://doi.org/10.1145/3299869.3314050

Published:25 June 2019Publication History

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 1803–1816

ABSTRACT

In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.

References

Apple. Turi create. https://github.com/apple/turicreate/, 2018; accessed November 28, 2018.Google Scholar
A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. Datahub: Collaborative data science & dataset version management at scale. CoRR, abs/1409.0798, 2014.Google Scholar
M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. In PVLDB, volume 9, 2016. Google ScholarDigital Library
M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald. Declarative machine learning - a classification of basic properties and types. In CoRR, abs/1605.05826, 2016.Google Scholar
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings on Computer Vision and Pattern Recognition. IEEE Computer Society, June 2009.Google ScholarCross Ref
Facebook. Introducing FBLearner Flow: Facebook's AI backbone. https://code.fb.com/core-data/ introducing-fblearner-flow-facebook-s-ai-backbone/, 2018; accessed November 28, 2018.Google Scholar
R. Gruener, O. Cheng, and Y. Litvin. Introducing Petastorm: Uber ATG's data access library for deep learning. https://eng.uber.com/petastorm/, 2018; accessed November 28, 2018.Google Scholar
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing Google's datasets. SIGMOD, 2016.Google ScholarDigital Library
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 620--629, Feb 2018.Google ScholarCross Ref
T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, October 2009.Google Scholar
T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google Scholar
A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.Google Scholar
A. Maccioni and R. Torlone. Crossing the finish line faster when paddling the data lake with kayak. Proceedings of the VLDB Endowment, 10(12):1853--1856, 2017. Google ScholarDigital Library
M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. Proceedings of the VLDB Endowment, 9(9):624--635, 2016. Google ScholarDigital Library
Apache MXNet. Mxnet data api. https://mxnet.incubator.apache.org/ versions/master/api/python/io/io.html, 2018; accessed November 28, 2018.Google Scholar
H. Miao, A. Chavan, and A. Deshpande. Provdb: Lifecycle management of collaborative analysis workflows. In Proceedings of the 2ndWorkshop on Human-In-the-Loop Data Analytics, page 7. ACM, 2017.Google Scholar
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 571--582. IEEE, 2017.Google ScholarCross Ref
V. Sridhar, S. Subramanian,D. Arteaga, S. Sundararaman,D. Roselli, and N. Talagala. Model governance: Reducing the anarchy of production {ML}. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 351--358, Boston, MA, 2018. USENIX Association. Google ScholarDigital Library
M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, Scientific and Statistical Database Management, pages 1--16, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. Google ScholarDigital Library
TensorFlow. An open source machine learning framework for everyone. https://www.tensorflow.org/, 2018; accessed November 28, 2018.Google Scholar
Uber. Meet Michelangelo: Uber's machine learning platform. https: //eng.uber.com/michelangelo/, 2017; accessed November 28, 2018.Google Scholar
L. Xu, S. Huang, S. Hui, A. J. Elmore, and A. Parameswaran. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1655--1658. ACM, 2017.Google ScholarDigital Library
Y. Zhang, F. Xu, E. Frise, S. Wu, B. Yu, and W. Xu. Datalab: A version data management and analytics system. In Proceedings of the 2Nd International Workshop on BIG Data Software Engineering, BIGDSE '16, pages 12--18, New York, NY, USA, 2016. ACM. Google ScholarDigital Library

Index Terms

Data Platform for Machine Learning
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Record and block layout
    2. Database design and models
      1. Data model extensions
        Data provenance
        Data streams
        Semi-structured data
  2. Information storage systems
    1. Storage management
      1. Information lifecycle management
      2. Version management
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Data modeling

Recommendations

Data Management in Machine Learning: Challenges, Techniques, and Systems
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ...
Read More
Lifelong Machine Learning
Read More
Machine Learning: The State of the Art

The two fundamental problems in machine learning (ML) are statistical analysis and algorithm design. The former tells us the principles of the mathematical models that we establish from the observation data. The latter defines the conditions on which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data
June 2019
2106 pages
ISBN:9781450356435
DOI:10.1145/3299869
General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA
Copyright © 2019 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data platform
data streaming access
data version control
dataset management for machine learning
physical data layout
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '19 Paper Acceptance Rate88of430submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 9,234
  Total Downloads
- Downloads (Last 12 months)1,659
- Downloads (Last 6 weeks)102
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Platform for Machine Learning

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Management in Machine Learning: Challenges, Techniques, and Systems

Lifelong Machine Learning

Machine Learning: The State of the Art