Abstract
Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the data into single-node or distributed main memory to enable fast matrix-vector operations. General-purpose compression struggles to achieve both good compression ratios and fast decompression for block-wise uncompressed operations. Therefore, we introduce Compressed Linear Algebra (CLA) for lossless matrix compression. CLA encodes matrices with lightweight, value-based compression techniques and executes linear algebra operations directly on the compressed representations. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show good compression ratios and operations performance close to the uncompressed case, which enables fitting larger datasets into available memory. We thereby obtain significant end-to-end performance improvements.
- Abadi, M. et al. TensorFlow: A system for large-scale machine learning. In OSDI (2016). Google ScholarDigital Library
- Alexandrov, A. et al. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014). Google ScholarDigital Library
- American Statistical Association (ASA). Airline on-time performance dataset. stat-computing.org/dataexpo/2009.Google Scholar
- Boehm, M., et al. SystemML: Declarative machine learning on spark. PVLDB 9, 13 (2016). Google ScholarDigital Library
- Bottou, L. The infinite MNIST dataset. leon.bottou.org.Google Scholar
- Chitta, R. et al. Approximate Kernel k-means: Solution to large scale Kernel clustering. In KDD (2011). Google ScholarDigital Library
- Cohen, J. et al. MAD skills: New analysis practices for big data. PVLDB 2, 2 (2009). Google ScholarDigital Library
- Elgamal, T. et al. SPOOF: Sum-product optimization and operator fusion for large-scale machine learning. In CIDR (2017).Google Scholar
- Elgohary, A. et al. Compressed linear algebra for large-scale machine learning. PVLDB 9, 12 (2016). Google ScholarDigital Library
- Elgohary, A. et al. Compressed linear algebra for large-scale machine learning. VLDB J. (2017a). Google ScholarDigital Library
- Elgohary, A., Boehm, M., Haas, P.J., Reiss, F.R., and Reinwald, B. Scaling Machine Learning via Compressed Linear Algebra. SIGMOD Record 46, 1 (2017b). Google ScholarDigital Library
- Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika (1953).Google Scholar
- Haas, P.J. and Stokes, L. Estimating the number of classes in a finite population. JASA 93, 444 (1998).Google Scholar
- Johnson, N.L. et al. Univariate Discrete Distributions, 2nd edn. Wiley, New York, 1992.Google Scholar
- Kourtis, K. et al. Optimizing sparse matrix-vector multiplication using index and value compression. In CF (2008). Google ScholarDigital Library
- Lichman, M. UCI machine learning repository: Higgs, covertype, US census (1990). archive.ics.uci.edu/ml/.Google Scholar
- Schelter, S. et al. Samsara: Declarative machine learning on distributed dataflow systems. NIPS ML Systems (2016).Google Scholar
- Williams, S. et al. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009). Google ScholarDigital Library
- Zadeh, R.B. et al. Matrix computations and optimization in apache spark. In KDD (2016).Google Scholar
- Zaharia, M. et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI (2012). Google ScholarDigital Library
Index Terms
- Compressed linear algebra for declarative large-scale machine learning
Recommendations
Compressed linear algebra for large-scale machine learning
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or ...
Scaling Machine Learning via Compressed Linear Algebra
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or ...
Compressed linear algebra for large-scale machine learning
Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed ...
Comments