ABSTRACT
In this paper we argue that systems for numerical computing are stuck in a local basin of performance and programmability. Systems researchers are doing an excellent job improving the performance of 5-year-old benchmarks, but gradually making it harder to explore innovative machine learning research ideas.
We explain how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, show how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model, and argue these programming abstractions lack expressiveness, maintainability, and modularity; all of which hinders research progress.
We conclude by noting promising directions in the field, and advocate steps to advance progress towards high-performance general purpose numerical computing systems on modern accelerators.
- G. Hinton, S. Sabour, and N. Frosst, "Matrix capsules with em routing," in International Conference on Learning Representations (ICLR), 2018.Google Scholar
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, (Berkeley, CA, USA), pp. 265--283, USENIX Association, 2016. Google ScholarDigital Library
- "PyTorch." https://pytorch.org/. Accessed 2019-01-09.Google Scholar
- "Understanding Matrix Capsules with EM Routing." https://jhui.github.io/2017/11/14/Matrix-Capsules-with-EM-routing-Capsule-Network. Accessed 2019-01-09.Google Scholar
- E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, "Anatomy of high-performance deep learning convolutions on simd architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18, (Piscataway, NJ, USA), pp. 66:1--66:12, IEEE Press, 2018. Google ScholarDigital Library
- "How to access global memory efficiently in CUDA C/C++ kernels." https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/. Accessed 2019-01-09.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, "Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines," in Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, (New York, NY, USA), pp. 519--530, ACM, 2013. Google ScholarDigital Library
- N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, "Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions," CoRR, vol. abs/1802.04730, 2018.Google Scholar
- "PlaidML." https://github.com/plaidml/plaidml. Accessed 2019-01-09.Google Scholar
- "cuDNN." https://developer.nvidia.com/cudnn. Accessed 2019-01-09.Google Scholar
- "End to end deep learning compiler stack." https://tvm.ai/. Accessed 2019-01-09.Google Scholar
- A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal Of The Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1--38, 1977.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," CoRR, vol. abs/1512.03385, 2015.Google Scholar
- "XLA: Accelerated linear algebra." https://www.tensorflow.org/xla/. Accessed 2019-01-09.Google Scholar
- S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding," CoRR, vol. abs/1510.00149, 2015.Google Scholar
- T. Chen, B. Xu, C. Zhang, and C. Guestrin, "Training deep nets with sublinear memory cost," CoRR, vol. abs/1604.06174, 2016.Google Scholar
- A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, "Device placement optimization with reinforcement learning," CoRR, vol. abs/1706.04972, 2017. Google ScholarDigital Library
- Z. Jia, M. Zaharia, and A. Aiken, "Beyond data and model parallelism for deep neural networks," in Proc. Conference on Systems and Machine Learning, SysML '19, 2019.Google Scholar
- Z. Jia, J. Thomas, T. Warszawski, M. Gao, M. Zaharia, and A. Aiken, "Optimizing DNN computation with relaxed graph substitutions," in Proc. Conference on Systems and Machine Learning, SysML '19, 2019.Google Scholar
- A. Adams, K. Ma, L. Anderson, R. Baghdadi, T.-M. Li, S. Johnson, M. Gharbi, B. Steiner, K. Fatahalian, F. Durand, and J. Ragan-Kelley, "Learning to optimize halide with tree search and random programs," in ACM Transactions on Graphics, SIGGRAPH '19, 2019.Google Scholar
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, pp. 484--503, 2016.Google ScholarCross Ref
- "APL programming language." https://en.wikipedia.org/wiki/APL_(programming_language). Accessed 2019-01-09.Google Scholar
- "TensorFlow." https://www.tensorflow.org/. Accessed 2019-01-09.Google Scholar
- "Intel Math Kernel Library." https://software.intel.com/en-us/mkl. Accessed 2019-01-09.Google Scholar
- "About CUDA." https://developer.nvidia.com/about-cuda. Accessed 2019-01-09.Google Scholar
- J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, "Julia: A fresh approach to numerical computing," CoRR, vol. abs/1411.1607, 2014.Google Scholar
- "Auto-vectorization with vmap." https://github.com/google/jax#auto-vectorization-with-vmap. Accessed 2019-01-09.Google Scholar
- "colah/LabeledTensor." https://github.com/colah/LabeledTensor. Accessed 2019-01-09.Google Scholar
- "Labels for TensorFlow." https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/labeled_tensor. Accessed 2019-01-09.Google Scholar
- "Tensor considered harmful." http://nlp.seas.harvard.edu/NamedTensor. Accessed 2019-01-09.Google Scholar
- N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. A. Hechtman, "Mesh-tensorflow: Deep learning for supercomputers," CoRR, vol. abs/1811.02084, 2018.Google Scholar
- "Multi-level intermediate representation." https://github.com/tensorflow/mlir. Accessed 2019-04-05.Google Scholar
- Machine Learning Systems are Stuck in a Rut
Comments