research-article

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

Authors:
Youngeun Kwon

School of Electrical Engineering KAIST

School of Electrical Engineering KAIST
View Profile

,
Yunjae Lee

School of Electrical Engineering KAIST

School of Electrical Engineering KAIST
View Profile

,
Minsoo Rhu

School of Electrical Engineering KAIST

School of Electrical Engineering KAIST
View Profile

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2019Pages 740–753https://doi.org/10.1145/3352460.3358284

Published:12 October 2019Publication History

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 740–753

ABSTRACT

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations. These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. A prototype implementation of our proposal on real DL systems shows an average 6.2-17.6× performance improvement on state-of-the-art DNN-based recommender systems.

References

ACM. 2019. The ACM Conference Series on Recommendation Systems. https://recsys.acm.org/.Google Scholar
Jorge Albericio, Alberto Delmas, Patrick Judd, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Mohammad Alian, Seung Won Min, Hadi Asgharimoghaddam, Ashutosh Dhar, Dong Kai Wang, Thomas Roewer, Adam McPadden, Oliver O'Halloran, Deming Chen, Jinjun Xiong, Daehoon Kim, Wenmei Hwu, and Nam Sung Kim. 2018. Application-Transparent Near-Memory Processing Architecture with Memory Channel Network. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for Continued Performance Scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
Miguel Campo, Cheng-Kang Hsieh, Matt Nickens, JJ Espinoza, Abhinav Taliyan, Julie Rieger, Jean Ho, and Bettina Sherick. 2018. Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation. https://arxiv.org/abs/1807.04465.Google Scholar
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of the Workshop on Machine Learning Systems.Google Scholar
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In Proceedings of the International Solid State Circuits Conference (ISSCC).Google ScholarCross Ref
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Esha Choukse, Michael Sullivan, Mike O'Connor, Mattan Erez, Jeff Pool, David Nellans, and Stephen W. Keckler. 2019. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. https://arxiv.org/abs/1903.02596.Google Scholar
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the ACM Conference on Recommender Systems (RECSYS).Google ScholarDigital Library
Jeff Dean, David Patterson, and Cliff Young. 2018. A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution. In IEEE Micro.Google Scholar
Alberto Delmas, Patrick Judd, Dylan Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, and Andreas Moshovos. 2018. Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How. https://arxiv.org/abs/1803.03688.Google Scholar
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Zidong Du, Daniel D Ben-Dayan Rubin, Yunji Chen, Liqiang He, Tianshi Chen, Lei Zhang, Chengyong Wu, and Olivier Temam. 2015. Neuromorphic Accelerators: A Comparison Between Neuroscience and Machine-Learning Approaches. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Facebook. 2019. Accelerating Facebook's infrastructure with Application-Specific Hardware. https://code.fb.com/data-center-engineering/accelerating-infrastructure/.Google Scholar
Facebook. 2019. Open Compute Project. https://www.opencompute.org/.Google Scholar
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
Google. 2017. Cloud TPUs: ML accelerators for TensorFlow.Google Scholar
Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. In arxiv.org.Google Scholar
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web (WWW).Google ScholarDigital Library
Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPOPP).Google ScholarDigital Library
Hynix. 2017. 128 GB 3DS LRDIMM: The World's First Developed 3DS LRDIMM. (2017).Google Scholar
IBM. 2017. IBM Power9 Microprocessor.Google Scholar
Intel. 2019. Intel Math Kernel Library. https://software.intel.com/en-us/mkl.Google Scholar
Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. MnnFast: A Fast and Scalable System Architecture for Memory-augmented Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
JEDEC. 2018. High Bandwidth Memory (HBM2) DRAM. (2018).Google Scholar
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A Fast and Extensible DRAM Simulator.Google Scholar
Alex Krizhevsky. 2014. One Weird Trick For Parallelizing Convolutional Neural Networks. https://arxiv.org/abs/1404.5997.Google Scholar
Youngeun Kwon and Minsoo Rhu. 2018. A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks. In IEEE Computer Architecture Letters.Google Scholar
Youngeun Kwon and Minsoo Rhu. 2018. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. 2018. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2012. System-level Implications of Disaggregated Memory. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
Patrick J. Meaney, Lawrence D. Curley, Glenn D. Gilda, Mark R. Hodges, Daniel J. Buerkle, Robert D. Siegl, and Roger K. Dong. 2015. The IBM z13 Memory Subsystem for Big Data. IBM Journal of Research and Development (2015).Google Scholar
Micron. 2017. Micron: System Power Calculator (DDR4). (2017).Google Scholar
Micron. 2018. Hybrid Memory Cube (HMC). (2018).Google Scholar
mlperf.org. 2018. MLPerf. https://mlperf.org/.Google Scholar
Duncan J.M. Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2018. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA).Google Scholar
Duncan J.M. Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2017. High Performance Binary Neural Networks on the Xeon+FPGA Platform. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL).Google Scholar
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In arxiv.org.Google Scholar
Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA).Google ScholarDigital Library
NVIDIA. 2008. cuBLAS Library. (2008).Google Scholar
NVIDIA. 2016. cuDNN: GPU Accelerated Deep Learning.Google Scholar
NVIDIA. 2016. NVIDIA CUDA Programming Guide.Google Scholar
NVIDIA. 2017. The NVIDIA DGX-1V Deep Learning System.Google Scholar
NVIDIA. 2017. The NVIDIA DGX-2 Deep Learning System.Google Scholar
NVIDIA. 2018. Accelerating Recommendation System Inference Performance with TensorRT. https://devblogs.nvidia.com/accelerating-recommendation-system-inference-performance-with-tensorrt/.Google Scholar
NVIDIA. 2018. NVIDIA Tesla V100.Google Scholar
NVIDIA. 2018. NVLINK High-Speed Interconnect.Google Scholar
NVIDIA. 2018. NVSwitch: Leveraging NVLink to Maximum Effect.Google Scholar
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood an Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications. In arxiv.org.Google Scholar
Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-Out Acceleration for Machine Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
PyTorch. 2019. http://pytorch.org.Google Scholar
Vijay Rao. 2019. Accelerating Infrastructure - together. https://2019ocpglobalsummit.sched.com/event/Jiis/accelerating-infrastructure-together-presented-by-facebook.Google Scholar
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Lee, Jose Miguel, Hernandez-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google Scholar
Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W. Keckler. 2018. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
Samsung. 2016. (8GB, 1Gx72 Module) 288pin Registered DIMM based on 4Gb E-die. (2016).Google Scholar
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic in Crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.Google ScholarDigital Library
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. 2016. From High-level Deep Neural Models to FPGAs. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
Tensorflow. 2016. https://www.tensorflow.org.Google Scholar
Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2018. RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
Hossein Valavi, Peter J. Ramadge, Eric Nestler, and Naveen Verma. 2019. A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing ChargeDomain Compute. IEEE Journal of Solid-State Circuits PP (03 2019), 1--11. https://doi.org/10.1109/JSSC.2019.2899730Google Scholar
Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating Deep Convolutional Networks using Low-precision and Sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarDigital Library
Paul N. Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu-Yeon Wei. 2017. A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications. In Proceedings of the International Solid State Circuits Conference (ISSCC).Google Scholar
Paul N. Whatmough, Sae Kyu Lee, Niamh Mulholland, Patrick Hansen, Sreela Kodali, David C. Brooks, and Gu-Yeon Wei. 2017. DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses. In Hot Chips: A Symposium on High Performance Chips.Google Scholar
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref

Index Terms

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Training personalized recommendation systems from (GPU) scratch: look forward not backwards
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. ...
Read More
Per-bank refresh with adaptive early termination for high density DRAM
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

DRAM, which is mainly used as main memory, requires a refresh operation to maintain the integrity of stored data. Since memory read and write operations to a bank are not allowed while the bank is being refreshed, a lot of memory accesses may be blocked ...
Read More
Hydra: a near hybrid memory accelerator for CNN inference
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in Europe

Convolutional neural network (CNN) accelerators often suffer from limited off-chip memory bandwidth and on-chip capacity constraints. One solution to this problem is near-memory or in-memory processing. Non-volatile memory, such as phase-change memory (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
October 2019
1104 pages
ISBN:9781450369381
DOI:10.1145/3352460

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DIMM
System architecture
graphics processing unit (GPU)
machine learning
memory architecture
near-memory processing
neural network
neural processing unit (NPU)
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 110
  Total Citations
  View Citations
- 2,795
  Total Downloads
- Downloads (Last 12 months)475
- Downloads (Last 6 weeks)65
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Training personalized recommendation systems from (GPU) scratch: look forward not backwards

Per-bank refresh with adaptive early termination for high density DRAM

Hydra: a near hybri<u>d</u> memo<u>r</u>y <u>a</u>ccelerator for CNN inference

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Training personalized recommendation systems from (GPU) scratch: look forward not backwards

Per-bank refresh with adaptive early termination for high density DRAM

Hydra: a near hybri<u>d</u> memo<u>r</u>y <u>a</u>ccelerator for CNN inference

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media