ABSTRACT
Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-memory processing cores tailored for DL tensor operations. These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. A prototype implementation of our proposal on real DL systems shows an average 6.2-17.6× performance improvement on state-of-the-art DNN-based recommender systems.
- ACM. 2019. The ACM Conference Series on Recommendation Systems. https://recsys.acm.org/.Google Scholar
- Jorge Albericio, Alberto Delmas, Patrick Judd, Sayeh Sharify, Gerard O'Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Mohammad Alian, Seung Won Min, Hadi Asgharimoghaddam, Ashutosh Dhar, Dong Kai Wang, Thomas Roewer, Adam McPadden, Oliver O'Halloran, Deming Chen, Jinjun Xiong, Daehoon Kim, Wenmei Hwu, and Nam Sung Kim. 2018. Application-Transparent Near-Memory Processing Architecture with Memory Channel Network. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for Continued Performance Scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
- Miguel Campo, Cheng-Kang Hsieh, Matt Nickens, JJ Espinoza, Abhinav Taliyan, Julie Rieger, Jean Ho, and Bettina Sherick. 2018. Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation. https://arxiv.org/abs/1807.04465.Google Scholar
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Proceedings of the Workshop on Machine Learning Systems.Google Scholar
- Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In Proceedings of the International Solid State Circuits Conference (ISSCC).Google ScholarCross Ref
- Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Esha Choukse, Michael Sullivan, Mike O'Connor, Mattan Erez, Jeff Pool, David Nellans, and Stephen W. Keckler. 2019. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. https://arxiv.org/abs/1903.02596.Google Scholar
- Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the ACM Conference on Recommender Systems (RECSYS).Google ScholarDigital Library
- Jeff Dean, David Patterson, and Cliff Young. 2018. A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution. In IEEE Micro.Google Scholar
- Alberto Delmas, Patrick Judd, Dylan Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, and Andreas Moshovos. 2018. Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How. https://arxiv.org/abs/1803.03688.Google Scholar
- Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Zidong Du, Daniel D Ben-Dayan Rubin, Yunji Chen, Liqiang He, Tianshi Chen, Lei Zhang, Chengyong Wu, and Olivier Temam. 2015. Neuromorphic Accelerators: A Comparison Between Neuroscience and Machine-Learning Approaches. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Facebook. 2019. Accelerating Facebook's infrastructure with Application-Specific Hardware. https://code.fb.com/data-center-engineering/accelerating-infrastructure/.Google Scholar
- Facebook. 2019. Open Compute Project. https://www.opencompute.org/.Google Scholar
- Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
- Google. 2017. Cloud TPUs: ML accelerators for TensorFlow.Google Scholar
- Udit Gupta, Xiaodong Wang, Maxim Naumov, Carole-Jean Wu, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2019. The Architectural Implications of Facebook's DNN-based Personalized Recommendation. In arxiv.org.Google Scholar
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web (WWW).Google ScholarDigital Library
- Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPOPP).Google ScholarDigital Library
- Hynix. 2017. 128 GB 3DS LRDIMM: The World's First Developed 3DS LRDIMM. (2017).Google Scholar
- IBM. 2017. IBM Power9 Microprocessor.Google Scholar
- Intel. 2019. Intel Math Kernel Library. https://software.intel.com/en-us/mkl.Google Scholar
- Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. MnnFast: A Fast and Scalable System Architecture for Memory-augmented Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- JEDEC. 2018. High Bandwidth Memory (HBM2) DRAM. (2018).Google Scholar
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial Deep Neural Network Computing. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
- Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A Fast and Extensible DRAM Simulator.Google Scholar
- Alex Krizhevsky. 2014. One Weird Trick For Parallelizing Convolutional Neural Networks. https://arxiv.org/abs/1404.5997.Google Scholar
- Youngeun Kwon and Minsoo Rhu. 2018. A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks. In IEEE Computer Architecture Letters.Google Scholar
- Youngeun Kwon and Minsoo Rhu. 2018. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, and Nam Sung Kim. 2018. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
- Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2012. System-level Implications of Disaggregated Memory. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
- Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).Google ScholarDigital Library
- Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified Template-based Framework for Accelerating Statistical Machine Learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Patrick J. Meaney, Lawrence D. Curley, Glenn D. Gilda, Mark R. Hodges, Daniel J. Buerkle, Robert D. Siegl, and Roger K. Dong. 2015. The IBM z13 Memory Subsystem for Big Data. IBM Journal of Research and Development (2015).Google Scholar
- Micron. 2017. Micron: System Power Calculator (DDR4). (2017).Google Scholar
- Micron. 2018. Hybrid Memory Cube (HMC). (2018).Google Scholar
- mlperf.org. 2018. MLPerf. https://mlperf.org/.Google Scholar
- Duncan J.M. Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2018. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA).Google Scholar
- Duncan J.M. Moss, Eriko Nurvitadhi, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2017. High Performance Binary Neural Networks on the Xeon+FPGA Platform. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL).Google Scholar
- Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. In arxiv.org.Google Scholar
- Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays (FPGA).Google ScholarDigital Library
- NVIDIA. 2008. cuBLAS Library. (2008).Google Scholar
- NVIDIA. 2016. cuDNN: GPU Accelerated Deep Learning.Google Scholar
- NVIDIA. 2016. NVIDIA CUDA Programming Guide.Google Scholar
- NVIDIA. 2017. The NVIDIA DGX-1V Deep Learning System.Google Scholar
- NVIDIA. 2017. The NVIDIA DGX-2 Deep Learning System.Google Scholar
- NVIDIA. 2018. Accelerating Recommendation System Inference Performance with TensorRT. https://devblogs.nvidia.com/accelerating-recommendation-system-inference-performance-with-tensorrt/.Google Scholar
- NVIDIA. 2018. NVIDIA Tesla V100.Google Scholar
- NVIDIA. 2018. NVLINK High-Speed Interconnect.Google Scholar
- NVIDIA. 2018. NVSwitch: Leveraging NVLink to Maximum Effect.Google Scholar
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
- Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood an Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications. In arxiv.org.Google Scholar
- Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. 2017. Scale-Out Acceleration for Machine Learning. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- PyTorch. 2019. http://pytorch.org.Google Scholar
- Vijay Rao. 2019. Accelerating Infrastructure - together. https://2019ocpglobalsummit.sched.com/event/Jiis/accelerating-infrastructure-together-presented-by-facebook.Google Scholar
- Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Lee, Jose Miguel, Hernandez-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google Scholar
- Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W. Keckler. 2018. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
- Samsung. 2016. (8GB, 1Gx72 Module) 288pin Registered DIMM based on 4Gb E-die. (2016).Google Scholar
- Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic in Crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.Google ScholarDigital Library
- Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. 2016. From High-level Deep Neural Models to FPGAs. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
- Tensorflow. 2016. https://www.tensorflow.org.Google Scholar
- Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2018. RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Hossein Valavi, Peter J. Ramadge, Eric Nestler, and Naveen Verma. 2019. A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing ChargeDomain Compute. IEEE Journal of Solid-State Circuits PP (03 2019), 1--11. https://doi.org/10.1109/JSSC.2019.2899730Google Scholar
- Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2017. Accelerating Deep Convolutional Networks using Low-precision and Sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarDigital Library
- Paul N. Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu-Yeon Wei. 2017. A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications. In Proceedings of the International Solid State Circuits Conference (ISSCC).Google Scholar
- Paul N. Whatmough, Sae Kyu Lee, Niamh Mulholland, Patrick Hansen, Sreela Kodali, David C. Brooks, and Gu-Yeon Wei. 2017. DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses. In Hot Chips: A Symposium on High Performance Chips.Google Scholar
- Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An Accelerator for Sparse Neural Networks. In Proceedings of the International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
Index Terms
- TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning
Recommendations
Training personalized recommendation systems from (GPU) scratch: look forward not backwards
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitecturePersonalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. ...
Per-bank refresh with adaptive early termination for high density DRAM
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingDRAM, which is mainly used as main memory, requires a refresh operation to maintain the integrity of stored data. Since memory read and write operations to a bank are not allowed while the bank is being refreshed, a lot of memory accesses may be blocked ...
Hydra: a near hybri<u>d</u> memo<u>r</u>y <u>a</u>ccelerator for CNN inference
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in EuropeConvolutional neural network (CNN) accelerators often suffer from limited off-chip memory bandwidth and on-chip capacity constraints. One solution to this problem is near-memory or in-memory processing. Non-volatile memory, such as phase-change memory (...
Comments