ABSTRACT
Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models---aka "realtime AI". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates.
This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.
- E. Chung et al., "Accelerating persistent neural networks at datacenter scale," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- E. Chung et al., "Serving DNNs in Real Time at Datacenter Scale with Project Brainwave," in IEEE MICRO: Hot Chips, April 2018.Google Scholar
- N. Toon and S. Knowles, "Graphcore," https://www.graphcore.ai, 2017.Google Scholar
- Y. Chen et al., "DaDianNao: A Machine-Learning Supercomputer," in Proc. 47th Annu. Int. Symp. on Microarchitecture (MICRO), 2014, pp. 609--622. Google ScholarDigital Library
- A. Putnam et al., "A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services," in Proc. 41st Annu. Int. Symp. on Computer Architecture (ISCA), 2014, pp. 13--24. Google ScholarDigital Library
- M. Abadi et al., "TensorFlow: A system for large-scale machine learning," in 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI), 2016, pp. 265--283. Google ScholarDigital Library
- S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Comput., vol. 9, no. 8, pp. 1735--1780, Nov. 1997. Google ScholarDigital Library
- K. He et al., "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770--778.Google ScholarCross Ref
- C. L. Lawson et al., "Basic Linear Algebra Subprograms for Fortran Usage," ACM Trans. Math. Softw., vol. 5, no. 3, pp. 308--323, Sep. 1979. Google ScholarDigital Library
- N. P. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 1--12. Google ScholarDigital Library
- S. Gupta et al., "Deep Learning with Limited Numerical Precision," in Proc. 32nd Int. Conf. on Machine Learning - Volume 37, 2015, pp. 1737--1746. Google ScholarDigital Library
- S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), 2016, pp. 243--254. Google ScholarDigital Library
- M. Courbariaux and Y. Bengio, "BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1," CoRR, vol. abs/1602.02830, 2016.Google Scholar
- U. Köster et al., "Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks," in NIPS, 2017.Google Scholar
- J. H. Wilkinson, Rounding Errors in Algebraic Processes, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1963.Google Scholar
- S. Narang and G. Diamos, "Baidu DeepBench," https://github.com/baidu-research/DeepBench, 2017.Google Scholar
- A. Y. Hannun et al., "Deep Speech: Scaling up end-to-end speech recognition," CoRR, vol. abs/1412.5567, 2014.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Proc. 25th International Conference on Neural Information Processing Systems - Volume 1, 2012, pp. 1097--1105. Google ScholarDigital Library
- Y. Jia et al., "Caffe: Convolutional Architecture for Fast Feature Embedding," in Proc. 22nd ACM International Conference on Multimedia, 2014, pp. 675--678. Google ScholarDigital Library
- Y. Chen et al., "DianNao Family: Energy-efficient Hardware Accelerators for Machine Learning," Commun. ACM, vol. 59, no. 11, pp. 105--112, Oct. 2016. Google ScholarDigital Library
- P. Whatmough, "DNN ENGINE: A 16nm sub-uj deep neural network inference accelerator for the embedded masses," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- C. Farabet et al., "Neuflow: A runtime-reconfigurable dataflow processor for vision," in Proc. Embedded Computer Vision Workshop (ECVW'11), 2011, (invited paper).Google Scholar
- B. Moons and M. Verhelst, "A 0.3--2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets," 2016.Google ScholarCross Ref
- R. LiKamWa et al., "RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), 2016, pp. 255--266. Google ScholarDigital Library
- P. Chi et al., "PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 27--39. Google ScholarDigital Library
- S. Chakradhar et al., "A Dynamically Configurable Coprocessor for Convolutional Neural Networks," in Proc. 37th Annu. Int. Symp. on Computer Architecture (ISCA), 2010, pp. 247--257. Google ScholarDigital Library
- S. Venkataramani et al., "ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 13--26. Google ScholarDigital Library
- S. Li et al., "DRISA: A DRAM-based Reconfigurable In-Situ Accelerator," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 288--301.Google ScholarDigital Library
- B. Reagen et al., "Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 267--278. Google ScholarDigital Library
- M. Peemen et al., "Memory-centric accelerator design for Convolutional Neural Networks," in Proc. 31st IEEE Int. Conf. on Computer Design (ICCD), Oct 2013, pp. 13--19.Google Scholar
- S. W. Park et al., "An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications," IEEE Trans on Biomed Circuits Syst, vol. 9, no. 6, pp. 838--848, Dec 2015.Google Scholar
- V. Gokhale et al., "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in 2014 IEEE Conf. on Computer Vision and Pattern Recognition Workshops, June 2014, pp. 696--701. Google ScholarDigital Library
- T. Chen et al., "DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning," in Proc. 19th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269--284. Google ScholarDigital Library
- Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor," in Proc. 42nd Annu. Int. Symp. on Computer Architecture (ISCA), June 2015, pp. 92--104. Google ScholarDigital Library
- D. Liu et al., "PuDianNao: A Polyvalent Machine Learning Accelerator," in Proc. 20th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). New York, NY, USA: ACM, 2015, pp. 369--381. Google ScholarDigital Library
- Y. H. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 367--379. Google ScholarDigital Library
- P. Judd et al., "Stripes: Bit-serial deep neural network computing," in Proc. 49th Annu. Int. Symp. on Microarchitecture (MICRO), Oct 2016, pp. 1--12. Google ScholarDigital Library
- J. Albericio et al., "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 1--13. Google ScholarDigital Library
- J. Albericio et al., "Bit-pragmatic Deep Neural Network Computing," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 382--394. Google ScholarDigital Library
- S. Han et al., "ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 75--84. Google ScholarDigital Library
- S. Han, H. Mao, and W. J. Dally, "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding," in International Conference on Learning Representations, 2016.Google Scholar
- K. Guo et al., "From model to FPGA: Software-hardware co-design for efficient neural network acceleration," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--27.Google Scholar
- C. Young, "Evaluation of the Tensor Processing Unit: A Deep Neural Network Accelerator for the Datacenter," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- J. Dean, "Recent Advances in Artificial Intelligence via Machine Learning and the Implications for Computer System Design," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- A. Parashar et al., "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," SIGARCH Comput. Archit. News, vol. 45, no. 2, pp. 27--40, Jun. 2017. Google ScholarDigital Library
- C. Nicol, "A dataflow processing chip for training deep neural networks," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- D. Moloney, "Embedded deep neural networks: The cost of everything and the value of nothing," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--20.Google Scholar
- V. Sze et al., "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295--2329, Dec 2017.Google ScholarCross Ref
- S. Liu et al., "Cambricon: An Instruction Set Architecture for Neural Networks," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 393--405. Google ScholarDigital Library
- M. Rastegari et al., "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," in In Proceedings of European Conference on Computer Vision, 2016.Google Scholar
- B. Moons et al., "Energy-efficient ConvNets through approximate computing," in IEEE Winter Conf. on Appl. of Computer Vision (WACV), 2016, pp. 1--8.Google Scholar
- M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS'15. Cambridge, MA, USA: MIT Press, 2015, pp. 3123--3131. Google ScholarDigital Library
- P. Judd et al., "Reduced-precision strategies for bounded memory in deep neural nets," CoRR, vol. abs/1511.05236, 2015.Google Scholar
- P. Gysel, M. Motamedi, and S. Ghiasi, "Hardware-oriented approximation of convolutional neural networks," CoRR, vol. abs/1604.03168, 2016.Google Scholar
- P. Colangelo et al., "Fine-grained acceleration of binary neural networks using Intel Xeon processor with integrated FPGA," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 135--135.Google Scholar
- J. Cong and B. Xiao, Minimizing Computation in Convolutional Neural Networks. Cham: Springer International Publishing, 2014, pp. 281--290.Google Scholar
- J. Yu et al., "Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 548--560. Google ScholarDigital Library
- T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing Energy-Efficient Con-volutional Neural Networks Using Energy-Aware Pruning," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071--6079, 2017.Google Scholar
- Y. Kim et al., "Compression of deep convolutional neural networks for fast and low power mobile applications," CoRR, vol. abs/1511.06530, 2015.Google Scholar
- F. N. Iandola et al., "SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size," CoRR, vol. abs/1602.07360, 2016.Google Scholar
- S. Han et al., "Learning Both Weights and Connections for Efficient Neural Networks," in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS'15. Cambridge, MA, USA: MIT Press, 2015, pp. 1135--1143. Google ScholarDigital Library
- G. Diamos et al., "Persistent RNNs: Stashing Recurrent Weights On-Chip," in Proc. 33rd Int. Conf. on Machine Learning, 2016, pp. 2024--2033. Google ScholarDigital Library
- Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN Accelerator Efficiency Through Resource Partitioning," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA '17. New York, NY, USA: ACM, 2017, pp. 535--547. Google ScholarDigital Library
- J. Ouyang et al., "SDA: Software-defined accelerator for large-scale dnn systems," in 2014 IEEE Hot Chips 26 Symposium, Aug 2014, pp. 1--23.Google Scholar
- J. Ouyang et al., "SDA: Software-defined accelerator for general-purpose big data analysis system," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--23.Google Scholar
- J. Ouyang, "XPU: A programmable FPGA accelerator for diverse workloads," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- D. Shin and H.-J. Yoo, "DNPU: An energy-efficient deep neural network processor with on-chip stereo matching," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google Scholar
- A. Rahman, J. Lee, and K. Choi, "Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical-3D Compute Array," in Conf. on Design, Automation & Test in Europe (DATE), 2016, pp. 1393--1398. Google ScholarDigital Library
- A. Podili, C. Zhang, and V. Prasanna, "Fast and efficient implementation of convolutional neural networks on FPGA," in Proc. 28th IEEE Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP), July 2017, pp. 11--18.Google Scholar
- S. Li et al., "An FPGA design framework for CNN sparsification and acceleration," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 28--28.Google Scholar
- L. Lu et al., "Evaluating fast algorithms for convolutional neural networks on FPGAs," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 101--108.Google Scholar
- Y. Shen, M. Ferdman, and P. Milder, "Escher: A CNN accelerator with flexible buffering to minimize off-chip transfer," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 93--100.Google ScholarCross Ref
- M. Samragh, M. Ghasemzadeh, and F. Koushanfar, "Customizing neural networks for efficient FPGA implementation," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 85--92.Google Scholar
- E. Kousanakis et al., "An architecture for the acceleration of a hybrid leaky integrate and fire SNN on the convey HC-2ex FPGA-based processor," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 56--63.Google ScholarCross Ref
- S. Yin et al., "Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 295--295. Google ScholarDigital Library
- S. I. Venieris and C. S. Bouganis, "fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs," in Proc. 24th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), May 2016, pp. 40--47.Google ScholarCross Ref
- Y. Li et al., "A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '17. New York, NY, USA: ACM, 2017, pp. 290--291. Google ScholarDigital Library
- H. Nakahara et al., "A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 290--290. Google ScholarDigital Library
- Y. Umuroglu et al., "FINN: A Framework for Fast, Scalable Binarized Neural Network Inference," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 65--74. Google ScholarDigital Library
- U. Aydonat et al., "An OpenCL<sup>™</sup>Deep Learning Accelerator on Arria 10," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA'17. New York, NY, USA: ACM, 2017, pp. 55--64. Google ScholarDigital Library
- Y. Ma et al., "Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 45--54. Google ScholarDigital Library
- C. Zhang and V. Prasanna, "Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 35--44. Google ScholarDigital Library
- J. Zhang and J. Li, "Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 25--34. Google ScholarDigital Library
- R. Zhao et al., "Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 15--24. Google ScholarDigital Library
- E. Nurvitadhi et al., "Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?" in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 5--14. Google ScholarDigital Library
- J. Qiu et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 26--35. Google ScholarDigital Library
- N. Suda et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 16--25. Google ScholarDigital Library
- C. Zhang et al., "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2015, pp. 161--170. Google ScholarDigital Library
- W. Qadeer et al., "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing," in Proc. 40th Annu. Int. Symp. on Computer Architecture (ISCA), 2013, pp. 24--35. Google ScholarDigital Library
- K. Ovtcharov et al., "Toward accelerating deep learning at scale using specialized hardware in the datacenter," in 2015 IEEE Hot Chips 27 Symposium, Aug 2015, pp. 1--38.Google Scholar
- C. Farabet et al., "CNP: An FPGA-based processor for Convolutional Networks," in 2009 International Conference on Field Programmable Logic and Applications, Aug 2009, pp. 32--37.Google Scholar
- C. Farabet, C. Poulet, and Y. LeCun, "An FPGA-based stream processor for embedded real-time vision with Convolutional Networks," in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Sept 2009, pp. 878--885.Google Scholar
- C. Ding et al., "CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 395--408. Google ScholarDigital Library
- M. Alwani et al., "Fused-layer CNN accelerators," in Proc. 49th Annu. Int. Symp. on Microarchitecture (MICRO), Oct 2016, pp. 1--12. Google ScholarDigital Library
- D. Lewis et al., "The Stratix<sup>™</sup>10 Highly Pipelined FPGA Architecture," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 159--168. Google ScholarDigital Library
- D. Kim et al., "Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 380--392. Google ScholarDigital Library
- A. Shafiee et al., "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 14--26. Google ScholarDigital Library
- S. B. Eryilmaz et al., "Neuromorphic architectures with electronic synapses," in Proc. 17th Int. Symp. on Quality Electronic Design (ISQED), March 2016, pp. 118--123.Google Scholar
- S. K. Esser et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," CoRR, vol. abs/1603.08270, 2016.Google Scholar
- A. Ling and J. Anderson, "The Role of FPGAs in Deep Learning," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 3--3. Google ScholarDigital Library
Index Terms
- A configurable cloud-scale DNN processor for real-time AI
Recommendations
Lightweight SIMT core designs for intelligent 3D stacked DRAM
MEMSYS '17: Proceedings of the International Symposium on Memory SystemsIn this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica ...
Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingWe analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's ...
Teramac-configurable custom computing
FCCM '95: Proceedings of the IEEE Symposium on FPGA's for Custom Computing MachinesAbstract: The Teramac configurable hardware system can execute synchronous logic designs of up to one million gates at rates up to 1 megahertz. A fully configured Teramac includes half a gigabyte of RAM and hardware support for large multiported ...
Comments