skip to main content
research-article
Free Access

GP-SIMD Processing-in-Memory

Published:09 January 2015Publication History
Skip Abstract Section

Abstract

GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

References

  1. A. Akerib and R. Adar. 1995. Associative approach to real time color, motion and stereo vision. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing (ICASSP’95). Vol. 5. IEEE.Google ScholarGoogle Scholar
  2. A. J. Akerib and S. Ruhman. 1991. Associative array and tree algorithms in stereo vision. In Proceedings of the 8th Israel Conference on Artificial Intelligence, Vision & Pattern Recognition. Elsevier.Google ScholarGoogle Scholar
  3. G. Almási et al. 2003. Dissecting Cyclops: A detailed analysis of a multithreaded architecture. ACM SIGARCH Computer Architecture News 31, 1, 26--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AltiVec Engine. 2014. Homepage. Retrieved from http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCALTVC.Google ScholarGoogle Scholar
  5. ARM. 2014. NEON™ General-Purpose SIMD Engine. Retrieved from http://www.arm.com/products/processors/technologies/neon.php.Google ScholarGoogle Scholar
  6. C. Auth et al. 2012. A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  7. K. Banerjee et al. 2003. A self-consistent junction temperature estimation methodology for nanometer scale ICs with implications for performance and thermal management. Electron Devices Meeting, 2003. IEDM'03 Technical Digest. IEEE International. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  8. K. E. Batcher. 1974. STARAN parallel processor system hardware. In Proceedings of the National Computer Conference. 405--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Binkert et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. Journal of Political Economy 81, 637--654.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the ACM/IEEE 44th Design Automation Conference (DAC’07). 746--749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Brockman et al. 2004. A low cost, multithreaded processing-in-memory system. In Proceedings of the 31st International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. T. Burger Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News 25, 3, 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Cassidy and A. Andreou. 2012. Beyond Amdahl Law - An objective function that links performance gains to delay and energy. IEEE Transactions on Computers 61, 8, 1110--1126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. L. Cloud. 1988. The geometric arithmetic parallel processor. In Proceedings of the 2nd Symposium on the Frontiers of Massively Parallel Computation. IEEE.Google ScholarGoogle Scholar
  16. P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. 2014. An efficient and scalable semiconductor architecture for parallel automata processing. In IEEE Transactions on Parallel and Distributed Systems. 1--1.Google ScholarGoogle Scholar
  17. J. Draper et al. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Esmaeilzadeh et al. 2013. Power challenges may end the multicore era. Communications of the ACM 56, 2, 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Flatt et al. 1989. Performance of parallel processors. Parallel Computing 12, 1, 1--20.Google ScholarGoogle ScholarCross RefCross Ref
  20. C. Foster. 1976. Content Addressable Parallel Processors. Van Nostrand Reinhold Company, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Gokhale et al. 1995. Processing in memory: The Terasys massively parallel PIM array. IEEE Computer 23--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Gschwind et al. 2006. Synergistic processing in cell's multicore architecture. IEEE Micro 26, 2, 10--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Gunther, S. Subramanyam, and S. Parvu. 2011. A methodology for optimizing multithreaded system scalability on multi--cores. Retrieved from http://arxiv.org/abs/1105.4301.Google ScholarGoogle Scholar
  24. M. Hall et al. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Hardavellas et al. 2011. Toward dark silicon in servers. IEEE Micro 31, 4, 6--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Hennessy and D. A. Patterson. 1996. Computer Architecture: A Quantitative Approach (2nd ed.) Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Hentrich et al. 2009. Performance evaluation of SRAM cells in 22nm predictive CMOS technology. In Proceedings of the IEEE International Conference on Electro/Information Technology.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Hill et al. 2008. Amdahl's law in the multicore era. IEEE Computer 41, 7, 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. ACM SIGARCH Computer Architecture News 37, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. IBM. 2005. PowerPC Vector/SIMD Multimedia Extension. Retrieved from http://math-at-las.sourceforge.net/devel/assembly/vector_simd_pem.ppc.2005AUG23.pdf.Google ScholarGoogle Scholar
  31. Intel. 2013. The Intel® Xeon Phi™ Coprocessor. Retrieved from http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html.Google ScholarGoogle Scholar
  32. S. W. Keckler et al. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5, 7--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Kogge et al. 2000. PIM architectures to support petaflops level computation in the HTMT machine. In Proceedings of the International Workshop on Innovative Architecture for Future Generation Processors and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. E. Kozyrakis et al. 1997. Scalable processors in the billion-transistor era: IRAM. Computer 30, 9, 75--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Kumar. 2012. Smart Memory. Retrieved from http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.23.325-1-Kumar-Smart-Memory.pdf.Google ScholarGoogle Scholar
  36. G. Lipovski and C. Yu. 1999. The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval. In Proceedings of the IEEE International Workshop on Memory Technology, Design and Testing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. G. Loh. 2008. The cost of uncore in throughput-oriented many-core processors. In Proceedings of the Workshop on Architectures and Languages for Throughput Applications (ALTA).Google ScholarGoogle Scholar
  38. D. Luebke. 2004. General-purpose computation on graphics hardware. In Proceedings of the SIGGRAPH Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Midwinter, M. Huch, P. A. Ivey, and G. Saucier. 1988. Architectural considerations of a wafer scale processor. IEE Colloquium on VLSI for Parallel Processing 4/1, 4/4, 17.Google ScholarGoogle Scholar
  40. A. Morad et al. 2013. Generalized multiAmdahl: Optimization of heterogeneous multi-accelerator SoC. Computer Architecture Letters 13, 1, 37--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Morad et al. 2014. Convex optimization of resource allocation in asymmetric and heterogeneous SoC. Power and Timing Modeling, Optimization and Simulation (PATMOS).Google ScholarGoogle Scholar
  42. A. Morad et al. 2014. Efficient dense and sparse matrix multiplication on GP-SIMD. Power and Timing Modeling, Optimization and Simulation (PATMOS).Google ScholarGoogle Scholar
  43. A. Morad et al. 2014. Optimization of asymmetric and heterogeneous SoC. Under review.Google ScholarGoogle Scholar
  44. T. Morad et al. 2006. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters 5, 1, 14--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Owens et al. 2008. GPU computing. Proceedings of the IEEE 96, 5, 879--899.Google ScholarGoogle ScholarCross RefCross Ref
  46. A. Pedram. 2013. Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics. PhD dissertation, University of Texas. Retrieved from http://repositories.lib.utexas.edu/bitstream/handle/2152/21364/PEDRAM-DISSERTATION-2013.pdf?sequence=1.Google ScholarGoogle Scholar
  47. F. Pollack. 1999. New microarchitecture challenges in the coming generations of CMOS process technologies. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Potter et al. 1994. ASC: An associative-computing paradigm. Computer 27, 11, 19--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Pugsley et al. 2014. Comparing implementations of near-data computing with in-memory MapReduce workloads. IEEE Micro 34, 4, 44--52.Google ScholarGoogle ScholarCross RefCross Ref
  50. G. Qing, X. Guo, R. Patel, E. Ipek, and E. Friedman. 2013. AP-DIMM: Associative computing with STT-MRAM. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Quinn. 1987. Designing Efficient Algorithms for Parallel Computers. McGraw-Hill, 125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. F. Reddaway. 1973. DAP—a distributed array processor. ACM SIGARCH Computer Architecture News 2, 4, 61--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. B. Rogers et al. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 371--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. M. Russell. 1978. The CRAY-1 computer system. Communications of the ACM 21, 1, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. G. E. Sayre. 1976. STARAN: An associative approach to multiprocessor architecture. Computer Architecture. Springer, Berlin.Google ScholarGoogle Scholar
  56. I. Scherson et al. 1992. Bit-parallel arithmetic in a massively-parallel associative processor. IEEE Transactions on Computers 41, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. Sheaffer et al. 2005. Studying thermal management for graphics-processor architectures. ISPASS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Steinkraus, L. Buck, and P. Simard. 2005. Using GPUs for machine learning algorithms. IEEE ICDAR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. T. Sterling and H. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. Suh et al. 2001. A PIM-based multiprocessor system. In Proceedings of the 15th International Symposium on Parallel and Distributed Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. L. W. Tucker and G. G. Robertson. 1988. Architecture and applications of the connection machine. Computer 21, 8, 26--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. V. Volkov and J. W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. D. Wentzlaff et al. 2010. Core Count vs. Cache Size for Manycore Architectures in the Cloud. Technical Report. MIT-CSAIL-TR-2010-008, MIT.Google ScholarGoogle Scholar
  64. L. Yavits. 1994. Architecture and Design of Associative Processor for Image Processing and Computer Vision. MSc Thesis, Technion -- Israel Institute of Technology. Retrieved from http://webee.technion.ac.il/∼ran/papers/LeonidYavitsMasterThesis1994.pdf.Google ScholarGoogle Scholar
  65. L. Yavits et al. 2014a. Computer architecture with associative processor replacing last level cache and SIMD accelerator. IEEE Transactions on Computers.Google ScholarGoogle Scholar
  66. L. Yavits et al. 2014b. The effect of communication and synchronization on Amdahl's law in multicore systems. Parallel Computing 40.1, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. L. Yavits et al. 2014c. Thermal analysis of 3D associative processor. http://arxiv.org/abs/1307.3853v1Google ScholarGoogle Scholar
  68. D. Zhang et al. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Y. Zhang and J. D. Owens. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GP-SIMD Processing-in-Memory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
        January 2015
        797 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2695583
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 January 2015
        • Revised: 1 October 2014
        • Accepted: 1 October 2014
        • Received: 1 May 2014
        Published in taco Volume 11, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader