skip to main content
A Media-Enhanced Vector Architecture for Embedded Memory SystemsAugust 1999
1999 Technical Report
Publisher:
  • University of California at Berkeley
  • Computer Science Division 571 Evans Hall Berkeley, CA
  • United States
Published:27 August 1999
Bibliometrics
Skip Abstract Section
Abstract

Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representative media kernels, VIRAM sustains on average 88% of its peak performance, outperforming conventional SIMD media extensions and DSP processors by factors of 4.5 to 17. Using a clustered implementation approach, the modular design can be scaled without complicating control logic. We demonstrate that scaling the architecture leads to near linear application speedup. We also evaluate the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance.

Cited By

  1. Balaprakash P, Buntinas D, Chan A, Guha A, Gupta R, Narayanan S, Chien A, Hovland P and Norris B Exascale workload characterization and architecture implications Proceedings of the High Performance Computing Symposium, (1-8)
  2. ACM
    Takano S (2012). Design and analysis of adaptive processor, ACM Transactions on Reconfigurable Technology and Systems, 5:1, (1-34), Online publication date: 1-Mar-2012.
  3. ACM
    Meng J, Tarjan D and Skadron K Dynamic warp subdivision for integrated branch and memory divergence tolerance Proceedings of the 37th annual international symposium on Computer architecture, (235-246)
  4. ACM
    Meng J, Tarjan D and Skadron K (2010). Dynamic warp subdivision for integrated branch and memory divergence tolerance, ACM SIGARCH Computer Architecture News, 38:3, (235-246), Online publication date: 19-Jun-2010.
  5. Zhang Z, Zhu Z and Zhang X (2004). Design and Optimization of Large Size and Low Overhead Off-Chip Caches, IEEE Transactions on Computers, 53:7, (843-855), Online publication date: 1-Jul-2004.
  6. Gaeke B, Husbands P, Li X, Oliker L, Yelick K and Biswas R Memory-Intensive Benchmarks Proceedings of the 16th International Parallel and Distributed Processing Symposium
  7. Corbal J, Espasa R and Valero M Three-dimensional memory vectorization for high bandwidth media memory systems Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, (149-160)
  8. Khailany B, Dally W, Kapasi U, Mattson P, Namkoong J, Owens J, Towles B, Chang A and Rixner S (2001). Imagine, IEEE Micro, 21:2, (35-46), Online publication date: 1-Mar-2001.
  9. ACM
    Catthoor F, Dutt N and Kozyrakis C How to solve the current memory access and data transfer bottlenecks Proceedings of the conference on Design, automation and test in Europe, (426-435)
  10. ACM
    Owens J, Dally W, Kapasi U, Rixner S, Mattson P and Mowery B Polygon rendering on a stream architecture Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, (23-32)
Contributors
  • Stanford University

Recommendations