ABSTRACT
We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores. While Z-NAND is a hundred times faster than conventional 3D-stacked flash, its latency is still longer than DRAM. To address this shortcoming, we propose a dynamic page-placement and buffer manager in Z-NAND subsystems by being aware of bulk and parallel memory access characteristics of GPU applications, thereby offering high-throughput and low-energy consumption behaviors.
- Jaehyung Ahn et al. 2015. DCS: a fast and scalable device-centric server architecture. In MICRO. ACM. Google ScholarDigital Library
- AMD. 2017. Radeon Pro SSG Graphics. https://www.amd.com/en/products/professional-graphics/radeon-pro-ssg. (2017).Google Scholar
- Mark Harris. 2013. Unified Memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/. (2013).Google Scholar
- Myoungsoo Jung et al. 2012. Physically addressed queueing (PAQ): improving parallelism in solid state disks. In SIGARCH Computer Architecture News. IEEE. Google ScholarDigital Library
- Myoungsoo Jung et al. 2018. SimpleSSD: modeling solid state drives for holistic system simulation. Computer Architecture Letters (2018). Google ScholarDigital Library
- Myoungsoo Jung and Mahmut T Kandemir. 2014. Sprinkler: Maximizing resource utilization in many-chip solid state disks. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 524--535.Google ScholarCross Ref
- Hyesoon Kim et al. 2012. Macsim: A cpu-gpu heterogeneous simulation framework user guide. Georgia Institute of Technology (2012).Google Scholar
- Sungjoon Koh et al. 2018. Exploring system challenges of ultra-low latency solid state drives. In HotStorage 18. Google ScholarDigital Library
- Lifeng Nai et al. 2015. GraphBIG: understanding graph computing in the context of industrial solutions. In SC. IEEE. Google ScholarDigital Library
- Samsung. 2017. Ultra-Low Latency with Samsung Z-NAND SSD. Ultra-Low_Latency_with_Samsung_Z-NAND_SSD-0.pdf. (2017).Google Scholar
- Sudharsan Seshadri et al. 2014. Willow: A User-Programmable SSD.. In OSDI. Google ScholarDigital Library
- Mimi Xie, et al. 2018. AIM: Fast and energy-efficient AES in-memory implementation for emerging non-volatile main memory. In DATE. IEEE.Google Scholar
- Yuan Xue et al. 2017. Age-aware logic and memory co-placement for RRAM-FPGAs. In DAC. ACM, 1. Google ScholarDigital Library
- Jie Zhang et al. 2015. Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures. In PACT. IEEE. Google ScholarDigital Library
- Jie Zhang and Myoungsoo Jung. 2018. Flashabacus: a self-governing flash-based accelerator for low-power systems. In EuroSys. ACM, 15. Google ScholarDigital Library
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments