research-article

Public Access

FlashGPU: Placing New Flash Next to GPU Cores

Authors:
Jie Zhang

Yonsei University and Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology

Yonsei University and Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology
View Profile

,
Miryeong Kwon

Yonsei University and Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology

Yonsei University and Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology
View Profile

,
Hyojong Kim

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Hyesoon Kim

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Myoungsoo Jung

Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology

Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology
View Profile

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019June 2019Article No.: 156Pages 1–6https://doi.org/10.1145/3316781.3317827

Published:02 June 2019Publication History

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

Pages 1–6

ABSTRACT

We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores. While Z-NAND is a hundred times faster than conventional 3D-stacked flash, its latency is still longer than DRAM. To address this shortcoming, we propose a dynamic page-placement and buffer manager in Z-NAND subsystems by being aware of bulk and parallel memory access characteristics of GPU applications, thereby offering high-throughput and low-energy consumption behaviors.

References

Jaehyung Ahn et al. 2015. DCS: a fast and scalable device-centric server architecture. In MICRO. ACM. Google ScholarDigital Library
AMD. 2017. Radeon Pro SSG Graphics. https://www.amd.com/en/products/professional-graphics/radeon-pro-ssg. (2017).Google Scholar
Mark Harris. 2013. Unified Memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/. (2013).Google Scholar
Myoungsoo Jung et al. 2012. Physically addressed queueing (PAQ): improving parallelism in solid state disks. In SIGARCH Computer Architecture News. IEEE. Google ScholarDigital Library
Myoungsoo Jung et al. 2018. SimpleSSD: modeling solid state drives for holistic system simulation. Computer Architecture Letters (2018). Google ScholarDigital Library
Myoungsoo Jung and Mahmut T Kandemir. 2014. Sprinkler: Maximizing resource utilization in many-chip solid state disks. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 524--535.Google ScholarCross Ref
Hyesoon Kim et al. 2012. Macsim: A cpu-gpu heterogeneous simulation framework user guide. Georgia Institute of Technology (2012).Google Scholar
Sungjoon Koh et al. 2018. Exploring system challenges of ultra-low latency solid state drives. In HotStorage 18. Google ScholarDigital Library
Lifeng Nai et al. 2015. GraphBIG: understanding graph computing in the context of industrial solutions. In SC. IEEE. Google ScholarDigital Library
Samsung. 2017. Ultra-Low Latency with Samsung Z-NAND SSD. Ultra-Low_Latency_with_Samsung_Z-NAND_SSD-0.pdf. (2017).Google Scholar
Sudharsan Seshadri et al. 2014. Willow: A User-Programmable SSD.. In OSDI. Google ScholarDigital Library
Mimi Xie, et al. 2018. AIM: Fast and energy-efficient AES in-memory implementation for emerging non-volatile main memory. In DATE. IEEE.Google Scholar
Yuan Xue et al. 2017. Age-aware logic and memory co-placement for RRAM-FPGAs. In DAC. ACM, 1. Google ScholarDigital Library
Jie Zhang et al. 2015. Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures. In PACT. IEEE. Google ScholarDigital Library
Jie Zhang and Myoungsoo Jung. 2018. Flashabacus: a self-governing flash-based accelerator for low-power systems. In EuroSys. ACM, 15. Google ScholarDigital Library

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019
June 2019
1378 pages
ISBN:9781450367257
DOI:10.1145/3316781

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)133
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FlashGPU: Placing New Flash Next to GPU Cores

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FlashGPU: Placing New Flash Next to GPU Cores

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media