Piranha: a scalable architecture based on single-chip multiprocessing

Authors:
Luiz André Barroso

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Kourosh Gharachorloo

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Robert McNamara

Systems Research Center, Compaq Computer Corporation, Palo Alto, CA

Systems Research Center, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Andreas Nowatzyk

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Shaz Qadeer

Systems Research Center, Compaq Computer Corporation, Palo Alto, CA

Systems Research Center, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Barton Sano

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Scott Smith

NonStop Hardware Development, Compaq Computer Corporation, Austin, TX

NonStop Hardware Development, Compaq Computer Corporation, Austin, TX
View Profile

,
Robert Stets

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

,
Ben Verghese

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA

Western Research Laboratory, Compaq Computer Corporation, Palo Alto, CA
View Profile

ISCA '00: Proceedings of the 27th annual international symposium on Computer architectureJune 2000Pages 282–293https://doi.org/10.1145/339647.339696

Published:01 May 2000Publication History

ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture

Pages 282–293

ABSTRACT

The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers.

This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multi-processing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.

References

1.A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In 15th Annual International Symposium on Computer Architecture, pages 280-289, May 1988. Google ScholarDigital Library
2.P. Bannon. Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum '98 (http://www.digital.com/alphaoem/microprocessorforum.htm), October 1998.Google Scholar
3.L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th International Symposium on High-Performance Computer Architecture, pages 3-14, January 2000.Google Scholar
4.L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In 25th Annual International Symposium on Computer Architecture, pages 3-14, June 1998. Google ScholarDigital Library
5.J. Borkenhagen and S. Storino. 5th Generation 64-bit PowerPC-Compatible Commercial Processor Design. http://www.rs6OOO.ibm.com /resource/technology/pulsar.pdf. September 1999.Google Scholar
6.S. Crowder et al. IEDM Technical Digest, page 1017, 1998.Google Scholar
7.Z. Cvetanovic and D. Bhandarkar. Characterization of Alpha AXP Performance using TP and SPEC Workloads. In 21st Annual International Symposium on Computer Architecture, pages 60-70, April 1994. Google ScholarDigital Library
8.Z. Cvetanovic and D. Donaldson. AlphaServer 4100 Performance Characterization. In Digital Technical Journal, 8(4), pages 3-20, 1996.Google Scholar
9.K. Diefendorff. Power4 Focuses on Memory Bandwidth: IBM Confronts IA-64, Says ISA Not Important. In Microprocessor Report, Vol. 13, No. 13, October 1999.Google Scholar
10.Digital Equipment Corporation. Digital Semiconductor 21164 Alpha Microprocessor Hardware Reference Manual. March 1996.Google Scholar
11.S.J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stature, and D. M. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. In IEEE Micro, pages 12-19, October 1997. Google ScholarDigital Library
12.R.J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In 23rd Annual International Symposium on Computer Architecture, pages 203-212, May 1996. Google ScholarDigital Library
13.J.S. Emer. Simultaneous Multithreading: Multiplying Alpha's Performance. Presentation at the Microprocessor Forum '99, October 1999.Google Scholar
14.A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.Google Scholar
15.L. Hammond, B. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. In IEEE Computer 30(9), pages 79-85, September 1997. Google ScholarDigital Library
16.L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In 8th ACM International Symposium on Architectural Support for Programming Languages and 0 peratin Systems, San Jose, California, October 1998. Google ScholarDigital Library
17.L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Willey, M. Chen, M. Kozyrczak, and K. Olukotun. The Stanford Hydra CMP. Presented at Hot Chips 11, August 1999.Google Scholar
18.J. Hennessy. The Future of Systems Research. In IEEE Computer, Vol. 32, No. 8, pages 27-33, August 1999. Google ScholarDigital Library
19.IBM Microelectronics. ASIC SA27E Databook. International Business Machines, 1999.Google Scholar
20.N.P. Jouppi and S. Wilton. Tradeoffs in Two-Level On-Chip Caching. In 21st Annual International Symposium on Computer Architecture, pages 34-45, April 1994. Google ScholarDigital Library
21.K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads. In 25th Annual International Symposium on ComputerArchitecture, pages 15-26, June 1998. Google ScholarDigital Library
22.V. Krishnan and J. Torrellas. Hardware and Software Support for Speculative Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS'98), pages 85-92, June 1998. Google ScholarDigital Library
23.S. Kunkel, B. Armstrong, and P. Vitale. System Optimization for OLTP Workloads. IEEE Micro, Vol. 19, No. 3, May/June 1999. Google ScholarDigital Library
24.J. Kuskin et al. The Stanford FLASH Multiprocessor. In 21st Annual International Symposium on Computer Architecture, April 1994. Google ScholarDigital Library
25.J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In 24 th Annual International Symposium on Computer Architecture, pages 241-251, June 1997. Google ScholarDigital Library
26.D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In 17 th Annual International Symposium on Computer Architecture, pages 94-105, May 1990. Google ScholarDigital Library
27.J. Lo, L. A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. In 25th Annual International Symposium on Computer Architecture, June 1998. Google ScholarDigital Library
28.A.M.G. Maynard, C. M. Donnelly, and B. R. Olszewski. Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads. In 6th International Conference on Architectural Support for Programming L anguages and 0 perating Syste~ns pages 145-156, October 1994. Google Scholar
29.B. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. In 23rd Annual International Symposium on Computer Architecture, May 1996. Google ScholarDigital Library
30.A. Nowatzyk, G. Aybay, M. Browne, W. Radke, and S. Vishin. S- Connect: from Networks of Workstations to Supercomputing Performance. In 22nd Annual International Symposium on Computer Architecture, pages 71-82, May 1995. Google ScholarDigital Library
31.A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP' 95), pages 1.1 - 1.10, July 1995.Google Scholar
32.A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. Exploiting Parallelism in Cache Coherency Protocol Engines. In EuroPar'95 International Conference on Parallel Processing, August 1995. Google ScholarDigital Library
33.K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In 7 th International Symposium on Architectural Support for Programming L anguages and 0 perating System~October 1996. Google ScholarDigital Library
34.S.E. Perl and R. L. Sites. Studies of Windows NT Performance Using Dynamic Execution Traces. In 2nd Symposium on 0 perating System Design and Implementation, pages 169-184, October 1996. Google ScholarDigital Library
35.P. Ranganathan, K. Gharachorloo, S. Adve, and L. A. Barroso. Performance of Database Workloads on Shared- Memory Systems with Outof-Order Processors. In 8th International Conference on Architectural Support for Programming L anguages and 0 perating Syste~yages 307-318, October 1998. Google ScholarDigital Library
36.M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The Impact of Architectural Trends on Operating System Performance. In 15th Symposium on 0 perating System Principl~sDecember 1995. Google ScholarDigital Library
37.M. Rosenblum, E. Bugnion, S. Herrod, and S. Devine. Using the SimOS Machine Simulator to Study Complex Computer Systems. In ACM Transactions on Modeling and Computer Simulation, Vol. 7, No. 1, pages 78-103, January 1997. Google ScholarDigital Library
38.A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In 23rd Annual International Symposium on Computer Architecture. May 1996. Google ScholarDigital Library
39.R.L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual (second edition). Digital Press, 1995. Google ScholarDigital Library
40.Standard Performance Council. The SPEC95 CPU Benchmark Suite. http ://www.specbench.org, 1995.Google Scholar
41.J. Steffan and T. Mowry. The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. In 4 th International Symposium on High-Performance Computer Architecture, pages 2-13, February 1998. Google ScholarDigital Library
42.S.S. Thakkar and M. Sweiger. Performance of an OLTP Application on Symmetry Multiprocessor System. In 17 th Annual International Symposium on Computer Architecture, pages 228-238, May 1990. Google ScholarDigital Library
43.Transaction Processing Performance Council. TPC Benchmark B Standard Specification Revision 2.0. June 1994.Google Scholar
44.Transaction Processing Performance Council. TPC Benchmark D (Decision Support) Standard Specification Revision 1.2. November 1996.Google Scholar
45.Transaction Processing Performance Council. TPC Benchmark C, Standard Specification Revision 3.6, October 1999.Google Scholar
46.P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas. The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. In 3rd Annual International Symposium on High- Performance Computer Architecture, pages 250-260, February 1997. Google ScholarDigital Library
47.M. Tremblay. MAJC-5200: A VLIW Convergent MPSOC. In Microprocessor Forum, October 1999.Google Scholar
48.E. Witchel and M. Rosenblum. Embra: Fast and Flexible Machine Simulation. In 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 68-79, May 1996. Google ScholarDigital Library

Index Terms

Piranha: a scalable architecture based on single-chip multiprocessing
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Emerging technologies
  2. Very large scale integration design

Recommendations

Piranha: a scalable architecture based on single-chip multiprocessing
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)

The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ...
Read More
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Read More
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture
June 2000
327 pages
ISBN:1581132328
DOI:10.1145/339647
Chairmen:
Alan Berenbaum
Lucent Technologies
,
Joel Emer
Compaq Computer Corp.
ACM SIGARCH Computer Architecture News Volume 28, Issue 2
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)
May 2000
325 pages
ISSN:0163-5964
DOI:10.1145/342001
Chairmen:
Alan Berenbaum
Lucent Technologies, Berkeley Heights, NJ
,
Joel Emer
Compaq Computer Corp., Palo Alto, CA
Issue’s Table of Contents
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 377
  Total Citations
  View Citations
- 3,155
  Total Downloads
- Downloads (Last 12 months)210
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Piranha: a scalable architecture based on single-chip multiprocessing

ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Piranha: a scalable architecture based on single-chip multiprocessing

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Increasing hardware data prefetching performance using the second-level cache