ABSTRACT
The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers.
This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multi-processing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.
- 1.A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In 15th Annual International Symposium on Computer Architecture, pages 280-289, May 1988. Google ScholarDigital Library
- 2.P. Bannon. Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum '98 (http://www.digital.com/alphaoem/microprocessorforum.htm), October 1998.Google Scholar
- 3.L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th International Symposium on High-Performance Computer Architecture, pages 3-14, January 2000.Google Scholar
- 4.L.A. Barroso, K. Gharachorloo, and E. Bugnion. Memory System Characterization of Commercial Workloads. In 25th Annual International Symposium on Computer Architecture, pages 3-14, June 1998. Google ScholarDigital Library
- 5.J. Borkenhagen and S. Storino. 5th Generation 64-bit PowerPC-Compatible Commercial Processor Design. http://www.rs6OOO.ibm.com /resource/technology/pulsar.pdf. September 1999.Google Scholar
- 6.S. Crowder et al. IEDM Technical Digest, page 1017, 1998.Google Scholar
- 7.Z. Cvetanovic and D. Bhandarkar. Characterization of Alpha AXP Performance using TP and SPEC Workloads. In 21st Annual International Symposium on Computer Architecture, pages 60-70, April 1994. Google ScholarDigital Library
- 8.Z. Cvetanovic and D. Donaldson. AlphaServer 4100 Performance Characterization. In Digital Technical Journal, 8(4), pages 3-20, 1996.Google Scholar
- 9.K. Diefendorff. Power4 Focuses on Memory Bandwidth: IBM Confronts IA-64, Says ISA Not Important. In Microprocessor Report, Vol. 13, No. 13, October 1999.Google Scholar
- 10.Digital Equipment Corporation. Digital Semiconductor 21164 Alpha Microprocessor Hardware Reference Manual. March 1996.Google Scholar
- 11.S.J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stature, and D. M. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. In IEEE Micro, pages 12-19, October 1997. Google ScholarDigital Library
- 12.R.J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In 23rd Annual International Symposium on Computer Architecture, pages 203-212, May 1996. Google ScholarDigital Library
- 13.J.S. Emer. Simultaneous Multithreading: Multiplying Alpha's Performance. Presentation at the Microprocessor Forum '99, October 1999.Google Scholar
- 14.A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.Google Scholar
- 15.L. Hammond, B. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. In IEEE Computer 30(9), pages 79-85, September 1997. Google ScholarDigital Library
- 16.L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In 8th ACM International Symposium on Architectural Support for Programming Languages and 0 peratin Systems, San Jose, California, October 1998. Google ScholarDigital Library
- 17.L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Willey, M. Chen, M. Kozyrczak, and K. Olukotun. The Stanford Hydra CMP. Presented at Hot Chips 11, August 1999.Google Scholar
- 18.J. Hennessy. The Future of Systems Research. In IEEE Computer, Vol. 32, No. 8, pages 27-33, August 1999. Google ScholarDigital Library
- 19.IBM Microelectronics. ASIC SA27E Databook. International Business Machines, 1999.Google Scholar
- 20.N.P. Jouppi and S. Wilton. Tradeoffs in Two-Level On-Chip Caching. In 21st Annual International Symposium on Computer Architecture, pages 34-45, April 1994. Google ScholarDigital Library
- 21.K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads. In 25th Annual International Symposium on ComputerArchitecture, pages 15-26, June 1998. Google ScholarDigital Library
- 22.V. Krishnan and J. Torrellas. Hardware and Software Support for Speculative Execution of Sequential Binaries on Chip-Multiprocessor. In ACM International Conference on Supercomputing (ICS'98), pages 85-92, June 1998. Google ScholarDigital Library
- 23.S. Kunkel, B. Armstrong, and P. Vitale. System Optimization for OLTP Workloads. IEEE Micro, Vol. 19, No. 3, May/June 1999. Google ScholarDigital Library
- 24.J. Kuskin et al. The Stanford FLASH Multiprocessor. In 21st Annual International Symposium on Computer Architecture, April 1994. Google ScholarDigital Library
- 25.J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In 24 th Annual International Symposium on Computer Architecture, pages 241-251, June 1997. Google ScholarDigital Library
- 26.D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In 17 th Annual International Symposium on Computer Architecture, pages 94-105, May 1990. Google ScholarDigital Library
- 27.J. Lo, L. A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. In 25th Annual International Symposium on Computer Architecture, June 1998. Google ScholarDigital Library
- 28.A.M.G. Maynard, C. M. Donnelly, and B. R. Olszewski. Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads. In 6th International Conference on Architectural Support for Programming L anguages and 0 perating Syste~ns pages 145-156, October 1994. Google Scholar
- 29.B. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. In 23rd Annual International Symposium on Computer Architecture, May 1996. Google ScholarDigital Library
- 30.A. Nowatzyk, G. Aybay, M. Browne, W. Radke, and S. Vishin. S- Connect: from Networks of Workstations to Supercomputing Performance. In 22nd Annual International Symposium on Computer Architecture, pages 71-82, May 1995. Google ScholarDigital Library
- 31.A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP' 95), pages 1.1 - 1.10, July 1995.Google Scholar
- 32.A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. Exploiting Parallelism in Cache Coherency Protocol Engines. In EuroPar'95 International Conference on Parallel Processing, August 1995. Google ScholarDigital Library
- 33.K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In 7 th International Symposium on Architectural Support for Programming L anguages and 0 perating System~October 1996. Google ScholarDigital Library
- 34.S.E. Perl and R. L. Sites. Studies of Windows NT Performance Using Dynamic Execution Traces. In 2nd Symposium on 0 perating System Design and Implementation, pages 169-184, October 1996. Google ScholarDigital Library
- 35.P. Ranganathan, K. Gharachorloo, S. Adve, and L. A. Barroso. Performance of Database Workloads on Shared- Memory Systems with Outof-Order Processors. In 8th International Conference on Architectural Support for Programming L anguages and 0 perating Syste~yages 307-318, October 1998. Google ScholarDigital Library
- 36.M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The Impact of Architectural Trends on Operating System Performance. In 15th Symposium on 0 perating System Principl~sDecember 1995. Google ScholarDigital Library
- 37.M. Rosenblum, E. Bugnion, S. Herrod, and S. Devine. Using the SimOS Machine Simulator to Study Complex Computer Systems. In ACM Transactions on Modeling and Computer Simulation, Vol. 7, No. 1, pages 78-103, January 1997. Google ScholarDigital Library
- 38.A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In 23rd Annual International Symposium on Computer Architecture. May 1996. Google ScholarDigital Library
- 39.R.L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual (second edition). Digital Press, 1995. Google ScholarDigital Library
- 40.Standard Performance Council. The SPEC95 CPU Benchmark Suite. http ://www.specbench.org, 1995.Google Scholar
- 41.J. Steffan and T. Mowry. The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. In 4 th International Symposium on High-Performance Computer Architecture, pages 2-13, February 1998. Google ScholarDigital Library
- 42.S.S. Thakkar and M. Sweiger. Performance of an OLTP Application on Symmetry Multiprocessor System. In 17 th Annual International Symposium on Computer Architecture, pages 228-238, May 1990. Google ScholarDigital Library
- 43.Transaction Processing Performance Council. TPC Benchmark B Standard Specification Revision 2.0. June 1994.Google Scholar
- 44.Transaction Processing Performance Council. TPC Benchmark D (Decision Support) Standard Specification Revision 1.2. November 1996.Google Scholar
- 45.Transaction Processing Performance Council. TPC Benchmark C, Standard Specification Revision 3.6, October 1999.Google Scholar
- 46.P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas. The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. In 3rd Annual International Symposium on High- Performance Computer Architecture, pages 250-260, February 1997. Google ScholarDigital Library
- 47.M. Tremblay. MAJC-5200: A VLIW Convergent MPSOC. In Microprocessor Forum, October 1999.Google Scholar
- 48.E. Witchel and M. Rosenblum. Embra: Fast and Flexible Machine Simulation. In 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 68-79, May 1996. Google ScholarDigital Library
Index Terms
- Piranha: a scalable architecture based on single-chip multiprocessing
Recommendations
Piranha: a scalable architecture based on single-chip multiprocessing
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Comments