Abstract
The number of active threads in a multi-core processor varies over time and is often much smaller than the number of supported hardware threads. This requires multi-core chip designs to balance core count and per-core performance. Low active thread counts benefit from a few big, high-performance cores, while high active thread counts benefit more from a sea of small, energy-efficient cores.
This paper comprehensively studies the trade-offs in multi-core design given dynamically varying active thread counts. We find that, under these workload conditions, a homogeneous multi-core processor, consisting of a few high-performance SMT cores, typically outperforms heterogeneous multi-cores consisting of a mix of big and small cores (without SMT), within the same power budget. We also show that a homogeneous multi-core performs almost as well as a heterogeneous multi-core that also implements SMT, as well as a dynamic multi-core, while being less complex to design and verify. Further, heterogeneous multi-cores that power-gate idle cores yield (only) slightly better energy-efficiency compared to homogeneous multi-cores.
The overall conclusion is that the benefit of SMT in the multi-core era is to provide flexibility with respect to the available thread-level parallelism. Consequently, homogeneous multi-cores with big SMT cores are competitive high-performance, energy-efficient design points for workloads with dynamically varying active thread counts.
- M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 298--309, June 2005. Google ScholarDigital Library
- L. A. Barroso and U. Hölzle. The case for energy-proportional systems. IEEE Computer, 40: 33--37, Dec. 2007. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008. Google ScholarDigital Library
- G. Blake, R. G. Dreslinski, T. N. Mudge, and K. Flautner. Evolution of thread-level parallelism in desktop applications. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 302--313, June 2010. Google ScholarDigital Library
- T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 52:1--52:12, Nov. 2011. Google ScholarDigital Library
- K. Du Bois, S. Eyerman, J. Sartor, and L. Eeckhout. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 511--522, June 2013. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multi-program workloads. IEEE Micro, 28 (3): 42--53, May/June 2008. Google ScholarDigital Library
- P. Greenhalgh. Big.LITTLE processing with ARM Cortex-A15 & Cortex-A7: Improving energy efficiency in high-performance mobile platforms. http://www.arm.com/files/downloads/big\_LITTLE\_Final\_Final.pdf, Sept. 2011.Google Scholar
- L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 58--69, Oct. 1998. Google ScholarDigital Library
- M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. IEEE Computer, 41 (7): 33--38, July 2008. Google ScholarDigital Library
- E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: Accommodating software diversity in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 186--197, June 2007. Google ScholarDigital Library
- J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--234, Mar. 2012. Google ScholarDigital Library
- M. T. Jones. Inside the Linux scheduler: The latest version of this all-important kernel component improves scalability. http://www.ibm.com/developerworks/linux/library/l-scheduler/index.html, June 2006.Google Scholar
- R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM's next-generation server processor. IEEE Micro, 30: 7--15, March/April 2010. Google ScholarDigital Library
- C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 23 (2): 66--76, Mar. 2007. Google ScholarDigital Library
- K. Khubaib, M. Suleman, M. Hashemi, C. Wilkerson, and Y. Patt. MorphCore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. In 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 305--316, Dec. 2012. Google ScholarDigital Library
- C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. Keckler. Composable lightweight processors. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 381--394, Dec. 2007. Google ScholarDigital Library
- R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), pages 81--92, Dec. 2003. Google ScholarDigital Library
- R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 64--75, June 2004. Google ScholarDigital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--480, Dec. 2009. Google ScholarDigital Library
- Y. Li, D. Brooks, Z. Hu, and K. Skadron. Performance, energy, and thermal considerations for SMT and CMP architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), pages 71--82, Feb. 2005. Google ScholarDigital Library
- NVidia. Variable SMP -- a multi-core CPU architecture for low power and high performance. http://www.nvidia.com/content/PDF/tegra\_white\_papers/Variable-SMP-A-Multi-%Core-CPU-Architecture-for-Low-Power-and-High-Performance-v1.1.pdf, 2011.Google Scholar
- K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The case for a single-chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 2--11, Oct. 1996. Google ScholarDigital Library
- S. E. Raasch and S. K. Reinhardt. The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 15--26, Sept. 2003. Google ScholarDigital Library
- E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann. Power-management architecture of the intel microarchitecture code-named sandy bridge. IEEE Micro, 32: 20--27, March/April 2012. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 45--57, Oct. 2002. Google ScholarDigital Library
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 234--244, Nov. 2000. Google ScholarDigital Library
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA), pages 414--425, June 1995. Google ScholarDigital Library
- M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 253--264, Mar. 2009. Google ScholarDigital Library
- M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data marshaling for multi-core architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 441--450, June 2010. Google ScholarDigital Library
- D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA), pages 191--202, May 1996. Google ScholarDigital Library
- R. Velasquez, P. Michaud, and A. Seznec. Selecting benchmark combinations for the evaluation of multicore throughput. In The IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 173--182, Apr. 2013.Google ScholarCross Ref
Index Terms
- The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism
Recommendations
The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism
ASPLOS '14The number of active threads in a multi-core processor varies over time and is often much smaller than the number of supported hardware threads. This requires multi-core chip designs to balance core count and per-core performance. Low active thread ...
The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsThe number of active threads in a multi-core processor varies over time and is often much smaller than the number of supported hardware threads. This requires multi-core chip designs to balance core count and per-core performance. Low active thread ...
Boosting single-thread performance in multi-core systems through fine-grain multi-threading
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIndustry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with ...
Comments