Abstract
Current multicore architectures offer high throughput by increasing hardware resource utilization. As the number of cores in a multicore system increases, providing Quality of Service (QoS) to applications in addition to throughput is becoming an important problem.
In this work, we present FlexDCP, a framework that allows the Operating System (OS) to guarantee a QoS for each application running in a chip multiprocessor. FlexDCP directly estimates the performance of applications for different cache configurations instead of using indirect measures of performance like the number of misses. This information allows the OS to convert QoS requirements into resource assignments. Consequently, it offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput.
Our results show that FlexDCP is able to force applications in a workload to run at a certain percentage of their maximum performance in 94% of the cases considered, being on average 1:48% under the objective for remaining cases. When optimizing a global QoS metric like fairness, FlexDCP consistently outperforms traditional eviction policies like LRU, pseudo LRU and previous dynamic cache partitioning proposals for two-, four- and eightcore configurations. In an eight-core architecture FlexDCP obtains a fairness improvement of 10:1% over Fair, the best policy in the literature optimizing fairness.
- ARM920T. Technical Reference Manual. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0151c/ARM920T_TRM1_S.pdf.Google Scholar
- UltraSPARC T2. Supplement to the UltraSPARC Architecture 2007. http://opensparc-t2.sunsource.net/specs/UST2-UASuppl-current-draft-HP-EXT.pdf.Google Scholar
- D.P. Bovet and M. Cesati. Understanding Linux kernel. O'Reilly, 3rd edition, 2005. Google ScholarDigital Library
- F.J. Cazorla, P.M.W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable performance in SMT processors: Synergy between the OS and SMTs. IEEE ToC, 55(7):785--799, 2006. Google ScholarDigital Library
- D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Design Automation Conference, 2000.Google Scholar
- F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In MICRO, 2007. Google ScholarDigital Library
- L. Hammond, B.A. Nayfeh, and K. Olukotun. A single-chip multiprocessor. Computer, 30(9):79--85, 1997. Google ScholarDigital Library
- L.C. Heller and M.S. Farrell. Millicode in an IBM zSeries processor. IBM J. Res. Dev., 48(3-4):425--434, 2004. Google ScholarDigital Library
- J.L. Hennessy and D.A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., 3rd edition, 2002. Google ScholarDigital Library
- L.R. Hsu, S.K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In PACT, 2006. Google ScholarDigital Library
- C.J. Hughes, P. Kaul, S.V. Adve, R. Jain, C. Park, and J. Srinivasan. Variability in the execution of multimedia applications and implications for architecture. In ISCA, 2001. Google ScholarDigital Library
- R.R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L.R. Hsu, and S.K. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, 2007. Google ScholarDigital Library
- A. Jaleel, W. Hasenplaugh, M.K. Qureshi, J. Sebot, S.C.S. Jr, and J. Emer. Adaptive insertion policies for managing shared caches on cmps. In PACT, 2008. Google ScholarDigital Library
- T.S. Karkhanis and J.E. Smith. A first-order superscalar processor model. In ISCA, 2004. Google ScholarDigital Library
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004. Google ScholarDigital Library
- J.W. Lee and K. Asanovic. METERG: Measurement-based end-toend performance estimation technique in QoS-capable multiprocessors. In RTAS, 2006. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, 2005. Google ScholarDigital Library
- K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. In ISPASS, 2001.Google Scholar
- R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78--117, 1970.Google ScholarDigital Library
- M. Moreto, F.J. Cazorla, A. Ramirez, and M. Valero. Explaining dynamic cache partitioning speed ups. IEEE CAL, 2007. Google ScholarDigital Library
- M. Moreto, F.J. Cazorla, A. Ramirez, and M. Valero. Online prediction of applications cache utility. In IC-SAMOS, 2007.Google ScholarCross Ref
- O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007. Google ScholarDigital Library
- K.J. Nesbit, N. Aggarwal, J. Laudon, and J.E. Smith. Fair queuing memory systems. In MICRO, 2006. Google ScholarDigital Library
- K.J. Nesbit, J. Laudon, and J.E. Smith. Virtual private caches. In ISCA, 2007. Google ScholarDigital Library
- K.J. Nesbit, M. Moreto, F.J. Cazorla, A. Ramirez, M. Valero, and J.E. Smith. A framework for managing multicore resources. IEEE Micro, special issue on Interaction of Computer Architecture and Operating System in the Many-core Era, 38(3), 2008.Google Scholar
- M.K. Qureshi and Y.N. Patt. Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006. Google ScholarDigital Library
- N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In PACT, 2006. Google ScholarDigital Library
- M.J. Serrano, R. Wood, and M. Nemirovsky. A study on multistreamed superscalar processors. Technical Report 93-05, UCSB, 1993.Google Scholar
- A. Settle, D. Connors, E. Gibert, and A. Gonzalez. A dynamically reconfigurable cache for multithreaded processors. Journal of Embedded Computing, 1(3-4), 2005. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discovering and exploiting program phases. IEEE Micro, 2003. Google ScholarDigital Library
- J.E. Smith and R. Nair. Virtual machines: versatile platforms for systems and processes. Morgan Kaufmann Publishers Inc., 2005. Google ScholarDigital Library
- G.E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA, 2002. Google ScholarDigital Library
- D.M. Tullsen, S.J. Eggers, and H.M. Levy. Simultaneous multithreading: maximizing on-chip parallelism. In ISCA, 1995. Google ScholarDigital Library
- J. Vera, F.J. Cazorla, A. Pajuelo, O.J. Santana, E. Fernandez, and M. Valero. FAME: Fairly measuring multithreaded architectures. In PACT, 2007. Google ScholarDigital Library
- T.Y. Yeh and G. Reinman. Fast and fair: data-stream quality of service. In CASES, 2005. Google ScholarDigital Library
- P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In ASPLOS, 2004. Google ScholarDigital Library
Index Terms
- FlexDCP: a QoS framework for CMP architectures
Recommendations
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Ubik: efficient cache sharing with strict qos for latency-critical workloads
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsChip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, ...
Comments