ABSTRACT
This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.
Supplemental Material
- Akenine-Möller, T., Haines, E. 2002. Real-Time Rendering. 2nd Edition. A. K. Peters. Google ScholarDigital Library
- Aila, T., Laine, S. 2004. Alias-Free Shadow Maps. In Proceedings of Eurographics Symposium on Rendering 2004, Eurographics Association. 161--166. Google ScholarCross Ref
- Alpert, D., Avnon, D. 1993. Architecture of the Pentium Microprocessor. IEEE Micro, v.13, n.3, 11--21. May 1993. Google ScholarDigital Library
- AMD. 2007. Product description web site: ati.amd.com/products/Radeonhd3800/specs.html.Google Scholar
- Bader, A., Chhugani, J., Dubey, P., Junkins, S., Morrison T., Ragozin, D., Smelyanskiy. 2008. Game Physics Performance On Larrabee Architecture. Intel whitepaper, available in August, 2008. Web site: techresearch.intel.com.Google Scholar
- Bavoil, L., Callahan, S., Lefohn, A., Comba, J. Silva, C. 2007. Multi-fragment effects on the GPU using the k-buffer. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (Seattle, Washington, April 30 - May 02, 2007). I3D 2007. ACM, New York, NY, 97--104. Google ScholarDigital Library
- Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y. Aug. 25, 1996. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, v. 37, i. 1, 55--69. Google ScholarDigital Library
- Blythe, D. 2006. The Direct3D 10 System. ACM Transactions on Graphics, 25, 3, 724--734. Google ScholarDigital Library
- Bookout, D. July, 2007. Shadow Map Aliasing. Web site: www.gamedev.net/reference/articles/article2376.asp.Google Scholar
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics, v. 23, n. 3, 777--786. Google ScholarDigital Library
- Callahan, S., Ikits, M., Comba, J., Silva, C. 2005. Hardwareassisted visibility sorting for unstructured volume rendering. IEEE Transactions on Visualization and Computer Graphics, 11, 3, 285--295 Google ScholarDigital Library
- Chandra, R., Menon, R., Dagum, L., Kohr, D, Maydan, D., McDonald, J. 2000. Parallel Programming in OpenMP. Morgan Kaufman. Google ScholarDigital Library
- Chen, M., Stoll, G., Igehy, H., Proudfoot, K., Hanrahan P. 1998. Simple models of the impact of overlap in bucket rendering. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (Lisbon, Portugal, August 31 - September 01, 1998). S. N. Spencer, Ed. HWWS '98. ACM, New York, NY, 105--112. Google ScholarDigital Library
- Chen, Y., Chhugani, J., Dubey, P., Hughes, C., Kim, D., Kumar, S., Lee, V., Nguyen A., Smelyanskiy, M. 2008. Convergence of Recognition, Mining, and Synthesis Workloads and its Implications. In Procedings of IEEE, v. 96, n. 5, 790--807.Google Scholar
- Chuvelev, M., Greer, B., Henry, G., Kuznetsov, S., Burylov, I., Sabanin, B. Nov. 2007. Intel Performance Libraries: Multicore ready Software for Numeric Intensive Computation. Intel Technology Journal, v. 11, i. 4, 1--10.Google Scholar
- Cohen, J., Lin., M., Manocha, D., Ponamgi., D. 1995. I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments. In Proceedings of 1995 Symposium on Interactive 3D Graphics. SI3D '95. ACM, New York, NY, 189--196. Google ScholarDigital Library
- Eldridge, M. 2001. Designing Graphics Architectures Around Scalability and Communication. PhD thesis, Stanford. Google ScholarDigital Library
- Foley, J., Van Dam, A., Feiner, S., Hughes, J. 1996. Computer Graphics: Principles and Practice. Addison Wesley. Google ScholarDigital Library
- Fuchs, H., Poulton, J., Eyles, J., Greer, T., Goldfeather, J., Ellsworth, D., Molnar, S., Turk, G., Tebbs, B., Israel, L. 1989. Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories. In Computer Graphics (Proceedings of ACM SIGGRAPH 89), ACM, 79--88. Google ScholarDigital Library
- Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B. November 2007. Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architectures. Intel Technology Journal, v. 11, i. 04, 333--348.Google Scholar
- Gilbert, E., Johnson, D., Keerthi, S. 1988. A fast procedure for computing the distance between complex objects in three-dimensional space. IEEE Journal of Robotics and Automation, 4, 2, 193--203.Google ScholarCross Ref
- GPGPU. 2007. GPGPU web site: www.gpgpu.org.Google Scholar
- Greene, N. 1996. Hierarchical polygon tiling with coverage masks, In Proceedings of ACM SIGGRAPH 93, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 65--64. Google ScholarDigital Library
- Grochowski, E., Ronen, R., Shen, J., Wang, H. 2004. Best of Both Latency and Throughput. 2004 IEEE International Conference on Computer Design (ICCD '04), 236--243. Google ScholarDigital Library
- Gwennap, L. 1995. Intel's P6 Uses Decoupled Superscalar Design. Microprocessor Report. v. 9, n. 2, Feb. 16, 1995.Google Scholar
- Hsieh, E., Pentkovski, V., Piazza, T. 2001. ZR: A 3D API Transparent Technology For Chunk Rendering. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (Austin, Texas, December 01 - 05, 2001). International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 284--291. Google ScholarDigital Library
- Hughes, C. J., Grzeszczuk, R., Sifakis, E., Kim, D., Kumar, S., Selle, A. P., Chhugani, J., Holliman, M., and Chen, Y. 2007. Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. In Proceedings of the 34th Annual international Symposium on Computer Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. ACM, New York, NY, 220--231. Google ScholarDigital Library
- IEEE Std. 1003.1, 2004 Edition. Standard for Information Technology - Portable Operating System Interface (POSIX) System Interfaces. The Open Group Technical Standard Base Specifications. Issue 6.Google Scholar
- Jacobsen, T. 2001. Advanced Character Physics. Proc. Game Developers Conference 2001, 1--10.Google Scholar
- Johnson, G. S., Lee, J., Burns, C. A., Mark, W. R. 2005. The irregular Z-buffer: Hardware acceleration for irregular data structures. ACM Transactions on Graphics. 24, 4, 1462--1482. Google ScholarDigital Library
- Kelley, M., Gould, K., Pease, B., Winner, S., Yen, A. 1994. Hardware accelerated rendering of CSG and transparency. In Proceedings of SIGGRAPH 1994, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 177--184. Google ScholarDigital Library
- Kelley, M., Winner, S., Gould, K. 1992. A Scalable Hardware Render Accelerator using a Modified Scanline Algorithm. In Computer Graphics (Proceedings of ACM SIGGRAPH 1992), SIGGRAPH '92. ACM, New York, NY, 241--248. Google ScholarDigital Library
- Kessenich, J., Baldwin, D., Rost, R. The OpenGL Shading Language. Version 1.1. Sept. 7, 2006. Web site: www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdfGoogle Scholar
- Khailany, B., Dally, W., Rixner, S., Kapasi, U., Mattson, P., Namkoong, J., Owens, J., Towles, B., Chang, A. 2001. Imagine: Media Processing with Streams. IEEE Micro, 21, 2, 35--46. Google ScholarDigital Library
- Kongetira, P., Aingaran, K., Olukotun, K. Mar/Apr 2005. Niagara: A 32-way multithreaded SPARC Processor. IEEE Micro. v. 25, i. 2. 21--29. Google ScholarDigital Library
- Lake, A. 2005. Intel Graphics Media Accelerator Series 900 Developer's Guide. Version 2.0. Web site:download.intel.com/ids/gma/Intel_915G_SDG_Feb05.pdf.Google Scholar
- Lloyd, B., Govindaraju, N., Molnar, S., Manocha, D. 2007. Practical logarithmic rasterization for low-error shadow maps. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, 17--24. Google ScholarDigital Library
- Mark, W., Glanville, S., Akeley, K., Kilgard, M. 2003. Cg: A System for Programming Graphics Hardware in a C-like Language, ACM Transactions on Graphics, v. 22, n. 3, 896--907. Google ScholarDigital Library
- Microsoft. 2007. Microsoft Reference for HLSL. Web site: msdn2.microsoft.com/en-us/library/bb509638.aspx.Google Scholar
- Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. 1994. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Applications, v.14, n. 4, July 1994, 23--32. Google ScholarDigital Library
- Molnar, S., Eyles, J., Poulton, J. 1992. Pixelflow: High Speed Rendering Using Image Composition. Computer Graphics (Proceedings of SIGGRAPH 92), v. 26 n. 2, 231--240. Google ScholarDigital Library
- Morein, S. 2000. ATI Radeon HyperZ Technology. Presented at Graphics Hardware 2000. Web site:www.graphicshardware.org/previous/www_2000/presentations/ATIHot3D.pdf.Google Scholar
- Nickolls, J., Buck, I., Garland, M. 2008. Scalable Parallel Programming with CUDA. ACM Queue, 6, 2, 40--53. Google ScholarDigital Library
- Nvidia. 2008. Product description web site:www.nvidia.com/object/geforce_family.html.Google Scholar
- Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Purcell, T. 2007. A Survey of General Purpose Computation on Graphics Hardware. Computer Graphics Forum. v.26, n. 1, 80--113.Google Scholar
- Pham D., Asano, S., Bolliger, M., Day, M., Hofstee, H., Johns., C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiask, D., Suzuodi, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K. 2005. The Design and Implementation of a First Generation CELL Processor. IEEE International Solid-State Circuits Conference. 184--186.Google ScholarCross Ref
- Pharr, M. 2006. Interactive Rendering in the Post-GPU Era. Presented at Graphics Hardware 2006. Web site:www.pharr.org/matt/.Google Scholar
- Pineda, J. 1988. A Parallel Algorithm for Polygon Rasterization. In Computer Graphics (Proceedings of ACM SIGGRAPH 88), 22, 4, 17--20. Google ScholarDigital Library
- Power VR. 2008. Web site:www.imgtec.com/powervr/products/Graphics/index.asp.Google Scholar
- Pollack, F. 1999. New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32. Google ScholarDigital Library
- Reinders, J., 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reily Media, Inc. Google ScholarDigital Library
- Reshetov A., Soupikov, A., Hurley, J. 2005. Multi-level Ray Tracing Algorithm. ACM Transactions on Graphics, 24, 3, 1176--1185. Google ScholarDigital Library
- Rost, R. 2004. The OpenGL Shading Language. Addison Wesley. Google ScholarDigital Library
- Shevtsov, M., Soupikov, A., Kapustin, A. 2007. Ray-Triangle Intersection Algorithm for Modern CPU Architectures. In Proceedings of GraphiCon 2007, 33--39.Google Scholar
- Stevens, A. 2006. ARM Mali 3D Graphics System Solution. Web site:www.arm.com/miscPDFs/16514.pdf.Google Scholar
- Stoll, G., Eldridge, M., Patterson, D., Webb, A., Berman, S., Levy, R., Caywood, C., Taveira, M., Hunt, S., Hanrahan, P. 2001. Lightning 2: A High Performance Display Subsystem for PC Clusters. In Computer Graphics (Proceedings of ACM SIGGRAPH 2001), ACM, 141--148. Google ScholarDigital Library
- Torborg, J., Kajiya, J. 1996. Talisman Commodity Realtime 3D Graphics for the PC. In Proceedings of ACM SIGGRAPH 1996, ACM Press/ACM SIGGRAPH, New York. Computer Graphics Proceedings, Annual Conference Series, ACM, 353--363. Google ScholarDigital Library
- Wexler, D., Gritz, L., Enderton, E., Rice, J. 2005. GPU-accelerated high-quality hidden surface removal. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (Los Angeles, California, July 30 - 31, 2005). HWWS '05, ACM, New York, NY, 7--14. Google ScholarDigital Library
Index Terms
- Larrabee: a many-core x86 architecture for visual computing
Recommendations
Larrabee: a many-core x86 architecture for visual computing
This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are ...
Marching cubes: A high resolution 3D surface construction algorithm
SIGGRAPH '87: Proceedings of the 14th annual conference on Computer graphics and interactive techniquesWe present a new algorithm, called marching cubes, that creates triangle models of constant density surfaces from 3D medical data. Using a divide-and-conquer approach to generate inter-slice connectivity, we create a case table that defines triangle ...
"GrabCut": interactive foreground extraction using iterated graph cuts
SIGGRAPH '04: ACM SIGGRAPH 2004 PapersThe problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) ...
Comments