research-article

Larrabee: a many-core x86 architecture for visual computing

Authors:
Larry Seiler

Intel® Corporation

Intel® Corporation
View Profile

,
Doug Carmean

Intel® Corporation

Intel® Corporation
View Profile

,
Eric Sprangle

Intel® Corporation

Intel® Corporation
View Profile

,
Tom Forsyth

Intel® Corporation

Intel® Corporation
View Profile

,
Michael Abrash

RAD Game Tools

RAD Game Tools
View Profile

,
Pradeep Dubey

Intel® Corporation

Intel® Corporation
View Profile

,
Stephen Junkins

Intel® Corporation

Intel® Corporation
View Profile

,
Adam Lake

Intel® Corporation

Intel® Corporation
View Profile

,
Jeremy Sugerman

Stanford University

Stanford University
View Profile

,
Robert Cavin

Intel® Corporation

Intel® Corporation
View Profile

,
Roger Espasa

Intel® Corporation

Intel® Corporation
View Profile

,
Ed Grochowski

Intel® Corporation

Intel® Corporation
View Profile

,
Toni Juan

Intel® Corporation

Intel® Corporation
View Profile

,
Pat Hanrahan

Stanford University

Stanford University
View Profile

Authors Info & Claims

SIGGRAPH '08: ACM SIGGRAPH 2008 papersAugust 2008Article No.: 18Pages 1–15https://doi.org/10.1145/1399504.1360617

Published:01 August 2008Publication History

SIGGRAPH '08: ACM SIGGRAPH 2008 papers

Pages 1–15

ABSTRACT

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2^nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

Supplemental Material

a18-seiler.mov

mov

22.6 MB

Download

References

Akenine-Möller, T., Haines, E. 2002. Real-Time Rendering. 2nd Edition. A. K. Peters. Google ScholarDigital Library
Aila, T., Laine, S. 2004. Alias-Free Shadow Maps. In Proceedings of Eurographics Symposium on Rendering 2004, Eurographics Association. 161--166. Google ScholarCross Ref
Alpert, D., Avnon, D. 1993. Architecture of the Pentium Microprocessor. IEEE Micro, v.13, n.3, 11--21. May 1993. Google ScholarDigital Library
AMD. 2007. Product description web site: ati.amd.com/products/Radeonhd3800/specs.html.Google Scholar
Bader, A., Chhugani, J., Dubey, P., Junkins, S., Morrison T., Ragozin, D., Smelyanskiy. 2008. Game Physics Performance On Larrabee Architecture. Intel whitepaper, available in August, 2008. Web site: techresearch.intel.com.Google Scholar
Bavoil, L., Callahan, S., Lefohn, A., Comba, J. Silva, C. 2007. Multi-fragment effects on the GPU using the k-buffer. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (Seattle, Washington, April 30 - May 02, 2007). I3D 2007. ACM, New York, NY, 97--104. Google ScholarDigital Library
Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y. Aug. 25, 1996. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, v. 37, i. 1, 55--69. Google ScholarDigital Library
Blythe, D. 2006. The Direct3D 10 System. ACM Transactions on Graphics, 25, 3, 724--734. Google ScholarDigital Library
Bookout, D. July, 2007. Shadow Map Aliasing. Web site: www.gamedev.net/reference/articles/article2376.asp.Google Scholar
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics, v. 23, n. 3, 777--786. Google ScholarDigital Library
Callahan, S., Ikits, M., Comba, J., Silva, C. 2005. Hardwareassisted visibility sorting for unstructured volume rendering. IEEE Transactions on Visualization and Computer Graphics, 11, 3, 285--295 Google ScholarDigital Library
Chandra, R., Menon, R., Dagum, L., Kohr, D, Maydan, D., McDonald, J. 2000. Parallel Programming in OpenMP. Morgan Kaufman. Google ScholarDigital Library
Chen, M., Stoll, G., Igehy, H., Proudfoot, K., Hanrahan P. 1998. Simple models of the impact of overlap in bucket rendering. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (Lisbon, Portugal, August 31 - September 01, 1998). S. N. Spencer, Ed. HWWS '98. ACM, New York, NY, 105--112. Google ScholarDigital Library
Chen, Y., Chhugani, J., Dubey, P., Hughes, C., Kim, D., Kumar, S., Lee, V., Nguyen A., Smelyanskiy, M. 2008. Convergence of Recognition, Mining, and Synthesis Workloads and its Implications. In Procedings of IEEE, v. 96, n. 5, 790--807.Google Scholar
Chuvelev, M., Greer, B., Henry, G., Kuznetsov, S., Burylov, I., Sabanin, B. Nov. 2007. Intel Performance Libraries: Multicore ready Software for Numeric Intensive Computation. Intel Technology Journal, v. 11, i. 4, 1--10.Google Scholar
Cohen, J., Lin., M., Manocha, D., Ponamgi., D. 1995. I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments. In Proceedings of 1995 Symposium on Interactive 3D Graphics. SI3D '95. ACM, New York, NY, 189--196. Google ScholarDigital Library
Eldridge, M. 2001. Designing Graphics Architectures Around Scalability and Communication. PhD thesis, Stanford. Google ScholarDigital Library
Foley, J., Van Dam, A., Feiner, S., Hughes, J. 1996. Computer Graphics: Principles and Practice. Addison Wesley. Google ScholarDigital Library
Fuchs, H., Poulton, J., Eyles, J., Greer, T., Goldfeather, J., Ellsworth, D., Molnar, S., Turk, G., Tebbs, B., Israel, L. 1989. Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories. In Computer Graphics (Proceedings of ACM SIGGRAPH 89), ACM, 79--88. Google ScholarDigital Library
Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B. November 2007. Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architectures. Intel Technology Journal, v. 11, i. 04, 333--348.Google Scholar
Gilbert, E., Johnson, D., Keerthi, S. 1988. A fast procedure for computing the distance between complex objects in three-dimensional space. IEEE Journal of Robotics and Automation, 4, 2, 193--203.Google ScholarCross Ref
GPGPU. 2007. GPGPU web site: www.gpgpu.org.Google Scholar
Greene, N. 1996. Hierarchical polygon tiling with coverage masks, In Proceedings of ACM SIGGRAPH 93, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 65--64. Google ScholarDigital Library
Grochowski, E., Ronen, R., Shen, J., Wang, H. 2004. Best of Both Latency and Throughput. 2004 IEEE International Conference on Computer Design (ICCD '04), 236--243. Google ScholarDigital Library
Gwennap, L. 1995. Intel's P6 Uses Decoupled Superscalar Design. Microprocessor Report. v. 9, n. 2, Feb. 16, 1995.Google Scholar
Hsieh, E., Pentkovski, V., Piazza, T. 2001. ZR: A 3D API Transparent Technology For Chunk Rendering. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (Austin, Texas, December 01 - 05, 2001). International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 284--291. Google ScholarDigital Library
Hughes, C. J., Grzeszczuk, R., Sifakis, E., Kim, D., Kumar, S., Selle, A. P., Chhugani, J., Holliman, M., and Chen, Y. 2007. Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. In Proceedings of the 34th Annual international Symposium on Computer Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. ACM, New York, NY, 220--231. Google ScholarDigital Library
IEEE Std. 1003.1, 2004 Edition. Standard for Information Technology - Portable Operating System Interface (POSIX) System Interfaces. The Open Group Technical Standard Base Specifications. Issue 6.Google Scholar
Jacobsen, T. 2001. Advanced Character Physics. Proc. Game Developers Conference 2001, 1--10.Google Scholar
Johnson, G. S., Lee, J., Burns, C. A., Mark, W. R. 2005. The irregular Z-buffer: Hardware acceleration for irregular data structures. ACM Transactions on Graphics. 24, 4, 1462--1482. Google ScholarDigital Library
Kelley, M., Gould, K., Pease, B., Winner, S., Yen, A. 1994. Hardware accelerated rendering of CSG and transparency. In Proceedings of SIGGRAPH 1994, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 177--184. Google ScholarDigital Library
Kelley, M., Winner, S., Gould, K. 1992. A Scalable Hardware Render Accelerator using a Modified Scanline Algorithm. In Computer Graphics (Proceedings of ACM SIGGRAPH 1992), SIGGRAPH '92. ACM, New York, NY, 241--248. Google ScholarDigital Library
Kessenich, J., Baldwin, D., Rost, R. The OpenGL Shading Language. Version 1.1. Sept. 7, 2006. Web site: www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdfGoogle Scholar
Khailany, B., Dally, W., Rixner, S., Kapasi, U., Mattson, P., Namkoong, J., Owens, J., Towles, B., Chang, A. 2001. Imagine: Media Processing with Streams. IEEE Micro, 21, 2, 35--46. Google ScholarDigital Library
Kongetira, P., Aingaran, K., Olukotun, K. Mar/Apr 2005. Niagara: A 32-way multithreaded SPARC Processor. IEEE Micro. v. 25, i. 2. 21--29. Google ScholarDigital Library
Lake, A. 2005. Intel Graphics Media Accelerator Series 900 Developer's Guide. Version 2.0. Web site:download.intel.com/ids/gma/Intel_915G_SDG_Feb05.pdf.Google Scholar
Lloyd, B., Govindaraju, N., Molnar, S., Manocha, D. 2007. Practical logarithmic rasterization for low-error shadow maps. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, 17--24. Google ScholarDigital Library
Mark, W., Glanville, S., Akeley, K., Kilgard, M. 2003. Cg: A System for Programming Graphics Hardware in a C-like Language, ACM Transactions on Graphics, v. 22, n. 3, 896--907. Google ScholarDigital Library
Microsoft. 2007. Microsoft Reference for HLSL. Web site: msdn2.microsoft.com/en-us/library/bb509638.aspx.Google Scholar
Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. 1994. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Applications, v.14, n. 4, July 1994, 23--32. Google ScholarDigital Library
Molnar, S., Eyles, J., Poulton, J. 1992. Pixelflow: High Speed Rendering Using Image Composition. Computer Graphics (Proceedings of SIGGRAPH 92), v. 26 n. 2, 231--240. Google ScholarDigital Library
Morein, S. 2000. ATI Radeon HyperZ Technology. Presented at Graphics Hardware 2000. Web site:www.graphicshardware.org/previous/www_2000/presentations/ATIHot3D.pdf.Google Scholar
Nickolls, J., Buck, I., Garland, M. 2008. Scalable Parallel Programming with CUDA. ACM Queue, 6, 2, 40--53. Google ScholarDigital Library
Nvidia. 2008. Product description web site:www.nvidia.com/object/geforce_family.html.Google Scholar
Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Purcell, T. 2007. A Survey of General Purpose Computation on Graphics Hardware. Computer Graphics Forum. v.26, n. 1, 80--113.Google Scholar
Pham D., Asano, S., Bolliger, M., Day, M., Hofstee, H., Johns., C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiask, D., Suzuodi, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K. 2005. The Design and Implementation of a First Generation CELL Processor. IEEE International Solid-State Circuits Conference. 184--186.Google ScholarCross Ref
Pharr, M. 2006. Interactive Rendering in the Post-GPU Era. Presented at Graphics Hardware 2006. Web site:www.pharr.org/matt/.Google Scholar
Pineda, J. 1988. A Parallel Algorithm for Polygon Rasterization. In Computer Graphics (Proceedings of ACM SIGGRAPH 88), 22, 4, 17--20. Google ScholarDigital Library
Power VR. 2008. Web site:www.imgtec.com/powervr/products/Graphics/index.asp.Google Scholar
Pollack, F. 1999. New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32. Google ScholarDigital Library
Reinders, J., 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reily Media, Inc. Google ScholarDigital Library
Reshetov A., Soupikov, A., Hurley, J. 2005. Multi-level Ray Tracing Algorithm. ACM Transactions on Graphics, 24, 3, 1176--1185. Google ScholarDigital Library
Rost, R. 2004. The OpenGL Shading Language. Addison Wesley. Google ScholarDigital Library
Shevtsov, M., Soupikov, A., Kapustin, A. 2007. Ray-Triangle Intersection Algorithm for Modern CPU Architectures. In Proceedings of GraphiCon 2007, 33--39.Google Scholar
Stevens, A. 2006. ARM Mali 3D Graphics System Solution. Web site:www.arm.com/miscPDFs/16514.pdf.Google Scholar
Stoll, G., Eldridge, M., Patterson, D., Webb, A., Berman, S., Levy, R., Caywood, C., Taveira, M., Hunt, S., Hanrahan, P. 2001. Lightning 2: A High Performance Display Subsystem for PC Clusters. In Computer Graphics (Proceedings of ACM SIGGRAPH 2001), ACM, 141--148. Google ScholarDigital Library
Torborg, J., Kajiya, J. 1996. Talisman Commodity Realtime 3D Graphics for the PC. In Proceedings of ACM SIGGRAPH 1996, ACM Press/ACM SIGGRAPH, New York. Computer Graphics Proceedings, Annual Conference Series, ACM, 353--363. Google ScholarDigital Library
Wexler, D., Gritz, L., Enderton, E., Rice, J. 2005. GPU-accelerated high-quality hidden surface removal. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (Los Angeles, California, July 30 - 31, 2005). HWWS '05, ACM, New York, NY, 7--14. Google ScholarDigital Library

Index Terms

Larrabee: a many-core x86 architecture for visual computing
1. Computing methodologies
  1. Computer graphics
  2. Parallel computing methodologies

Recommendations

Larrabee: a many-core x86 architecture for visual computing

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are ...
Read More
Marching cubes: A high resolution 3D surface construction algorithm
SIGGRAPH '87: Proceedings of the 14th annual conference on Computer graphics and interactive techniques

We present a new algorithm, called marching cubes, that creates triangle models of constant density surfaces from 3D medical data. Using a divide-and-conquer approach to generate inter-slice connectivity, we create a case table that defines triangle ...
Read More
Anatomy-based modeling of the human musculature
SIGGRAPH '97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Read More

Reviews

Reviewer: Hector Yee

In the early years of computer graphics, software renderers were very popular on the personal computer. These renderers have been recently supplanted by graphics processing units (GPUs), which first took over fixed-function operations such as triangle setup and rasterization, and eventually grew to encompass the computation of transformation and lighting of geometry completely in hardware. Recent GPU technology enables the user to have limited customization of the shading of pixels and transformation of geometry by means of programmable graphics hardware. However, some kinds of operations, such as creation and manipulation of dynamic data structures (for example, linked lists and other irregular data structures), are still difficult to implement on graphics hardware, and are important for many rendering problems. The Larrabee architecture, described by the authors, attempts to address issues such as this by implementing a multi-core general-purpose processor-based architecture, augmented with several vector units, as an alternative to the classic GPU model. This paper is written in two parts, the first describing the hardware architecture and the second describing an implementation of a software renderer running on top of the architecture. The hardware is described as many in-order central processing units (CPUs), based on the Intel x86 architecture, connected by an interprocessor ring network for communication, with each having its own L2 cache. The hardware has additional fixed-function units that perform tasks such as texture filtering, which is difficult to implement efficiently in software. Almost everything else, such as shading and geometry transformation, is done in software. The Larrabee software renderer follows a sort-middle architecture, where polygons are binned up for rendering and then each block is rendered at once, in order to use the CPU as much as possible while not saturating the bandwidth with too many simultaneous memory requests. The authors show almost linear scale up with the number of CPUs for applications, such as game fluid simulation; applications such as rigid-body simulations do not scale up as well. It is interesting to see real-time rendering software go the full circle from software to hardware and now back to software. I am eager to see the actual hardware in operation in the future. My only disappointment with the paper is the lack of comparison with existing GPUs in terms of performance on state-of-the-art games. The authors do, however, provide a detailed analysis of how each game uses the CPU and bandwidth of the architecture. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SIGGRAPH '08: ACM SIGGRAPH 2008 papers
August 2008
887 pages
ISBN:9781450301121
DOI:10.1145/1399504

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
SIMD
graphics architecture
many-core computing
parallel processing
realtime graphics
software rendering
throughput computing
visual computing
Qualifiers
- research-article
Conference

Acceptance Rates
SIGGRAPH '08 Paper Acceptance Rate90of518submissions,17%Overall Acceptance Rate1,822of8,601submissions,21%
More
Upcoming Conference
SIGGRAPH '24

Sponsor:

siggraph

Special Interest Group on Computer Graphics and Interactive Techniques Conference

July 27 - August 1, 2024

Denver , CO , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 155
  Total Citations
  View Citations
- 15,082
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Larrabee: a many-core x86 architecture for visual computing

SIGGRAPH '08: ACM SIGGRAPH 2008 papers

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Larrabee: a many-core x86 architecture for visual computing

Marching cubes: A high resolution 3D surface construction algorithm

Anatomy-based modeling of the human musculature

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Larrabee: a many-core x86 architecture for visual computing

SIGGRAPH '08: ACM SIGGRAPH 2008 papers

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Larrabee: a many-core x86 architecture for visual computing

Marching cubes: A high resolution 3D surface construction algorithm

Anatomy-based modeling of the human musculature

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media