research-article

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

Authors:
Changkyu Kim

Intel Labs, Santa Clara, CA., USA

Intel Labs, Santa Clara, CA., USA
View Profile

,
Jongsoo Park

Intel Labs, Santa Clara, CA., USA

Intel Labs, Santa Clara, CA., USA
View Profile

,
Nadathur Satish

Intel Labs, Santa Clara, CA., USA

Intel Labs, Santa Clara, CA., USA
View Profile

,
Hongrae Lee

Google Research, Mountain View, USA

Google Research, Mountain View, USA
View Profile

,
Pradeep Dubey

Intel Labs, Santa Clara, CA., USA

Intel Labs, Santa Clara, CA., USA
View Profile

,
Jatin Chhugani

Intel Labs, Santa Clara, CA., USA

Intel Labs, Santa Clara, CA., USA
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 841–850https://doi.org/10.1145/2213836.2213965

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 841–850

ABSTRACT

Sorting is a fundamental kernel used in many database operations. The total memory available across cloud computers is now sufficient to store even hundreds of terabytes of data in-memory. Applications requiring high-speed data analysis typically use in-memory sorting. The two most important factors in designing a high-speed in-memory sorting system are the single-node sorting performance and inter-node communication.

In this paper, we present CloudRAMSort, a fast and efficient system for large-scale distributed sorting on shared-nothing clusters. CloudRAMSort performs multi-node optimizations by carefully overlapping computation with inter-node communication. The system uses a dynamic multi-stage random sampling approach for improved load-balancing between nodes. CloudRAMSort maximizes per-node efficiency by exploiting modern architectural features such as multiple cores and SIMD (Single-Instruction Multiple Data) units. This holistic combination results in the highest performing sorting performance on distributed shared-nothing platforms. CloudRAMSort sorts 1 Terabyte (TB) of data in 4.6 seconds on a 256-node Xeon X5680 cluster called the Intel Endeavor system. CloudRAMSort also performs well on heavily skewed input distributions, sorting 1 TB of data generated using Zipf distribution in less than 5 seconds. We also provide a detailed analytical model that accurately projects (within avg. 7%) the performance of CloudRAMSort with varying tuple sizes and interconnect bandwidths. Our analytical model serves as a useful tool to analyze performance bottlenecks on current systems and project performance with future architectural advances.

With architectural trends of increasing number of cores, bandwidth, SIMD width, cache-sizes, and interconnect bandwidth, we believe CloudRAMSort would be the system of choice for distributed sorting of large-scale in-memory data of current and future systems

References

Apache Hadoop. http://hadoop.apache.org.Google Scholar
bonnie++: a program to test hard drive performance. http://www.coker.com.au/bonnie++Google Scholar
Data Generator For Sorting Benchmarks. www.ordinal.com/gensort.html.Google Scholar
Hadoop Cluster Setup. http://hadoop.apache.org/common/docs/current/cluster_setup.html.Google Scholar
OpenFabrics Alliance. https://www.openfabrics.org/index.php.Google Scholar
SAS In-Memory Analytics. http://www.sas.com/software/high-performance-computing/in-memory-analytics/.Google Scholar
Sort Benchmark Home Page. http://sortbenchmark.org.Google Scholar
The Message Passing Interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/.Google Scholar
B. Abali, F. Özgüner, and A. Bataineh. Balanced Parallel Sort on Hypercube Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 4(5):572--581, 1993. Google ScholarDigital Library
D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu. Energy Proportional Datacenter Networks. In International Symposium on Computer Architecture (ISCA), 2010. Google ScholarDigital Library
E. Anderson and J. Tucek. Efficiency Matters! ACM SIGOPS Operating Systems Review, 44(1), 2010. Google ScholarDigital Library
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-Performance Sorting on Networks of Workstations. In SIGMOD, pages 243--254. Google ScholarDigital Library
G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A Comparison of Sorting Algorithms for the Connection Machine CM-2. In Symposium on Parallel Algorithms and Architectures (SPAA), pages 3--16, 1991. Google ScholarDigital Library
L. Breslau, P. Cue, P. Cao, L. Fan, et al. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM, pages 126--134, 1999.Google ScholarCross Ref
R. E. Bryant. Data-Intensive Supercomputing: The case for DISC. Technical Report Carnegie Mellon University-CS-07-128, 2007.Google Scholar
A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson. Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In MICRO, pages 385--395, 2010. Google ScholarDigital Library
J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
H. Cho, P. Kapur, and K. C. Saraswat. Power Comparison Between High-Speed Electrical and Optical Interconnects for Interchip Communication. Journal of Lightwave Technology, 22(9), 2004.Google Scholar
J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, pages 339--350, 2007. Google ScholarDigital Library
J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 105--118, 2011. Google ScholarDigital Library
J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better I/O Through Byte-Addressable, Persistent Memory. In Symposium on Operating Systems Principles (SOSP), pages 133--146, 2009. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The gamma database machine project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting. In International Conference on Parallel and Distributed Information Systems, pages 280--291. Google ScholarDigital Library
M. Glick. Optical interconnects in next generation data centers: An end to end view. In Hot Interconnects, pages 178--181, 2008. Google ScholarDigital Library
H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In SIGMOD Conference, pages 1061--1066, 2010. Google ScholarDigital Library
J. Gray and F. Putzolu. The 5 minute Rule for Trading Memory for Disc Accesses and the 10 Byte Rule for Trading Memory for CPU Time. In SIGMOD, pages 395--398, 1987. Google ScholarDigital Library
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In SIGMOD Conference, pages 243--252, 1994. Google ScholarDigital Library
J. S. Huang and Y. C. Chow. Parallel sorting and data partitioning by sampling. In International Computer Software and Applications Conference, 1983.Google Scholar
Intel Research. Light Peak: Overview. Intel White Paper, 2010.Google Scholar
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
B. R. Iyer, G. R. Ricard, and P. J. Varman. Percentile Finding Algorithm for Multiple Sorted Runs. In VLDB, pages 135--144, 1989. Google ScholarDigital Library
L. V. Kalé and S. Krishnan. A Comparison Based Parallel Sorting Algorithm. In International Conference on Paralllel Processing (ICPP), pages 196--200, 1993. Google ScholarDigital Library
C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. Nguyen, A. D. Blas, V. W. Lee, N. Satish, and P. Dubey. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. PVLDB, 2(2):1378--1389, 2009. Google ScholarDigital Library
J. Kim, W. J. Dally, and D. Abts. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks. In International Symposium on Computer Architecture (ISCA), 2007. Google ScholarDigital Library
J. Kim, W. J. Dally, B. Towles, and A. Gupta. Microarchitecture of a High-Radix Router. In International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
M. H. Kryder and C. S. Kim. After Hard Drives-What Comes Next? IEEE Transactions on Magnetics, 45(10), 2009.Google ScholarCross Ref
B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Phase change memory architecture and the quest for scalability. Commun. ACM, 53(7):99--106, 2010. Google ScholarDigital Library
C. E. Leiserson. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers, 34(10), 1985. Google ScholarDigital Library
D. L. Lewis and H.-H. S. Lee. Architectural evaluation of 3d stacked rram caches. In 3DIC, pages 1--4, 2009.Google Scholar
K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In International Symposium on Computer Architecture (ISCA), pages 315--326, 2008. Google ScholarDigital Library
H. Liu, C. F. Lam, and C. Johnson. Scaling optical interconnects in datacenter networks: Opportunities and challenges for wdm. In Hot Interconnects, 2010. Google ScholarDigital Library
M. LLC. In-Memory Database Systems: Myths and Facts. 2010.Google Scholar
O. O'Malley and A. C. Murthy. Winning a 60 Second Dash with a Yellow Elephant. http://sortbenchmark.org/Yahoo2009.pdf, 2009.Google Scholar
J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanana, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM. ACM SIGOPS Operating Systems Review, 43(4), 2010. Google ScholarDigital Library
D. A. Patterson. Latency Lags Bandwidth. Communications of the ACM, 47(10), 2004. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
P. Ranganathan. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. IEEE Computer, 44(1):39--48, 2011. Google ScholarDigital Library
A. Rasmussen, G. Porter, M. Conley, H. V. Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. TritonSort: A Balanced Large-Scale Sorting System. In USENIX conference on Networked System Design and Implementation, 2011. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362. ACM, 2010. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
S. Seshadri and J. F. Naughton. Sampling Issues in Parallel Database Systems. In Advanced in Database Technology - EDBT. Google ScholarDigital Library
H. Shi and J. Schaeffer. Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed Computing, 14(4):361--372, 1992. Google ScholarDigital Library
E. Solomonik and L. V. Kalé. Highly scalable parallel sorting. In International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12.Google Scholar
M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. PVLDB, 2(1):385--394, 2009. Google ScholarDigital Library

Index Terms

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and ...
Read More
Developmental directions in parallel accelerators
AusPDC '14: Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing - Volume 152

Parallel accelerators such as massively-cored graphical processing units or many-cored co-processors such as the Xeon Phi are becoming widespread and affordable on many systems including blade servers and even desktops. The use of a single such ...
Read More
Larrabee: A Many-Core x86 Architecture for Visual Computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
distributed
many core
multi core
ram sort
simd
sorting
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 1,203
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Developmental directions in parallel accelerators

Larrabee: A Many-Core x86 Architecture for Visual Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Developmental directions in parallel accelerators

Larrabee: A Many-Core x86 Architecture for Visual Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media