ABSTRACT
Sorting is a fundamental kernel used in many database operations. The total memory available across cloud computers is now sufficient to store even hundreds of terabytes of data in-memory. Applications requiring high-speed data analysis typically use in-memory sorting. The two most important factors in designing a high-speed in-memory sorting system are the single-node sorting performance and inter-node communication.
In this paper, we present CloudRAMSort, a fast and efficient system for large-scale distributed sorting on shared-nothing clusters. CloudRAMSort performs multi-node optimizations by carefully overlapping computation with inter-node communication. The system uses a dynamic multi-stage random sampling approach for improved load-balancing between nodes. CloudRAMSort maximizes per-node efficiency by exploiting modern architectural features such as multiple cores and SIMD (Single-Instruction Multiple Data) units. This holistic combination results in the highest performing sorting performance on distributed shared-nothing platforms. CloudRAMSort sorts 1 Terabyte (TB) of data in 4.6 seconds on a 256-node Xeon X5680 cluster called the Intel Endeavor system. CloudRAMSort also performs well on heavily skewed input distributions, sorting 1 TB of data generated using Zipf distribution in less than 5 seconds. We also provide a detailed analytical model that accurately projects (within avg. 7%) the performance of CloudRAMSort with varying tuple sizes and interconnect bandwidths. Our analytical model serves as a useful tool to analyze performance bottlenecks on current systems and project performance with future architectural advances.
With architectural trends of increasing number of cores, bandwidth, SIMD width, cache-sizes, and interconnect bandwidth, we believe CloudRAMSort would be the system of choice for distributed sorting of large-scale in-memory data of current and future systems
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- bonnie++: a program to test hard drive performance. http://www.coker.com.au/bonnie++Google Scholar
- Data Generator For Sorting Benchmarks. www.ordinal.com/gensort.html.Google Scholar
- Hadoop Cluster Setup. http://hadoop.apache.org/common/docs/current/cluster_setup.html.Google Scholar
- OpenFabrics Alliance. https://www.openfabrics.org/index.php.Google Scholar
- SAS In-Memory Analytics. http://www.sas.com/software/high-performance-computing/in-memory-analytics/.Google Scholar
- Sort Benchmark Home Page. http://sortbenchmark.org.Google Scholar
- The Message Passing Interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/.Google Scholar
- B. Abali, F. Özgüner, and A. Bataineh. Balanced Parallel Sort on Hypercube Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 4(5):572--581, 1993. Google ScholarDigital Library
- D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu. Energy Proportional Datacenter Networks. In International Symposium on Computer Architecture (ISCA), 2010. Google ScholarDigital Library
- E. Anderson and J. Tucek. Efficiency Matters! ACM SIGOPS Operating Systems Review, 44(1), 2010. Google ScholarDigital Library
- A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-Performance Sorting on Networks of Workstations. In SIGMOD, pages 243--254. Google ScholarDigital Library
- G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A Comparison of Sorting Algorithms for the Connection Machine CM-2. In Symposium on Parallel Algorithms and Architectures (SPAA), pages 3--16, 1991. Google ScholarDigital Library
- L. Breslau, P. Cue, P. Cao, L. Fan, et al. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM, pages 126--134, 1999.Google ScholarCross Ref
- R. E. Bryant. Data-Intensive Supercomputing: The case for DISC. Technical Report Carnegie Mellon University-CS-07-128, 2007.Google Scholar
- A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson. Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In MICRO, pages 385--395, 2010. Google ScholarDigital Library
- J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
- H. Cho, P. Kapur, and K. C. Saraswat. Power Comparison Between High-Speed Electrical and Optical Interconnects for Interchip Communication. Journal of Lightwave Technology, 22(9), 2004.Google Scholar
- J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, pages 339--350, 2007. Google ScholarDigital Library
- J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 105--118, 2011. Google ScholarDigital Library
- J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better I/O Through Byte-Addressable, Persistent Memory. In Symposium on Operating Systems Principles (SOSP), pages 133--146, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The gamma database machine project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting. In International Conference on Parallel and Distributed Information Systems, pages 280--291. Google ScholarDigital Library
- M. Glick. Optical interconnects in next generation data centers: An end to end view. In Hot Interconnects, pages 178--181, 2008. Google ScholarDigital Library
- H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In SIGMOD Conference, pages 1061--1066, 2010. Google ScholarDigital Library
- J. Gray and F. Putzolu. The 5 minute Rule for Trading Memory for Disc Accesses and the 10 Byte Rule for Trading Memory for CPU Time. In SIGMOD, pages 395--398, 1987. Google ScholarDigital Library
- J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly generating billion-record synthetic databases. In SIGMOD Conference, pages 243--252, 1994. Google ScholarDigital Library
- J. S. Huang and Y. C. Chow. Parallel sorting and data partitioning by sampling. In International Computer Software and Applications Conference, 1983.Google Scholar
- Intel Research. Light Peak: Overview. Intel White Paper, 2010.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
- B. R. Iyer, G. R. Ricard, and P. J. Varman. Percentile Finding Algorithm for Multiple Sorted Runs. In VLDB, pages 135--144, 1989. Google ScholarDigital Library
- L. V. Kalé and S. Krishnan. A Comparison Based Parallel Sorting Algorithm. In International Conference on Paralllel Processing (ICPP), pages 196--200, 1993. Google ScholarDigital Library
- C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
- C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. Nguyen, A. D. Blas, V. W. Lee, N. Satish, and P. Dubey. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. PVLDB, 2(2):1378--1389, 2009. Google ScholarDigital Library
- J. Kim, W. J. Dally, and D. Abts. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks. In International Symposium on Computer Architecture (ISCA), 2007. Google ScholarDigital Library
- J. Kim, W. J. Dally, B. Towles, and A. Gupta. Microarchitecture of a High-Radix Router. In International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
- M. H. Kryder and C. S. Kim. After Hard Drives-What Comes Next? IEEE Transactions on Magnetics, 45(10), 2009.Google ScholarCross Ref
- B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Phase change memory architecture and the quest for scalability. Commun. ACM, 53(7):99--106, 2010. Google ScholarDigital Library
- C. E. Leiserson. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers, 34(10), 1985. Google ScholarDigital Library
- D. L. Lewis and H.-H. S. Lee. Architectural evaluation of 3d stacked rram caches. In 3DIC, pages 1--4, 2009.Google Scholar
- K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In International Symposium on Computer Architecture (ISCA), pages 315--326, 2008. Google ScholarDigital Library
- H. Liu, C. F. Lam, and C. Johnson. Scaling optical interconnects in datacenter networks: Opportunities and challenges for wdm. In Hot Interconnects, 2010. Google ScholarDigital Library
- M. LLC. In-Memory Database Systems: Myths and Facts. 2010.Google Scholar
- O. O'Malley and A. C. Murthy. Winning a 60 Second Dash with a Yellow Elephant. http://sortbenchmark.org/Yahoo2009.pdf, 2009.Google Scholar
- J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanana, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM. ACM SIGOPS Operating Systems Review, 43(4), 2010. Google ScholarDigital Library
- D. A. Patterson. Latency Lags Bandwidth. Communications of the ACM, 47(10), 2004. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
- P. Ranganathan. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. IEEE Computer, 44(1):39--48, 2011. Google ScholarDigital Library
- A. Rasmussen, G. Porter, M. Conley, H. V. Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. TritonSort: A Balanced Large-Scale Sorting System. In USENIX conference on Networked System Design and Implementation, 2011. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362. ACM, 2010. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
- S. Seshadri and J. F. Naughton. Sampling Issues in Parallel Database Systems. In Advanced in Database Technology - EDBT. Google ScholarDigital Library
- H. Shi and J. Schaeffer. Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed Computing, 14(4):361--372, 1992. Google ScholarDigital Library
- E. Solomonik and L. V. Kalé. Highly scalable parallel sorting. In International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12.Google Scholar
- M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. PVLDB, 2(1):385--394, 2009. Google ScholarDigital Library
Index Terms
- CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
Recommendations
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataSort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and ...
Developmental directions in parallel accelerators
AusPDC '14: Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing - Volume 152Parallel accelerators such as massively-cored graphical processing units or many-cored co-processors such as the Xeon Phi are becoming widespread and affordable on many systems including blade servers and even desktops. The use of a single such ...
Larrabee: A Many-Core x86 Architecture for Visual Computing
The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Comments