skip to main content
10.1145/2517349.2522716acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open Access

Sparrow: distributed, low latency scheduling

Published:03 November 2013Publication History

ABSTRACT

Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

Skip Supplemental Material Section

Supplemental Material

d1-05-kay-ousterhout.mp4

mp4

1.2 GB

References

  1. Apache Thrift. http://thrift.apache.org.Google ScholarGoogle Scholar
  2. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly. In HotCloud, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri. In Proc. OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Bradley, T. S. Clair, M. Farrellee, Z. Guo, M. Livny, I. Sfiligoi, and T. Tannenbaum. An Update on the Scalability Limits of the Condor Batch System. Journal of Physics: Conference Series, 331(6), 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM, 56(2), February 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Demers, S. Keshav, and S. Shenker. Analysis and Simulation of a Fair Queueing Algorithm. In Proc. SIGCOMM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. L. Eager, E. D. Lazowska, and J. Zahorjan. Adaptive Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software Engineering, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform For Fine-Grained Resource Sharing in the Data Center. In Proc. NSDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proc. SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. A. Jette, A. B. Yoo, and M. Grondona. SLURM: Simple Linux Utility for Resource Management. In Proc. Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, pages 44--60. Springer, 2003.Google ScholarGoogle Scholar
  11. M. Kornacker and J. Erickson. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/, October 2012.Google ScholarGoogle Scholar
  12. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. Proc. VLDB Endow., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Mitzenmacher. How Useful is Old Information? volume 11, pages 6--20, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Mitzenmacher. The Power of Two Choices in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Computing, 12(10):1094--1104, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Mitzenmacher. The Power of Two Random Choices: A Survey of Techniques and Results. In S. Rajasekaran, P. Pardalos, J. Reif, and J. Rolim, editors, Handbook of Randomized Computing, volume 1, pages 255--312. Springer, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. C. Murthy. The Next Generation of Apache MapReduce. http://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html, February 2012.Google ScholarGoogle Scholar
  17. K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The Case for Tiny Tasks in Compute Clusters. In Proc. HotOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Park. A Generalization of Multiple Choice Balls-into-Bins. In Proc. PODC, pages 297--298, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In Proc. SPAA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. EuroSys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and Synthesizing Task Placement Constraints in Google Compute Clusters. In Proc. SOCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Shue, M. J. Freedman, and A. Shaikh. Performance Isolation and Fairness for Multi-Tenant Cloud Storage. In Proc. OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In Proc. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay Scheduling: A Simple Technique For Achieving Locality and Fairness in Cluster Scheduling. In Proc. EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proc. OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SOSP '13: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
    November 2013
    498 pages
    ISBN:9781450323888
    DOI:10.1145/2517349

    Copyright © 2013 Owner/Author

    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 3 November 2013

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate131of716submissions,18%

    Upcoming Conference

    SOSP '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader