skip to main content
10.1145/2851613.2851762acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures

Published:04 April 2016Publication History

ABSTRACT

The management of Grid systems commonly lacks information for identifying the failures that may hinder the timely completion of jobs, and cause the wasting of computing resources. Monitoring can certainly help, but novel approaches need to be conceived for such large and geographically distributed systems. We propose a Grid Architecture for scalable Monitoring and Enhanced dependable job ScHeduling (GAMESH). GAMESH is a completely distributed and highly efficient management infrastructure for the dissemination of monitoring data and troubleshooting of job execution failures in large-scale and multi-domain Grid environments. Challenged in a real deployment and compared to other Grid management systems, GAMESH demonstrates to (i) ensure measurements of both computing resources and conditions of task scheduling at geographically sparse sites, while inducing a low overhead on the entire infrastructure, and (ii) enable failure-aware scheduling and improve overall system performance, even in the presence of failures, by coordinating local job schedulers at multiple domains.

References

  1. W. Abdulal and S. Ramachandram. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment. In Communication Systems and Network Technologies (CSNT), 2011 International Conference on, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adaptive Computing. TORQUE Resource Manager, December 2015. http://www.adaptivecomputing.com/products/open-source/torque/.Google ScholarGoogle Scholar
  3. B. Tierney et al. A grid monitoring architecture. In Global Grid Forum, 2002.Google ScholarGoogle Scholar
  4. D. Batista and N. da Fonseca. A survey of self-adaptive grids. Communications Magazine, IEEE, 48(7):94--100, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. CERN. CERN LHC, September 2015. http://lhc.web.cern.ch/lhc/.Google ScholarGoogle Scholar
  6. CERN. CERN WLCG, September 2015. http://wlcg.web.cern.ch/.Google ScholarGoogle Scholar
  7. M. Cinque, D. Cotroneo, F. Frattini, and S. Russo. To Cloudify or Not to Cloudify: the Question for a Scientific Data Center. Cloud Computing, IEEE Transactions on, PP(99):1--1, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Cotroneo, F. Frattini, R. Natella, and R. Pietrantuono. Performance Degradation Analysis of a Supercomputer. In Proc. of Int. Symp. on Software Reliability Engineering Workshops (ISSREW), 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Guo, H.-Z. Huang, Z. Wang, and M. Xie. Grid service reliability modeling and optimal task scheduling considering fault recovery. Reliability, IEEE Transactions on, 60(1):263--274, 2011.Google ScholarGoogle Scholar
  10. I. Legrand et al. Monalisa: An agent based, dynamic service system to monitor, control and optimize distributed systems. Computer Physics Communications, 180(12):2472--2498, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  11. Italian Grid Infrastructure. Troubleshooting guide for CREAM, September 2015. https://wiki.italianGrid.it/twiki/bin/view/CREAM/TroubleshootingGuide.Google ScholarGoogle Scholar
  12. J. M. Schopf et al. Monitoring the grid with the globus toolkit mds4. Journal of Physics: Conference Series, 46:521--525, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Chtepen et al. Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids. Parallel and Distributed Systems, IEEE Transactions on, 20(2):180--190, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. Object Management Group. Data Distribution Service Specification 1.2, September 2015. http://www.omg.org/spec/DDS/1.2.Google ScholarGoogle Scholar
  16. Rice University - Division of Information Technology. Why Are My Jobs Not Running?, September 2015. http://rcsg.rice.edu/rcsg/shared/scheduling.html.Google ScholarGoogle Scholar
  17. Universita degli Studi di Napoli Federico II. SCOPE datacenter, December 2015. http://www.scope.unina.it:8080/web/guest/home en.Google ScholarGoogle Scholar
  18. F. Xhafa and A. Abraham. Computational models and heuristic methods for grid scheduling problems. Future Generation Computer Systems, 26(4):608--621, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
          April 2016
          2360 pages
          ISBN:9781450337397
          DOI:10.1145/2851613

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 April 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SAC '16 Paper Acceptance Rate252of1,047submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader