research-article

Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures

Authors:
Marcello Cinque

Universiti Napoli Federico II - Italy

Universiti Napoli Federico II - Italy
View Profile

,
Antonio Corradi

University of Bologna - Italy

University of Bologna - Italy
View Profile

,
Luca Foschini

University of Bologna - Italy

University of Bologna - Italy
View Profile

,
Flavio Frattini

Universiti Napoli Federico II - Italy and Consorzio Interuniversitario Nazionale per l'Informatica (CINI) - Italy

Universiti Napoli Federico II - Italy and Consorzio Interuniversitario Nazionale per l'Informatica (CINI) - Italy
View Profile

,
Javier Povedano-Molina

Universidad de Granada - Spain

Universidad de Granada - Spain
View Profile

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingApril 2016Pages 2015–2020https://doi.org/10.1145/2851613.2851762

Published:04 April 2016Publication History

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Pages 2015–2020

ABSTRACT

The management of Grid systems commonly lacks information for identifying the failures that may hinder the timely completion of jobs, and cause the wasting of computing resources. Monitoring can certainly help, but novel approaches need to be conceived for such large and geographically distributed systems. We propose a Grid Architecture for scalable Monitoring and Enhanced dependable job ScHeduling (GAMESH). GAMESH is a completely distributed and highly efficient management infrastructure for the dissemination of monitoring data and troubleshooting of job execution failures in large-scale and multi-domain Grid environments. Challenged in a real deployment and compared to other Grid management systems, GAMESH demonstrates to (i) ensure measurements of both computing resources and conditions of task scheduling at geographically sparse sites, while inducing a low overhead on the entire infrastructure, and (ii) enable failure-aware scheduling and improve overall system performance, even in the presence of failures, by coordinating local job schedulers at multiple domains.

References

W. Abdulal and S. Ramachandram. Reliability-Aware Genetic Scheduling Algorithm in Grid Environment. In Communication Systems and Network Technologies (CSNT), 2011 International Conference on, 2011. Google ScholarDigital Library
Adaptive Computing. TORQUE Resource Manager, December 2015. http://www.adaptivecomputing.com/products/open-source/torque/.Google Scholar
B. Tierney et al. A grid monitoring architecture. In Global Grid Forum, 2002.Google Scholar
D. Batista and N. da Fonseca. A survey of self-adaptive grids. Communications Magazine, IEEE, 48(7):94--100, 2010. Google ScholarDigital Library
CERN. CERN LHC, September 2015. http://lhc.web.cern.ch/lhc/.Google Scholar
CERN. CERN WLCG, September 2015. http://wlcg.web.cern.ch/.Google Scholar
M. Cinque, D. Cotroneo, F. Frattini, and S. Russo. To Cloudify or Not to Cloudify: the Question for a Scientific Data Center. Cloud Computing, IEEE Transactions on, PP(99):1--1, 2015. Google ScholarDigital Library
D. Cotroneo, F. Frattini, R. Natella, and R. Pietrantuono. Performance Degradation Analysis of a Supercomputer. In Proc. of Int. Symp. on Software Reliability Engineering Workshops (ISSREW), 2013.Google ScholarCross Ref
S. Guo, H.-Z. Huang, Z. Wang, and M. Xie. Grid service reliability modeling and optimal task scheduling considering fault recovery. Reliability, IEEE Transactions on, 60(1):263--274, 2011.Google Scholar
I. Legrand et al. Monalisa: An agent based, dynamic service system to monitor, control and optimize distributed systems. Computer Physics Communications, 180(12):2472--2498, 2009.Google ScholarCross Ref
Italian Grid Infrastructure. Troubleshooting guide for CREAM, September 2015. https://wiki.italianGrid.it/twiki/bin/view/CREAM/TroubleshootingGuide.Google Scholar
J. M. Schopf et al. Monitoring the grid with the globus toolkit mds4. Journal of Physics: Conference Series, 46:521--525, 2006.Google ScholarCross Ref
M. Chtepen et al. Adaptive task checkpointing and replication: Toward efficient fault-tolerant grids. Parallel and Distributed Systems, IEEE Transactions on, 20(2):180--190, 2009. Google ScholarDigital Library
M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, 2004.Google ScholarCross Ref
Object Management Group. Data Distribution Service Specification 1.2, September 2015. http://www.omg.org/spec/DDS/1.2.Google Scholar
Rice University - Division of Information Technology. Why Are My Jobs Not Running?, September 2015. http://rcsg.rice.edu/rcsg/shared/scheduling.html.Google Scholar
Universita degli Studi di Napoli Federico II. SCOPE datacenter, December 2015. http://www.scope.unina.it:8080/web/guest/home en.Google Scholar
F. Xhafa and A. Abraham. Computational models and heuristic methods for grid scheduling problems. Future Generation Computer Systems, 26(4):608--621, 2010. Google ScholarDigital Library

Index Terms

Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures

Recommendations

Dependable Grid Workflow Scheduling Based on Resource Availability

Due to the highly dynamic feature, dependable workflow scheduling is critical in the Grid environment. Various scheduling algorithms have been proposed, but seldom consider the resource reliability. Current Grid systems mainly exploit fault tolerance ...
Read More
Monitoring of Grid scientific workflows
Large-Scale Programming Tools and Environments

Scientific workflows are a means of conducting in silico experiments in modern computing infrastructures for e-Science, often built on top of Grids. Monitoring of Grid scientific workflows is essential not only for performance analysis but also to ...
Read More
Using the black-box approach with machine learning methods in order to improve job scheduling in GRID environments

This article focuses on mapping jobs to resources with use of off-the-shelf machine learning methods. The machine learning methods are used in the black-box manner, having a wide variety of parameters for internal cross validation. In the article we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
April 2016
2360 pages
ISBN:9781450337397
DOI:10.1145/2851613
Conference Chair:
Sascha Ossowski
University Rey Juan Carlos, Spain
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dependability
fault tolerance
grid
monitoring
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '16 Paper Acceptance Rate252of1,047submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 73
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dependable Grid Workflow Scheduling Based on Resource Availability

Monitoring of Grid scientific workflows

Using the black-box approach with machine learning methods in order to improve job scheduling in GRID environments