ABSTRACT
This paper discusses our work providing support for processing a large number of short tasks within the context of our development of a collaborative bioinformatics knowledge environment for structural biologists, environmental microbiologists, and evolutionary biologists. We have designed and implemented a new ensemble-based task dispatching system that we have deployed on a Blue Gene/L system in conjunction with the Blue Gene's High Throughput Computing (HTC) capability. Unlike our prior general database-backed HTC task dispatching system, the ensemble-based task dispatching system is able to efficiently process and dispatch large numbers of very short tasks to over a thousand cores. We also investigate the scalability of the IBM Blue Gene/L at HTC in general, identifying and eliminating processor-reboot inefficincies for very short tasks for specific applications, making the Blue Gene/L a feasible processing system for this bioinformatics workload.
- Amazon Web Services. http://www.amazon.com/aws/.Google Scholar
- H. Andres Lagar-Cavilla, J. Whitney, A. Scannell, S. M. Rumble, E. de Lara, M. Brudno, and M. Satyanarayanan. Impromptu Clusters for Near-Interactive Cloud-Based Services. Department of Computer Science, University of Toronto, Technical Report, June 2008.Google Scholar
- B. Bode, D. Halstead, R. Kendall, Z. Lei, W. Hall, and D. Jackson. The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters. Usenix, 4th Annual Linux Showcase and Conference, 2000. Google ScholarDigital Library
- D. Borthakur. The Hadoop Distributed File System: Architecture and Design. Hadoop Project Website, 2007.Google Scholar
- J. Cope, M. Oberg, H. Tufo, T. Voran, and M. Woitaszek. High Throughput Grid Computing with an IBM Blue Gene/L. In IEEE International Conference on Cluster Computing, September 2007. Google ScholarDigital Library
- N. Desai. Cobalt: An Open Source Platform for HPC System Software Research. Edinburgh BG/L System Software Workshop, 2005.Google Scholar
- J. Evans, L. Sheneman, and J. Foster. Relaxed Neighbor-Joining: A Fast Distance-Based Phylogenetic Tree Construction Method. Journal of Molecular Evolution, 62:785--792, 2006.Google ScholarCross Ref
- I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. In IFIP International Conference on Network and Parallel Computing, pages 2--13, 2005. Google ScholarDigital Library
- I. Foster, C. Kesselman, and S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications, 15:200--222, 2001. Google ScholarDigital Library
- C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, and J. Good. On the Use of Cloud Computing for Scientific Workflows. In SWBES, December 2008. Google ScholarDigital Library
- L. Hui, Y. Huashan, and L. Xiaoming. A Lightweight Execution Framework for Massive Independent Tasks. In Many-Task Computing on Grids and Supercomputers, November 2008.Google Scholar
- A. Peters, A. King, T. Budnik, P. McCarthy, P. Michaud, M. Mundy, J. Sexton, and G. Stewart. Asynchronous Task Dispatch for High Throughput Computing for the eServer IBM Blue Gene Supercomputer. In IEEE International Symposium on Parallel and Distributed Processing, 2008.Google ScholarCross Ref
- M. Price. FastTree. http://www.microbesonline.org/fasttree/.Google Scholar
- I. Raicu and I. Foster. Many-Task Computing for Grids and Supercomputers. IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008.Google Scholar
- I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, and B. Clifford. Toward Loosely Coupled Programming on Petascale Systems. Proceedeings of the 2008 ACM/IEEE conference on Supercomputing, November 2008. Google ScholarDigital Library
- A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 21:456--463, 2005. Google ScholarDigital Library
- D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency - Practice and Experience, 17(2--4):323--356, 2005. Google ScholarDigital Library
Index Terms
- Ensemble dispatching on an IBM Blue Gene/L for a bioinformatics knowledge environment
Recommendations
Metscape 2 bioinformatics tool for the analysis and visualization of metabolomics and gene expression data
Motivation: Metabolomics is a rapidly evolving field that holds promise to provide insights into genotype–phenotype relationships in cancers, diabetes and other complex diseases. One of the major informatics challenges is providing tools that link ...
Comments