ABSTRACT
The Sequence Read Archive (SRA), the world's largest database of sequences, hosts approximately 10 petabases (1016 bp) of sequence data and is growing at the alarming rate of 10 TB per day. Yet this rich trove of data is inaccessible to most researchers: searching through the SRA requires large storage and computing facilities that are beyond the capacity of most laboratories. Enabling scientists to analyze existing sequence data will provide insight into ecology, medicine, and industrial applications. In this project we specifically focus on metagenomic sequences (whole community data sets from different environments). We are developing a set of tools to enable biologists to mine the metagenomes in the SRA using the NSF-funded cloud computing resources, Jetstream and Wrangler. We have developed a proof-of-principle pipeline to demonstrate the feasibility of the approach. We are leveraging our existing infrastructure to enable all scientists to access the SRA metagenomes regardless of their computational ability and are working to create a stable pipeline with a science gateway portal that is accessible to all researchers.
- 2018. jetstream-search-sra: Search the SRA using the jetstream sra cluster setup. https://github.com/linsalrob/jetstream-search-sra original-date: 2018-01-05T15:56:45Z.Google Scholar
- 2018. jetstream-sra-cluster-setup: This module sets up an autoscaled Jetstream cluster for SRA work, based on HTCondor, the OpenStack Shade library, and SaltStack. https://github.com/linsalrob/jetstream-sra-cluster-setup original-date: 2018-01-05T15:52:04Z.Google Scholar
- 2018. shade: Client library for OpenStack containing Infra business logic. https://github.com/openstack-infra/shade original-date: 2015-01-07T21:07:08Z.Google Scholar
- 2018. SRA Documentation. https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/Google Scholar
- Jamie Alnasir and Hugh P. Shanahan. 2015. Investigation into the annotation of protocol sequencing steps in the sequence read archive. GigaScience 4, 1 (09 May 2015), 23.Google Scholar
- Marcus A. Christie, Anuj Bhandar, Supun Nakandala, Suresh Marru, Eroma Abeysinghe, Sudhakar Pamidighantam, and Marlon E. Pierce. 2017. Using Keycloak for Gateway Authentication and Authorization. (2017).Google Scholar
- Ralf Conrad. 2009. The global methane cycle: recent advances in understanding the microbial processes involved. Environmental Microbiology Reports 1, 5 (2009), 285--292.Google ScholarCross Ref
- Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow Management in Condor. Springer London, London, 357--375.Google Scholar
- Bas E. Dutilh, Noriko Cassman, Katelyn McNair, Savannah E. Sanchez, Genivaldo G. Z. Silva, Lance Boling, Jeremy J. Barr, Daan R. Speth, Victor Seguritan, Ramy K. Aziz, Ben Felts, Elizabeth A. Dinsdale, John L. Mokili, and Robert A. Edwards. 2014. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nature Communications 5 (2014), 4498.Google ScholarCross Ref
- Robert A. Edwards and Forest Rohwer. 2005. Viral metagenomics. Nature Reviews Microbiology 3, 6 (2005), 504--510.Google ScholarCross Ref
- N. Gaffney, C. Jordan, T. Minyard, and D. Stanzione. 2014. Building Wrangler: A transformational data intensive resource for the open science community. In 2014 IEEE International Conference on Big Data (Big Data). 20--22.Google Scholar
- T. A. Kanewala, S. Marru, J. Basney, and M. Pierce. 2014. A Credential Store for Multi-tenant Science Gateways. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014-05). 445--454.Google Scholar
- Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, Aleksander Slominski, Ate Douma, Srinath Perera, and Sanjiva Weerawarana. 2011. Apache Airavata: A Framework for Distributed Applications and Computational Workflows. In Proceedings of the 2011 ACM Workshop on Gateway Computing Environments (2011) (GCE '11). ACM, 21--28. Google ScholarDigital Library
- S. Nakandala, H. Gunasinghe, S. Marru, and M. Pierce. 2016. Apache Airavata security manager: Authentication and authorization implementations for a multi-tenant escience framework. In 2016 IEEE 12th International Conference on e-Science (e-Science) (2016-10). 287--292.Google Scholar
- Supun Nakandala, Suresh Marru, Marlon Piece, Sudhakar Pamidighantam, Kenneth Yoshimoto, Terri Schwartz, Subhashini Sivagnanam, Amit Majumdar, and Mark A. Miller. 2017. Apache Airavata Sharing Service: A Tool for Enabling User Collaboration in Science Gateways. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability Success and Impact (2017) (PEARC17). ACM, 20:1--20:8. Google ScholarDigital Library
- Marlon Pierce, Suresh Marru, Borries Demeler, Amitava Majumdar, and Mark Miller. 2013. Science Gateway Operational Sustainability: Adopting a Platform-as-a-Service Approach.Google Scholar
- Craig A. Stewart, George Turner, Matthew Vaughn, Niall I. Gaffney, Timothy M. Cockerill, Ian Foster, David Hancock, Nirav Merchant, Edwin Skidmore, Daniel Stanzione, James Taylor, and Steven Tuecke. 2015. Jetstream: a self-provisioned, scalable science and engineering cloud environment. ACM Press, 1--8. Google ScholarDigital Library
- Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience 17, 2 (2005), 323--356. Google ScholarCross Ref
- J. Towns, T Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science Engineering 16, 5 (Sep 2014), 62--74.Google ScholarCross Ref
- Yongan Zhao, Haixu Tang, and Yuzhen Ye. 2012. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics (Oxford, England) 28, 1 (2012), 125--126. Google ScholarDigital Library
Index Terms
- Searching the Sequence Read Archive using Jetstream and Wrangler
Recommendations
Science Gateway Implementation at the University of South Dakota: Applications in Research and Education
PEARC '18: Proceedings of the Practice and Experience on Advanced Research ComputingScience Gateways are virtual environments that accelerate scientific discovery by enabling scientific communities to more easily and effectively utilize distributed computing and data resources. Successful Science Gateways provide access to ...
Jetstream: A Distributed Cloud Infrastructure for Underresourced higher education communities
SCREAM '15: Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and ModelsThe US National Science Foundation (NSF) in 2015 awarded funding for a first-of-a-kind distributed cyberinfrastructure (DCI) system called Jetstream. Jetstream will be the NSF's first production cloud for general-purpose science and engineering research ...
CAMERA 2.0: A Data-centric Metagenomics Community Infrastructure Driven by Scientific Workflows
SERVICES '10: Proceedings of the 2010 6th World Congress on ServicesOver the last decade, workflows have been established as a mechanism for scientific developers to create simplified views of complex scientific processes. However, there is a need for a comprehensive system architecture to link scientific developers ...
Comments