skip to main content
10.1145/3219104.3229278acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article
Open Access

Searching the Sequence Read Archive using Jetstream and Wrangler

Authors Info & Claims
Published:22 July 2018Publication History

ABSTRACT

The Sequence Read Archive (SRA), the world's largest database of sequences, hosts approximately 10 petabases (1016 bp) of sequence data and is growing at the alarming rate of 10 TB per day. Yet this rich trove of data is inaccessible to most researchers: searching through the SRA requires large storage and computing facilities that are beyond the capacity of most laboratories. Enabling scientists to analyze existing sequence data will provide insight into ecology, medicine, and industrial applications. In this project we specifically focus on metagenomic sequences (whole community data sets from different environments). We are developing a set of tools to enable biologists to mine the metagenomes in the SRA using the NSF-funded cloud computing resources, Jetstream and Wrangler. We have developed a proof-of-principle pipeline to demonstrate the feasibility of the approach. We are leveraging our existing infrastructure to enable all scientists to access the SRA metagenomes regardless of their computational ability and are working to create a stable pipeline with a science gateway portal that is accessible to all researchers.

References

  1. 2018. jetstream-search-sra: Search the SRA using the jetstream sra cluster setup. https://github.com/linsalrob/jetstream-search-sra original-date: 2018-01-05T15:56:45Z.Google ScholarGoogle Scholar
  2. 2018. jetstream-sra-cluster-setup: This module sets up an autoscaled Jetstream cluster for SRA work, based on HTCondor, the OpenStack Shade library, and SaltStack. https://github.com/linsalrob/jetstream-sra-cluster-setup original-date: 2018-01-05T15:52:04Z.Google ScholarGoogle Scholar
  3. 2018. shade: Client library for OpenStack containing Infra business logic. https://github.com/openstack-infra/shade original-date: 2015-01-07T21:07:08Z.Google ScholarGoogle Scholar
  4. 2018. SRA Documentation. https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/Google ScholarGoogle Scholar
  5. Jamie Alnasir and Hugh P. Shanahan. 2015. Investigation into the annotation of protocol sequencing steps in the sequence read archive. GigaScience 4, 1 (09 May 2015), 23.Google ScholarGoogle Scholar
  6. Marcus A. Christie, Anuj Bhandar, Supun Nakandala, Suresh Marru, Eroma Abeysinghe, Sudhakar Pamidighantam, and Marlon E. Pierce. 2017. Using Keycloak for Gateway Authentication and Authorization. (2017).Google ScholarGoogle Scholar
  7. Ralf Conrad. 2009. The global methane cycle: recent advances in understanding the microbial processes involved. Environmental Microbiology Reports 1, 5 (2009), 285--292.Google ScholarGoogle ScholarCross RefCross Ref
  8. Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow Management in Condor. Springer London, London, 357--375.Google ScholarGoogle Scholar
  9. Bas E. Dutilh, Noriko Cassman, Katelyn McNair, Savannah E. Sanchez, Genivaldo G. Z. Silva, Lance Boling, Jeremy J. Barr, Daan R. Speth, Victor Seguritan, Ramy K. Aziz, Ben Felts, Elizabeth A. Dinsdale, John L. Mokili, and Robert A. Edwards. 2014. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nature Communications 5 (2014), 4498.Google ScholarGoogle ScholarCross RefCross Ref
  10. Robert A. Edwards and Forest Rohwer. 2005. Viral metagenomics. Nature Reviews Microbiology 3, 6 (2005), 504--510.Google ScholarGoogle ScholarCross RefCross Ref
  11. N. Gaffney, C. Jordan, T. Minyard, and D. Stanzione. 2014. Building Wrangler: A transformational data intensive resource for the open science community. In 2014 IEEE International Conference on Big Data (Big Data). 20--22.Google ScholarGoogle Scholar
  12. T. A. Kanewala, S. Marru, J. Basney, and M. Pierce. 2014. A Credential Store for Multi-tenant Science Gateways. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014-05). 445--454.Google ScholarGoogle Scholar
  13. Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, Aleksander Slominski, Ate Douma, Srinath Perera, and Sanjiva Weerawarana. 2011. Apache Airavata: A Framework for Distributed Applications and Computational Workflows. In Proceedings of the 2011 ACM Workshop on Gateway Computing Environments (2011) (GCE '11). ACM, 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Nakandala, H. Gunasinghe, S. Marru, and M. Pierce. 2016. Apache Airavata security manager: Authentication and authorization implementations for a multi-tenant escience framework. In 2016 IEEE 12th International Conference on e-Science (e-Science) (2016-10). 287--292.Google ScholarGoogle Scholar
  15. Supun Nakandala, Suresh Marru, Marlon Piece, Sudhakar Pamidighantam, Kenneth Yoshimoto, Terri Schwartz, Subhashini Sivagnanam, Amit Majumdar, and Mark A. Miller. 2017. Apache Airavata Sharing Service: A Tool for Enabling User Collaboration in Science Gateways. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability Success and Impact (2017) (PEARC17). ACM, 20:1--20:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marlon Pierce, Suresh Marru, Borries Demeler, Amitava Majumdar, and Mark Miller. 2013. Science Gateway Operational Sustainability: Adopting a Platform-as-a-Service Approach.Google ScholarGoogle Scholar
  17. Craig A. Stewart, George Turner, Matthew Vaughn, Niall I. Gaffney, Timothy M. Cockerill, Ian Foster, David Hancock, Nirav Merchant, Edwin Skidmore, Daniel Stanzione, James Taylor, and Steven Tuecke. 2015. Jetstream: a self-provisioned, scalable science and engineering cloud environment. ACM Press, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Douglas Thain, Todd Tannenbaum, and Miron Livny. 2005. Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience 17, 2 (2005), 323--356. Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Towns, T Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science Engineering 16, 5 (Sep 2014), 62--74.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yongan Zhao, Haixu Tang, and Yuzhen Ye. 2012. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics (Oxford, England) 28, 1 (2012), 125--126. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Searching the Sequence Read Archive using Jetstream and Wrangler

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader