ABSTRACT
Massively parallel whole transcriptome sequencing, commonly referred to as RNA-Seq, has become the technology of choice for performing gene expression profiling. However, reconstruction of full-length novel transcripts from RNA-Seq data remains challenging due to the short read length delivered by most existing sequencing technologies. We propose a novel statistical genome-guided method called "Transcriptome Reconstruction using Integer Programming" (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. TRIP creates a splice graph based on aligned RNA-Seq reads and enumerates all maximal paths corresponding to putative transcripts. The problem of selecting true transcripts is formulated as an integer program (IP) which minimizes the set of selected transcripts yielding a good statistical fit between the fragment length distribution (empirically determined during library preparation) and fragment lengths implied by mapped read pairs. Experimental results on both real and synthetic datasets show that TRIP is more accurate than methods ignoring fragment length distribution information. The software is available at: http://www.cs.gsu.edu/serghei/?q=trip
- I. Astrovskaya, B. Tork, S. Mangul, K. Westbrooks, I. Mandoiu, P. Balfe, and A. Zelikovsky. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics, 12(Suppl 6):S1, 2011.Google ScholarCross Ref
- K. F. Au, H. Jiang, L. Lin, Y. Xing, and W. H. Wong. Detection of splice junctions from paired-end rna-seq data by splicemap. Nucleic Acids Research, 2010.Google ScholarCross Ref
- A. Derti, P. Garrett-Engele, K. D. MacIsaac, R. C. Stevens, S. Sriram, R. Chen, C. A. Rohl, J. M. Johnson, and T. Babak. A quantitative atlas of polyadenylation in five mammals. Genome Research, 22(6):1173--1183, 2012.Google ScholarCross Ref
- J. Feng, W. Li, and T. Jiang. Inference of isoforms from short sequence reads. In Proc. RECOMB, pages 138--157, 2010. Google ScholarDigital Library
- M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods, 8(6):469--477, May 2011.Google ScholarCross Ref
- M. Grabherr. Full-length transcriptome assembly from rna-seq data without a reference genome. Nature biotechnology, 29(7):644--652, 2011.Google ScholarCross Ref
- M. Guttman, M. Garber, J. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, M. Koziol, A. Gnirke, C. Nusbaum, J. Rinn, E. Lander, and A. Regev. Ab initio reconstruction of cell type--specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology, 28(5):503--510, 2010.Google ScholarCross Ref
- B. Li, V. Ruotti, R. Stewart, J. Thomson, and C. Dewey. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493--500, 2010. Google ScholarDigital Library
- W. Li, J. Feng, and T. Jiang. IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. Lecture Notes in Computer Science, 6577:168--+, 2011. Google ScholarDigital Library
- Y. Y. Lin, P. Dao, F. Hach, M. Bakhshi, F. Mo, A. Lapuk, C. Collins, and S. C. Sahinalp. Cliiq: Accurate comparative detection and quantification of expressed isoforms in a population. Proc. 12th Workshop on Algorithms in Bioinformatics, 2012. Google ScholarDigital Library
- S. Mangul, A. Caciula, I. Mandoiu, and A. Zelikovsky. Rna-seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes. In Bioinformatics and Biomedicine Workshops (BIBMW), 2011 IEEE International Conference on, pages 118--123, nov. 2011. Google ScholarDigital Library
- T. R. Mercer, D. J. Gerhardt, M. E. Dinger, J. Crawford, C. Trapnell, J. A. Jeddeloh, J. S. Mattick, and J. L. Rinn. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nature Biotechnology, 30(1):99--104, 2012.Google ScholarCross Ref
- A. Mortazavi, B. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 2008.Google Scholar
- M. Nicolae, S. Mangul, I. Mandoiu, and A. Zelikovsky. Estimation of alternative splicing isoform frequencies from rna-seq data. Algorithms for Molecular Biology, 6:9, 2011.Google ScholarCross Ref
- S. Pal, R. Gupta, H. Kim, P. Wickramasinghe, V. Baubet, L. C. Showe, N. Dahmane, and R. V. Davuluri. Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development. Genome Research, 2011.Google ScholarCross Ref
- P. A. Pevzner. 1-Tuple DNA sequencing: computer analysis. J Biomol Struct Dyn, 7(1):63--73, Aug. 1989.Google ScholarCross Ref
- A. Roberts, H. Pimentel, C. Trapnell, and L. Pachter. Identification of novel transcripts in annotated genomes using rna-seq. Bioinformatics, 2011. Google ScholarDigital Library
- G. Robertson, J. Schein, R. Chiu, R. Corbett, M. Field, S. D. Jackman, K. Mungall, S. Lee, H. M. Okada, J. Q. Qian, and et al. De novo assembly and analysis of rna-seq data. Nature Methods, 7(11):909--912, 2010.Google ScholarCross Ref
- J. M. Rothberg, W. Hinz, T. M. Rearick, J. Schultz, W. Mileski, M. Davey, J. H. Leamon, K. Johnson, M. J. Milgrew, M. Edwards, J. Hoon, J. F. Simons, D. Marran, J. W. Myers, J. F. Davidson, A. Branting, J. R. Nobile, B. P. Puc, D. Light, T. A. Clark, M. Huber, J. T. Branciforte, I. B. Stoner, S. E. Cawley, M. Lyons, Y. Fu, N. Homer, M. Sedova, X. Miao, B. Reed, J. Sabina, E. Feierstein, M. Schorn, M. Alanjary, E. Dimalanta, D. Dressman, R. Kasinskas, T. Sokolsky, J. A. Fidanza, E. Namsaraev, K. J. McKernan, A. Williams, G. T. Roth, and J. Bustillo. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356):348--352, 2011.Google ScholarCross Ref
- C. Trapnell, L. Pachter, and S. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105--1111, 2009. Google ScholarDigital Library
- C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van Baren, S. Salzberg, B. Wold, and L. Pachter. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology, 28(5):511--515, 2010.Google ScholarCross Ref
- E. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. Kingsmore, G. Schroth, and C. Burge. Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221):470--476, 2008.Google ScholarCross Ref
Index Terms
- An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads
Recommendations
Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript
Motivation: The discovery of novel gene fusions can lead to a better comprehension of cancer progression and development. The emergence of deep sequencing of trancriptome, known as RNA-seq, has opened many opportunities for the identification of this ...
Paired-end RAD-seq for de novo assembly and marker design without available reference
Motivation: Next-generation sequencing technologies have facilitated the study of organisms on a genome-wide scale. A recent method called restriction site associated DNA sequencing (RAD-seq) allows to sample sequence information at reduced ...
A probabilistic framework for aligning paired-end RNA-seq data
Motivation: The RNA-seq paired-end read (PER) protocol samples transcript fragments longer than the sequencing capability of today's technology by sequencing just the two ends of each fragment. Deep sampling of the transcriptome using the PER ...
Comments