Algorithms for dna sequence assembly and motif search

January 2012

Author:
Hieu Trung Dinh
University of Connecticut
,
Adviser:
Sanguthevar Rajasekaran
University of Connecticut

Publisher:

University of Connecticut
Computer Sci. & Eng. Division U-157 EECS Dept. Storrs, CT
United States

ISBN:978-1-267-79554-0

Order Number:AAI3533974

Pages:

Purchase on ProQuest

Bibliometrics

Abstract

In this thesis we present algorithmic results for computational problems arising in two important areas of computational biology: de novo sequence assembly and de novo motif search.

De novo sequence assembly is generally the problem of reconstructing a genome from its overlapping fragments called reads. One popular approach for solving the problem is to use an overlap graph from the reads and then analyze it. One of the challenges in this approach is to construct and store the overlap graphs efficiently when the number of reads is very large. Constructing and storing the overlap graphs has been a computational bottleneck for many de novo assembly software tools that use the overlap graph approach because it will need ý( n ²) memory and time to store any overlap graphs if we employ any traditional data structure. Here n is the number of reads. For the next generation sequencing data, n is usually ranges from hundreds of millions to billions. As a result, many overlap graph-based software tools are unable to handle this huge amount of data. Fortunately, we propose a data structure for the overlap graphs that provably requires only O ( n ) space and time. Our experimental results show that our data structure is very efficient in practice. In addition, we develop a De novo assembly software tool, named Large-scale Efficient Assembly Program (LEAP), that utilizes our data structure. LEAP can process efficiently a billion of 454 reads on a desktop computer.

De novo motif search is another important research topic in computational biology. In general terms, de novo motif search is the problem of finding patterns in a set of biological sequences such as DNA or protein sequences. In the literature, motif search has been modeled as various combinatorial problems such as Simple Motif Search (SMS), Planted Motif Search (PMS) - also known as (ý, d )- motif search , and Edit-distance-based Motif Search (EMS). Among these, PMS is the most well-known and well-studied problem. There have been many algorithms proposed for solving PMS that usually fall into the two following categories: approximate algorithms and exact algorithms. The difference between the two kinds of algorithms is that exact algorithms will provably find all of the motifs, whereas approximate algorithms may not output all of the motifs. In this thesis, we focus on exact algorithms. All of the known exact algorithms for PMS take exponential time in ý and d. We propose novel exact algorithms that improve the best-known exact algorithm in terms of running time and memory. In particular, our algorithm can solve the challenging instances of PMS (21, 8) and (23, 9) that no prior exact algorithm could solve. On a regular desktop, it takes about 4.3 hours to solve the instance (21, 8) and 24 hours for the instance (23, 9). In addition, our algorithms have been incorporated into our online software tool for motif search at http://pms.engr.uconn.edu or at http://motifsearch.com.

Contributors

Sanguthevar Rajasekaran
University of Connecticut
- Publication Years1987 - 2023
- Publication counts165
- Citation count536
- Available for Download25
- Downloads (cumulative)6,306
- Downloads (12 months)549
- Downloads (6 weeks)63
- Average Downloads per Article252
- Average Citation per Article3
View Full Profile
Hieu Trung Dinh
University of Connecticut
- Publication Years2009 - 2012
- Publication counts11
- Citation count3
- Available for Download1
- Downloads (cumulative)67
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article67
- Average Citation per Article0
View Full Profile

Recommendations

Optimal algorithms for haplotype assembly from whole-genome sequence data

Motivation: Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then ...
Read More
Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups
Read More
Computational approaches for motif-finding in dna sequences
Read More

Comments

Browse Theses

Sections

Optimal algorithms for haplotype assembly from whole-genome sequence data

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Computational approaches for motif-finding in dna sequences

Sections

Save to Binder

Recommendations

Optimal algorithms for haplotype assembly from whole-genome sequence data

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Computational approaches for motif-finding in dna sequences