skip to main content
Algorithms for dna sequence assembly and motif search
Publisher:
  • University of Connecticut
  • Computer Sci. & Eng. Division U-157 EECS Dept. Storrs, CT
  • United States
ISBN:978-1-267-79554-0
Order Number:AAI3533974
Pages:
95
Bibliometrics
Skip Abstract Section
Abstract

In this thesis we present algorithmic results for computational problems arising in two important areas of computational biology: de novo sequence assembly and de novo motif search.

De novo sequence assembly is generally the problem of reconstructing a genome from its overlapping fragments called reads. One popular approach for solving the problem is to use an overlap graph from the reads and then analyze it. One of the challenges in this approach is to construct and store the overlap graphs efficiently when the number of reads is very large. Constructing and storing the overlap graphs has been a computational bottleneck for many de novo assembly software tools that use the overlap graph approach because it will need ý( n ²) memory and time to store any overlap graphs if we employ any traditional data structure. Here n is the number of reads. For the next generation sequencing data, n is usually ranges from hundreds of millions to billions. As a result, many overlap graph-based software tools are unable to handle this huge amount of data. Fortunately, we propose a data structure for the overlap graphs that provably requires only O ( n ) space and time. Our experimental results show that our data structure is very efficient in practice. In addition, we develop a De novo assembly software tool, named Large-scale Efficient Assembly Program (LEAP), that utilizes our data structure. LEAP can process efficiently a billion of 454 reads on a desktop computer.

De novo motif search is another important research topic in computational biology. In general terms, de novo motif search is the problem of finding patterns in a set of biological sequences such as DNA or protein sequences. In the literature, motif search has been modeled as various combinatorial problems such as Simple Motif Search (SMS), Planted Motif Search (PMS) - also known as (ý, d )- motif search , and Edit-distance-based Motif Search (EMS). Among these, PMS is the most well-known and well-studied problem. There have been many algorithms proposed for solving PMS that usually fall into the two following categories: approximate algorithms and exact algorithms. The difference between the two kinds of algorithms is that exact algorithms will provably find all of the motifs, whereas approximate algorithms may not output all of the motifs. In this thesis, we focus on exact algorithms. All of the known exact algorithms for PMS take exponential time in ý and d. We propose novel exact algorithms that improve the best-known exact algorithm in terms of running time and memory. In particular, our algorithm can solve the challenging instances of PMS (21, 8) and (23, 9) that no prior exact algorithm could solve. On a regular desktop, it takes about 4.3 hours to solve the instance (21, 8) and 24 hours for the instance (23, 9). In addition, our algorithms have been incorporated into our online software tool for motif search at http://pms.engr.uconn.edu or at http://motifsearch.com.

Contributors
  • University of Connecticut
  • University of Connecticut

Recommendations