In this thesis, we develop combinatorial approaches to two important problems in computational biology: signal finding and gene finding in DNA sequences.
Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being solved: most signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We complement existing statistical and machine learning approaches to this problem by combinatorial approaches that proved to be successful in identifying very subtle signals. This work appears in “Pevzner P. A. and Sze S.-H. (2000). Combinatorial approaches to finding subtle signals in DNA sequences. In Proc. of the 8th Int. Conf. on Intelligent Systems for Mol. Biol . ( ISMB'2000 ), 269–278.”
Gene finding (determination of splicing site locations in genomic DNA sequences) is an important problem in molecular biology. We describe three different approaches we have developed to aid the gene finding process. This work appears in “Sze S.-H. and Pevzner P. A. (1997). Las Vegas algorithms for gene recognition: suboptimal and errortolerant spliced alignment. J. Comp. Biol ., 4 , 297–309”; “Sze S.-H., Roytberg M. A., Gelfand M. S., Mironov A. A., Astakhova T. V. and Pevzner P. A. (1998). Algorithms and software for support of gene identification experiments. Bioinformatics , 14 , 14–19”; and “Xu G., Sze S.-H., Liu C.-P., Pevzner P. A. and Arnheim N. (1998). Gene hunting without sequencing genomic clones: finding exon boundaries in cDNAs. Genomics , 47 , 171–179.”
Recommendations
Using native and syntenically mapped cDNA alignments to improve de novo gene finding
Motivation: Computational annotation of protein coding genes in genomic DNA is a widely used and essential tool for analyzing newly sequenced genomes. However, current methods suffer from inaccuracy and do poorly with certain types of genes. ...
Building Innovative Representations of DNA Sequences to Facilitate Gene Finding
Genomic DNA sequences are outstanding examples of complex multiscale systems where randomness and structure coexist. One of the most important problems in genomics is gene finding. Good indices that accurately discriminate coding from noncoding regions ...