ABSTRACT
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif M of length 15, where each planted instance differs from M in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem. Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)-motif problems.
We introduce a novel motif discovery algorithm based on the use of random projections of the input's substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of d for which the algorithm fails to recover the planted (l, d)-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes.
- 1.R. D. Andersen, S. J. Taplitz, S. Wong, G. Bristol, B. Larkin, and H. R. Herschman. Metal-dependent binding of a factor in vivo to the metal-responsive elements of the metallothionein 1 gene promoter. Molecular and Cellular Biology, 7:3574-81, 1987.Google ScholarCross Ref
- 2.T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, Oct. 1995. Google ScholarDigital Library
- 3.M. Blanchette. Algorithms for phylogenetic footprinting. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarDigital Library
- 4.M. Blanchette, B. Schwikowski, and M. Tompa. An exact algorithm to identify motifs in orthologous sequences from multiple species. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 37-45, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
- 5.D. S. W. Boam, A. R. Clark, and K. Docherty. Positive and negative regulation of the human insulin gene by multiple trans-acting factors. Journal of Biological Chemistry, 265:8285-96, 1990.Google Scholar
- 6.A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998.Google ScholarCross Ref
- 7.J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 2001. To appear.Google Scholar
- 8.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38, 1977.Google Scholar
- 9.Y. M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Computer Applications in the Biosciences, 11(4):379-387, 1995.Google Scholar
- 10.D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology, 186(1):117-128, Nov. 1985.Google ScholarCross Ref
- 11.A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarDigital Library
- 12.W. S. Hayes and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing, pages 279-290, 1998.Google Scholar
- 13.G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999.Google ScholarCross Ref
- 14.P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 604-613, Dallas, TX, May 1998. Google ScholarDigital Library
- 15.M. Kozak. Comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles. Microbiological Reviews, 47(1):1-45, 1983.Google Scholar
- 16.C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993.Google ScholarCross Ref
- 17.C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990.Google Scholar
- 18.B. Lewin. Genes VI. Oxford University Press, 1997.Google Scholar
- 19.M. Linial, N. Linial, N. Tishby, and G. Yona. Global self-organization of all known protein sequences reveals inherent biological signatures. Journal of Molecular Biology, 268:539-556, 1997.Google ScholarCross Ref
- 20.C. J. McInerny, J. F. Partridge, G. E. Mikesell, D. P. Creemer, and L. L. Breeden. A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G-specific transcription. Genes & Development, 11:1277-1288, 1997.Google Scholar
- 21.A. L. Means and P. G. Farnham. Transcription initiation from the dihydrofolate reductase promoter is positioned by hip1 binding at the initiation site. Molecular and Cellular Biology, 10:653-61, 1990.Google ScholarCross Ref
- 22.S. Natsan and M. Gilman. Yy1 facilitates the association of serum response factor with the c-fos serum response element. Molecular and Cellular Biology, 15:5975-82, 1995.Google ScholarCross Ref
- 23.P. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 269-278, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
- 24.I. Rigoutsos and A. Floratos. Motif discovery without alignment or enumeration. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 221-227, New York, NY, Mar. 1998. Google ScholarDigital Library
- 25.E. Rocke and M. Tompa. An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 228-233, New York, NY, Mar. 1998. Google ScholarDigital Library
- 26.M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. In C. L. Lucchesi and A. V. Moura, editors, Latin '98: Theoretical Informatics, volume 1380 of Lecture Notes in Computer Science, pages 111-127. Springer, 1998. Google ScholarDigital Library
- 27.S. Sinha and M. Tompa. A statistical method for finding transcription factor binding sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 344-354, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
- 28.R. Staden. Methods for discovering novel motifs in nucleic acid sequences. Computer Applications in the Biosciences, 5(4):293-298, 1989.Google Scholar
- 29.M. Tompa. An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 262-271, Heidelberg, Germany, Aug. 1999. AAAI Press. Google ScholarDigital Library
- 30.J. van Helden, B. Andr~, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-842, Sept. 4 1998.Google ScholarCross Ref
- 31.E. Wingender, P. Dietze, H. Karas, and R. Kn~ppel. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research, 24(1):238-241, 1996. http://transfac.gbfbraunschweig. de/TRANSFAC/.Google ScholarCross Ref
Index Terms
- Finding motifs using random projections
Recommendations
Finding functional promoter motifs by computational methods: a word of caution
The standard practice in the analysis of promoters is to select promoter regions of convenient length. This may lead to false results when searching for Transcription Factor Binding Sites (TFBSs), since the sequences may contain coding segments. In such ...
Comments