skip to main content
10.1145/369133.369172acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

Finding motifs using random projections

Authors Info & Claims
Published:22 April 2001Publication History

ABSTRACT

Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif M of length 15, where each planted instance differs from M in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem. Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)-motif problems.

We introduce a novel motif discovery algorithm based on the use of random projections of the input's substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of d for which the algorithm fails to recover the planted (l, d)-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes.

References

  1. 1.R. D. Andersen, S. J. Taplitz, S. Wong, G. Bristol, B. Larkin, and H. R. Herschman. Metal-dependent binding of a factor in vivo to the metal-responsive elements of the metallothionein 1 gene promoter. Molecular and Cellular Biology, 7:3574-81, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  2. 2.T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, Oct. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.M. Blanchette. Algorithms for phylogenetic footprinting. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.M. Blanchette, B. Schwikowski, and M. Tompa. An exact algorithm to identify motifs in orthologous sequences from multiple species. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 37-45, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.D. S. W. Boam, A. R. Clark, and K. Docherty. Positive and negative regulation of the human insulin gene by multiple trans-acting factors. Journal of Biological Chemistry, 265:8285-96, 1990.Google ScholarGoogle Scholar
  6. 6.A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7.J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 2001. To appear.Google ScholarGoogle Scholar
  8. 8.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38, 1977.Google ScholarGoogle Scholar
  9. 9.Y. M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Computer Applications in the Biosciences, 11(4):379-387, 1995.Google ScholarGoogle Scholar
  10. 10.D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology, 186(1):117-128, Nov. 1985.Google ScholarGoogle ScholarCross RefCross Ref
  11. 11.A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.W. S. Hayes and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing, pages 279-290, 1998.Google ScholarGoogle Scholar
  13. 13.G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999.Google ScholarGoogle ScholarCross RefCross Ref
  14. 14.P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 604-613, Dallas, TX, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.M. Kozak. Comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles. Microbiological Reviews, 47(1):1-45, 1983.Google ScholarGoogle Scholar
  16. 16.C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993.Google ScholarGoogle ScholarCross RefCross Ref
  17. 17.C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990.Google ScholarGoogle Scholar
  18. 18.B. Lewin. Genes VI. Oxford University Press, 1997.Google ScholarGoogle Scholar
  19. 19.M. Linial, N. Linial, N. Tishby, and G. Yona. Global self-organization of all known protein sequences reveals inherent biological signatures. Journal of Molecular Biology, 268:539-556, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  20. 20.C. J. McInerny, J. F. Partridge, G. E. Mikesell, D. P. Creemer, and L. L. Breeden. A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G-specific transcription. Genes & Development, 11:1277-1288, 1997.Google ScholarGoogle Scholar
  21. 21.A. L. Means and P. G. Farnham. Transcription initiation from the dihydrofolate reductase promoter is positioned by hip1 binding at the initiation site. Molecular and Cellular Biology, 10:653-61, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  22. 22.S. Natsan and M. Gilman. Yy1 facilitates the association of serum response factor with the c-fos serum response element. Molecular and Cellular Biology, 15:5975-82, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  23. 23.P. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 269-278, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.I. Rigoutsos and A. Floratos. Motif discovery without alignment or enumeration. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 221-227, New York, NY, Mar. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.E. Rocke and M. Tompa. An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 228-233, New York, NY, Mar. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. In C. L. Lucchesi and A. V. Moura, editors, Latin '98: Theoretical Informatics, volume 1380 of Lecture Notes in Computer Science, pages 111-127. Springer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.S. Sinha and M. Tompa. A statistical method for finding transcription factor binding sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 344-354, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.R. Staden. Methods for discovering novel motifs in nucleic acid sequences. Computer Applications in the Biosciences, 5(4):293-298, 1989.Google ScholarGoogle Scholar
  29. 29.M. Tompa. An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 262-271, Heidelberg, Germany, Aug. 1999. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.J. van Helden, B. Andr~, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-842, Sept. 4 1998.Google ScholarGoogle ScholarCross RefCross Ref
  31. 31.E. Wingender, P. Dietze, H. Karas, and R. Kn~ppel. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research, 24(1):238-241, 1996. http://transfac.gbfbraunschweig. de/TRANSFAC/.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Finding motifs using random projections

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            RECOMB '01: Proceedings of the fifth annual international conference on Computational biology
            April 2001
            316 pages
            ISBN:1581133537
            DOI:10.1145/369133
            • Chairman:
            • Thomas Lengauer

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 April 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            RECOMB '01 Paper Acceptance Rate35of128submissions,27%Overall Acceptance Rate148of538submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader