Article

Finding motifs using random projections

Authors:
Jeremy Buhler

Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA

Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA
View Profile

,
Martin Tompa

Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA

Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA
View Profile

RECOMB '01: Proceedings of the fifth annual international conference on Computational biologyApril 2001Pages 69–76https://doi.org/10.1145/369133.369172

Published:22 April 2001Publication History

RECOMB '01: Proceedings of the fifth annual international conference on Computational biology

Pages 69–76

ABSTRACT

Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif M of length 15, where each planted instance differs from M in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem. Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)-motif problems.

We introduce a novel motif discovery algorithm based on the use of random projections of the input's substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of d for which the algorithm fails to recover the planted (l, d)-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes.

References

1.R. D. Andersen, S. J. Taplitz, S. Wong, G. Bristol, B. Larkin, and H. R. Herschman. Metal-dependent binding of a factor in vivo to the metal-responsive elements of the metallothionein 1 gene promoter. Molecular and Cellular Biology, 7:3574-81, 1987.Google ScholarCross Ref
2.T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21(1-2):51-80, Oct. 1995. Google ScholarDigital Library
3.M. Blanchette. Algorithms for phylogenetic footprinting. In RECOMB01: Proceedings of the Fifth Annual International Conference on Computational Molecular Biology, Montreal, Canada, Apr. 2001. Google ScholarDigital Library
4.M. Blanchette, B. Schwikowski, and M. Tompa. An exact algorithm to identify motifs in orthologous sequences from multiple species. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 37-45, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
5.D. S. W. Boam, A. R. Clark, and K. Docherty. Positive and negative regulation of the human insulin gene by multiple trans-acting factors. Journal of Biological Chemistry, 265:8285-96, 1990.Google Scholar
6.A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998.Google ScholarCross Ref
7.J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 2001. To appear.Google Scholar
8.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38, 1977.Google Scholar
9.Y. M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Computer Applications in the Biosciences, 11(4):379-387, 1995.Google Scholar
10.D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology, 186(1):117-128, Nov. 1985.Google ScholarCross Ref
11.A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, Scotland, 1999. Google ScholarDigital Library
12.W. S. Hayes and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing, pages 279-290, 1998.Google Scholar
13.G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999.Google ScholarCross Ref
14.P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 604-613, Dallas, TX, May 1998. Google ScholarDigital Library
15.M. Kozak. Comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles. Microbiological Reviews, 47(1):1-45, 1983.Google Scholar
16.C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993.Google ScholarCross Ref
17.C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990.Google Scholar
18.B. Lewin. Genes VI. Oxford University Press, 1997.Google Scholar
19.M. Linial, N. Linial, N. Tishby, and G. Yona. Global self-organization of all known protein sequences reveals inherent biological signatures. Journal of Molecular Biology, 268:539-556, 1997.Google ScholarCross Ref
20.C. J. McInerny, J. F. Partridge, G. E. Mikesell, D. P. Creemer, and L. L. Breeden. A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G-specific transcription. Genes & Development, 11:1277-1288, 1997.Google Scholar
21.A. L. Means and P. G. Farnham. Transcription initiation from the dihydrofolate reductase promoter is positioned by hip1 binding at the initiation site. Molecular and Cellular Biology, 10:653-61, 1990.Google ScholarCross Ref
22.S. Natsan and M. Gilman. Yy1 facilitates the association of serum response factor with the c-fos serum response element. Molecular and Cellular Biology, 15:5975-82, 1995.Google ScholarCross Ref
23.P. Pevzner and S.-H. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 269-278, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
24.I. Rigoutsos and A. Floratos. Motif discovery without alignment or enumeration. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 221-227, New York, NY, Mar. 1998. Google ScholarDigital Library
25.E. Rocke and M. Tompa. An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 228-233, New York, NY, Mar. 1998. Google ScholarDigital Library
26.M.-F. Sagot. Spelling approximate repeated or common motifs using a suffix tree. In C. L. Lucchesi and A. V. Moura, editors, Latin '98: Theoretical Informatics, volume 1380 of Lecture Notes in Computer Science, pages 111-127. Springer, 1998. Google ScholarDigital Library
27.S. Sinha and M. Tompa. A statistical method for finding transcription factor binding sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 344-354, San Diego, CA, Aug. 2000. AAAI Press. Google ScholarDigital Library
28.R. Staden. Methods for discovering novel motifs in nucleic acid sequences. Computer Applications in the Biosciences, 5(4):293-298, 1989.Google Scholar
29.M. Tompa. An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 262-271, Heidelberg, Germany, Aug. 1999. AAAI Press. Google ScholarDigital Library
30.J. van Helden, B. Andr~, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-842, Sept. 4 1998.Google ScholarCross Ref
31.E. Wingender, P. Dietze, H. Karas, and R. Kn~ppel. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research, 24(1):238-241, 1996. http://transfac.gbfbraunschweig. de/TRANSFAC/.Google ScholarCross Ref

Index Terms

Finding motifs using random projections
1. Applied computing
  1. Life and medical sciences
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Finding functional promoter motifs by computational methods: a word of caution

The standard practice in the analysis of promoters is to select promoter regions of convenient length. This may lead to false results when searching for Transcription Factor Binding Sites (TFBSs), since the sequences may contain coding segments. In such ...
Read More
Computational methods for finding regulatory dna motifs using sequence characteristics and positional preferences
Read More
Comparative genomics reveals unusually long motifs in mammalian genomes

Between short regulatory motifs and long ‘ultraconserved’ regions lies a whole spectrum of functional elements that remains uncharted.

– Manolis Kellis, RECOMB Regulatory Genomics satellite workshop, December 2005

Motivation: The recent discovery of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '01: Proceedings of the fifth annual international conference on Computational biology
April 2001
316 pages
ISBN:1581133537
DOI:10.1145/369133
Chairman:
Thomas Lengauer
German National Research Center for Information Technology, Germany
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 April 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
RECOMB '01 Paper Acceptance Rate35of128submissions,27%Overall Acceptance Rate148of538submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 112
  Total Citations
  View Citations
- 1,145
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding motifs using random projections

RECOMB '01: Proceedings of the fifth annual international conference on Computational biology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Finding functional promoter motifs by computational methods: a word of caution

Computational methods for finding regulatory dna motifs using sequence characteristics and positional preferences

Comparative genomics reveals unusually long motifs in mammalian genomes