short-paper

Community detection-based features for sequence classification

Authors:
Karthik Tangirala

Kansas State University, Manhattan, Kansas

Kansas State University, Manhattan, Kansas
View Profile

,
Doina Caragea

Kansas State University, Manhattan, Kansas

Kansas State University, Manhattan, Kansas
View Profile

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health InformaticsSeptember 2014Pages 559–568https://doi.org/10.1145/2649387.2649446

Published:20 September 2014Publication History

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Pages 559–568

ABSTRACT

Machine learning algorithms are widely used to annotate biological sequences. Most learning algorithms require data to be represented using a vector of features. In the absence of biologically informative features, k-mers generated using the sliding window-based approach are commonly used to represent biological sequences. However, the number of features generated using the sliding window-based approach increases exponentially with the length of the k-mer, k, and as a consequence, the computational time of the learning algorithms also increases. As an alternative to k-mers, we propose to use a community detection approach to identify informative features for sequence classification. The community detection approach is run on a network constructed based on k-mer similarity, and it has the advantage that class labels are not needed when constructing the network. It works by identifying sets of nodes that are densely connected when compared to other nodes of the network. Community detection approaches are computationally expensive, especially for large or densely connected networks. To address this challenge, we propose an efficient strategy that involves running the community detection approach on a set of samples, where each sample is a small subset of the total set of sequences. A set of features is obtained from each sample, and the union of all feature sets is used to represent sequences for classification purposes. Experimental results in both supervised and semi-supervised settings show that the features generated with the community detection approach are more informative than k-mers. Furthermore, we observe that running the community detection approach on a set of small samples of sequences produces better features as compared to running the community detection approach on a set of large samples of sequences.

References

T. L. Bailey, M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. Meme suite: tools for motif discovery and searching. Nucleic Acids Research, 37(suppl 2):W202--W208, 2009.Google ScholarCross Ref
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL '01, pages 26--33, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537--550, 1994. Google ScholarDigital Library
A. Bazinet and M. Cummings. A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13(1):1--13, 2012.Google ScholarCross Ref
V. Blondel, J. Guillaume, R. Lambiotte, and E. Mech. Fast unfolding of communities in large networks. J. Stat. Mech, page P10008, 2008.Google Scholar
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT' 98, pages 92--100, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classification using feature hashing. Proteome Science, 10(1):1--8, 2012.Google Scholar
A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, pages 1--6, 2004.Google ScholarCross Ref
L. Donetti and M. A. Muñoz. Improved spectral algorithm for the detection of network communities. In Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pages 1--2, 2005.Google ScholarCross Ref
S. Fortunato. Community detection in graphs. Physics Reports, 486(3):75--174, 2010.Google ScholarCross Ref
S. Fortunato and A. Lancichinetti. Community detection algorithms: A comparative analysis: Invited presentation, extended abstract. In Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS '09, pages 27:1--27:2, ICST, Brussels, Belgium, Belgium, 2009. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). Google ScholarDigital Library
M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821--7826, 2002.Google ScholarCross Ref
M. Griffith, M. J. Tang, O. L. Griffith, R. D. Morin, S. Y. Chan, J. K. Asano, T. Zeng, S. Flibotte, A. Ally, A. Baross, M. Hirst, S. J. M. Jones, G. B. Morin, I. T. Tai, and M. A. Marra. ALEXA: a microarray design platform for alternative expression analysis. Nature Methods, 5(2):118, 2008.Google ScholarCross Ref
R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028):895--900, 2005.Google ScholarCross Ref
S.-P. M. A. L. Guimera, R. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, 70:art. no. 025101, AUG 2004.Google Scholar
C. Jia, M. Carson, and J. Yu. A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics, 14(1):1--14, 2013.Google ScholarCross Ref
C. Largeron, C. Moulin, and M. Gèry. Entropy based feature selection for text categorization. In Proc. of the 2011 ACM Symp. on Applied Computing, SAC '11, pages 924--928, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
C. Leslie, E. Eskin, and W. S. S. Noble. The spectrum kernel: a string kernel for SVM protein classification. pages 564--575, Department of Computer Science, Columbia University, New York, NY 10027, USA. [email protected], 2002.Google Scholar
C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467--476, 2004. Google ScholarDigital Library
C. P. Massen and J. P. K. Doye. Identifying communities within energy landscapes. Phys. Rev. E, 71:046101, Apr 2005.Google ScholarCross Ref
A. Medus, G. Acuna, and C. Dorso. Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications, 358(2):593--604, 2005.Google ScholarCross Ref
P. Melsted and J. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12(1):333+, 2011.Google Scholar
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review, E 69(026113), 2004.Google Scholar
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum, 31(SI):67--73, July 1997. Google ScholarDigital Library
V. Ng and C. Cardie. Weakly supervised natural language learning without redundant views. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 94--101, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM '00, pages 86--93, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknown length in dna sequences. Bioinformatics, 17(suppl 1):S207--S214, 2001.Google ScholarCross Ref
G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research, 32(Web-Server-Issue):199--203, 2004.Google Scholar
R. Pudimat, R. Backofen, and E. G. Schukat-Talamazzini. Fast feature subset selection in biological sequence analysis. International Journal of Pattern Recognition and Artificial Intelligence, 23(02):191--207, 2009.Google ScholarCross Ref
F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, 101(9):2658--2663, 2004.Google ScholarCross Ref
U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3):036106+, Sept. 2007.Google ScholarCross Ref
G. Rätsch, S. Sonnenburg, and B. Schölkopf. Rase: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21(suppl 1):i369--i377, 2005. Google ScholarDigital Library
M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118--1123, 2008.Google ScholarCross Ref
N. Spolaôr, E. Cherman, M. Monard, and H. Lee. Filter approach feature selection methods to support multi-label learning based on relieff and information gain. In L. Barros, M. Finger, A. Pozo, G. GimenÃl'nez-Lugo, and M. Castilho, editors, Advances in Artificial Intelligence - SBIA 2012, Lecture Notes in Computer Science, pages 72--81. Springer Berlin Heidelberg, 2012. Google ScholarDigital Library
L. Sun, H. Luo, D. Bu, G. Zhao, K. Yu, C. Zhang, Y. Liu, R. Chen, and Y. Zhao. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research, 2013.Google ScholarCross Ref
E. D. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In Proc. of SDAIR-95, pages 317--332, Las Vegas, US, 1995.Google Scholar
J. Xia, D. Caragea, and S. Brown. Exploring alternative splicing features using support vector machines. In Proc. of IEEE BIBM 2008, pages 231--238, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library

Index Terms

Community detection-based features for sequence classification
1. Computing methodologies
  1. Machine learning
    1. Learning settings
    2. Machine learning algorithms
      1. Feature selection

Recommendations

Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification
BIOSTEC 2014: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3

Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data

(both DNA and protein sequences). The annotation of biological sequence data can be approached using machine

learning techniques. Such ...
Read More
Genomic Sequence Fragment Identification using Quasi-Alignment
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Identification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases ...
Read More
Deep Learning Approach for Pathogen Detection Through Shotgun Metagenomics Sequence Classification
Artificial Intelligence in Medicine
Abstract
Studies have shown that shotgun metagenomics sequencing facilitates the evaluation of diverse viruses, bacteria, and eukaryotic microbes and assists in exploring their abundances in complex samples. Due to the challenges of processing a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
September 2014
851 pages
ISBN:9781450328944
DOI:10.1145/2649387
General Chairs:
Pierre Baldi
University of California, Irvine
,
Wei Wang
University of California, Los Angeles
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dimensionality reduction
feature generation
machine learning
motifs
semi-supervised learning
sequence classification
supervised learning
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate254of885submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 104
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Community detection-based features for sequence classification

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification

Genomic Sequence Fragment Identification using Quasi-Alignment

Deep Learning Approach for Pathogen Detection Through Shotgun Metagenomics Sequence Classification