ABSTRACT
Machine learning algorithms are widely used to annotate biological sequences. Most learning algorithms require data to be represented using a vector of features. In the absence of biologically informative features, k-mers generated using the sliding window-based approach are commonly used to represent biological sequences. However, the number of features generated using the sliding window-based approach increases exponentially with the length of the k-mer, k, and as a consequence, the computational time of the learning algorithms also increases. As an alternative to k-mers, we propose to use a community detection approach to identify informative features for sequence classification. The community detection approach is run on a network constructed based on k-mer similarity, and it has the advantage that class labels are not needed when constructing the network. It works by identifying sets of nodes that are densely connected when compared to other nodes of the network. Community detection approaches are computationally expensive, especially for large or densely connected networks. To address this challenge, we propose an efficient strategy that involves running the community detection approach on a set of samples, where each sample is a small subset of the total set of sequences. A set of features is obtained from each sample, and the union of all feature sets is used to represent sequences for classification purposes. Experimental results in both supervised and semi-supervised settings show that the features generated with the community detection approach are more informative than k-mers. Furthermore, we observe that running the community detection approach on a set of small samples of sequences produces better features as compared to running the community detection approach on a set of large samples of sequences.
- T. L. Bailey, M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. Meme suite: tools for motif discovery and searching. Nucleic Acids Research, 37(suppl 2):W202--W208, 2009.Google ScholarCross Ref
- M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL '01, pages 26--33, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
- R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537--550, 1994. Google ScholarDigital Library
- A. Bazinet and M. Cummings. A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13(1):1--13, 2012.Google ScholarCross Ref
- V. Blondel, J. Guillaume, R. Lambiotte, and E. Mech. Fast unfolding of communities in large networks. J. Stat. Mech, page P10008, 2008.Google Scholar
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT' 98, pages 92--100, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classification using feature hashing. Proteome Science, 10(1):1--8, 2012.Google Scholar
- A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, pages 1--6, 2004.Google ScholarCross Ref
- L. Donetti and M. A. Muñoz. Improved spectral algorithm for the detection of network communities. In Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pages 1--2, 2005.Google ScholarCross Ref
- S. Fortunato. Community detection in graphs. Physics Reports, 486(3):75--174, 2010.Google ScholarCross Ref
- S. Fortunato and A. Lancichinetti. Community detection algorithms: A comparative analysis: Invited presentation, extended abstract. In Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS '09, pages 27:1--27:2, ICST, Brussels, Belgium, Belgium, 2009. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). Google ScholarDigital Library
- M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821--7826, 2002.Google ScholarCross Ref
- M. Griffith, M. J. Tang, O. L. Griffith, R. D. Morin, S. Y. Chan, J. K. Asano, T. Zeng, S. Flibotte, A. Ally, A. Baross, M. Hirst, S. J. M. Jones, G. B. Morin, I. T. Tai, and M. A. Marra. ALEXA: a microarray design platform for alternative expression analysis. Nature Methods, 5(2):118, 2008.Google ScholarCross Ref
- R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028):895--900, 2005.Google ScholarCross Ref
- S.-P. M. A. L. Guimera, R. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, 70:art. no. 025101, AUG 2004.Google Scholar
- C. Jia, M. Carson, and J. Yu. A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics, 14(1):1--14, 2013.Google ScholarCross Ref
- C. Largeron, C. Moulin, and M. Gèry. Entropy based feature selection for text categorization. In Proc. of the 2011 ACM Symp. on Applied Computing, SAC '11, pages 924--928, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- C. Leslie, E. Eskin, and W. S. S. Noble. The spectrum kernel: a string kernel for SVM protein classification. pages 564--575, Department of Computer Science, Columbia University, New York, NY 10027, USA. [email protected], 2002.Google Scholar
- C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467--476, 2004. Google ScholarDigital Library
- C. P. Massen and J. P. K. Doye. Identifying communities within energy landscapes. Phys. Rev. E, 71:046101, Apr 2005.Google ScholarCross Ref
- A. Medus, G. Acuna, and C. Dorso. Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications, 358(2):593--604, 2005.Google ScholarCross Ref
- P. Melsted and J. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12(1):333+, 2011.Google Scholar
- M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review, E 69(026113), 2004.Google Scholar
- H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum, 31(SI):67--73, July 1997. Google ScholarDigital Library
- V. Ng and C. Cardie. Weakly supervised natural language learning without redundant views. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 94--101, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
- K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM '00, pages 86--93, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknown length in dna sequences. Bioinformatics, 17(suppl 1):S207--S214, 2001.Google ScholarCross Ref
- G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research, 32(Web-Server-Issue):199--203, 2004.Google Scholar
- R. Pudimat, R. Backofen, and E. G. Schukat-Talamazzini. Fast feature subset selection in biological sequence analysis. International Journal of Pattern Recognition and Artificial Intelligence, 23(02):191--207, 2009.Google ScholarCross Ref
- F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, 101(9):2658--2663, 2004.Google ScholarCross Ref
- U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3):036106+, Sept. 2007.Google ScholarCross Ref
- G. Rätsch, S. Sonnenburg, and B. Schölkopf. Rase: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21(suppl 1):i369--i377, 2005. Google ScholarDigital Library
- M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118--1123, 2008.Google ScholarCross Ref
- N. Spolaôr, E. Cherman, M. Monard, and H. Lee. Filter approach feature selection methods to support multi-label learning based on relieff and information gain. In L. Barros, M. Finger, A. Pozo, G. GimenÃl'nez-Lugo, and M. Castilho, editors, Advances in Artificial Intelligence - SBIA 2012, Lecture Notes in Computer Science, pages 72--81. Springer Berlin Heidelberg, 2012. Google ScholarDigital Library
- L. Sun, H. Luo, D. Bu, G. Zhao, K. Yu, C. Zhang, Y. Liu, R. Chen, and Y. Zhao. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research, 2013.Google ScholarCross Ref
- E. D. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In Proc. of SDAIR-95, pages 317--332, Las Vegas, US, 1995.Google Scholar
- J. Xia, D. Caragea, and S. Brown. Exploring alternative splicing features using support vector machines. In Proc. of IEEE BIBM 2008, pages 231--238, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Community detection-based features for sequence classification
Recommendations
Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification
BIOSTEC 2014: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data
(both DNA and protein sequences). The annotation of biological sequence data can be approached using machine
learning techniques. Such ...
Genomic Sequence Fragment Identification using Quasi-Alignment
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical InformaticsIdentification of organisms using their genetic sequences is a popular problem in molecular biology and is used in fields such as metagenomics, molecular phylogenetics and DNA Barcoding. These applications depend on searching large sequence databases ...
Deep Learning Approach for Pathogen Detection Through Shotgun Metagenomics Sequence Classification
Artificial Intelligence in MedicineAbstractStudies have shown that shotgun metagenomics sequencing facilitates the evaluation of diverse viruses, bacteria, and eukaryotic microbes and assists in exploring their abundances in complex samples. Due to the challenges of processing a ...
Comments