skip to main content
10.1145/2649387.2649446acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper

Community detection-based features for sequence classification

Published:20 September 2014Publication History

ABSTRACT

Machine learning algorithms are widely used to annotate biological sequences. Most learning algorithms require data to be represented using a vector of features. In the absence of biologically informative features, k-mers generated using the sliding window-based approach are commonly used to represent biological sequences. However, the number of features generated using the sliding window-based approach increases exponentially with the length of the k-mer, k, and as a consequence, the computational time of the learning algorithms also increases. As an alternative to k-mers, we propose to use a community detection approach to identify informative features for sequence classification. The community detection approach is run on a network constructed based on k-mer similarity, and it has the advantage that class labels are not needed when constructing the network. It works by identifying sets of nodes that are densely connected when compared to other nodes of the network. Community detection approaches are computationally expensive, especially for large or densely connected networks. To address this challenge, we propose an efficient strategy that involves running the community detection approach on a set of samples, where each sample is a small subset of the total set of sequences. A set of features is obtained from each sample, and the union of all feature sets is used to represent sequences for classification purposes. Experimental results in both supervised and semi-supervised settings show that the features generated with the community detection approach are more informative than k-mers. Furthermore, we observe that running the community detection approach on a set of small samples of sequences produces better features as compared to running the community detection approach on a set of large samples of sequences.

References

  1. T. L. Bailey, M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. Meme suite: tools for motif discovery and searching. Nucleic Acids Research, 37(suppl 2):W202--W208, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL '01, pages 26--33, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537--550, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bazinet and M. Cummings. A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13(1):1--13, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  5. V. Blondel, J. Guillaume, R. Lambiotte, and E. Mech. Fast unfolding of communities in large networks. J. Stat. Mech, page P10008, 2008.Google ScholarGoogle Scholar
  6. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT' 98, pages 92--100, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Caragea, A. Silvescu, and P. Mitra. Protein sequence classification using feature hashing. Proteome Science, 10(1):1--8, 2012.Google ScholarGoogle Scholar
  8. A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, pages 1--6, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. L. Donetti and M. A. Muñoz. Improved spectral algorithm for the detection of network communities. In Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pages 1--2, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Fortunato. Community detection in graphs. Physics Reports, 486(3):75--174, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Fortunato and A. Lancichinetti. Community detection algorithms: A comparative analysis: Invited presentation, extended abstract. In Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS '09, pages 27:1--27:2, ICST, Brussels, Belgium, Belgium, 2009. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821--7826, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Griffith, M. J. Tang, O. L. Griffith, R. D. Morin, S. Y. Chan, J. K. Asano, T. Zeng, S. Flibotte, A. Ally, A. Baross, M. Hirst, S. J. M. Jones, G. B. Morin, I. T. Tai, and M. A. Marra. ALEXA: a microarray design platform for alternative expression analysis. Nature Methods, 5(2):118, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028):895--900, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  15. S.-P. M. A. L. Guimera, R. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, 70:art. no. 025101, AUG 2004.Google ScholarGoogle Scholar
  16. C. Jia, M. Carson, and J. Yu. A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics, 14(1):1--14, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  17. C. Largeron, C. Moulin, and M. Gèry. Entropy based feature selection for text categorization. In Proc. of the 2011 ACM Symp. on Applied Computing, SAC '11, pages 924--928, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Leslie, E. Eskin, and W. S. S. Noble. The spectrum kernel: a string kernel for SVM protein classification. pages 564--575, Department of Computer Science, Columbia University, New York, NY 10027, USA. [email protected], 2002.Google ScholarGoogle Scholar
  19. C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467--476, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. P. Massen and J. P. K. Doye. Identifying communities within energy landscapes. Phys. Rev. E, 71:046101, Apr 2005.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Medus, G. Acuna, and C. Dorso. Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications, 358(2):593--604, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  22. P. Melsted and J. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12(1):333+, 2011.Google ScholarGoogle Scholar
  23. M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review, E 69(026113), 2004.Google ScholarGoogle Scholar
  24. H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum, 31(SI):67--73, July 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Ng and C. Cardie. Weakly supervised natural language learning without redundant views. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03, pages 94--101, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM '00, pages 86--93, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknown length in dna sequences. Bioinformatics, 17(suppl 1):S207--S214, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research, 32(Web-Server-Issue):199--203, 2004.Google ScholarGoogle Scholar
  29. R. Pudimat, R. Backofen, and E. G. Schukat-Talamazzini. Fast feature subset selection in biological sequence analysis. International Journal of Pattern Recognition and Artificial Intelligence, 23(02):191--207, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  30. F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America, 101(9):2658--2663, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  31. U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3):036106+, Sept. 2007.Google ScholarGoogle ScholarCross RefCross Ref
  32. G. Rätsch, S. Sonnenburg, and B. Schölkopf. Rase: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21(suppl 1):i369--i377, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118--1123, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  34. N. Spolaôr, E. Cherman, M. Monard, and H. Lee. Filter approach feature selection methods to support multi-label learning based on relieff and information gain. In L. Barros, M. Finger, A. Pozo, G. GimenÃl'nez-Lugo, and M. Castilho, editors, Advances in Artificial Intelligence - SBIA 2012, Lecture Notes in Computer Science, pages 72--81. Springer Berlin Heidelberg, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Sun, H. Luo, D. Bu, G. Zhao, K. Yu, C. Zhang, Y. Liu, R. Chen, and Y. Zhao. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  36. E. D. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In Proc. of SDAIR-95, pages 317--332, Las Vegas, US, 1995.Google ScholarGoogle Scholar
  37. J. Xia, D. Caragea, and S. Brown. Exploring alternative splicing features using support vector machines. In Proc. of IEEE BIBM 2008, pages 231--238, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Community detection-based features for sequence classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
        September 2014
        851 pages
        ISBN:9781450328944
        DOI:10.1145/2649387
        • General Chairs:
        • Pierre Baldi,
        • Wei Wang

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 September 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate254of885submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader