skip to main content
article

Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification

Published:01 May 2012Publication History
Skip Abstract Section

Abstract

The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed.

References

  1. D. Chaussabel and A. Sher, "Mining Microarray Expression Data by Literature Profiling," Genome Biology, vol. 3, no. research0055, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  2. D.W. Huang, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens, M.W. Baseler, H.C. Lane, and R.A. Lempicki, "The David Gene Functional Classification Tool: A Novel Biological Module-Centric Algorithm to Functionally Analyze Large Gene Lists," Genome Biology, vol. 8, no. R183, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. J. Natarajan and J. Ganapathy, "Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature," Bioinformation, vol. 2, no. 5, pp. 185-193, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. K. Ovaska, M. Laakso, and S. Hautaniemi, "Fast Gene Ontology Based Clustering for Microarray Experiments," BioData Mining, vol. 1, no. 11, 2008.Google ScholarGoogle Scholar
  5. G. Macintyre, J. Bailey, D. Gustafsson, I. Haviv, and A. Kowalczyk, "Using Gene Ontology Annotations in Exploratory Microarray Clustering to Understand Cancer Etiology," Biochemistry, vol. 31, no. 14, pp. 2138-2146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Khatri, S. Draghici, G.C. Ostermeier, and S.A. Krawetz, "Profiling Gene Expression Using Onto-Express," Genomics, vol. 79, no. 2, pp. 266-270, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Draghici, P. Khatri, R. Martins, G. Ostermeier, and S. Krawetz, "Global Functional Profiling of Gene Expression," Genomics, vol. 81, no. 2, pp. 98-104, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  8. P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D.W.W. Huang, B.T.T. Sherman, and R.A.A. Lempicki, "Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists," Nucleic Acids Research, vol. 37, no. 1, Nov. 2008.Google ScholarGoogle Scholar
  10. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, "Tissue Classification with Gene Expression Profiles," Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology, pp. 54-64, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  12. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Lee, J. Lee, M. Park, and S. Song, "An Extensive Evaluation of Recent Classification Tools Applied to Microarray Data," Computational Statistics and Data Analysis, vol. 48, no. 4, pp. 869-885, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Dupuy and R. Simon, "Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting," J. Nat'l Cancer Institute, vol. 99, no. 2, pp. 147-157, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Michiels, S. Koscielny, and C. Hill, "Prediction of Cancer Outcome with Microarrays: A Multiple Random Validation Strategy," The Lancet, vol. 365, no. 9458, pp. 488-492, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  16. V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of Microarrays Applied to the Ionizing Radiation Response," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 9, pp. 5116-5121, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov, "Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 102, no. 43, pp. 15545-15550, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  18. I. Dinu, J. Potter, T. Mueller, Q. Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, and Y. Yasui, "Improving Gene Set Analysis of Microarray Data by SAM-GS," BMC Bioinformatics, vol. 8, no. 242, 2007.Google ScholarGoogle Scholar
  19. Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Analyzing Gene Expression Data in Terms of Gene Sets: Methodological Issues," Bioinformatics, vol. 23, no. 8, pp. 980-987, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Holec, F. ¿elezny, J. Kléma, and J. Tolar, "Integrating Multiple-Platform Expression Data through Gene Set Features," Proc. Fifth Int'l Symp. Bioinformatics Research and Applications, pp. 5-17, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Holec, F. ¿elezny, J. Kléma, and J. Tolar, "A Comparative Evaluation of Gene Set Analysis Techniques in Predictive Classsification of Expression Samples," Proc. Int'l Conf. Bioinformatics, Computational Biology, Genomics and Chemoinformatics (BCBGC '10), 2010.Google ScholarGoogle Scholar
  22. F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert, "Classification of Microarray Data Using Gene Networks," BMC Bioinformatics, vol. 8, no. 35, 2007.Google ScholarGoogle Scholar
  23. E. Lee, H. Chuang, J. Kim, T. Ideker, and D. Lee, "Inferring Pathway Activity Toward Precise Disease Classification," PLoS Computational Biology, vol. 4, no. e1000217, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Efroni, C.F. Schaefer, and K.H. Buetow, "Identification of Key Processes Underlying Cancer Phenotypes Using Biologic Pathway Analysis," PLoS ONE, vol. 2, no. e425, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  25. B. Hanczar, M. Courtine, A. Benis, C. Hennegar, K. Clément, and J.-D. Zucker, "Improving Classification of Microarray Data Using Prototype-Based Feature Selection," SIGKDD Explorations Newsletter, vol. 5, no. 2, pp. 23-30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A.L. Tarca, S. Draghici, P. Khatri, S.S. Hassan, P. Mittal, J.-s. Kim, C.J. Kim, J.P. Kusanovic, and R. Romero, "A Novel Signaling Pathway Impact Analysis," Bioinformatics, vol. 25, no. 1, pp. 75-82, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J.P.A. Ioannidis, "Genetic Associations: False or True?," Trends in Molecular Medicine, vol. 9, no. 4, pp. 135-138, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  28. J.P.A. Ioannidis, "Why Most Published Research Findings are False," PLoS Medicine, vol. 2, no. e124, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  29. S.Y. Rhee, V. Wood, K. Dolinski, and S. Draghici, "Use and Misuse of the Gene Ontology Annotations," Nature Reviews Genetics, vol. 9, no. 7, pp. 509-515, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  30. R. Gentleman et al., "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. R80, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  31. J. Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, vol. 20, pp. 37-46, 1960.Google ScholarGoogle ScholarCross RefCross Ref
  32. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010.Google ScholarGoogle Scholar
  33. J. MacQueen et al., "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.Google ScholarGoogle Scholar
  34. A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, no. 19, pp. 2405-2412, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Kerr, H. Ruskin, M. Crane, and P. Doolan, "Techniques for Clustering Gene Expression Data," Computers in Biology and Medicine, vol. 38, no. 3, pp. 283-293, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. I. Priness, O. Maimon, and I. Ben-Gal, "Evaluation of Gene-Expression Clustering via Mutual Information Distance Measure," BMC Bioinformatics, vol. 8, no. 111, 2007.Google ScholarGoogle Scholar
  37. F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau, "Adaptive Quality-Based Clustering of Gene Expression Profiles," Bioinformatics, vol. 18, no. 5, pp. 735-746, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  38. L. Kaufman and P. Rousseeuw, Finding Groups in Data an Introduction to Cluster Analysis. Wiley Interscience, 1990.Google ScholarGoogle Scholar
  39. E. Jones et al., "SciPy: Open Source Scientific Tools for Python," http://www.scipy.org/, 2001.Google ScholarGoogle Scholar
  40. D. Stirewalt et al., "Identification of Genes with Abnormal Expression Changes in Acute Myeloid Leukemia," Genes, Chromosomes and Cancer, vol. 47, no. 1, pp. 8-20, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  41. A. Tripathi et al., "Gene Expression Abnormalities in Histologically Normal Breast Epithelium of Breast Cancer Patients," Int'l J. Cancer, vol. 122, no. 7, pp. 1557-1566, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  42. Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Global Gene Expression Analysis of Gastric Cancer by Oligonucleotide Microarrays," Cancer Research, vol. 62, no. 1, pp. 233-240, 2002.Google ScholarGoogle Scholar
  43. W. Freije, F. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. Liau, P. Mischel, and S. Nelson, "Gene Expression Profiling of Gliomas Strongly Predicts Survival," Cancer Research, vol. 64, no. 18, pp. 6503-6510, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  44. T. Bull, C. Coldren, M. Moore, S. Sotto-Santiago, D. Pham, S. Nana-Sinkam, N. Voelkel, and M. Geraci, "Gene Microarray Analysis of Peripheral Blood Cells in Pulmonary Arterial Hypertension," Am. J. Respiratory and Critical Care Medicine, vol. 170, no. 8, pp. 911-919, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  45. R. Palmer et al., "Pediatric Malignant Germ Cell Tumors Show Characteristic Transcriptome Profiles," Cancer Research, vol. 68, no. 11, pp. 4239-4247, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  46. C. Best et al., "Molecular Alterations in Primary Prostate Cancer After Androgen Ablation Therapy," Clinical Cancer Research, vol. 11, no. 19, pp. 6823-6834, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  47. K. Detwiller, N. Fernando, N. Segal, S. Ryeom, P. D'Amore, and S. Yoon, "Analysis of Hypoxia-Related Gene Expression in Sarcomas and Effect of Hypoxia on rna Interference of Vascular Endothelial Cell Growth Factor a," Cancer Research, vol. 65, no. 13, pp. 5881- 5889, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  48. B.J. Carolan, A. Heguy, B.-G. Harvey, P.L. Leopold, B. Ferris, and R.G. Crystal, "Up-Regulation of Expression of the Ubiquitin Carboxyl-Terminal Hydrolase l1 Gene in Human Airway Epithelium of Cigarette Smokers," Cancer Research, vol. 66, no. 22, pp. 10729-10740, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  49. B. Bolstad, R. Irizarry, M. _Astrand, and T. Speed, "A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias," Bioinformatics, vol. 19, no. 2, pp. 185-193, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  50. T. Barrett, D. Troup, S. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar, "Ncbi Geo: Mining Tens of Millions of Expression Profiles-Database and Tools Update," Nucleic Acids Research, vol. 35, no. suppl 1, pp. D760-D765, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  51. R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1137-1143, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. V. Vapnik, The Nature of Statistical Learning Theory. Springer Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. J. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. I. Rish, "An Empirical Study of the Naive Bayes Classifier," Proc. IJCAI Workshop Empirical Methods in Artificial Intelligence, pp. 41-46, 2001.Google ScholarGoogle Scholar
  56. T. Cover and P. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, 1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 1, pp. 262-267, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  58. R. Diáz-Uriarte and S. De Andres, "Gene Selection and Classification of Microarray Data Using Random Forest," BMC Bioinformatics, vol. 7, no. 3, 2006.Google ScholarGoogle Scholar
  59. J. Dem¿ar, B. Zupan, G. Leban, and T. Curk, "Orange: From Experimental Machine Learning to Interactive Data Mining," Proc. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '04), pp. 537-539, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. F. Wilcoxon, "Individual Comparisons by Ranking Methods," Biometrics, vol. 1, no. 6, pp. 80-83, 1945.Google ScholarGoogle ScholarCross RefCross Ref
  61. J. Dem_sar, "Statistical Comparisons of Classifiers over Multiple Data Sets," J. Machine Learning Research, vol. 7, pp. 1-30, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. F.D. Gibbons and F.P. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation," Genome Research, vol. 12, no. 10, pp. 1574-1581, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  63. M. Friedman, "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," J. Am. Statistical Assoc., vol. 32, no. 200, pp. 675-701, 1937.Google ScholarGoogle ScholarCross RefCross Ref
  64. Y. Saeys, I.n. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation," Proc. 21st Int'l Conf. Data Eng., pp. 341-352, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. A. Strehl and J. Ghosh, "Cluster Ensembles--A Knowledge Reuse Framework for Combining Multiple Partitions," The J. Machine Learning Research, vol. 3, pp. 583-617, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. P. Glenisson, J. Mathys, and B. de Moor, "Meta-clustering of Gene Expression Data and Literature-Based Information," SIGKDD Explorations, vol. 5, pp. 101-112, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. J. Tomfohr, J. Lu, and T.B. Kepler, "Pathway Level Analysis of Gene Expression Using Singular Value Decomposition," BMC Bioinformatics, vol. 6, no. 225, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader