Abstract
The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed.
- D. Chaussabel and A. Sher, "Mining Microarray Expression Data by Literature Profiling," Genome Biology, vol. 3, no. research0055, 2002.Google ScholarCross Ref
- D.W. Huang, B.T. Sherman, Q. Tan, J.R. Collins, W.G. Alvord, J. Roayaei, R. Stephens, M.W. Baseler, H.C. Lane, and R.A. Lempicki, "The David Gene Functional Classification Tool: A Novel Biological Module-Centric Algorithm to Functionally Analyze Large Gene Lists," Genome Biology, vol. 8, no. R183, 2007.Google ScholarCross Ref
- J. Natarajan and J. Ganapathy, "Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature," Bioinformation, vol. 2, no. 5, pp. 185-193, 2007.Google ScholarCross Ref
- K. Ovaska, M. Laakso, and S. Hautaniemi, "Fast Gene Ontology Based Clustering for Microarray Experiments," BioData Mining, vol. 1, no. 11, 2008.Google Scholar
- G. Macintyre, J. Bailey, D. Gustafsson, I. Haviv, and A. Kowalczyk, "Using Gene Ontology Annotations in Exploratory Microarray Clustering to Understand Cancer Etiology," Biochemistry, vol. 31, no. 14, pp. 2138-2146, 2010. Google ScholarDigital Library
- P. Khatri, S. Draghici, G.C. Ostermeier, and S.A. Krawetz, "Profiling Gene Expression Using Onto-Express," Genomics, vol. 79, no. 2, pp. 266-270, 2002.Google ScholarCross Ref
- S. Draghici, P. Khatri, R. Martins, G. Ostermeier, and S. Krawetz, "Global Functional Profiling of Gene Expression," Genomics, vol. 81, no. 2, pp. 98-104, 2003.Google ScholarCross Ref
- P. Khatri and S. Draghici, "Ontological Analysis of Gene Expression Data: Current Tools, Limitations, and Open Problems," Bioinformatics, vol. 21, no. 18, pp. 3587-3595, 2005. Google ScholarDigital Library
- D.W.W. Huang, B.T.T. Sherman, and R.A.A. Lempicki, "Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists," Nucleic Acids Research, vol. 37, no. 1, Nov. 2008.Google Scholar
- A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, "Tissue Classification with Gene Expression Profiles," Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology, pp. 54-64, 2000. Google ScholarDigital Library
- S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.Google ScholarCross Ref
- T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.Google ScholarCross Ref
- J. Lee, J. Lee, M. Park, and S. Song, "An Extensive Evaluation of Recent Classification Tools Applied to Microarray Data," Computational Statistics and Data Analysis, vol. 48, no. 4, pp. 869-885, 2005.Google ScholarCross Ref
- A. Dupuy and R. Simon, "Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting," J. Nat'l Cancer Institute, vol. 99, no. 2, pp. 147-157, 2007.Google ScholarCross Ref
- S. Michiels, S. Koscielny, and C. Hill, "Prediction of Cancer Outcome with Microarrays: A Multiple Random Validation Strategy," The Lancet, vol. 365, no. 9458, pp. 488-492, 2005.Google ScholarCross Ref
- V.G. Tusher, R. Tibshirani, and G. Chu, "Significance Analysis of Microarrays Applied to the Ionizing Radiation Response," Proc. Nat'l Academy of Sciences USA, vol. 98, no. 9, pp. 5116-5121, 2001.Google ScholarCross Ref
- A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov, "Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 102, no. 43, pp. 15545-15550, 2005.Google ScholarCross Ref
- I. Dinu, J. Potter, T. Mueller, Q. Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran, and Y. Yasui, "Improving Gene Set Analysis of Microarray Data by SAM-GS," BMC Bioinformatics, vol. 8, no. 242, 2007.Google Scholar
- Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Analyzing Gene Expression Data in Terms of Gene Sets: Methodological Issues," Bioinformatics, vol. 23, no. 8, pp. 980-987, 2007. Google ScholarDigital Library
- M. Holec, F. ¿elezny, J. Kléma, and J. Tolar, "Integrating Multiple-Platform Expression Data through Gene Set Features," Proc. Fifth Int'l Symp. Bioinformatics Research and Applications, pp. 5-17, 2009. Google ScholarDigital Library
- M. Holec, F. ¿elezny, J. Kléma, and J. Tolar, "A Comparative Evaluation of Gene Set Analysis Techniques in Predictive Classsification of Expression Samples," Proc. Int'l Conf. Bioinformatics, Computational Biology, Genomics and Chemoinformatics (BCBGC '10), 2010.Google Scholar
- F. Rapaport, A. Zinovyev, M. Dutreix, E. Barillot, and J.P. Vert, "Classification of Microarray Data Using Gene Networks," BMC Bioinformatics, vol. 8, no. 35, 2007.Google Scholar
- E. Lee, H. Chuang, J. Kim, T. Ideker, and D. Lee, "Inferring Pathway Activity Toward Precise Disease Classification," PLoS Computational Biology, vol. 4, no. e1000217, 2008.Google ScholarCross Ref
- S. Efroni, C.F. Schaefer, and K.H. Buetow, "Identification of Key Processes Underlying Cancer Phenotypes Using Biologic Pathway Analysis," PLoS ONE, vol. 2, no. e425, 2007.Google ScholarCross Ref
- B. Hanczar, M. Courtine, A. Benis, C. Hennegar, K. Clément, and J.-D. Zucker, "Improving Classification of Microarray Data Using Prototype-Based Feature Selection," SIGKDD Explorations Newsletter, vol. 5, no. 2, pp. 23-30, 2003. Google ScholarDigital Library
- A.L. Tarca, S. Draghici, P. Khatri, S.S. Hassan, P. Mittal, J.-s. Kim, C.J. Kim, J.P. Kusanovic, and R. Romero, "A Novel Signaling Pathway Impact Analysis," Bioinformatics, vol. 25, no. 1, pp. 75-82, 2009. Google ScholarDigital Library
- J.P.A. Ioannidis, "Genetic Associations: False or True?," Trends in Molecular Medicine, vol. 9, no. 4, pp. 135-138, 2003.Google ScholarCross Ref
- J.P.A. Ioannidis, "Why Most Published Research Findings are False," PLoS Medicine, vol. 2, no. e124, 2005.Google ScholarCross Ref
- S.Y. Rhee, V. Wood, K. Dolinski, and S. Draghici, "Use and Misuse of the Gene Ontology Annotations," Nature Reviews Genetics, vol. 9, no. 7, pp. 509-515, 2008.Google ScholarCross Ref
- R. Gentleman et al., "Bioconductor: Open Software Development for Computational Biology and Bioinformatics," Genome Biology, vol. 5, no. R80, 2004.Google ScholarCross Ref
- J. Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, vol. 20, pp. 37-46, 1960.Google ScholarCross Ref
- R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010.Google Scholar
- J. MacQueen et al., "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.Google Scholar
- A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, no. 19, pp. 2405-2412, 2006. Google ScholarDigital Library
- G. Kerr, H. Ruskin, M. Crane, and P. Doolan, "Techniques for Clustering Gene Expression Data," Computers in Biology and Medicine, vol. 38, no. 3, pp. 283-293, 2008. Google ScholarDigital Library
- I. Priness, O. Maimon, and I. Ben-Gal, "Evaluation of Gene-Expression Clustering via Mutual Information Distance Measure," BMC Bioinformatics, vol. 8, no. 111, 2007.Google Scholar
- F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau, "Adaptive Quality-Based Clustering of Gene Expression Profiles," Bioinformatics, vol. 18, no. 5, pp. 735-746, 2002.Google ScholarCross Ref
- L. Kaufman and P. Rousseeuw, Finding Groups in Data an Introduction to Cluster Analysis. Wiley Interscience, 1990.Google Scholar
- E. Jones et al., "SciPy: Open Source Scientific Tools for Python," http://www.scipy.org/, 2001.Google Scholar
- D. Stirewalt et al., "Identification of Genes with Abnormal Expression Changes in Acute Myeloid Leukemia," Genes, Chromosomes and Cancer, vol. 47, no. 1, pp. 8-20, 2008.Google ScholarCross Ref
- A. Tripathi et al., "Gene Expression Abnormalities in Histologically Normal Breast Epithelium of Breast Cancer Patients," Int'l J. Cancer, vol. 122, no. 7, pp. 1557-1566, 2008.Google ScholarCross Ref
- Y. Hippo, H. Taniguchi, S. Tsutsumi, N. Machida, J. Chong, M. Fukayama, T. Kodama, and H. Aburatani, "Global Gene Expression Analysis of Gastric Cancer by Oligonucleotide Microarrays," Cancer Research, vol. 62, no. 1, pp. 233-240, 2002.Google Scholar
- W. Freije, F. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy, L. Liau, P. Mischel, and S. Nelson, "Gene Expression Profiling of Gliomas Strongly Predicts Survival," Cancer Research, vol. 64, no. 18, pp. 6503-6510, 2004.Google ScholarCross Ref
- T. Bull, C. Coldren, M. Moore, S. Sotto-Santiago, D. Pham, S. Nana-Sinkam, N. Voelkel, and M. Geraci, "Gene Microarray Analysis of Peripheral Blood Cells in Pulmonary Arterial Hypertension," Am. J. Respiratory and Critical Care Medicine, vol. 170, no. 8, pp. 911-919, 2004.Google ScholarCross Ref
- R. Palmer et al., "Pediatric Malignant Germ Cell Tumors Show Characteristic Transcriptome Profiles," Cancer Research, vol. 68, no. 11, pp. 4239-4247, 2008.Google ScholarCross Ref
- C. Best et al., "Molecular Alterations in Primary Prostate Cancer After Androgen Ablation Therapy," Clinical Cancer Research, vol. 11, no. 19, pp. 6823-6834, 2005.Google ScholarCross Ref
- K. Detwiller, N. Fernando, N. Segal, S. Ryeom, P. D'Amore, and S. Yoon, "Analysis of Hypoxia-Related Gene Expression in Sarcomas and Effect of Hypoxia on rna Interference of Vascular Endothelial Cell Growth Factor a," Cancer Research, vol. 65, no. 13, pp. 5881- 5889, 2005.Google ScholarCross Ref
- B.J. Carolan, A. Heguy, B.-G. Harvey, P.L. Leopold, B. Ferris, and R.G. Crystal, "Up-Regulation of Expression of the Ubiquitin Carboxyl-Terminal Hydrolase l1 Gene in Human Airway Epithelium of Cigarette Smokers," Cancer Research, vol. 66, no. 22, pp. 10729-10740, 2006.Google ScholarCross Ref
- B. Bolstad, R. Irizarry, M. _Astrand, and T. Speed, "A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias," Bioinformatics, vol. 19, no. 2, pp. 185-193, 2003.Google ScholarCross Ref
- T. Barrett, D. Troup, S. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar, "Ncbi Geo: Mining Tens of Millions of Expression Profiles-Database and Tools Update," Nucleic Acids Research, vol. 35, no. suppl 1, pp. D760-D765, 2007.Google ScholarCross Ref
- R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1137-1143, 1995. Google ScholarDigital Library
- V. Vapnik, The Nature of Statistical Learning Theory. Springer Verlag, 2000. Google ScholarDigital Library
- L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. Google ScholarDigital Library
- J. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
- I. Rish, "An Empirical Study of the Naive Bayes Classifier," Proc. IJCAI Workshop Empirical Methods in Artificial Intelligence, pp. 41-46, 2001.Google Scholar
- T. Cover and P. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, 1967.Google ScholarDigital Library
- M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, no. 1, pp. 262-267, 2000.Google ScholarCross Ref
- R. Diáz-Uriarte and S. De Andres, "Gene Selection and Classification of Microarray Data Using Random Forest," BMC Bioinformatics, vol. 7, no. 3, 2006.Google Scholar
- J. Dem¿ar, B. Zupan, G. Leban, and T. Curk, "Orange: From Experimental Machine Learning to Interactive Data Mining," Proc. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '04), pp. 537-539, 2004. Google ScholarDigital Library
- F. Wilcoxon, "Individual Comparisons by Ranking Methods," Biometrics, vol. 1, no. 6, pp. 80-83, 1945.Google ScholarCross Ref
- J. Dem_sar, "Statistical Comparisons of Classifiers over Multiple Data Sets," J. Machine Learning Research, vol. 7, pp. 1-30, 2006. Google ScholarDigital Library
- F.D. Gibbons and F.P. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation," Genome Research, vol. 12, no. 10, pp. 1574-1581, 2002.Google ScholarCross Ref
- M. Friedman, "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," J. Am. Statistical Assoc., vol. 32, no. 200, pp. 675-701, 1937.Google ScholarCross Ref
- Y. Saeys, I.n. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. Google ScholarDigital Library
- A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation," Proc. 21st Int'l Conf. Data Eng., pp. 341-352, 2005. Google ScholarDigital Library
- A. Strehl and J. Ghosh, "Cluster Ensembles--A Knowledge Reuse Framework for Combining Multiple Partitions," The J. Machine Learning Research, vol. 3, pp. 583-617, 2003. Google ScholarDigital Library
- P. Glenisson, J. Mathys, and B. de Moor, "Meta-clustering of Gene Expression Data and Literature-Based Information," SIGKDD Explorations, vol. 5, pp. 101-112, 2003. Google ScholarDigital Library
- J. Tomfohr, J. Lu, and T.B. Kepler, "Pathway Level Analysis of Gene Expression Using Singular Value Decomposition," BMC Bioinformatics, vol. 6, no. 225, 2005.Google Scholar
Index Terms
- Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification
Recommendations
Investigating Gene and MicroRNA Expression in Glioblastoma
IJCBS '09: Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent ComputingGlioblastoma is the most common primary brain tumor in adults. Here we present an integrated analysis of microRNA expression and gene expression in 237 tumor tissues and 10 normal tissues. We indentified 1,236 genes, and 131 pathways significantly ...
Improving biological significance of gene expression biclusters with key missing genes
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health InformaticsIdentifying condition-specific co-expressed gene groups is critical for gene functional and regulatory analysis. However, given that genes with critical functions (such as transcription factors) may not co-express with their target genes, it is ...
Computational selection of distinct class- and subclass-specific gene expression signatures
In this investigation we used statistical methods to select genes with expression profiles that partition classes and subclasses of biological samples. Gene expression data corresponding to liver samples from rats treated for 24 h with an enzyme inducer ...
Comments