ABSTRACT
Mining frequent patterns is a general and important issue in data mining. Complex and unstructured (or semi-structured) datasets have appeared in major data mining applications, including text mining, web mining and bioinformatics. Mining patterns from these datasets is the focus of many of the current data mining approaches. We focus on labeled ordered trees, typical datasets of semi-structured data in data mining, and propose a new probabilistic model and its efficient learning scheme for mining labeled ordered trees. The proposed approach significantly improves the time and space complexity of an existing probabilistic modeling for labeled ordered trees, while maintaining its expressive power. We evaluated the performance of the proposed model, comparing it with that of the existing model, using synthetic as well as real datasets from the field of glycobiology. Experimental results showed that the proposed model drastically reduced the computation time of the competing model, keeping the predictive power and avoiding overfitting to the training data. Finally, we assessed our results using the proposed model on real data from a variety of biological viewpoints, verifying known facts in glycobiology.
- S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured data and XML. Morgan Kaufmann, 2000. Google ScholarDigital Library
- K. F. Aoki, N. Ueda, A. Yamaguchi, T. Akutsu, M. Kanehisa, and H. Mamitsuka. Managing and analyzing carbohydrate data. ACM SIGMOD Record, 33(2), 2004. Google ScholarDigital Library
- E. Baum and T. Petrie. Statistical inference for probabilistic functions of infinite state Markov chains. Ann. Math. Stat., 37:1554--1563, 1966.Google ScholarCross Ref
- S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.Google Scholar
- A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., B, 39:1--38, 1977.Google ScholarCross Ref
- M. Diligenti, P. Frasconi, and M. Gori. Hidden tree Markov models for document image classification. IEEE Trans. PAMI, 25(4):519--523, 2003. Google ScholarDigital Library
- D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn., 45:171--186, 2001. Google ScholarDigital Library
- J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29--36, 1982.Google ScholarCross Ref
- K. Hashimoto, S. Goto, S. Kawano, K. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa. KEGG as a glycome informatics resource. Glycobiology, 16(5):63R--70R, 2005.Google ScholarCross Ref
- M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: New developments in KEGG. Nucl. Acids Res., 34:D354--D357, 2006.Google ScholarCross Ref
- H. Kashima and T. Koyanagi. Kernels for semi-structured data. In Proc. 19th ICML, pages 291--298, 2002. Google ScholarDigital Library
- K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35--56, 1990.Google ScholarCross Ref
- S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Royal Stat. Soc. B, 50(2):157--224, 1988.Google Scholar
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. Google ScholarDigital Library
- G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1996.Google Scholar
- J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Google ScholarDigital Library
- L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4--16, 1986.Google ScholarCross Ref
- N. Ueda, K. F. Aoki, and H. Mamitsuka. A general probabilistic framework for mining labeled ordered trees. In Proc. Fourth SDM, pages 357--368, 2004.Google ScholarCross Ref
- N. Ueda, K. F. Aoki-Kinoshita, A. Yamaguchi, T. Akutsu, and H. Mamitsuka. A probabilistic model for mining labeled ordered trees: Capturing patterns in carbohydrate sugar chains. IEEE Trans. Know. Data Eng., 17(8):1051--1064, 2005. Google ScholarDigital Library
- A. Varki, R. Cummings, J. Esko, H. Freeze, G. Hart, and J. Marth, editors. Essentials of Glycobiology. Cold Spring Harbor Laboratory Press, New York, 1999.Google Scholar
- S. M. Weiss, N. Indurkhya, T. Zhang, and F. J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005. Google ScholarDigital Library
- M. Zaki and C. Aggarwal. Xrules: An effective structural classifier for XML data. In Proc. Ninth ACM KDD, pages 316--325, 2003. Google ScholarDigital Library
- M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Know. Data Eng., 17(8):1021--1035, 2005. Google ScholarDigital Library
Index Terms
- A new efficient probabilistic model for mining labeled ordered trees
Recommendations
A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology
Mining frequent patterns from large datasets is an important issue in data mining. Recently, complex and unstructured (or semi-structured) datasets have appeared as targets for major data mining applications, including text mining, web mining and ...
Canonical forms for labelled trees and their applications in frequent subtree mining
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees---the breadth-first ...
Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently occurring subtrees. Because of the ...
Comments