skip to main content
10.1145/1150402.1150425acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A new efficient probabilistic model for mining labeled ordered trees

Published:20 August 2006Publication History

ABSTRACT

Mining frequent patterns is a general and important issue in data mining. Complex and unstructured (or semi-structured) datasets have appeared in major data mining applications, including text mining, web mining and bioinformatics. Mining patterns from these datasets is the focus of many of the current data mining approaches. We focus on labeled ordered trees, typical datasets of semi-structured data in data mining, and propose a new probabilistic model and its efficient learning scheme for mining labeled ordered trees. The proposed approach significantly improves the time and space complexity of an existing probabilistic modeling for labeled ordered trees, while maintaining its expressive power. We evaluated the performance of the proposed model, comparing it with that of the existing model, using synthetic as well as real datasets from the field of glycobiology. Experimental results showed that the proposed model drastically reduced the computation time of the competing model, keeping the predictive power and avoiding overfitting to the training data. Finally, we assessed our results using the proposed model on real data from a variety of biological viewpoints, verifying known facts in glycobiology.

References

  1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured data and XML. Morgan Kaufmann, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. F. Aoki, N. Ueda, A. Yamaguchi, T. Akutsu, M. Kanehisa, and H. Mamitsuka. Managing and analyzing carbohydrate data. ACM SIGMOD Record, 33(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Baum and T. Petrie. Statistical inference for probabilistic functions of infinite state Markov chains. Ann. Math. Stat., 37:1554--1563, 1966.Google ScholarGoogle ScholarCross RefCross Ref
  4. S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.Google ScholarGoogle Scholar
  5. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., B, 39:1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  6. M. Diligenti, P. Frasconi, and M. Gori. Hidden tree Markov models for document image classification. IEEE Trans. PAMI, 25(4):519--523, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn., 45:171--186, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29--36, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  9. K. Hashimoto, S. Goto, S. Kawano, K. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa. KEGG as a glycome informatics resource. Glycobiology, 16(5):63R--70R, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: New developments in KEGG. Nucl. Acids Res., 34:D354--D357, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  11. H. Kashima and T. Koyanagi. Kernels for semi-structured data. In Proc. 19th ICML, pages 291--298, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35--56, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Royal Stat. Soc. B, 50(2):157--224, 1988.Google ScholarGoogle Scholar
  14. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1996.Google ScholarGoogle Scholar
  16. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4--16, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  18. N. Ueda, K. F. Aoki, and H. Mamitsuka. A general probabilistic framework for mining labeled ordered trees. In Proc. Fourth SDM, pages 357--368, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  19. N. Ueda, K. F. Aoki-Kinoshita, A. Yamaguchi, T. Akutsu, and H. Mamitsuka. A probabilistic model for mining labeled ordered trees: Capturing patterns in carbohydrate sugar chains. IEEE Trans. Know. Data Eng., 17(8):1051--1064, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Varki, R. Cummings, J. Esko, H. Freeze, G. Hart, and J. Marth, editors. Essentials of Glycobiology. Cold Spring Harbor Laboratory Press, New York, 1999.Google ScholarGoogle Scholar
  21. S. M. Weiss, N. Indurkhya, T. Zhang, and F. J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Zaki and C. Aggarwal. Xrules: An effective structural classifier for XML data. In Proc. Ninth ACM KDD, pages 316--325, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Know. Data Eng., 17(8):1021--1035, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A new efficient probabilistic model for mining labeled ordered trees

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2006
            986 pages
            ISBN:1595933395
            DOI:10.1145/1150402

            Copyright © 2006 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 August 2006

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader