Article

A new efficient probabilistic model for mining labeled ordered trees

Authors:
Kosuke Hashimoto

Kyoto University, Gokasho Uji, Japan

Kyoto University, Gokasho Uji, Japan
View Profile

,
Kiyoko F. Aoki-Kinoshita

Kyoto University, Gokasho Uji, Japan

Kyoto University, Gokasho Uji, Japan
View Profile

,
Nobuhisa Ueda

Kyoto University, Gokasho Uji, Japan

Kyoto University, Gokasho Uji, Japan
View Profile

,
Minoru Kanehisa

Kyoto University, Gokasho Uji, Japan

Kyoto University, Gokasho Uji, Japan
View Profile

,
Hiroshi Mamitsuka

Kyoto University, Gokasho Uji, Japan

Kyoto University, Gokasho Uji, Japan
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 177–186https://doi.org/10.1145/1150402.1150425

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 177–186

ABSTRACT

Mining frequent patterns is a general and important issue in data mining. Complex and unstructured (or semi-structured) datasets have appeared in major data mining applications, including text mining, web mining and bioinformatics. Mining patterns from these datasets is the focus of many of the current data mining approaches. We focus on labeled ordered trees, typical datasets of semi-structured data in data mining, and propose a new probabilistic model and its efficient learning scheme for mining labeled ordered trees. The proposed approach significantly improves the time and space complexity of an existing probabilistic modeling for labeled ordered trees, while maintaining its expressive power. We evaluated the performance of the proposed model, comparing it with that of the existing model, using synthetic as well as real datasets from the field of glycobiology. Experimental results showed that the proposed model drastically reduced the computation time of the competing model, keeping the predictive power and avoiding overfitting to the training data. Finally, we assessed our results using the proposed model on real data from a variety of biological viewpoints, verifying known facts in glycobiology.

References

S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: from relations to semistructured data and XML. Morgan Kaufmann, 2000. Google ScholarDigital Library
K. F. Aoki, N. Ueda, A. Yamaguchi, T. Akutsu, M. Kanehisa, and H. Mamitsuka. Managing and analyzing carbohydrate data. ACM SIGMOD Record, 33(2), 2004. Google ScholarDigital Library
E. Baum and T. Petrie. Statistical inference for probabilistic functions of infinite state Markov chains. Ann. Math. Stat., 37:1554--1563, 1966.Google ScholarCross Ref
S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.Google Scholar
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., B, 39:1--38, 1977.Google ScholarCross Ref
M. Diligenti, P. Frasconi, and M. Gori. Hidden tree Markov models for document image classification. IEEE Trans. PAMI, 25(4):519--523, 2003. Google ScholarDigital Library
D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn., 45:171--186, 2001. Google ScholarDigital Library
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29--36, 1982.Google ScholarCross Ref
K. Hashimoto, S. Goto, S. Kawano, K. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki, and M. Kanehisa. KEGG as a glycome informatics resource. Glycobiology, 16(5):63R--70R, 2005.Google ScholarCross Ref
M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: New developments in KEGG. Nucl. Acids Res., 34:D354--D357, 2006.Google ScholarCross Ref
H. Kashima and T. Koyanagi. Kernels for semi-structured data. In Proc. 19th ICML, pages 291--298, 2002. Google ScholarDigital Library
K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35--56, 1990.Google ScholarCross Ref
S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Royal Stat. Soc. B, 50(2):157--224, 1988.Google Scholar
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. Google ScholarDigital Library
G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1996.Google Scholar
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Google ScholarDigital Library
L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4--16, 1986.Google ScholarCross Ref
N. Ueda, K. F. Aoki, and H. Mamitsuka. A general probabilistic framework for mining labeled ordered trees. In Proc. Fourth SDM, pages 357--368, 2004.Google ScholarCross Ref
N. Ueda, K. F. Aoki-Kinoshita, A. Yamaguchi, T. Akutsu, and H. Mamitsuka. A probabilistic model for mining labeled ordered trees: Capturing patterns in carbohydrate sugar chains. IEEE Trans. Know. Data Eng., 17(8):1051--1064, 2005. Google ScholarDigital Library
A. Varki, R. Cummings, J. Esko, H. Freeze, G. Hart, and J. Marth, editors. Essentials of Glycobiology. Cold Spring Harbor Laboratory Press, New York, 1999.Google Scholar
S. M. Weiss, N. Indurkhya, T. Zhang, and F. J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005. Google ScholarDigital Library
M. Zaki and C. Aggarwal. Xrules: An effective structural classifier for XML data. In Proc. Ninth ACM KDD, pages 316--325, 2003. Google ScholarDigital Library
M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Trans. Know. Data Eng., 17(8):1021--1035, 2005. Google ScholarDigital Library

Index Terms

A new efficient probabilistic model for mining labeled ordered trees
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Markov decision processes

Recommendations

A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology

Mining frequent patterns from large datasets is an important issue in data mining. Recently, complex and unstructured (or semi-structured) datasets have appeared as targets for major data mining applications, including text mining, web mining and ...
Read More
Canonical forms for labelled trees and their applications in frequent subtree mining

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labelled rooted unordered trees---the breadth-first ...
Read More
Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently occurring subtrees. Because of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
expectation - maximization
labeled ordered trees
maximum likelihood
probabilistic models
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 771
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A new efficient probabilistic model for mining labeled ordered trees

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology

Canonical forms for labelled trees and their applications in frequent subtree mining

Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A new efficient probabilistic model for mining labeled ordered trees

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology

Canonical forms for labelled trees and their applications in frequent subtree mining

Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media