ABSTRACT
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TREEMINER, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. We contrast TREEMINER with a pattern matching tree mining algorithm (PATTERNMATCHER). We conduct detailed experiments to test the performance and scalability of these methods. We find that TREEMINER outperforms the pattern matching approach by a factor of 4 to 20, and has good scaleup properties. We also present an application of tree mining to analyze real web logs for usage patterns.
- S. Abiteboul, H. Kaplan, and T. Milo. Compact labeling schemes for ancestor queries. In ACM Symp. on Discrete Algorithms, January 2001.]] Google ScholarDigital Library
- S. Abiteboul and V. Vianu. Regular path expressions with constraints. In ACM Int'l Conf. on Principles of Database Systems, May 1997.]] Google ScholarDigital Library
- R. Agrawal et al. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307--328. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarDigital Library
- R. Agrawal and R. Srikant. Mining sequential patterns. In 11th Intl. Conf. on Data Engg., 1995.]] Google ScholarDigital Library
- T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. In 2nd SIAM Int'l Conference on Data Mining, April 2002.]] Google ScholarDigital Library
- M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In International Conference on Distributed Computing Systems, 1996.]]Google Scholar
- Z. Chen et al. Counting twig matches in a tree. In 17th Intl. Conf. on Data Engineering, 2001.]] Google ScholarDigital Library
- R. Cole, R. Hariharan, and P. Indyk. Tree pattern matching and subset matching in deterministic o(n log3n)-time. In l0th Symposium on Discrete Algorithms, 1999.]] Google ScholarDigital Library
- D. Cook and L. Holder. Substructure discovery using minimal description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.]]Google ScholarCross Ref
- L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds. In 4th Intl. Conf. Knowledge Discovery and Data Mining, August 1998.]]Google Scholar
- M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. In IEEE Int'l Conf. on Data Engineering, February 1998.]] Google ScholarDigital Library
- A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In 4th European Conference on Principles of Knowledge Discovery and Data Mining, September 2000.]] Google ScholarDigital Library
- P. Kilpelainen and H. Mannila. Ordered and unordered tree inclusion. SIAM J. of Computing, 24(2):340--356, 1995.]] Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In 1st IEEE Int'l Conf. on Data Mining, November 2001.]] Google ScholarDigital Library
- Q. Li and B. Moon. Indexing and querying XML data for regular path expressions. In 27th Int'l Conf. on Very Large Data Bases, 2001.]] Google ScholarDigital Library
- J. Punin, M. Krishnamoorthy, and M. Zaki. LOGML: Log markup language for web usage mining. In ACM SIGKDD Workshop on Mining Log Data Across All Customer TouchPoints, August 2001.]] Google ScholarDigital Library
- R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In 8th IEEE Intl. Conf. on Tools with AI, 1997.]] Google ScholarDigital Library
- R. Shamir and D. Tsur. Faster subtree isomorphism. Journal of Algorithms, 33:267--280, 1999.]] Google ScholarDigital Library
- B. Shapiro and K. Zhang. Comparing multiple rna secondary strutures using tree comparisons. Computer Applications in Biosciences, 6(4):309--318, 1990.]]Google Scholar
- K. Wang and H. Liu. Discovering typical structures of documents: A road map approach. In ACM SIGIR Conference on Information Retrieval, 1998.]] Google ScholarDigital Library
- K. Yoshida and H. Motoda. CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1):63--92, 1995.]] Google ScholarDigital Library
- M. J. Zaki. Efficiently mining trees in a forest. Tech. Report 01--7, CS Dept., RPI, July 2001.]]Google Scholar
- C. Zhang et al. On supporting containment queries in relational database managment systems. In ACM Int'l Conf. on Management of Data, May 2001.]] Google ScholarDigital Library
Index Terms
- Efficiently mining frequent trees in a forest
Recommendations
Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications
Mining frequent trees is very useful in domains like bioinformatics, Web mining, mining semistructured data, etc. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TreeMiner, a novel ...
Efficiently Mining Frequent Embedded Unordered Trees
Advances in Mining Graphs, Trees and SequencesMining frequent trees is very useful in domains like bioinformatics, web mining, mining semi-structured data, and so on. In this paper we introduce SLEUTH, an efficient algorithm for mining frequent, unordered, embedded subtrees in a database of labeled ...
Efficiently Mining Frequent Embedded Unordered Trees
Advances in Mining Graphs, Trees and SequencesMining frequent trees is very useful in domains like bioinformatics, web mining, mining semi-structured data, and so on. In this paper we introduce SLEUTH, an efficient algorithm for mining frequent, unordered, embedded subtrees in a database of labeled ...
Comments