ABSTRACT
In a vertical representation of a market-basket database, each item is associated with a column of values representing the transactions in which it is present. The association-rule mining algorithms that have been recently proposed for this representation show performance improvements over their classical horizontal counterparts, but are either efficient only for certain database sizes, or assume particular characteristics of the database contents, or are applicable only to specific kinds of database schemas. We present here a new vertical mining algorithm called VIPER, which is general-purpose, making no special requirements of the underlying database. VIPER stores data in compressed bit-vectors called “snakes” and integrates a number of novel optimizations for efficient snake generation, intersection, counting and storage. We analyze the performance of VIPER for a range of synthetic database workloads. Our experimental results indicate significant performance gains, especially for large databases, over previously proposed vertical and horizontal mining algorithms. In fact, there are even workload regions where VIPER outperforms an optimal, but practically infeasible, horizontal mining algorithm.
- 1.R. Agrawal, T. Imielinski, and A. Swamy. Mining association rules between sets of items in large databases. In Proc. of ACM SIGMOD Intl. Conf. on Management of Data, May 1993. Google ScholarDigital Library
- 2.R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of 20th Intl. Conf. Very Large Databases (VLDB), September 1994. Google ScholarDigital Library
- 3.B. Dunkel and N. Soparkar. Data organization and access for efficient data mining. In Proc. of 15th Intl. Conf. on Data Engineering (ICDE), 1999. Google ScholarDigital Library
- 4.G. Gardarin, P. Pucheral, and F. Wu. Bitmap based algorithms for mining association rules. Technical report 1998-18, University of Versailles, 1998. (http://www.prism.uvsq.fr/rapports/1998/ document_1998_18.ps.gz)Google Scholar
- 5.S.W. Golomb. Run-length encoding. IEEE Trans. on Information Theory, 12(3), 3uly 1966.Google Scholar
- 6.M. Holsheimer, M. Kersten, H. Mannila, and H. Toivonen. A perspective on databases and data mining. In Proc. of 1st Intl. Conf. on Knowledge Discovery and Data Mining (KDD), August 1995.Google Scholar
- 7.A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of 21st Intl. Conf. on Very Large Databases (VLDB), 199 5. Google ScholarDigital Library
- 8.P. Shenoy, 3. Haritsa, S. Sudarshan, M. Bawa, G. Bhalotia, and D. Shah. Turbo-charging vertical mining of large databases. Technical Report TR-2000-02, DSL, Indian Institute of Science, 2000. (http://dsl.serc.iisc.ernet.in/pub/TR/TR-2000-02.ps)Google ScholarDigital Library
- 9.S-J. Yen and A.L.P. Chen. An efficient approach to discovering knowledge from large databases. In Proc. of 4th Intl. Conf. on Parallel and Distributed Information Systems (PDIS), 1996. Google ScholarDigital Library
- 10.M. 3. Zaki. Scalable Data Mining for Rules. PhD thesis, Dept. of Computer Science, University of Rochester, July 1998. Google ScholarDigital Library
- 11.M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. of 3rd Intl. Conf. on Knowledge Discovery and Data Mining (KDD), August 1997.Google Scholar
Index Terms
- Turbo-charging vertical mining of large databases
Recommendations
Turbo-charging vertical mining of large databases
In a vertical representation of a market-basket database, each item is associated with a column of values representing the transactions in which it is present. The association-rule mining algorithms that have been recently proposed for this ...
Mining frequent itemsets in large databases: The hierarchical partitioning approach
Although many methods have been proposed to enhance the efficiencies of data mining, little research has been devoted to the issue of scalability - that is, the problem of mining frequent itemsets when the size of the database is very large. This study ...
Comments