ABSTRACT
The growth of bioinformatics has resulted in datasets with new characteristics. These datasets typically contain a large number of columns and a small number of rows. For example, many gene expression datasets may contain 10,000-100,000 columns but only 100-1000 rows.Such datasets pose a great challenge for existing (closed) frequent pattern discovery algorithms, since they have an exponential dependence on the average row length. In this paper, we describe a new algorithm called CARPENTER that is specially designed to handle datasets having a large number of attributes and relatively small number of rows. Several experiments on real bioinformatics datasets show that CARPENTER is orders of magnitude better than previous closed pattern mining algorithms like CLOSET and CHARM.
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB'94), pages 487--499, Santiago, Chile, Sept. 1994.]] Google ScholarDigital Library
- Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent closed itemsets with counting inference. In SIGKDD Explorations, 2(2), Dec. 2000.]] Google ScholarDigital Library
- K. Beyer and R. Ramakrishnan. Botton-up computation of sparse and iceberg cubes. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), pages 359--370, Philadelphia, PA, June 1999.]] Google ScholarDigital Library
- J. Han, J. Pei, and Y. Yin. Mining partial periodicity using frequent pattern trees. In Computing Science Techniqcal Report TR-99-10, Simon Fraser University, July 1999.]]Google Scholar
- H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), pages 181--192, Seattle, WA, July 1994.]]Google Scholar
- N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. Database Theory (ICDT'99), pages 398--416, Jerusalem, Israel, Jan. 1999.]] Google ScholarDigital Library
- J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proc. 2000 ACM-SIGMOD Int. Workshop Data Mining and Knowledge Discovery (DMKD'00), pages 11--20, Dallas, TX, May 2000.]]Google Scholar
- P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging vertical mining of large databases. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00), pages 22--23, Dallas, TX, May 2000.]] Google ScholarDigital Library
- J. Wang, J. Han, and J. Pei. Closet+: Searching for the best strategies for mining frequent closed itemsets. In Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'03), Washington, D. C., Aug 2003.]] Google ScholarDigital Library
- M. Zaki and C. Hsiao. Charm: An efficient algorithm for closed association rule mining. In Proc. of SDM 2002, 2002.]]Google Scholar
- M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 283--286, Newport Beach, CA, Aug. 1997.]]Google Scholar
Index Terms
- Carpenter: finding closed patterns in long biological datasets
Recommendations
TSP: Mining top-k closed sequential patterns
Sequential pattern mining has been studied extensively in the data mining community. Most previous studies require the specification of a min_support threshold for mining a complete set of sequential patterns satisfying the threshold. However, in ...
TSP: Mining top-k closed sequential patterns
Sequential pattern mining has been studied extensively in the data mining community. Most previous studies require the specification of a min_support threshold for mining a complete set of sequential patterns satisfying the threshold. However, in ...
Mining frequent closed patterns in pointset databases
In this paper, we proposed an efficient algorithm, called PCP-Miner (Pointset Closed Pattern Miner), for mining frequent closed patterns from a pointset database, where a pointset contains a set of points. Our proposed algorithm consists of two phases. ...
Comments