Abstract
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts.
Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.
We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model--based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.
- Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. CompreX: Compression based Anomaly Detection. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12). ACM, 415--424. Google ScholarDigital Library
- R. Bayardo. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'98). 85--93. Google ScholarDigital Library
- Radim Belohlavek and Vilém Vychodil. 2010. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computing System Science 76, 1 (2010), 3--20. Google ScholarDigital Library
- Toon Calders and Bart Goethals. 2007. Non-derivable itemset mining. Data Mining and Knowledge Discovery 14, 1 (2007), 171--206. Google ScholarDigital Library
- R. Cattell. 1966. The scree test for the number of factors. Multivariate Behavioral Research 1 (1966), 245--276.Google ScholarCross Ref
- Varun Chandola and Vipin Kumar. 2007. Summarization—compressing data into an informative representation. Knowledge and Information Systems 12, 3 (2007), 355--378. Google ScholarDigital Library
- Rudi Cilibrasi and Paul Vitányi. 2005. Clustering by Compression. IEEE Transactions on Information Technology 51, 4 (2005), 1523--1545. Google ScholarDigital Library
- Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarDigital Library
- Tijl De Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining and Knowledge Discovery 23, 3 (2011), 407--446. Google ScholarDigital Library
- Paul De Boeck and Seymour Rosenberg. 1988. Hierarchical classes: Model and data analysis. Psychometrika 53, 3 (Sept. 1988), 361--381.Google ScholarCross Ref
- Carlos T. dos S. Dias and Wojtek J. Krzanowski. 2003. Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Science 43 (2003), 865--873.Google ScholarCross Ref
- Sheila M. Embleton and Eric S. Wheeler. 1997. Finnish dialect atlas for quantitative studies. Journal of Quantitative Linguistics 4, 1--3 (1997), 99--102.Google ScholarCross Ref
- Sheila M. Embleton and Eric S. Wheeler. 2000. Computerized dialect atlas of Finnish: Dealing with ambiguity. Journal of Quantitative Linguistics 7, 3 (2000), 227--231.Google ScholarCross Ref
- Christos Faloutsos and Vasilis Megalooikonomou. 2007. On data mining, compression and Kolmogorov Complexity. Data Mining and Knowledge Discovery 15 (2007), 3--20. Issue 1. Google ScholarDigital Library
- Usama Fayyad and K. Irani. 1993. Multi-Interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 9th International Conference on Uncertainty in Artificial Intelligence (UAI'93). 1022--1027.Google Scholar
- Mikael Fortelius and others. 2003. Neogene of the Old World Database of Fossil Mammals (NOW). Available at http://www.helsinki.fi/science/now/.Google Scholar
- A. Frank and A. Asuncion. 2010. UCI Machine Learning Repository. Available at http://archive.ics.uci.edu/ml.Google Scholar
- Mario Frank, Morteza Haghir Chehreghani, and Joachim M. Buhmann. 2011. The minimum transfer cost principle for model-order selection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD'11). 423--438. Google ScholarDigital Library
- Gemma C. Garriga, Esa Junttila, and Heikki Mannila. 2011. Banded structure in binary matrices. Knowledge and Information Systems 28, 1 (2011), 197--226. Google ScholarDigital Library
- Floris Geerts, Bart Goethals, and Taneli Mielikäinen. 2004. Tiling databases. In Proceedings of Discovery Science. 278--289.Google ScholarCross Ref
- Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.Google Scholar
- Peter Grünwald. 2007. The Minimum Description Length Principle. MIT Press.Google Scholar
- Hannes Heikinheimo, Jilles Vreeken, Arno Siebes, and Heikki Mannila. 2009. Low-entropy set selection. In Proceedings of the 9th SIAM International Conference on Data Mining (SDM'09). 569--580.Google ScholarCross Ref
- Ruoming Jin, Yang Xiang, and Lin Liu. 2009. Cartesian contour: A concise representation for a collection of frequent sets. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'09). ACM, 417--426. Google ScholarDigital Library
- Kleanthis-Nikolaus Kontonasios and Tijl De Bie. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). SIAM, 153--164.Google ScholarCross Ref
- Laks V. S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore Johnson. 2002. The generalized MDL approach for summarization. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02). 766--777. Google ScholarDigital Library
- Matthijs Leeuwenvan Leeuwen, Jilles Vreeken, and Arno Siebes. 2009. Identifying the components. Data Mining and Knowledge Discovery 19, 2 (2009), 173--292. Google ScholarDigital Library
- M. Li and P. Vitányi. 1993. An Introduction to Kolmogorov Complexity and Its Applications. Springer. Google ScholarDigital Library
- Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. 2008. Optimal Boolean matrix decomposition: Application to role engineering. In Proceedings of the 24th International Conference on Data Engineering (ICDE'08). 297--306. Google ScholarDigital Library
- Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. 2010. Mining top-k patterns from binary datasets in presence of noise. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). 165--176.Google ScholarCross Ref
- Claudio Lucchese, Salvatore Orlando, and R. Perego. 2014. A unifying framework for mining approximate top-k binary patterns. IEEE Transactions on Knowledge and Data Engineering (2014). In press.Google Scholar
- Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing data succinctly with the most informative itemsets. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), 1--44. Google ScholarDigital Library
- Pauli Miettinen. 2008. On the positive-negative partial set cover problem. Information Processing Letters 108, 4 (2008), 219--221. Google ScholarDigital Library
- Pauli Miettinen. 2009. Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms. Ph.D. Dissertation. University of Helsinki.Google Scholar
- Pauli Miettinen. 2010. Sparse boolean matrix factorizations. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM'10). 935--940. Google ScholarDigital Library
- Pauli Miettinen. 2012. On finding joint subspace Boolean matrix factorizations. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). 954--965.Google ScholarCross Ref
- Pauli Miettinen. 2013. Fully dynamic quasi-biclique edge covers via Boolean matrix factorizations. In Proceedings of the 1st ACM SIGMOD Workshop on Dynamic Network Management and Mining (DyNetMM'13). Google ScholarDigital Library
- Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. 2008. The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering 20, 10 (2008), 1348--1362. Google ScholarDigital Library
- Pauli Miettinen and Jilles Vreeken. 2011. Model order selection for Boolean matrix factorization. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'11). 51--59. Google ScholarDigital Library
- T. P. Minka. 2001. Automatic choice of dimensionality for PCA. In Proceedings of the 13th Annual Conference on Neural Information Processing Systems (NIPS'01). 598--604.Google Scholar
- A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. 1999. The Atlas of European Mammals. Academic Press.Google Scholar
- Fabian Moerchen, Michael Thies, and Alfred Ultsch. 2011. Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression. Knowledge and Information Systems 29, 1 (2011), 55--80. Google ScholarDigital Library
- Sylvia D. Monson, Norman J. Pullman, and Rolf Rees. 1995. A survey of clique and biclique coverings and factorizations of (0, 1)-matrices. Bulletin of the ICA 14 (1995), 17--86.Google Scholar
- Samuel Myllykangas, J. Himberg, T. Böhling, B. Nagy, Jaakko Hollmén, and S. Knuutila. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55 (2006), 7324--7332.Google ScholarCross Ref
- Dana S. Nau, George Markowsky, Max A. Woodbury, and D. Bernard Amos. 1978. A mathematical analysis of human leukocyte antigen serology. Mathematical Biosciences 40 (1978), 243--270.Google ScholarCross Ref
- Dianne P. O'leary and Shmuel Peleg. 1983. Digital image compression by outer product expansion. IEEE Transactions on Communications 31, 3 (1983), 441--444.Google ScholarCross Ref
- Art B. Owen and Patrick O. Perry. 2009. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Annals of Applied Statistics 3, 2 (June 2009), 564--594.Google ScholarCross Ref
- Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (1994), 111--126.Google ScholarCross Ref
- Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). ACM, 398--416. Google ScholarDigital Library
- V. Pestov. 2008. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks 21, 2--3 (2008), 204--213. Google ScholarDigital Library
- J. Ross Quinlan and Ronald L. Rivest. 1989. Inferring decision trees using the minimum description length principle. Information and Computation 80, 3 (1989), 227--248. Google ScholarDigital Library
- Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 1 (1978), 465--471. Google ScholarDigital Library
- Jorma Rissanen. 1983. Modeling by shortest data description. The Annals of Statistics 11, 2 (1983), 416--431.Google ScholarCross Ref
- Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. 2003. A generalized linear model for principal component analysis of binary data. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.Google Scholar
- M. Schmidt, O. Winther, and L. Hansen. 2009. Bayesian non-negative matrix factorization. In Proceedings of International Conference on Independent Component Analysis and Signal Separation. Vol. 5411. 540--547. Google ScholarDigital Library
- Gideon Schwarz. 1978. Estimating the dimension of a model. The Annals of Statistics 6, 2 (1978), 461--464.Google ScholarCross Ref
- Hao Shao, Bin Tong, and Einoshin Suzuki. 2013. Extended MDL principle for feature-based inductive transfer learning. Knowledge and Information Systems 35, 2 (2013), 365--389. DOI:http://dx.doi.org/10.1007/s10115-012-0505-xGoogle ScholarCross Ref
- Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM'06). SIAM, 393--404.Google ScholarCross Ref
- David Skillicorn. 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman & Hall/CRC Press.Google Scholar
- Koen Smets and Jilles Vreeken. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). SIAM, 236--247.Google ScholarCross Ref
- Andreas Streich, Mario Frank, David Basin, and Joachim Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th International Conference on Machine Learning (ICML'09). 969--976. Google ScholarDigital Library
- Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. 2006. What is the dimension of your binary data? In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06). 603--612. Google ScholarDigital Library
- Nikolaj Tatti and Jilles Vreeken. 2008. Finding good itemsets by packing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 588--597. Google ScholarDigital Library
- Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. 2007. The role mining problem: Finding a minimal descriptive set of roles. In Proceedings of the 12th ACM International Symposium on Access Control Models and Technologies (SACMAT'07). 175--184. Google ScholarDigital Library
- N. K. Vereshchagin and P. M. B. Vitanyi. 2004. Kolmogorov's structure functions and model selection. IEEE Transactions on Information Technology 50, 12 (2004), 3265--3290. Google ScholarDigital Library
- Jilles Vreeken and Arno Siebes. 2008. Filling in the blanks: Krimp minimisation for missing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 1067--1072. Google ScholarDigital Library
- Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: Mining itemsets that compress. Data Mining and Knowledge Discovery 23, 1 (2011), 169--214. Google ScholarDigital Library
- C. S. Wallace. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer-Verlag. Google ScholarDigital Library
- Chao Wang and Srinivasan Parthasarathy. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'06). ACM, 730--735. Google ScholarDigital Library
- Jianyong Wang and George Karypis. 2004. SUMMARY: Efficiently summarizing transactions for clustering. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04). IEEE, 241--248. Google ScholarDigital Library
- Y. Xiang, R. Jin, D. Fuhry, and F. Dragan. 2010. Summarizing transactional databases with overlapped hyperrectangles. Data Mining and Knowledge Discovery 23, 2 (2010), 215--251. Google ScholarDigital Library
- Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. 2008. Succinct summarization of transactional databases: An overlapped hyperrectangle scheme. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08). ACM, 758--766. Google ScholarDigital Library
- K. A. Yeomans and P. A. Golder. 1982. The Guttman--Kaiser criterion as a predictor of the number of common factors. The Statistician 31, 3 (1982), 221--229.Google ScholarCross Ref
- Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. 2010. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery 20, 1 (2010), 28--52. Google ScholarDigital Library
- M. Zhu and A. Ghodsi. 2006. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics and Data Analysis 51, 2 (2006), 918--930. Google ScholarDigital Library
Index Terms
- MDL4BMF: Minimum Description Length for Boolean Matrix Factorization
Recommendations
Factorization of Binary Matrices: Rank Relations, Uniqueness and Model Selection of Boolean Decomposition
The application of binary matrices are numerous. Representing a matrix as a mixture of a small collection of latent vectors via low-rank decomposition is often seen as an advantageous method to interpret and analyze data. In this work, we examine the ...
Model order selection for boolean matrix factorization
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningMatrix factorizations---where a given data matrix is approximated by a product of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, ...
Summarizing data succinctly with the most informative itemsets
Special Issue on the Best of SIGKDD 2011Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our ...
Comments