research-article

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Authors:
Pauli Miettinen

Max-Planck Institute for Informatics, Saarbrücken, Germany

Max-Planck Institute for Informatics, Saarbrücken, Germany
View Profile

,
Jilles Vreeken

Max-Planck Institute for Informatics, Saarland University, University of Antwerp, Antwerp, Belgium

Max-Planck Institute for Informatics, Saarland University, University of Antwerp, Antwerp, Belgium
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 8 Issue 4Article No.: 18pp 1–31https://doi.org/10.1145/2601437

Published:07 October 2014Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts.

Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.

We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model--based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

References

Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. CompreX: Compression based Anomaly Detection. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12). ACM, 415--424. Google ScholarDigital Library
R. Bayardo. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'98). 85--93. Google ScholarDigital Library
Radim Belohlavek and Vilém Vychodil. 2010. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computing System Science 76, 1 (2010), 3--20. Google ScholarDigital Library
Toon Calders and Bart Goethals. 2007. Non-derivable itemset mining. Data Mining and Knowledge Discovery 14, 1 (2007), 171--206. Google ScholarDigital Library
R. Cattell. 1966. The scree test for the number of factors. Multivariate Behavioral Research 1 (1966), 245--276.Google ScholarCross Ref
Varun Chandola and Vipin Kumar. 2007. Summarization—compressing data into an informative representation. Knowledge and Information Systems 12, 3 (2007), 355--378. Google ScholarDigital Library
Rudi Cilibrasi and Paul Vitányi. 2005. Clustering by Compression. IEEE Transactions on Information Technology 51, 4 (2005), 1523--1545. Google ScholarDigital Library
Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarDigital Library
Tijl De Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining and Knowledge Discovery 23, 3 (2011), 407--446. Google ScholarDigital Library
Paul De Boeck and Seymour Rosenberg. 1988. Hierarchical classes: Model and data analysis. Psychometrika 53, 3 (Sept. 1988), 361--381.Google ScholarCross Ref
Carlos T. dos S. Dias and Wojtek J. Krzanowski. 2003. Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Science 43 (2003), 865--873.Google ScholarCross Ref
Sheila M. Embleton and Eric S. Wheeler. 1997. Finnish dialect atlas for quantitative studies. Journal of Quantitative Linguistics 4, 1--3 (1997), 99--102.Google ScholarCross Ref
Sheila M. Embleton and Eric S. Wheeler. 2000. Computerized dialect atlas of Finnish: Dealing with ambiguity. Journal of Quantitative Linguistics 7, 3 (2000), 227--231.Google ScholarCross Ref
Christos Faloutsos and Vasilis Megalooikonomou. 2007. On data mining, compression and Kolmogorov Complexity. Data Mining and Knowledge Discovery 15 (2007), 3--20. Issue 1. Google ScholarDigital Library
Usama Fayyad and K. Irani. 1993. Multi-Interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 9th International Conference on Uncertainty in Artificial Intelligence (UAI'93). 1022--1027.Google Scholar
Mikael Fortelius and others. 2003. Neogene of the Old World Database of Fossil Mammals (NOW). Available at http://www.helsinki.fi/science/now/.Google Scholar
A. Frank and A. Asuncion. 2010. UCI Machine Learning Repository. Available at http://archive.ics.uci.edu/ml.Google Scholar
Mario Frank, Morteza Haghir Chehreghani, and Joachim M. Buhmann. 2011. The minimum transfer cost principle for model-order selection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD'11). 423--438. Google ScholarDigital Library
Gemma C. Garriga, Esa Junttila, and Heikki Mannila. 2011. Banded structure in binary matrices. Knowledge and Information Systems 28, 1 (2011), 197--226. Google ScholarDigital Library
Floris Geerts, Bart Goethals, and Taneli Mielikäinen. 2004. Tiling databases. In Proceedings of Discovery Science. 278--289.Google ScholarCross Ref
Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.Google Scholar
Peter Grünwald. 2007. The Minimum Description Length Principle. MIT Press.Google Scholar
Hannes Heikinheimo, Jilles Vreeken, Arno Siebes, and Heikki Mannila. 2009. Low-entropy set selection. In Proceedings of the 9th SIAM International Conference on Data Mining (SDM'09). 569--580.Google ScholarCross Ref
Ruoming Jin, Yang Xiang, and Lin Liu. 2009. Cartesian contour: A concise representation for a collection of frequent sets. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'09). ACM, 417--426. Google ScholarDigital Library
Kleanthis-Nikolaus Kontonasios and Tijl De Bie. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). SIAM, 153--164.Google ScholarCross Ref
Laks V. S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore Johnson. 2002. The generalized MDL approach for summarization. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02). 766--777. Google ScholarDigital Library
Matthijs Leeuwenvan Leeuwen, Jilles Vreeken, and Arno Siebes. 2009. Identifying the components. Data Mining and Knowledge Discovery 19, 2 (2009), 173--292. Google ScholarDigital Library
M. Li and P. Vitányi. 1993. An Introduction to Kolmogorov Complexity and Its Applications. Springer. Google ScholarDigital Library
Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. 2008. Optimal Boolean matrix decomposition: Application to role engineering. In Proceedings of the 24th International Conference on Data Engineering (ICDE'08). 297--306. Google ScholarDigital Library
Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. 2010. Mining top-k patterns from binary datasets in presence of noise. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). 165--176.Google ScholarCross Ref
Claudio Lucchese, Salvatore Orlando, and R. Perego. 2014. A unifying framework for mining approximate top-k binary patterns. IEEE Transactions on Knowledge and Data Engineering (2014). In press.Google Scholar
Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing data succinctly with the most informative itemsets. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), 1--44. Google ScholarDigital Library
Pauli Miettinen. 2008. On the positive-negative partial set cover problem. Information Processing Letters 108, 4 (2008), 219--221. Google ScholarDigital Library
Pauli Miettinen. 2009. Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms. Ph.D. Dissertation. University of Helsinki.Google Scholar
Pauli Miettinen. 2010. Sparse boolean matrix factorizations. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM'10). 935--940. Google ScholarDigital Library
Pauli Miettinen. 2012. On finding joint subspace Boolean matrix factorizations. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). 954--965.Google ScholarCross Ref
Pauli Miettinen. 2013. Fully dynamic quasi-biclique edge covers via Boolean matrix factorizations. In Proceedings of the 1st ACM SIGMOD Workshop on Dynamic Network Management and Mining (DyNetMM'13). Google ScholarDigital Library
Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. 2008. The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering 20, 10 (2008), 1348--1362. Google ScholarDigital Library
Pauli Miettinen and Jilles Vreeken. 2011. Model order selection for Boolean matrix factorization. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'11). 51--59. Google ScholarDigital Library
T. P. Minka. 2001. Automatic choice of dimensionality for PCA. In Proceedings of the 13th Annual Conference on Neural Information Processing Systems (NIPS'01). 598--604.Google Scholar
A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. 1999. The Atlas of European Mammals. Academic Press.Google Scholar
Fabian Moerchen, Michael Thies, and Alfred Ultsch. 2011. Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression. Knowledge and Information Systems 29, 1 (2011), 55--80. Google ScholarDigital Library
Sylvia D. Monson, Norman J. Pullman, and Rolf Rees. 1995. A survey of clique and biclique coverings and factorizations of (0, 1)-matrices. Bulletin of the ICA 14 (1995), 17--86.Google Scholar
Samuel Myllykangas, J. Himberg, T. Böhling, B. Nagy, Jaakko Hollmén, and S. Knuutila. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55 (2006), 7324--7332.Google ScholarCross Ref
Dana S. Nau, George Markowsky, Max A. Woodbury, and D. Bernard Amos. 1978. A mathematical analysis of human leukocyte antigen serology. Mathematical Biosciences 40 (1978), 243--270.Google ScholarCross Ref
Dianne P. O'leary and Shmuel Peleg. 1983. Digital image compression by outer product expansion. IEEE Transactions on Communications 31, 3 (1983), 441--444.Google ScholarCross Ref
Art B. Owen and Patrick O. Perry. 2009. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Annals of Applied Statistics 3, 2 (June 2009), 564--594.Google ScholarCross Ref
Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (1994), 111--126.Google ScholarCross Ref
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). ACM, 398--416. Google ScholarDigital Library
V. Pestov. 2008. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks 21, 2--3 (2008), 204--213. Google ScholarDigital Library
J. Ross Quinlan and Ronald L. Rivest. 1989. Inferring decision trees using the minimum description length principle. Information and Computation 80, 3 (1989), 227--248. Google ScholarDigital Library
Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 1 (1978), 465--471. Google ScholarDigital Library
Jorma Rissanen. 1983. Modeling by shortest data description. The Annals of Statistics 11, 2 (1983), 416--431.Google ScholarCross Ref
Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. 2003. A generalized linear model for principal component analysis of binary data. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.Google Scholar
M. Schmidt, O. Winther, and L. Hansen. 2009. Bayesian non-negative matrix factorization. In Proceedings of International Conference on Independent Component Analysis and Signal Separation. Vol. 5411. 540--547. Google ScholarDigital Library
Gideon Schwarz. 1978. Estimating the dimension of a model. The Annals of Statistics 6, 2 (1978), 461--464.Google ScholarCross Ref
Hao Shao, Bin Tong, and Einoshin Suzuki. 2013. Extended MDL principle for feature-based inductive transfer learning. Knowledge and Information Systems 35, 2 (2013), 365--389. DOI:http://dx.doi.org/10.1007/s10115-012-0505-xGoogle ScholarCross Ref
Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM'06). SIAM, 393--404.Google ScholarCross Ref
David Skillicorn. 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman & Hall/CRC Press.Google Scholar
Koen Smets and Jilles Vreeken. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). SIAM, 236--247.Google ScholarCross Ref
Andreas Streich, Mario Frank, David Basin, and Joachim Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th International Conference on Machine Learning (ICML'09). 969--976. Google ScholarDigital Library
Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. 2006. What is the dimension of your binary data&quest; In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06). 603--612. Google ScholarDigital Library
Nikolaj Tatti and Jilles Vreeken. 2008. Finding good itemsets by packing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 588--597. Google ScholarDigital Library
Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. 2007. The role mining problem: Finding a minimal descriptive set of roles. In Proceedings of the 12th ACM International Symposium on Access Control Models and Technologies (SACMAT'07). 175--184. Google ScholarDigital Library
N. K. Vereshchagin and P. M. B. Vitanyi. 2004. Kolmogorov's structure functions and model selection. IEEE Transactions on Information Technology 50, 12 (2004), 3265--3290. Google ScholarDigital Library
Jilles Vreeken and Arno Siebes. 2008. Filling in the blanks: Krimp minimisation for missing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 1067--1072. Google ScholarDigital Library
Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: Mining itemsets that compress. Data Mining and Knowledge Discovery 23, 1 (2011), 169--214. Google ScholarDigital Library
C. S. Wallace. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer-Verlag. Google ScholarDigital Library
Chao Wang and Srinivasan Parthasarathy. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'06). ACM, 730--735. Google ScholarDigital Library
Jianyong Wang and George Karypis. 2004. SUMMARY: Efficiently summarizing transactions for clustering. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04). IEEE, 241--248. Google ScholarDigital Library
Y. Xiang, R. Jin, D. Fuhry, and F. Dragan. 2010. Summarizing transactional databases with overlapped hyperrectangles. Data Mining and Knowledge Discovery 23, 2 (2010), 215--251. Google ScholarDigital Library
Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. 2008. Succinct summarization of transactional databases: An overlapped hyperrectangle scheme. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08). ACM, 758--766. Google ScholarDigital Library
K. A. Yeomans and P. A. Golder. 1982. The Guttman--Kaiser criterion as a predictor of the number of common factors. The Statistician 31, 3 (1982), 221--229.Google ScholarCross Ref
Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. 2010. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery 20, 1 (2010), 28--52. Google ScholarDigital Library
M. Zhu and A. Ghodsi. 2006. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics and Data Analysis 51, 2 (2006), 918--930. Google ScholarDigital Library

Index Terms

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Factorization of Binary Matrices: Rank Relations, Uniqueness and Model Selection of Boolean Decomposition
The application of binary matrices are numerous. Representing a matrix as a mixture of a small collection of latent vectors via low-rank decomposition is often seen as an advantageous method to interpret and analyze data. In this work, we examine the ...
Read More
Model order selection for boolean matrix factorization
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Matrix factorizations---where a given data matrix is approximated by a product of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, ...
Read More
Summarizing data succinctly with the most informative itemsets
Special Issue on the Best of SIGKDD 2011

Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 8, Issue 4
October 2014
219 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2663597
Editor:
Philip S. Yu
University of Illinois at Chicago, USA
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2014
- Accepted: 1 December 2013
- Revised: 1 May 2013
- Received: 1 June 2012
Published in tkdd Volume 8, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Boolean matrix factorization
Boolean rank
MDL
minimum description length principle
model order selection
model selection
parameter free
pattern sets
summarization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 486
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Factorization of Binary Matrices: Rank Relations, Uniqueness and Model Selection of Boolean Decomposition

Model order selection for boolean matrix factorization

Summarizing data succinctly with the most informative itemsets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Factorization of Binary Matrices: Rank Relations, Uniqueness and Model Selection of Boolean Decomposition

Model order selection for boolean matrix factorization

Summarizing data succinctly with the most informative itemsets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media