skip to main content
research-article

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Authors Info & Claims
Published:07 October 2014Publication History
Skip Abstract Section

Abstract

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts.

Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.

We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model--based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

References

  1. Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. CompreX: Compression based Anomaly Detection. In Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM'12). ACM, 415--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Bayardo. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'98). 85--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Radim Belohlavek and Vilém Vychodil. 2010. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computing System Science 76, 1 (2010), 3--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Toon Calders and Bart Goethals. 2007. Non-derivable itemset mining. Data Mining and Knowledge Discovery 14, 1 (2007), 171--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Cattell. 1966. The scree test for the number of factors. Multivariate Behavioral Research 1 (1966), 245--276.Google ScholarGoogle ScholarCross RefCross Ref
  6. Varun Chandola and Vipin Kumar. 2007. Summarization—compressing data into an informative representation. Knowledge and Information Systems 12, 3 (2007), 355--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rudi Cilibrasi and Paul Vitányi. 2005. Clustering by Compression. IEEE Transactions on Information Technology 51, 4 (2005), 1523--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory. Wiley-Interscience, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tijl De Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining and Knowledge Discovery 23, 3 (2011), 407--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Paul De Boeck and Seymour Rosenberg. 1988. Hierarchical classes: Model and data analysis. Psychometrika 53, 3 (Sept. 1988), 361--381.Google ScholarGoogle ScholarCross RefCross Ref
  11. Carlos T. dos S. Dias and Wojtek J. Krzanowski. 2003. Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Science 43 (2003), 865--873.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sheila M. Embleton and Eric S. Wheeler. 1997. Finnish dialect atlas for quantitative studies. Journal of Quantitative Linguistics 4, 1--3 (1997), 99--102.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sheila M. Embleton and Eric S. Wheeler. 2000. Computerized dialect atlas of Finnish: Dealing with ambiguity. Journal of Quantitative Linguistics 7, 3 (2000), 227--231.Google ScholarGoogle ScholarCross RefCross Ref
  14. Christos Faloutsos and Vasilis Megalooikonomou. 2007. On data mining, compression and Kolmogorov Complexity. Data Mining and Knowledge Discovery 15 (2007), 3--20. Issue 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Usama Fayyad and K. Irani. 1993. Multi-Interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 9th International Conference on Uncertainty in Artificial Intelligence (UAI'93). 1022--1027.Google ScholarGoogle Scholar
  16. Mikael Fortelius and others. 2003. Neogene of the Old World Database of Fossil Mammals (NOW). Available at http://www.helsinki.fi/science/now/.Google ScholarGoogle Scholar
  17. A. Frank and A. Asuncion. 2010. UCI Machine Learning Repository. Available at http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  18. Mario Frank, Morteza Haghir Chehreghani, and Joachim M. Buhmann. 2011. The minimum transfer cost principle for model-order selection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD'11). 423--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gemma C. Garriga, Esa Junttila, and Heikki Mannila. 2011. Banded structure in binary matrices. Knowledge and Information Systems 28, 1 (2011), 197--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Floris Geerts, Bart Goethals, and Taneli Mielikäinen. 2004. Tiling databases. In Proceedings of Discovery Science. 278--289.Google ScholarGoogle ScholarCross RefCross Ref
  21. Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.Google ScholarGoogle Scholar
  22. Peter Grünwald. 2007. The Minimum Description Length Principle. MIT Press.Google ScholarGoogle Scholar
  23. Hannes Heikinheimo, Jilles Vreeken, Arno Siebes, and Heikki Mannila. 2009. Low-entropy set selection. In Proceedings of the 9th SIAM International Conference on Data Mining (SDM'09). 569--580.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ruoming Jin, Yang Xiang, and Lin Liu. 2009. Cartesian contour: A concise representation for a collection of frequent sets. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'09). ACM, 417--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kleanthis-Nikolaus Kontonasios and Tijl De Bie. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). SIAM, 153--164.Google ScholarGoogle ScholarCross RefCross Ref
  26. Laks V. S. Lakshmanan, Raymond T. Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore Johnson. 2002. The generalized MDL approach for summarization. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02). 766--777. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Matthijs Leeuwenvan Leeuwen, Jilles Vreeken, and Arno Siebes. 2009. Identifying the components. Data Mining and Knowledge Discovery 19, 2 (2009), 173--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Li and P. Vitányi. 1993. An Introduction to Kolmogorov Complexity and Its Applications. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. 2008. Optimal Boolean matrix decomposition: Application to role engineering. In Proceedings of the 24th International Conference on Data Engineering (ICDE'08). 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. 2010. Mining top-k patterns from binary datasets in presence of noise. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM'10). 165--176.Google ScholarGoogle ScholarCross RefCross Ref
  31. Claudio Lucchese, Salvatore Orlando, and R. Perego. 2014. A unifying framework for mining approximate top-k binary patterns. IEEE Transactions on Knowledge and Data Engineering (2014). In press.Google ScholarGoogle Scholar
  32. Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing data succinctly with the most informative itemsets. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), 1--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Pauli Miettinen. 2008. On the positive-negative partial set cover problem. Information Processing Letters 108, 4 (2008), 219--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pauli Miettinen. 2009. Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms. Ph.D. Dissertation. University of Helsinki.Google ScholarGoogle Scholar
  35. Pauli Miettinen. 2010. Sparse boolean matrix factorizations. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM'10). 935--940. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pauli Miettinen. 2012. On finding joint subspace Boolean matrix factorizations. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). 954--965.Google ScholarGoogle ScholarCross RefCross Ref
  37. Pauli Miettinen. 2013. Fully dynamic quasi-biclique edge covers via Boolean matrix factorizations. In Proceedings of the 1st ACM SIGMOD Workshop on Dynamic Network Management and Mining (DyNetMM'13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. 2008. The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering 20, 10 (2008), 1348--1362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pauli Miettinen and Jilles Vreeken. 2011. Model order selection for Boolean matrix factorization. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'11). 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. P. Minka. 2001. Automatic choice of dimensionality for PCA. In Proceedings of the 13th Annual Conference on Neural Information Processing Systems (NIPS'01). 598--604.Google ScholarGoogle Scholar
  41. A. J. Mitchell-Jones, G. Amori, W. Bogdanowicz, B. Krystufek, P. J. H. Reijnders, F. Spitzenberger, M. Stubbe, J. B. M. Thissen, V. Vohralik, and J. Zima. 1999. The Atlas of European Mammals. Academic Press.Google ScholarGoogle Scholar
  42. Fabian Moerchen, Michael Thies, and Alfred Ultsch. 2011. Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression. Knowledge and Information Systems 29, 1 (2011), 55--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sylvia D. Monson, Norman J. Pullman, and Rolf Rees. 1995. A survey of clique and biclique coverings and factorizations of (0, 1)-matrices. Bulletin of the ICA 14 (1995), 17--86.Google ScholarGoogle Scholar
  44. Samuel Myllykangas, J. Himberg, T. Böhling, B. Nagy, Jaakko Hollmén, and S. Knuutila. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene 25, 55 (2006), 7324--7332.Google ScholarGoogle ScholarCross RefCross Ref
  45. Dana S. Nau, George Markowsky, Max A. Woodbury, and D. Bernard Amos. 1978. A mathematical analysis of human leukocyte antigen serology. Mathematical Biosciences 40 (1978), 243--270.Google ScholarGoogle ScholarCross RefCross Ref
  46. Dianne P. O'leary and Shmuel Peleg. 1983. Digital image compression by outer product expansion. IEEE Transactions on Communications 31, 3 (1983), 441--444.Google ScholarGoogle ScholarCross RefCross Ref
  47. Art B. Owen and Patrick O. Perry. 2009. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Annals of Applied Statistics 3, 2 (June 2009), 564--594.Google ScholarGoogle ScholarCross RefCross Ref
  48. Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (1994), 111--126.Google ScholarGoogle ScholarCross RefCross Ref
  49. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). ACM, 398--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. Pestov. 2008. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks 21, 2--3 (2008), 204--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Ross Quinlan and Ronald L. Rivest. 1989. Inferring decision trees using the minimum description length principle. Information and Computation 80, 3 (1989), 227--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 1 (1978), 465--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jorma Rissanen. 1983. Modeling by shortest data description. The Annals of Statistics 11, 2 (1983), 416--431.Google ScholarGoogle ScholarCross RefCross Ref
  54. Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. 2003. A generalized linear model for principal component analysis of binary data. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  55. M. Schmidt, O. Winther, and L. Hansen. 2009. Bayesian non-negative matrix factorization. In Proceedings of International Conference on Independent Component Analysis and Signal Separation. Vol. 5411. 540--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Gideon Schwarz. 1978. Estimating the dimension of a model. The Annals of Statistics 6, 2 (1978), 461--464.Google ScholarGoogle ScholarCross RefCross Ref
  57. Hao Shao, Bin Tong, and Einoshin Suzuki. 2013. Extended MDL principle for feature-based inductive transfer learning. Knowledge and Information Systems 35, 2 (2013), 365--389. DOI:http://dx.doi.org/10.1007/s10115-012-0505-xGoogle ScholarGoogle ScholarCross RefCross Ref
  58. Arno Siebes, Jilles Vreeken, and Matthijs van Leeuwen. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM'06). SIAM, 393--404.Google ScholarGoogle ScholarCross RefCross Ref
  59. David Skillicorn. 2007. Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman & Hall/CRC Press.Google ScholarGoogle Scholar
  60. Koen Smets and Jilles Vreeken. 2012. Slim: Directly mining descriptive patterns. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM'12). SIAM, 236--247.Google ScholarGoogle ScholarCross RefCross Ref
  61. Andreas Streich, Mario Frank, David Basin, and Joachim Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th International Conference on Machine Learning (ICML'09). 969--976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. 2006. What is the dimension of your binary data? In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM'06). 603--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Nikolaj Tatti and Jilles Vreeken. 2008. Finding good itemsets by packing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 588--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. 2007. The role mining problem: Finding a minimal descriptive set of roles. In Proceedings of the 12th ACM International Symposium on Access Control Models and Technologies (SACMAT'07). 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. N. K. Vereshchagin and P. M. B. Vitanyi. 2004. Kolmogorov's structure functions and model selection. IEEE Transactions on Information Technology 50, 12 (2004), 3265--3290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Jilles Vreeken and Arno Siebes. 2008. Filling in the blanks: Krimp minimisation for missing data. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). IEEE, 1067--1072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: Mining itemsets that compress. Data Mining and Knowledge Discovery 23, 1 (2011), 169--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. C. S. Wallace. 2005. Statistical and Inductive Inference by Minimum Message Length. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Chao Wang and Srinivasan Parthasarathy. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'06). ACM, 730--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Jianyong Wang and George Karypis. 2004. SUMMARY: Efficiently summarizing transactions for clustering. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04). IEEE, 241--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Y. Xiang, R. Jin, D. Fuhry, and F. Dragan. 2010. Summarizing transactional databases with overlapped hyperrectangles. Data Mining and Knowledge Discovery 23, 2 (2010), 215--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Yang Xiang, Ruoming Jin, David Fuhry, and Feodor F. Dragan. 2008. Succinct summarization of transactional databases: An overlapped hyperrectangle scheme. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'08). ACM, 758--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. K. A. Yeomans and P. A. Golder. 1982. The Guttman--Kaiser criterion as a predictor of the number of common factors. The Statistician 31, 3 (1982), 221--229.Google ScholarGoogle ScholarCross RefCross Ref
  74. Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. 2010. Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery 20, 1 (2010), 28--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. M. Zhu and A. Ghodsi. 2006. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics and Data Analysis 51, 2 (2006), 918--930. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 8, Issue 4
      October 2014
      219 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2663597
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 October 2014
      • Accepted: 1 December 2013
      • Revised: 1 May 2013
      • Received: 1 June 2012
      Published in tkdd Volume 8, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader