skip to main content
10.1145/1390156.1390182acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Self-taught clustering

Authors Info & Claims
Published:05 July 2008Publication History

ABSTRACT

This paper focuses on a new clustering task, called self-taught clustering. Self-taught clustering is an instance of unsupervised transfer learning, which aims at clustering a small collection of target unlabeled data with the help of a large amount of auxiliary unlabeled data. The target and auxiliary data can be different in topic distribution. We show that even when the target data are not sufficient to allow effective learning of a high quality feature representation, it is possible to learn the useful features with the help of the auxiliary data on which the target data can be clustered effectively. We propose a co-clustering based self-taught clustering algorithm to tackle this problem, by clustering the target and auxiliary data simultaneously to allow the feature representation from the auxiliary data to influence the target data through a common set of features. Under the new data representation, clustering on the target data can be improved. Our experiments on image clustering show that our algorithm can greatly outperform several state-of-the-art clustering methods when utilizing irrelevant unlabeled auxiliary data.

References

  1. Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. Proceedings of the Twenty-first International Conference on Machine Learning (pp. 6--13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. Proceedings of the Nineteenth International Conference on Machine Learning (pp. 27--34). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59--68). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley-Interscience. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Daumé III, H., & Marcu, D. (2005). A bayesian model for supervised clustering with the dirichlet process prior. Journal of Machine Learning Research, 6, 1551--1577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Davidson, I., & Ravi, S. S. (2007). Intractability and clustering with constraints. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 201--208). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 89--98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Finley, T., & Joachims, T. (2005). Supervised clustering with support vector machines. Proceedings of the Twenty-second International Conference on Machine Learning (pp. 217--224). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology.Google ScholarGoogle Scholar
  11. Jain, A. J., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood, NJ: Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Li, F.-F., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (pp. 524--531). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability (pp. 1:281--297).Google ScholarGoogle Scholar
  15. Nelson, B., & Cohen, I. (2007). Revisiting probabilistic models for clustering with pair-wise constraints. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 673--680). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 759--766). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Raina, R., Ng, A. Y., & Koller, D. (2006). Constructing informative priors using transfer learning. Proceedings of the Twenty-third International Conference on Machine Learning (pp. 713--720). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566--1581.Google ScholarGoogle ScholarCross RefCross Ref
  19. Wagstaff, K., Cardie, C., Rogers, S., & Schröödl, S. (2001). Constrained k-means clustering with back-ground knowledge. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 577--584). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wu, P., & Dietterich, T. G. (2004). Improving svm accuracy by training on auxiliary data sources. Proceedings of the Twenty-first International Conference on Machine Learning (pp. 110--117). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. Proceedings of the Eleventh International Conference on Information and Knowledge Management (pp. 515--524). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Self-taught clustering

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICML '08: Proceedings of the 25th international conference on Machine learning
              July 2008
              1310 pages
              ISBN:9781605582054
              DOI:10.1145/1390156

              Copyright © 2008 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 5 July 2008

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate140of548submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader