ABSTRACT
This paper focuses on a new clustering task, called self-taught clustering. Self-taught clustering is an instance of unsupervised transfer learning, which aims at clustering a small collection of target unlabeled data with the help of a large amount of auxiliary unlabeled data. The target and auxiliary data can be different in topic distribution. We show that even when the target data are not sufficient to allow effective learning of a high quality feature representation, it is possible to learn the useful features with the help of the auxiliary data on which the target data can be clustered effectively. We propose a co-clustering based self-taught clustering algorithm to tackle this problem, by clustering the target and auxiliary data simultaneously to allow the feature representation from the auxiliary data to influence the target data through a common set of features. Under the new data representation, clustering on the target data can be improved. Our experiments on image clustering show that our algorithm can greatly outperform several state-of-the-art clustering methods when utilizing irrelevant unlabeled auxiliary data.
- Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. Proceedings of the Twenty-first International Conference on Machine Learning (pp. 6--13). Google ScholarDigital Library
- Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. Proceedings of the Nineteenth International Conference on Machine Learning (pp. 27--34). Google ScholarDigital Library
- Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-supervised clustering. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59--68). Google ScholarDigital Library
- Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41--75. Google ScholarDigital Library
- Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley-Interscience. Google ScholarDigital Library
- Daumé III, H., & Marcu, D. (2005). A bayesian model for supervised clustering with the dirichlet process prior. Journal of Machine Learning Research, 6, 1551--1577. Google ScholarDigital Library
- Davidson, I., & Ravi, S. S. (2007). Intractability and clustering with constraints. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 201--208). Google ScholarDigital Library
- Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 89--98). Google ScholarDigital Library
- Finley, T., & Joachims, T. (2005). Supervised clustering with support vector machines. Proceedings of the Twenty-second International Conference on Machine Learning (pp. 217--224). Google ScholarDigital Library
- Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology.Google Scholar
- Jain, A. J., & Dubes, R. C. (1988). Algorithms for clustering data. Englewood, NJ: Prentice-Hall. Google ScholarDigital Library
- Li, F.-F., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (pp. 524--531). Google ScholarDigital Library
- Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91--110. Google ScholarDigital Library
- MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability (pp. 1:281--297).Google Scholar
- Nelson, B., & Cohen, I. (2007). Revisiting probabilistic models for clustering with pair-wise constraints. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 673--680). Google ScholarDigital Library
- Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the Twenty-fourth International Conference on Machine Learning (pp. 759--766). Google ScholarDigital Library
- Raina, R., Ng, A. Y., & Koller, D. (2006). Constructing informative priors using transfer learning. Proceedings of the Twenty-third International Conference on Machine Learning (pp. 713--720). Google ScholarDigital Library
- Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566--1581.Google ScholarCross Ref
- Wagstaff, K., Cardie, C., Rogers, S., & Schröödl, S. (2001). Constrained k-means clustering with back-ground knowledge. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 577--584). Google ScholarDigital Library
- Wu, P., & Dietterich, T. G. (2004). Improving svm accuracy by training on auxiliary data sources. Proceedings of the Twenty-first International Conference on Machine Learning (pp. 110--117). Google ScholarDigital Library
- Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. Proceedings of the Eleventh International Conference on Information and Knowledge Management (pp. 515--524). Google ScholarDigital Library
Index Terms
Self-taught clustering
Recommendations
Self-taught learning: transfer learning from unlabeled data
ICML '07: Proceedings of the 24th international conference on Machine learningWe present a new machine learning framework called "self-taught learning" for using unlabeled data in supervised classification tasks. We do not assume that the unlabeled data follows the same class labels or generative distribution as the labeled data. ...
Supervised self-taught learning: actively transferring knowledge from unlabeled data
IJCNN'09: Proceedings of the 2009 international joint conference on Neural NetworksWe consider the task of Self-taught Learning (STL) from unlabeled data. In contrast to semi-supervised learning, which requires unlabeled data to have the same set of class labels as labeled data, STL can transfer knowledge from different types of ...
Self-taught support vector machines
In this paper, a new approach to self-taught learning is proposed. Classification in target task with limited labeled target data gets improved thanks to enormous unlabeled source data. The target and source data can be drawn from different ...
Comments