skip to main content
10.1145/1014052.1014062acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

A probabilistic framework for semi-supervised clustering

Published:22 August 2004Publication History

ABSTRACT

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

References

  1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Generative model-based clustering of directional data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 19--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science (FOCS-02), pages 238--247, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In Proceedings of 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.Google ScholarGoogle Scholar
  6. S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proceedings of 19th International Conference on Machine Learning (ICML-02), pages 19--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Basu, A. Banerjee, and R. J. Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. S. Basu, M. Bilenko, and R. J. Mooney. Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 42--49, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B (Methodological), 48(3):259--302, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 39--48, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Boykov, O. Veksler, and R. Zabih. Markov random fields with efficient approximations. In Proceedings of IEEE Computer Vision and Pattern Recognition Conference (CVPR-98), pages 648--655, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.Google ScholarGoogle Scholar
  14. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Demiriz, K. P. Bennett, and M. J. Embrechts. Semi-supervised clustering using genetic algorithms. In Artificial Neural Networks in Engineering (ANNIE-99), pages 809--814, 1999.Google ScholarGoogle Scholar
  16. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  17. I. S. Dhillon and Y. Guan. Information theoretic clustering of sparse co-occurrence data. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), pages 517--521, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.Google ScholarGoogle Scholar
  20. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95:14863--14848, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  21. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721--742, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971.Google ScholarGoogle Scholar
  23. D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 200--209, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 561--566, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 282--293, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Klein, S. D. Kamvar, and C. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the The Nineteenth International Conference on Machine Learning (ICML-02), pages 307--314, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS-99), pages 14--23, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.Google ScholarGoogle Scholar
  30. E. M. Marcotte, I. Xenarios, A. van der Bliek, and D. Eisenberg. Localizing proteins in the cell from their phylogenetic profiles. Proceedings of the National Academy of Science, 97:12115--20, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  31. K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.Google ScholarGoogle Scholar
  32. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355--368. MIT Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo,CA, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL-93), pages 183--190, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264--i272, July 2003.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI 2000 Workshop on Artificial Intelligence for Web Search, pages 58--64, July 2000.Google ScholarGoogle Scholar
  38. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-Means clustering with background knowledge. In Proceedings of 18th International Conference on Machine Learning (ICML-01), pages 577--584, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505--512, Cambridge, MA, 2003. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Zhang, M. Brady, and S. Smith. Hidden Markov random field model and segmentation of brain MR images. IEEE Transactions on Medical Imaging, 20(1):45--57, 2001.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A probabilistic framework for semi-supervised clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2004
        874 pages
        ISBN:1581138881
        DOI:10.1145/1014052

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 August 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader