skip to main content
10.1145/1401890.1401920acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Learning classifiers from only positive and unlabeled data

Authors Info & Claims
Published:24 August 2008Publication History

ABSTRACT

The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature.

Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.

References

  1. B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365--370, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. Das, M. H. Saier, and C. Elkan. Finding transport proteins in a general protein database. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 4702 of Lecture Notes in Computer Science, pages 54--66. Springer, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. F. Denis. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory (ALT'98), Otzenhausen, Germany, volume 1501 of Lecture Notes in Computer Science, pages 112--126. Springer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70--83, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proceedings of the Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pages 1927--1934, 2002.Google ScholarGoogle Scholar
  6. G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification without negative examples revisit (sic). IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Galperin. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2, 2008.Google ScholarGoogle Scholar
  8. W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC, pages 448--455, 2003.Google ScholarGoogle Scholar
  9. H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt's probabilistic outputs for support vector machines. Machine Learning, 68(3):267--276, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., second edition, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179--188, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Liu, W. Shi, D. Li, and Q. Qin. Partially supervised classification - based on weighted unlabeled samples support vector machine. In Proceedings of the First International Conference on Advanced Data Mining and Applications (ADMA 2005), Wuhan, China, volume 3584 of Lecture Notes in Computer Science, pages 118--129. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--73. MIT Press, 1999.Google ScholarGoogle Scholar
  15. M. H. Saier, C. V. Tran, and R. D. Barabote. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34:D181--D186, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443--1471, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Smith and C. Elkan. A Bayesian network framework for reject inference. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 286--295, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Smith and C. Elkan. Making generative classifiers robust to selection bias. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 657--666, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45--66, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Wang, C. Ding, R. F. Meraz, and S. R. Holbrook. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics, 22(21):2590--2596, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 2008. In press.Google ScholarGoogle Scholar
  22. F. Wu and D. S. Weld. Autonomously semantifying Wikipedia. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, pages 41--50, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Yu. Single-class classification with mapping convergence. Machine Learning, 61(1-3):49--69, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pages 694--699. AAAI Press (distributed by MIT Press), Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence (UKCI), pages 83--87, Sept. 2005.Google ScholarGoogle Scholar

Index Terms

  1. Learning classifiers from only positive and unlabeled data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2008
      1116 pages
      ISBN:9781605581934
      DOI:10.1145/1401890
      • General Chair:
      • Ying Li,
      • Program Chairs:
      • Bing Liu,
      • Sunita Sarawagi

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader