research-article

Learning classifiers from only positive and unlabeled data

Authors:
Charles Elkan

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

,
Keith Noto

University of California, San Diego, La Jolla, CA, USA

University of California, San Diego, La Jolla, CA, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 213–220https://doi.org/10.1145/1401890.1401920

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 213–220

ABSTRACT

The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature.

Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.

References

B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365--370, 2003.Google ScholarCross Ref
S. Das, M. H. Saier, and C. Elkan. Finding transport proteins in a general protein database. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 4702 of Lecture Notes in Computer Science, pages 54--66. Springer, 2007.Google ScholarCross Ref
F. Denis. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory (ALT'98), Otzenhausen, Germany, volume 1501 of Lecture Notes in Computer Science, pages 112--126. Springer, 1998. Google ScholarDigital Library
F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70--83, 2005. Google ScholarDigital Library
F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proceedings of the Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pages 1927--1934, 2002.Google Scholar
G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification without negative examples revisit (sic). IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006. Google ScholarDigital Library
M. Galperin. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2, 2008.Google Scholar
W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC, pages 448--455, 2003.Google Scholar
H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt's probabilistic outputs for support vector machines. Machine Learning, 68(3):267--276, 2007. Google ScholarDigital Library
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., second edition, 2002. Google ScholarDigital Library
B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179--188, 2003. Google ScholarDigital Library
Z. Liu, W. Shi, D. Li, and Q. Qin. Partially supervised classification - based on weighted unlabeled samples support vector machine. In Proceedings of the First International Conference on Advanced Data Mining and Applications (ADMA 2005), Wuhan, China, volume 3584 of Lecture Notes in Computer Science, pages 118--129. Springer, 2005. Google ScholarDigital Library
L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001. Google ScholarDigital Library
J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--73. MIT Press, 1999.Google Scholar
M. H. Saier, C. V. Tran, and R. D. Barabote. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34:D181--D186, 2006.Google ScholarCross Ref
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443--1471, 2001. Google ScholarDigital Library
A. Smith and C. Elkan. A Bayesian network framework for reject inference. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 286--295, 2004. Google ScholarDigital Library
A. Smith and C. Elkan. Making generative classifiers robust to selection bias. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 657--666, 2007. Google ScholarDigital Library
D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45--66, 2004. Google ScholarDigital Library
C. Wang, C. Ding, R. F. Meraz, and S. R. Holbrook. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics, 22(21):2590--2596, 2006. Google ScholarDigital Library
G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 2008. In press.Google Scholar
F. Wu and D. S. Weld. Autonomously semantifying Wikipedia. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, pages 41--50, Nov. 2007. Google ScholarDigital Library
H. Yu. Single-class classification with mapping convergence. Machine Learning, 61(1-3):49--69, 2006. Google ScholarDigital Library
H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004. Google ScholarDigital Library
B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pages 694--699. AAAI Press (distributed by MIT Press), Aug. 2002. Google ScholarDigital Library
D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence (UKCI), pages 83--87, Sept. 2005.Google Scholar

Index Terms

Learning classifiers from only positive and unlabeled data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Positive and unlabeled learning with label disambiguation
IJCAI'19: Proceedings of the 28th International Joint Conference on Artificial Intelligence

Positive and Unlabeled (PU) learning aims to learn a binary classifier from only positive and unlabeled training data. The state-of-the-art methods usually formulate PU learning as a cost-sensitive learning problem, in which every unlabeled example is ...
Read More
Evaluating Classification Performance with only Positive and Unlabeled Samples
S+SSPR 2014: Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition - Volume 8621

Testing binary classifiers usually requires a test set with labeled positive and negative examples. In many real-world applications however, some positive objects are manually labeled while negative objects are not labeled explicitly. For instance in ...
Read More
Active Learning for Multivariate Time Series Classification with Positive Unlabeled Data
ICTAI '15: Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI)

Traditional time series classification problem with supervised learning algorithm needs a large set of labeled training data. In reality, the number of labeled data is often smaller and there is huge number of unlabeled data. However, manually labeling ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bioinformatics
supervised learning
text mining
unlabeled examples
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 528
  Total Citations
  View Citations
- 5,757
  Total Downloads
- Downloads (Last 12 months)319
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning classifiers from only positive and unlabeled data

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Positive and unlabeled learning with label disambiguation

Evaluating Classification Performance with only Positive and Unlabeled Samples

Active Learning for Multivariate Time Series Classification with Positive Unlabeled Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning classifiers from only positive and unlabeled data

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Positive and unlabeled learning with label disambiguation

Evaluating Classification Performance with only Positive and Unlabeled Samples

Active Learning for Multivariate Time Series Classification with Positive Unlabeled Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media