ABSTRACT
The input to an algorithm that learns a binary classifier normally consists of two sets of examples, where one set consists of positive examples of the concept to be learned, and the other set consists of negative examples. However, it is often the case that the available training data are an incomplete set of positive examples, and a set of unlabeled examples, some of which are positive and some of which are negative. The problem solved in this paper is how to learn a standard binary classifier given a nontraditional training set of this nature.
Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive. We show how to use this result in two different ways to learn a classifier from a nontraditional training set. We then apply these two new methods to solve a real-world problem: identifying protein records that should be included in an incomplete specialized molecular biology database. Our experiments in this domain show that models trained using the new methods perform better than the current state-of-the-art biased SVM method for learning from positive and unlabeled examples.
- B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365--370, 2003.Google ScholarCross Ref
- S. Das, M. H. Saier, and C. Elkan. Finding transport proteins in a general protein database. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, volume 4702 of Lecture Notes in Computer Science, pages 54--66. Springer, 2007.Google ScholarCross Ref
- F. Denis. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory (ALT'98), Otzenhausen, Germany, volume 1501 of Lecture Notes in Computer Science, pages 112--126. Springer, 1998. Google ScholarDigital Library
- F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1):70--83, 2005. Google ScholarDigital Library
- F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proceedings of the Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002), pages 1927--1934, 2002.Google Scholar
- G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification without negative examples revisit (sic). IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006. Google ScholarDigital Library
- M. Galperin. The Molecular Biology Database Collection: 2008 update. Nucleic Acids Research, 36(Database issue):D2, 2008.Google Scholar
- W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington, DC, pages 448--455, 2003.Google Scholar
- H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt's probabilistic outputs for support vector machines. Machine Learning, 68(3):267--276, 2007. Google ScholarDigital Library
- R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., second edition, 2002. Google ScholarDigital Library
- B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179--188, 2003. Google ScholarDigital Library
- Z. Liu, W. Shi, D. Li, and Q. Qin. Partially supervised classification - based on weighted unlabeled samples support vector machine. In Proceedings of the First International Conference on Advanced Data Mining and Applications (ADMA 2005), Wuhan, China, volume 3584 of Lecture Notes in Computer Science, pages 118--129. Springer, 2005. Google ScholarDigital Library
- L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001. Google ScholarDigital Library
- J. C. Platt. Probabilities for SV machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--73. MIT Press, 1999.Google Scholar
- M. H. Saier, C. V. Tran, and R. D. Barabote. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Research, 34:D181--D186, 2006.Google ScholarCross Ref
- B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443--1471, 2001. Google ScholarDigital Library
- A. Smith and C. Elkan. A Bayesian network framework for reject inference. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 286--295, 2004. Google ScholarDigital Library
- A. Smith and C. Elkan. Making generative classifiers robust to selection bias. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 657--666, 2007. Google ScholarDigital Library
- D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45--66, 2004. Google ScholarDigital Library
- C. Wang, C. Ding, R. F. Meraz, and S. R. Holbrook. PSoL: a positive sample only learning algorithm for finding non-coding RNA genes. Bioinformatics, 22(21):2590--2596, 2006. Google ScholarDigital Library
- G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the EM algorithm. Biometrics, 2008. In press.Google Scholar
- F. Wu and D. S. Weld. Autonomously semantifying Wikipedia. In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, pages 41--50, Nov. 2007. Google ScholarDigital Library
- H. Yu. Single-class classification with mapping convergence. Machine Learning, 61(1-3):49--69, 2006. Google ScholarDigital Library
- H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004. Google ScholarDigital Library
- B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pages 694--699. AAAI Press (distributed by MIT Press), Aug. 2002. Google ScholarDigital Library
- D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence (UKCI), pages 83--87, Sept. 2005.Google Scholar
Index Terms
- Learning classifiers from only positive and unlabeled data
Recommendations
Positive and unlabeled learning with label disambiguation
IJCAI'19: Proceedings of the 28th International Joint Conference on Artificial IntelligencePositive and Unlabeled (PU) learning aims to learn a binary classifier from only positive and unlabeled training data. The state-of-the-art methods usually formulate PU learning as a cost-sensitive learning problem, in which every unlabeled example is ...
Evaluating Classification Performance with only Positive and Unlabeled Samples
S+SSPR 2014: Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition - Volume 8621Testing binary classifiers usually requires a test set with labeled positive and negative examples. In many real-world applications however, some positive objects are manually labeled while negative objects are not labeled explicitly. For instance in ...
Active Learning for Multivariate Time Series Classification with Positive Unlabeled Data
ICTAI '15: Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI)Traditional time series classification problem with supervised learning algorithm needs a large set of labeled training data. In reality, the number of labeled data is often smaller and there is huge number of unlabeled data. However, manually labeling ...
Comments