ABSTRACT
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
- Banko, M. and Brill, E. (2001). Mitigating the Paucity of Data Problem. Human Language Technology. Google ScholarDigital Library
- Breiman L., (1996). Bagging Predictors, Machine Learning 24 123--140. Google ScholarDigital Library
- Brill, E. and Wu, J. (1998). Classifier combination for improved lexical disambiguation. In Proceedings of the 17th International Conference on Computational Linguistics. Google ScholarDigital Library
- Charniak, E. (1996). Treebank Grammars, Proceedings AAAI-96, Menlo Park, Ca. Google ScholarDigital Library
- Dagan, I. and Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. In Proc. ML-95, the 12th Int. Conf. on Machine Learning.Google ScholarCross Ref
- Gale, W. A., Church, K. W., and Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.Google ScholarCross Ref
- Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, Boston, MA.Google Scholar
- Golding, A. R. and Roth, D.(1999), A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34:107--130. Google ScholarDigital Library
- Golding, A. R. and Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA. Google ScholarDigital Library
- Henderson, J. C. and Brill, E (1999). Exploiting diversity in natural language processing: combining parsers. In 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACL, New Brunswick NJ. 187--194.Google Scholar
- Jones, M. P. and Martin, J. H. (1997). Contextual spelling correction using latent semantic analysis. In Proc. 5th Conference on Applied Natural Language Processing, Washington, DC. Google ScholarDigital Library
- Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148--156). New Brunswick, NJ: Morgan Kaufmann.Google Scholar
- Mangu, L. and Brill, E. (1997). Automatic rule acquisition for spelling correction. In Proc. 14th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarDigital Library
- Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155--172. Google ScholarDigital Library
- Mitchell, T. M. (1999), The role of unlabeled data in supervised learning, in Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain.Google Scholar
- Nigam, N., McCallum, A., Thrun, S., and Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press. Google ScholarDigital Library
- Pedersen, T. (2000). A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics May 1-3, 2000, Seattle, WA Google ScholarDigital Library
- Powers, D. (1997). Learning and application of differential grammars. In Proc. Meeting of the ACL Special Interest Group in Natural Language Learning, Madrid.Google Scholar
- van Halteren, H. Zavrel, J. and Daelemans, W. (1998). Improving data driven wordclass tagging by system combination. In COLING-ACL'98, pages 491497, Montreal, Canada. Google ScholarDigital Library
- Weng, F., Stolcke, A, & Sankar, A (1998). Efficient lattice representation and generation. Proc. Intl. Conf. on Spoken Language Processing, vol. 6, pp. 2531--2534. Sydney, Australia.Google Scholar
- Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proc. 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM. Google ScholarDigital Library
- Yarowsky, D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189--196, 1995. Google ScholarDigital Library
- Scaling to very very large corpora for natural language disambiguation
Recommendations
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Unsupervised word sense disambiguation using bilingual comparable corpora
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1An unsupervised method for word sense disambiguation using a bilingual comparable corpus was developed. First, it extracts statistically significant pairs of related words from the corpus of each language. Then, aligning pairs of related words ...
Disambiguation of Homograms in a Pitch Accent Language
CSAI '17: Proceedings of the 2017 International Conference on Computer Science and Artificial IntelligenceThe Croatian language is a pitch-accent language in which the tone contour realized in the stressed syllable carries the lexical information. Therefore, in some cases, different lexical accent gives the word a different meaning. In such cases, the ...
Comments