Article

Free Access

Scaling to very very large corpora for natural language disambiguation

Authors:
Michele Banko

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Eric Brill

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational LinguisticsJuly 2001Pages 26–33https://doi.org/10.3115/1073012.1073017

Published:06 July 2001Publication History

ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Pages 26–33

ABSTRACT

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

References

Banko, M. and Brill, E. (2001). Mitigating the Paucity of Data Problem. Human Language Technology. Google ScholarDigital Library
Breiman L., (1996). Bagging Predictors, Machine Learning 24 123--140. Google ScholarDigital Library
Brill, E. and Wu, J. (1998). Classifier combination for improved lexical disambiguation. In Proceedings of the 17th International Conference on Computational Linguistics. Google ScholarDigital Library
Charniak, E. (1996). Treebank Grammars, Proceedings AAAI-96, Menlo Park, Ca. Google ScholarDigital Library
Dagan, I. and Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. In Proc. ML-95, the 12th Int. Conf. on Machine Learning.Google ScholarCross Ref
Gale, W. A., Church, K. W., and Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.Google ScholarCross Ref
Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, Boston, MA.Google Scholar
Golding, A. R. and Roth, D.(1999), A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34:107--130. Google ScholarDigital Library
Golding, A. R. and Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA. Google ScholarDigital Library
Henderson, J. C. and Brill, E (1999). Exploiting diversity in natural language processing: combining parsers. In 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACL, New Brunswick NJ. 187--194.Google Scholar
Jones, M. P. and Martin, J. H. (1997). Contextual spelling correction using latent semantic analysis. In Proc. 5th Conference on Applied Natural Language Processing, Washington, DC. Google ScholarDigital Library
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148--156). New Brunswick, NJ: Morgan Kaufmann.Google Scholar
Mangu, L. and Brill, E. (1997). Automatic rule acquisition for spelling correction. In Proc. 14th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarDigital Library
Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155--172. Google ScholarDigital Library
Mitchell, T. M. (1999), The role of unlabeled data in supervised learning, in Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain.Google Scholar
Nigam, N., McCallum, A., Thrun, S., and Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press. Google ScholarDigital Library
Pedersen, T. (2000). A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics May 1-3, 2000, Seattle, WA Google ScholarDigital Library
Powers, D. (1997). Learning and application of differential grammars. In Proc. Meeting of the ACL Special Interest Group in Natural Language Learning, Madrid.Google Scholar
van Halteren, H. Zavrel, J. and Daelemans, W. (1998). Improving data driven wordclass tagging by system combination. In COLING-ACL'98, pages 491497, Montreal, Canada. Google ScholarDigital Library
Weng, F., Stolcke, A, & Sankar, A (1998). Efficient lattice representation and generation. Proc. Intl. Conf. on Spoken Language Processing, vol. 6, pp. 2531--2534. Sydney, Australia.Google Scholar
Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proc. 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM. Google ScholarDigital Library
Yarowsky, D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189--196, 1995. Google ScholarDigital Library

Scaling to very very large corpora for natural language disambiguation
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Read More
Unsupervised word sense disambiguation using bilingual comparable corpora
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

An unsupervised method for word sense disambiguation using a bilingual comparable corpus was developed. First, it extracts statistically significant pairs of related words from the corpus of each language. Then, aligning pairs of related words ...
Read More
Disambiguation of Homograms in a Pitch Accent Language
CSAI '17: Proceedings of the 2017 International Conference on Computer Science and Artificial Intelligence

The Croatian language is a pitch-accent language in which the tone contour realized in the stressed syllable carries the lexical information. Therefore, in some cases, different lexical accent gives the word a different meaning. In such cases, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
July 2001
562 pages
General Chair:
Bonnie Lynn Webber
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 6 July 2001
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate85of443submissions,19%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 8,827
  Total Downloads
- Downloads (Last 12 months)915
- Downloads (Last 6 weeks)94
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling to very very large corpora for natural language disambiguation

ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Unsupervised word sense disambiguation using bilingual comparable corpora

Disambiguation of Homograms in a Pitch Accent Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scaling to very very large corpora for natural language disambiguation

ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

ABSTRACT

References

Cited By

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

Unsupervised word sense disambiguation using bilingual comparable corpora

Disambiguation of Homograms in a Pitch Accent Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media