skip to main content
10.3115/1073012.1073017dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

Scaling to very very large corpora for natural language disambiguation

Published:06 July 2001Publication History

ABSTRACT

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

References

  1. Banko, M. and Brill, E. (2001). Mitigating the Paucity of Data Problem. Human Language Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Breiman L., (1996). Bagging Predictors, Machine Learning 24 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brill, E. and Wu, J. (1998). Classifier combination for improved lexical disambiguation. In Proceedings of the 17th International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Charniak, E. (1996). Treebank Grammars, Proceedings AAAI-96, Menlo Park, Ca. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dagan, I. and Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. In Proc. ML-95, the 12th Int. Conf. on Machine Learning.Google ScholarGoogle ScholarCross RefCross Ref
  6. Gale, W. A., Church, K. W., and Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.Google ScholarGoogle ScholarCross RefCross Ref
  7. Golding, A. R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. In Proc. 3rd Workshop on Very Large Corpora, Boston, MA.Google ScholarGoogle Scholar
  8. Golding, A. R. and Roth, D.(1999), A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34:107--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Golding, A. R. and Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proc. 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Henderson, J. C. and Brill, E (1999). Exploiting diversity in natural language processing: combining parsers. In 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. ACL, New Brunswick NJ. 187--194.Google ScholarGoogle Scholar
  11. Jones, M. P. and Martin, J. H. (1997). Contextual spelling correction using latent semantic analysis. In Proc. 5th Conference on Applied Natural Language Processing, Washington, DC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148--156). New Brunswick, NJ: Morgan Kaufmann.Google ScholarGoogle Scholar
  13. Mangu, L. and Brill, E. (1997). Automatic rule acquisition for spelling correction. In Proc. 14th International Conference on Machine Learning. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mitchell, T. M. (1999), The role of unlabeled data in supervised learning, in Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain.Google ScholarGoogle Scholar
  16. Nigam, N., McCallum, A., Thrun, S., and Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pedersen, T. (2000). A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics May 1-3, 2000, Seattle, WA Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Powers, D. (1997). Learning and application of differential grammars. In Proc. Meeting of the ACL Special Interest Group in Natural Language Learning, Madrid.Google ScholarGoogle Scholar
  19. van Halteren, H. Zavrel, J. and Daelemans, W. (1998). Improving data driven wordclass tagging by system combination. In COLING-ACL'98, pages 491497, Montreal, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Weng, F., Stolcke, A, & Sankar, A (1998). Efficient lattice representation and generation. Proc. Intl. Conf. on Spoken Language Processing, vol. 6, pp. 2531--2534. Sydney, Australia.Google ScholarGoogle Scholar
  21. Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proc. 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yarowsky, D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189--196, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Scaling to very very large corpora for natural language disambiguation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
          July 2001
          562 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 6 July 2001

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate85of443submissions,19%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader