skip to main content
A corpus-based approach to language learning
Publisher:
  • University of Pennsylvania
  • Computer and Information Science Dept. 2000 South 33rd St. Philadelphia, PA
  • United States
Order Number:UMI Order No. GAX93-31757
Bibliometrics
Skip Abstract Section
Abstract

One goal of computational linguistics is to discover a method for assigning a rich structural annotation to sentences that are presented as simple linear strings of words; meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information. Also, structure allows for a more in-depth check of the well-formedness of a sentence. There are two phases to assigning these structural annotations: first, a knowledge base is created and second, an algorithm is used to generate a structural annotation for a sentence based upon the facts provided in the knowledge base. Until recently, most knowledge bases were created manually by language experts. These knowledge bases are expensive to create and have not been used effectively in structurally parsing sentences from other than highly restricted domains. The goal of this dissertation is to make significant progress toward designing automata that are able to learn some structural aspects of human language with little human guidance. In particular, we describe a learning algorithm that takes a small structurally annotated corpus of text and a larger unannotated corpus as input, and automatically learns how to assign accurate structural descriptions to sentences not in the training corpus. The main tool we use to automatically discover structural information about language from corpora is transformation-based error-driven learning. The distribution of errors produced by an imperfect annotator is examined to learn an ordered list of transformations that can be applied to provide an accurate structural annotation. We demonstrate the application of this learning algorithm to part of speech tagging and parsing. Successfully applying this technique to create systems that learn could lead to robust, trainable and accurate natural language processing systems.

Cited By

  1. Elayeb B (2019). Arabic word sense disambiguation: a review, Artificial Intelligence Review, 52:4, (2475-2532), Online publication date: 1-Dec-2019.
  2. Michael L Simultaneous learning and prediction Proceedings of the Fourteenth International Conference on Principles of Knowledge Representation and Reasoning, (348-357)
  3. ACM
    Uneson M (2014). When Errors Become the Rule, ACM Computing Surveys, 46:4, (1-51), Online publication date: 1-Apr-2014.
  4. Israel R, Tetreault J and Chodorow M Correcting comma errors in learner essays, and restoring commas in newswire text Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (284-294)
  5. Albared M, Omar N and Aziz M Developing a competitive HMM arabic POS tagger using small training corpora Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I, (288-296)
  6. Zhao Z and Zhu Y Prediction of prosodic phrase boundaries in Chinese TTS based on conditional random fields and transformation based learning Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2, (599-602)
  7. Martins A, Das D, Smith N and Xing E Stacking dependency parsers Proceedings of the Conference on Empirical Methods in Natural Language Processing, (157-166)
  8. Siebert A and Schlangen D A simple method for resolution of definite reference in a shared visual context Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, (84-87)
  9. Hamon T and Grabar N Acquisition of elementary synonym relations from biological structured terminology Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, (40-51)
  10. Michael L and Valiant L A first experimental demonstration of massive knowledge infusion Proceedings of the Eleventh International Conference on Principles of Knowledge Representation and Reasoning, (378-388)
  11. Rojc M, Rotovnik T, Brus M, Jan D and Kačič Z Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours, (294-309)
  12. Gupta K, Aha D and Moore P Rough set feature selection algorithms for textual case-based classification Proceedings of the 8th European conference on Advances in Case-Based Reasoning, (166-181)
  13. Kuntraruk J, Pottenger W and Ross A (2005). Application Resource Requirement Estimation in a Parallel-Pipeline Model of Execution, IEEE Transactions on Parallel and Distributed Systems, 16:12, (1154-1165), Online publication date: 1-Dec-2005.
  14. Arranz V, Atserias J and Castillo M Multiwords and word sense disambiguation Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing, (250-262)
  15. Marasek K and Gubrynowicz R Multi-level annotation in speecon polish speech database Proceedings of the Second international conference on Intelligent Media Technology for Communicative Intelligence, (58-67)
  16. Dien D and Kiem H POS-tagger for English-Vietnamese bilingual corpus Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3, (88-95)
  17. Curran J Blueprint for a high performance NLP infrastructure Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8, (39-44)
  18. Kit C, Xu Z and Webster J Integrating ngram model and case-based learning for Chinese word segmentation Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17, (160-163)
  19. Diab M and Resnik P An unsupervised method for word sense tagging using parallel corpora Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, (255-262)
  20. van den Bosch A and Buchholz S Shallow parsing on the basis of words only Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, (433-440)
  21. Dinh D Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16, (1-7)
  22. Potipiti T, Sornlertlamvanich V and Thanadkran K Towards an intelligent multilingual keyboard system Proceedings of the first international conference on Human language technology research, (1-4)
  23. Petasis G, Vichot F, Wolinski F, Paliouras G, Karkaletsis V and Spyropoulos C Using machine learning to maintain rule-based named-entity recognition and classification systems Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, (426-433)
  24. Hepple M Independence and commitment Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, (278-277)
  25. van den Bosch A Using induced rules as complex features in memory-based language learning Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, (73-78)
  26. Déjean H ALLiS Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, (95-98)
  27. Vilain M and Day D Phrase parsing with rule sequence processors Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, (160-162)
  28. Zhou G and Su J Error-driven HMM-based chunk tagger with context-dependent lexicon Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13, (71-79)
  29. Wiemer-Hastings K and Wiemer-Hastings P DP Proceedings of the sixth conference on Applied natural language processing, (90-96)
  30. Brants T TnT Proceedings of the sixth conference on Applied natural language processing, (224-231)
  31. Déjean H Theory refinement and Natural Language Learning Proceedings of the 18th conference on Computational linguistics - Volume 1, (229-235)
  32. Ruland T A context-sensitive model for probabilistic LR parsing of spoken language with transformation-based postprocessing Proceedings of the 18th conference on Computational linguistics - Volume 2, (677-683)
  33. Teahan W Text classification and segmentation using minimum cross-entropy Content-Based Multimedia Information Access - Volume 2, (943-961)
  34. Chen L and Tokuda N A new LSI and TM based cross-language information retrieval system providing text summaries Content-Based Multimedia Information Access - Volume 2, (1099-1106)
  35. Vilain M, Hyland R and Holland R Exploiting semantic extraction for spatiotemporal indexing in GeoNODE Content-Based Multimedia Information Access - Volume 2, (1440-1149)
  36. Houston A, Chen H, Hubbard S, Schatz B, Ng T, Sewell R and Tolle K (2019). Medical Data Mining on the Internet, Artificial Intelligence Review, 13:5-6, (437-466), Online publication date: 1-Dec-1999.
  37. Karkaletsis V, Paliouras G, Petasis G, Manousopoulou N and Spyropoulos C (1999). Named-Entity Recognition from Greek and English Texts, Journal of Intelligent and Robotic Systems, 26:2, (123-135), Online publication date: 1-Oct-1999.
  38. ACM
    Chung Y, He Q, Powell K and Schatz B Semantic indexing for a complete subject discipline Proceedings of the fourth ACM conference on Digital libraries, (39-48)
  39. ACM
    Lin C Training a selection function for extraction Proceedings of the eighth international conference on Information and knowledge management, (55-62)
  40. Hovy E and Lin C Automated text summarization and the SUMMARIST system Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, (197-214)
  41. Park J, Kang J, Hur W and Choi K Machine aided error-correction environment for Korean morphological analysis and part-of-speech tagging Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2, (1015-1019)
  42. Yeh A and Vilain M Some properties of preposition and subordinate conjunction attachments Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2, (1436-1442)
  43. Guo J One tokenization per source Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, (457-463)
  44. Hajič J and Hladká B Tagging inflective languages Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, (483-490)
  45. Kübler S Learning a lexicalized grammar for German Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, (11-18)
  46. Hajič J and Hladká B Probabilistic and rule-based tagger of an inflective language Proceedings of the fifth conference on Applied natural language processing, (111-118)
  47. Day D, Aberdeen J, Hirschman L, Kozierok R, Robinson P and Vilain M Mixed-initiative development of language processing systems Proceedings of the fifth conference on Applied natural language processing, (348-355)
  48. Palmer D A trainable rule-based algorithm for word segmentation Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, (321-328)
  49. Mitra M, Buckley C, Singhal A and Cardie C An analysis of statistical and syntactic phrases Computer-Assisted Information Searching on Internet, (200-214)
  50. Aberdeen J, Burger J, Day D, Hirschman L, Palmer D, Robinson P and Vilain M MITRE Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996, (461-462)
  51. Goodman J Parsing algorithms and metrics Proceedings of the 34th annual meeting on Association for Computational Linguistics, (177-183)
  52. Vilain M and Day D Finite-state phrase parsing by rule sequences Proceedings of the 16th conference on Computational linguistics - Volume 1, (274-279)
  53. Wilms G Using a hybrid system of corpus and knowledge-based techniques to automate the induction of a lexical sublanguage grammar Proceedings of the 16th conference on Computational linguistics - Volume 2, (1163-1166)
  54. Aberdeen J, Burger J, Day D, Hirschman L, Robinson P and Vilain M MITRE Proceedings of the 6th conference on Message understanding, (141-155)
  55. Lauer M Corpus statistics meet the noun compound Proceedings of the 33rd annual meeting on Association for Computational Linguistics, (47-54)
  56. Yarowsky D Unsupervised word sense disambiguation rivaling supervised methods Proceedings of the 33rd annual meeting on Association for Computational Linguistics, (189-196)
  57. Magerman D (1995). Review of "Statistical language learning" by Eugene Charniak. The MIT Press 1993., Computational Linguistics, 21:1, (103-111), Online publication date: 1-Mar-1995.
  58. Brill E (1995). Transformation-based error-driven learning and natural language processing, Computational Linguistics, 21:4, (543-565), Online publication date: 1-Dec-1995.
  59. ACM
    Rennison E Galaxy of news Proceedings of the 7th annual ACM symposium on User interface software and technology, (3-12)
  60. Brill E A report of recent progress in transformation-based error-driven learning Proceedings of the workshop on Human Language Technology, (256-261)
  61. Yarowsky D Decision lists for lexical ambiguity resolution Proceedings of the 32nd annual meeting on Association for Computational Linguistics, (88-95)
  62. Brill E and Resnik P A rule-based approach to prepositional phrase attachment disambiguation Proceedings of the 15th conference on Computational linguistics - Volume 2, (1198-1204)
  63. Brill E Some advances in transformation-based part of speech tagging Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, (722-727)
  64. Brill E Automatic grammar induction and parsing free text Proceedings of the 31st annual meeting on Association for Computational Linguistics, (259-265)
Contributors
  • Microsoft Research

Recommendations