A corpus-based approach to language learning

January 1993

Author:
Eric David Brill

Publisher:

University of Pennsylvania
Computer and Information Science Dept. 2000 South 33rd St. Philadelphia, PA
United States

Order Number:UMI Order No. GAX93-31757

Bibliometrics

Abstract

One goal of computational linguistics is to discover a method for assigning a rich structural annotation to sentences that are presented as simple linear strings of words; meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information. Also, structure allows for a more in-depth check of the well-formedness of a sentence. There are two phases to assigning these structural annotations: first, a knowledge base is created and second, an algorithm is used to generate a structural annotation for a sentence based upon the facts provided in the knowledge base. Until recently, most knowledge bases were created manually by language experts. These knowledge bases are expensive to create and have not been used effectively in structurally parsing sentences from other than highly restricted domains. The goal of this dissertation is to make significant progress toward designing automata that are able to learn some structural aspects of human language with little human guidance. In particular, we describe a learning algorithm that takes a small structurally annotated corpus of text and a larger unannotated corpus as input, and automatically learns how to assign accurate structural descriptions to sentences not in the training corpus. The main tool we use to automatically discover structural information about language from corpora is transformation-based error-driven learning. The distribution of errors produced by an imperfect annotator is examined to learn an ordered list of transformations that can be applied to provide an accurate structural annotation. We demonstrate the application of this learning algorithm to part of speech tagging and parsing. Successfully applying this technique to create systems that learn could lead to robust, trainable and accurate natural language processing systems.

Cited By

Contributors

Eric David Brill
Microsoft Research
- Publication Years1990 - 2019
- Publication counts45
- Citation count2,955
- Available for Download34
- Downloads (cumulative)62,746
- Downloads (12 months)2,842
- Downloads (6 weeks)204
- Average Downloads per Article1,845
- Average Citation per Article66
View Full Profile

Index Terms

A corpus-based approach to language learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms

Recommendations

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

This article presents the original results of Polish language statistical analysis, based on the orthographic and phonemic language corpus. Phonemic language corpus for Polish was developed by using automatic grapheme-to-phoneme conversion of the source ...
Read More
Word Sense Disambiguation Corpus Development for Romanian Language
Abstract
Research in the area of the interconnection of lexical resources represents a real challenge, because it addresses the difficult problem of semantic understanding and, more precisely, the disambiguation of the meaning of the words - Word Sense ...
Read More
An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus
TALLIP Notes and Regular Papers

Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor--intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-...
Read More

Comments

Browse Theses

Sections

Cited By

Index Terms

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Word Sense Disambiguation Corpus Development for Romanian Language

An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

Word Sense Disambiguation Corpus Development for Romanian Language

An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus