Language independent, minimally supervised methods in natural language ambiguity resolution

January 2004

Author:
Silviu-Petru Cucerzan,
Adviser:
David Yarowsky

Publisher:

The Johns Hopkins University

Order Number:AAI3130658

Pages:

264

Purchase on ProQuest

Bibliometrics

Abstract

This dissertation presents a comprehensive study of minimally supervised learning methods for natural language processing. Numerous original approaches are explored for a diversity of tasks, including lexicon and lexical probability induction, part-of-speech tagging, gender induction, named-entity recognition, and word sense disambiguation. Empirical results are presented for a diverse space of languages (Basque, Cebuano, Dutch, English, French, Greek, Hindi, Kurdish, Romanian, Slovene, Spanish, Swedish, Turkish).

Because of their cost, annotated training resources for machine learning tend to be very limited across the range of linguistic phenomena, especially for languages other than English. Thus supervised learning methods have limited current applicability to performing text analysis tasks on these languages. Recent TIDES exercises on “surprise” languages such as Cebuano and Hindi showed, on the other hand, that collection of relatively large unannotated corpora from the web, as well as acquisition of electronic bilingual dictionaries with English from on-line sources or through OCR processes, is possible. In this context, one important aim of this work is to investigate how can such resources be used effectively, and what is the marginal cost of distilling them to achieve a desired functionality through minimally supervised approaches. An example of minimal supervision in this framework is the proposed method for bootstrapping a fine-grained, broad-coverage POS tagger in a new language starting from resources available for more than 100 world languages and using only 1 person-day of data acquisition effort. Also investigated in this framework is a novel paradigmatic word similarity measure based on statistics obtained from large unannotated corpora, which is shown effective when evaluated over lexical probability induction and part-of-speech tagging.

Efficient hierarchically-smoothed trie structures are used successfully in three different tasks. In named-entity recognition, they provide a language-independent framework for an iterative learning algorithm from entity-internal and contextual information as relatively in-dependent sources based on co-occurrence evidence of entities and contexts. A similar bootstrapping algorithm, used in conjunction with multilingual projection and hierarchically-smoothed tries, is shown to be successful for grammatical gender induction.

Finally, augmented mixture models are presented in conjunction with a novel classification correction technique that successfully addresses the problem of under-estimation of low-frequency classes, and are analyzed in the context of word sense disambiguation and context-sensitive spelling correction.

Contributors

David Eric Yarowsky
Whiting School of Engineering
- Publication Years1992 - 2016
- Publication counts50
- Citation count1,960
- Available for Download43
- Downloads (cumulative)34,585
- Downloads (12 months)1,612
- Downloads (6 weeks)233
- Average Downloads per Article804
- Average Citation per Article39
View Full Profile
Silviu Petru Cucerzan
Microsoft Research
- Publication Years2000 - 2019
- Publication counts27
- Citation count388
- Available for Download20
- Downloads (cumulative)12,773
- Downloads (12 months)220
- Downloads (6 weeks)20
- Average Downloads per Article639
- Average Citation per Article14
View Full Profile

Index Terms

Language independent, minimally supervised methods in natural language ambiguity resolution
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
    2. Search methodologies
      1. Heuristic function construction
  2. Machine learning
    1. Learning paradigms

Recommendations

Introduction to Chinese Natural Language Processing
Read More
Text independent root word identification in Hindi language using natural language processing

In this paper, an attempt is made to parse Hindi words to identify root word from an inflected word using natural language processing NLP technique. Stemming is a heuristic process that chops off the ends of words to find the root word and often ...
Read More
Deciphering natural language
Read More

Comments

Browse Theses

Sections

Index Terms

Introduction to Chinese Natural Language Processing

Text independent root word identification in Hindi language using natural language processing

Deciphering natural language

Sections

Save to Binder

Index Terms

Recommendations

Introduction to Chinese Natural Language Processing

Text independent root word identification in Hindi language using natural language processing

Deciphering natural language