skip to main content
Language independent, minimally supervised methods in natural language ambiguity resolution
Publisher:
  • The Johns Hopkins University
Order Number:AAI3130658
Pages:
264
Bibliometrics
Skip Abstract Section
Abstract

This dissertation presents a comprehensive study of minimally supervised learning methods for natural language processing. Numerous original approaches are explored for a diversity of tasks, including lexicon and lexical probability induction, part-of-speech tagging, gender induction, named-entity recognition, and word sense disambiguation. Empirical results are presented for a diverse space of languages (Basque, Cebuano, Dutch, English, French, Greek, Hindi, Kurdish, Romanian, Slovene, Spanish, Swedish, Turkish).

Because of their cost, annotated training resources for machine learning tend to be very limited across the range of linguistic phenomena, especially for languages other than English. Thus supervised learning methods have limited current applicability to performing text analysis tasks on these languages. Recent TIDES exercises on “surprise” languages such as Cebuano and Hindi showed, on the other hand, that collection of relatively large unannotated corpora from the web, as well as acquisition of electronic bilingual dictionaries with English from on-line sources or through OCR processes, is possible. In this context, one important aim of this work is to investigate how can such resources be used effectively, and what is the marginal cost of distilling them to achieve a desired functionality through minimally supervised approaches. An example of minimal supervision in this framework is the proposed method for bootstrapping a fine-grained, broad-coverage POS tagger in a new language starting from resources available for more than 100 world languages and using only 1 person-day of data acquisition effort. Also investigated in this framework is a novel paradigmatic word similarity measure based on statistics obtained from large unannotated corpora, which is shown effective when evaluated over lexical probability induction and part-of-speech tagging.

Efficient hierarchically-smoothed trie structures are used successfully in three different tasks. In named-entity recognition, they provide a language-independent framework for an iterative learning algorithm from entity-internal and contextual information as relatively in-dependent sources based on co-occurrence evidence of entities and contexts. A similar bootstrapping algorithm, used in conjunction with multilingual projection and hierarchically-smoothed tries, is shown to be successful for grammatical gender induction.

Finally, augmented mixture models are presented in conjunction with a novel classification correction technique that successfully addresses the problem of under-estimation of low-frequency classes, and are analyzed in the context of word sense disambiguation and context-sensitive spelling correction.

Contributors
  • Whiting School of Engineering
  • Microsoft Research

Recommendations