The rhetorical parsing, summarization, and generation of natural language texts

January 1998

Author:
Daniel C. Marcu,
Adviser:
Graeme Hirst

Publisher:

University of Toronto
Computer Center Toronto, Ont. M5S 1A1
Canada

ISBN:978-0-612-35238-4

Order Number:AAINQ35238

Pages:

374

Purchase on ProQuest

Bibliometrics

Abstract

This thesis is an inquiry into the nature of the high-level, rhetorical structure of unrestricted natural language texts, computational means to enable its derivation, and two applications (in automatic summarization and natural language generation) that follow from the ability to build such structures automatically.

The thesis proposes a first-order formalization of the high-level, rhetorical structure of text. The formalization assumes that text can be sequenced into elementary units; that discourse relations hold between textual units of various sizes; that some textual units are more important to the writer's purpose than others; and that trees are a good approximation of the abstract structure of text. The formalization also introduces a linguistically motivated compositionality criterion, which is shown to hold for the text structures that are valid.

The thesis proposes, analyzes theoretically, and compares empirically four algorithms for determining the valid text structures of a sequence of units among which some rhetorical relations hold. Two algorithms apply model-theoretic techniques; the other two apply proof-theoretic techniques.

The formalization and the algorithms mentioned so far correspond to the theoretical facet of the thesis. An exploratory corpus analysis of cue phrases provides the means for applying the formalization to unrestricted natural language texts. A set of empirically motivated algorithms were designed in order to determine the elementary textual units of a text, to hypothesize rhetorical relations that hold among these units, and eventually, to derive the discourse structure of that text. The process that finds the discourse structure of unrestricted natural language texts is called rhetorical parsing.

The thesis explores two possible applications of the text theory that it proposes. The first application concerns a discourse-based summarization system, which is shown to significantly outperform both a baseline algorithm and a commercial system. An empirical psycholinguistic experiment not only provides an objective evaluation of the summarization system, but also confirms the adequacy of using the text theory proposed here in order to determine the most important units in a text. The second application concerns a set of text planning algorithms that can be used by natural language generation systems in order to construct text plans in the cases in which the high-level communicative goal is to map an entire knowledge pool into text.

Cited By

Contributors

Graeme Hirst
University of Toronto
- Publication Years1981 - 2019
- Publication counts57
- Citation count1,099
- Available for Download36
- Downloads (cumulative)20,005
- Downloads (12 months)1,046
- Downloads (6 weeks)134
- Average Downloads per Article556
- Average Citation per Article19
View Full Profile
Daniel C Marcu
Amazon.com, Inc.
- Publication Years1995 - 2020
- Publication counts71
- Citation count2,729
- Available for Download49
- Downloads (cumulative)32,770
- Downloads (12 months)1,213
- Downloads (6 weeks)232
- Average Downloads per Article669
- Average Citation per Article38
View Full Profile

Recommendations

The rhetorical parsing of natural language texts
ACL '98/EACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

We derive the rhetorical structures of texts by means of two new, surface-form-based algorithms: one that identifies discourse usages of cue phrases and breaks sentences into clauses, and one that produces valid rhetorical structure trees for ...
Read More
GLR parsing with multiple grammars for natural language queries

This article presents an approach for parsing natural language queries that integrates multiple subparsers and subgrammars, in contrast to the traditional single grammar and parser approach. In using LR(k) parsers for natural language processing, we are ...
Read More
The rhetorical parsing of unrestricted texts: a surface-based approach

Coherent texts are not just simple sequences of clauses and sentences, but rather complex artifacts that have highly elaborate rhetorical structure. This paper explores the extent to which well-formed rhetorical structures can be automatically derived ...
Read More

Comments

Browse Theses

Sections

Cited By

The rhetorical parsing of natural language texts

GLR parsing with multiple grammars for natural language queries

The rhetorical parsing of unrestricted texts: a surface-based approach

Sections

Cited By

Save to Binder

Recommendations

The rhetorical parsing of natural language texts

GLR parsing with multiple grammars for natural language queries

The rhetorical parsing of unrestricted texts: a surface-based approach