Information fusion for multidocument summarization

Information fusion for multidocument summarization: paraphrasing and generation

January 2003

Author:
Regina Barzilay,
Adviser:
Kathleen R. Mckeown

Publisher:

Columbia University
2960 Broadway New York, NY
United States

Order Number:AAI3088294

Pages:

204

Purchase on ProQuest

Bibliometrics

Abstract

The number and variety of online news sources makes it difficult for people to track the news concerning even a single event. Redundancy causes such tracking to be extremely time-consuming: multiple news feeds on the same event tend to contain similar information. A summary of such news feeds can present important information in one short text, dramatically reducing reading time. The focus of this thesis is information fusion, a technique which, given multiple documents, identifies redundant information and synthesizes a coherent summary. This technique is embodied in MultiGen, a system that I have designed, implemented and evaluated over the course of my Ph.D. Unlike previous work in the area, MultiGen is a domain-independent system: it generates news summaries on a variety of topics in any domain. Another contribution to the state of the art is that the system generates the summary by reusing and altering phrases from the input articles, creating a more fluent and cohesive text. This is in contrast with other existing systems, which simply extract sentences from input articles and concatenate them together, leading to fluency problems. Currently MultiGen operates as part of Columbia's Newsblaster system. Everyday, Newsblaster downloads all news articles from a variety of sources, clusters articles by topic, and generates a cohesive, readable automatic summary of each document cluster. One key challenge in multidocument summarization is eliminating redundant information in the produced summaries. Articles about the same event often contain descriptions of the same fact using different wording. To address this issue, we need a method to identify paraphrases—fragments of text that convey similar meaning even if they are not identical in wording. Automatic identification of paraphrases was not addressed in previous research, although it is necessary for many applications, including question-answering, information extraction and natural language generation. This thesis presents unsupervised learning techniques to identify paraphrases given a corpus of multiple parallel texts. This type of corpus provides many instances of paraphrasing, because these texts preserve the meaning of the original source, but may use different words to convey the meaning. Both the data and the method are departures from past approaches to corpus based techniques. Our evaluation experiments show that the algorithm extracts paraphrases with high accuracy and significantly outperforms a state of the art algorithm developed for related tasks in machine translation.

Cited By

Contributors

Kathleen Rose McKeown
Columbia University
- Publication Years1979 - 2020
- Publication counts144
- Citation count2,914
- Available for Download95
- Downloads (cumulative)55,844
- Downloads (12 months)2,928
- Downloads (6 weeks)380
- Average Downloads per Article588
- Average Citation per Article20
View Full Profile
Regina Barzilay
Massachusetts Institute of Technology
- Publication Years1999 - 2024
- Publication counts76
- Citation count1,578
- Available for Download51
- Downloads (cumulative)27,432
- Downloads (12 months)1,617
- Downloads (6 weeks)261
- Average Downloads per Article538
- Average Citation per Article21
View Full Profile

Index Terms

Information fusion for multidocument summarization: paraphrasing and generation
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Multidocument summarization: An added value to clustering in interactive retrieval

A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar ...
Read More
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology Research

This paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Read More
Towards multidocument summarization by reformulation: progress and prospects
AAAI '99/IAAI '99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence

By synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system ...
Read More

Comments

Browse Theses

Sections

Cited By

Index Terms

Multidocument summarization: An added value to clustering in interactive retrieval

Experiments in multidocument summarization

Towards multidocument summarization by reformulation: progress and prospects

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Multidocument summarization: An added value to clustering in interactive retrieval

Experiments in multidocument summarization

Towards multidocument summarization by reformulation: progress and prospects