The number and variety of online news sources makes it difficult for people to track the news concerning even a single event. Redundancy causes such tracking to be extremely time-consuming: multiple news feeds on the same event tend to contain similar information. A summary of such news feeds can present important information in one short text, dramatically reducing reading time. The focus of this thesis is information fusion, a technique which, given multiple documents, identifies redundant information and synthesizes a coherent summary. This technique is embodied in MultiGen, a system that I have designed, implemented and evaluated over the course of my Ph.D. Unlike previous work in the area, MultiGen is a domain-independent system: it generates news summaries on a variety of topics in any domain. Another contribution to the state of the art is that the system generates the summary by reusing and altering phrases from the input articles, creating a more fluent and cohesive text. This is in contrast with other existing systems, which simply extract sentences from input articles and concatenate them together, leading to fluency problems. Currently MultiGen operates as part of Columbia's Newsblaster system. Everyday, Newsblaster downloads all news articles from a variety of sources, clusters articles by topic, and generates a cohesive, readable automatic summary of each document cluster. One key challenge in multidocument summarization is eliminating redundant information in the produced summaries. Articles about the same event often contain descriptions of the same fact using different wording. To address this issue, we need a method to identify paraphrases—fragments of text that convey similar meaning even if they are not identical in wording. Automatic identification of paraphrases was not addressed in previous research, although it is necessary for many applications, including question-answering, information extraction and natural language generation. This thesis presents unsupervised learning techniques to identify paraphrases given a corpus of multiple parallel texts. This type of corpus provides many instances of paraphrasing, because these texts preserve the meaning of the original source, but may use different words to convey the meaning. Both the data and the method are departures from past approaches to corpus based techniques. Our evaluation experiments show that the algorithm extracts paraphrases with high accuracy and significantly outperforms a state of the art algorithm developed for related tasks in machine translation.
Cited By
- Karmaker Santu S, Geigle C, Ferguson D, Cope W, Kalantzis M, Searsmith D and Zhai C (2018). SOFSAT, ACM SIGKDD Explorations Newsletter, 20:2, (21-30), Online publication date: 11-Dec-2018.
- Preoţiuc-Pietro D, Xu W and Ungar L Discovering user attribute stylistic differences via paraphrasing Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, (3030-3037)
- Miranda-Jiménez S, Gelbukh A and Sidorov G (2014). Conceptual Graphs as Framework for Summarizing Short Texts, International Journal of Conceptual Structures and Smart Applications, 2:2, (55-75), Online publication date: 1-Jul-2014.
- Cohn T and Lapata M (2013). An abstractive approach to sentence compression, ACM Transactions on Intelligent Systems and Technology, 4:3, (1-35), Online publication date: 1-Jun-2013.
- Ganitkevitch J, Van Durme B and Callison-Burch C Monolingual distributional similarity for text-to-text generation Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, (256-264)
- Ganitkevitch J, Callison-Burch C, Napoles C and Van Durme B Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation Proceedings of the Conference on Empirical Methods in Natural Language Processing, (1168-1179)
- Woodsend K and Lapata M Learning to simplify sentences with quasi-synchronous grammar and integer programming Proceedings of the Conference on Empirical Methods in Natural Language Processing, (409-420)
- Deléger L and Zweigenbaum P Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, (2-10)
- Nahnsen T Domain-independent shallow sentence ordering Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, (78-83)
- Hsueh P and Moore J Improving meeting summarization by focusing on user needs Proceedings of the 14th international conference on Intelligent user interfaces, (17-26)
- Cohn T and Lapata M Sentence compression beyond word deletion Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, (137-144)
- Callison-Burch C, Cohn T and Lapata M ParaMetric Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, (97-104)
- Siddharthan A and Copestake A Generating research websites using summarisation techniques Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, (5-8)
- Seno E and Nunes M Automatic alignment of common information in comparable sentences of Portuguese Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web, (331-335)
- Soricut R and Marcu D (2007). Abstractive headline generation using WIDL-expressions, Information Processing and Management: an International Journal, 43:6, (1536-1548), Online publication date: 1-Nov-2007.
- Barzilay R and McKeown K (2019). Sentence fusion for multidocument news summarization, Computational Linguistics, 31:3, (297-328), Online publication date: 1-Sep-2005.
- Marsi E and Krahmer E Classification of semantic relations by humans and machines Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, (1-6)
- Nenkova A, Siddharthan A and McKeown K Automatically learning cognitive status for multi-document summarization of newswire Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, (241-248)
- Siddharthan A and McKeown K Improving multilingual summarization Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, (33-40)
- Soricut R and Marcu D Towards developing generation algorithms for text-to-text applications Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (66-74)
- Siddharthan A, Nenkova A and McKeown K Syntactic simplification for improving content selection in multi-document summarization Proceedings of the 20th international conference on Computational Linguistics, (896-es)
- Chieu H and Lee Y Query based event extraction along a timeline Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (425-432)
- Lapata M Probabilistic text structuring Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, (545-552)
Index Terms
- Information fusion for multidocument summarization: paraphrasing and generation
Recommendations
Multidocument summarization: An added value to clustering in interactive retrieval
A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar ...
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchThis paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Towards multidocument summarization by reformulation: progress and prospects
AAAI '99/IAAI '99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligenceBy synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system ...