skip to main content
Information fusion for multidocument summarization: paraphrasing and generation
Publisher:
  • Columbia University
  • 2960 Broadway New York, NY
  • United States
Order Number:AAI3088294
Pages:
204
Bibliometrics
Skip Abstract Section
Abstract

The number and variety of online news sources makes it difficult for people to track the news concerning even a single event. Redundancy causes such tracking to be extremely time-consuming: multiple news feeds on the same event tend to contain similar information. A summary of such news feeds can present important information in one short text, dramatically reducing reading time. The focus of this thesis is information fusion, a technique which, given multiple documents, identifies redundant information and synthesizes a coherent summary. This technique is embodied in MultiGen, a system that I have designed, implemented and evaluated over the course of my Ph.D. Unlike previous work in the area, MultiGen is a domain-independent system: it generates news summaries on a variety of topics in any domain. Another contribution to the state of the art is that the system generates the summary by reusing and altering phrases from the input articles, creating a more fluent and cohesive text. This is in contrast with other existing systems, which simply extract sentences from input articles and concatenate them together, leading to fluency problems. Currently MultiGen operates as part of Columbia's Newsblaster system. Everyday, Newsblaster downloads all news articles from a variety of sources, clusters articles by topic, and generates a cohesive, readable automatic summary of each document cluster. One key challenge in multidocument summarization is eliminating redundant information in the produced summaries. Articles about the same event often contain descriptions of the same fact using different wording. To address this issue, we need a method to identify paraphrases—fragments of text that convey similar meaning even if they are not identical in wording. Automatic identification of paraphrases was not addressed in previous research, although it is necessary for many applications, including question-answering, information extraction and natural language generation. This thesis presents unsupervised learning techniques to identify paraphrases given a corpus of multiple parallel texts. This type of corpus provides many instances of paraphrasing, because these texts preserve the meaning of the original source, but may use different words to convey the meaning. Both the data and the method are departures from past approaches to corpus based techniques. Our evaluation experiments show that the algorithm extracts paraphrases with high accuracy and significantly outperforms a state of the art algorithm developed for related tasks in machine translation.

Cited By

  1. ACM
    Karmaker Santu S, Geigle C, Ferguson D, Cope W, Kalantzis M, Searsmith D and Zhai C (2018). SOFSAT, ACM SIGKDD Explorations Newsletter, 20:2, (21-30), Online publication date: 11-Dec-2018.
  2. Preoţiuc-Pietro D, Xu W and Ungar L Discovering user attribute stylistic differences via paraphrasing Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, (3030-3037)
  3. Miranda-Jiménez S, Gelbukh A and Sidorov G (2014). Conceptual Graphs as Framework for Summarizing Short Texts, International Journal of Conceptual Structures and Smart Applications, 2:2, (55-75), Online publication date: 1-Jul-2014.
  4. ACM
    Cohn T and Lapata M (2013). An abstractive approach to sentence compression, ACM Transactions on Intelligent Systems and Technology, 4:3, (1-35), Online publication date: 1-Jun-2013.
  5. Ganitkevitch J, Van Durme B and Callison-Burch C Monolingual distributional similarity for text-to-text generation Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, (256-264)
  6. Ganitkevitch J, Callison-Burch C, Napoles C and Van Durme B Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation Proceedings of the Conference on Empirical Methods in Natural Language Processing, (1168-1179)
  7. Woodsend K and Lapata M Learning to simplify sentences with quasi-synchronous grammar and integer programming Proceedings of the Conference on Empirical Methods in Natural Language Processing, (409-420)
  8. Deléger L and Zweigenbaum P Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, (2-10)
  9. Nahnsen T Domain-independent shallow sentence ordering Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium, (78-83)
  10. ACM
    Hsueh P and Moore J Improving meeting summarization by focusing on user needs Proceedings of the 14th international conference on Intelligent user interfaces, (17-26)
  11. Cohn T and Lapata M Sentence compression beyond word deletion Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, (137-144)
  12. Callison-Burch C, Cohn T and Lapata M ParaMetric Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, (97-104)
  13. Siddharthan A and Copestake A Generating research websites using summarisation techniques Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, (5-8)
  14. ACM
    Seno E and Nunes M Automatic alignment of common information in comparable sentences of Portuguese Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web, (331-335)
  15. Soricut R and Marcu D (2007). Abstractive headline generation using WIDL-expressions, Information Processing and Management: an International Journal, 43:6, (1536-1548), Online publication date: 1-Nov-2007.
  16. Barzilay R and McKeown K (2019). Sentence fusion for multidocument news summarization, Computational Linguistics, 31:3, (297-328), Online publication date: 1-Sep-2005.
  17. Marsi E and Krahmer E Classification of semantic relations by humans and machines Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, (1-6)
  18. Nenkova A, Siddharthan A and McKeown K Automatically learning cognitive status for multi-document summarization of newswire Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, (241-248)
  19. Siddharthan A and McKeown K Improving multilingual summarization Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, (33-40)
  20. Soricut R and Marcu D Towards developing generation algorithms for text-to-text applications Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (66-74)
  21. Siddharthan A, Nenkova A and McKeown K Syntactic simplification for improving content selection in multi-document summarization Proceedings of the 20th international conference on Computational Linguistics, (896-es)
  22. ACM
    Chieu H and Lee Y Query based event extraction along a timeline Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (425-432)
  23. Lapata M Probabilistic text structuring Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, (545-552)
Contributors
  • Columbia University
  • Massachusetts Institute of Technology

Recommendations