skip to main content
The rhetorical parsing, summarization, and generation of natural language texts
Publisher:
  • University of Toronto
  • Computer Center Toronto, Ont. M5S 1A1
  • Canada
ISBN:978-0-612-35238-4
Order Number:AAINQ35238
Pages:
374
Bibliometrics
Skip Abstract Section
Abstract

This thesis is an inquiry into the nature of the high-level, rhetorical structure of unrestricted natural language texts, computational means to enable its derivation, and two applications (in automatic summarization and natural language generation) that follow from the ability to build such structures automatically.

The thesis proposes a first-order formalization of the high-level, rhetorical structure of text. The formalization assumes that text can be sequenced into elementary units; that discourse relations hold between textual units of various sizes; that some textual units are more important to the writer's purpose than others; and that trees are a good approximation of the abstract structure of text. The formalization also introduces a linguistically motivated compositionality criterion, which is shown to hold for the text structures that are valid.

The thesis proposes, analyzes theoretically, and compares empirically four algorithms for determining the valid text structures of a sequence of units among which some rhetorical relations hold. Two algorithms apply model-theoretic techniques; the other two apply proof-theoretic techniques.

The formalization and the algorithms mentioned so far correspond to the theoretical facet of the thesis. An exploratory corpus analysis of cue phrases provides the means for applying the formalization to unrestricted natural language texts. A set of empirically motivated algorithms were designed in order to determine the elementary textual units of a text, to hypothesize rhetorical relations that hold among these units, and eventually, to derive the discourse structure of that text. The process that finds the discourse structure of unrestricted natural language texts is called rhetorical parsing.

The thesis explores two possible applications of the text theory that it proposes. The first application concerns a discourse-based summarization system, which is shown to significantly outperform both a baseline algorithm and a commercial system. An empirical psycholinguistic experiment not only provides an objective evaluation of the summarization system, but also confirms the adequacy of using the text theory proposed here in order to determine the most important units in a text. The second application concerns a set of text planning algorithms that can be used by natural language generation systems in order to construct text plans in the cases in which the high-level communicative goal is to map an entire knowledge pool into text.

Cited By

  1. ACM
    Nguyen C and Nguyen D Towards Building Vietnamese Discourse Treebank Proceedings of the 8th International Symposium on Information and Communication Technology, (63-69)
  2. Liu W, Luo X, Gong Z, Xuan J, Kou N and Xu Z (2016). Discovering the core semantics of event from social media, Future Generation Computer Systems, 64:C, (175-185), Online publication date: 1-Nov-2016.
  3. Elhoseiny M and Elgammal A (2016). Text to multi-level MindMaps, Multimedia Tools and Applications, 75:8, (4217-4244), Online publication date: 1-Apr-2016.
  4. Greenbacker C Towards a framework for abstractive summarization of multimodal documents Proceedings of the ACL 2011 Student Session, (75-80)
  5. Greenbacker C, Wu P, Carberry S, McCoy K, Elzer S, McDonald D, Chester D and Demir S Improving the accessibility of line graphs in multimodal documents Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, (52-62)
  6. Ledeneva Y, Hernández R, Soto R, Reyes R and Gelbukh A EM clustering algorithm for automatic text summarization Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I, (305-315)
  7. ACM
    Uzêda V, Pardo T and Nunes M (2010). A comprehensive comparative evaluation of RST-based summarization methods, ACM Transactions on Speech and Language Processing , 6:4, (1-20), Online publication date: 1-May-2010.
  8. ACM
    Groza T, Handschuh S and Bordea G Towards automatic extraction of epistemic items from scientific publications Proceedings of the 2010 ACM Symposium on Applied Computing, (1341-1348)
  9. Demir S, Carberry S and McCoy K A discourse-aware graph-based content-selection framework Proceedings of the 6th International Natural Language Generation Conference, (17-25)
  10. Saggion H A classification algorithm for predicting the structure of summaries Proceedings of the 2009 Workshop on Language Generation and Summarisation, (31-38)
  11. Stent A and Molina M Evaluating automatic extraction of rules for sentence plan construction Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, (290-297)
  12. Curteanu N, Trandabăţ D and Moruz M Expanding Topic-Focus Articulation with Boundary and Accent Assignment Rules for Romanian Sentence Proceedings of the 12th International Conference on Text, Speech and Dialogue, (226-233)
  13. Atkinson J, Ferreira A and Aravena E (2009). Discovering implicit intention-level knowledge from natural-language texts, Knowledge-Based Systems, 22:7, (502-508), Online publication date: 1-Oct-2009.
  14. Barcellini F, Détienne F, Burkhardt J and Sack W (2008). A socio-cognitive analysis of online design discussions in an Open Source Software community, Interacting with Computers, 20:1, (141-165), Online publication date: 1-Jan-2008.
  15. Demir S, Carberry S and McCoy K Generating textual summaries of bar charts Proceedings of the Fifth International Natural Language Generation Conference, (7-15)
  16. Da Cunha I, Fernández S, Morales P, Vivaldi J, SanJuan E and Torres-Moreno J A new hybrid summarizer based on vector space model, statistical physics and linguistics Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence, (872-882)
  17. Ming Y Discursive usage of six Chinese punctuation marks Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, (43-48)
  18. Kazantseva A An approach to summarizing short stories Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, (55-62)
  19. Pardo T and Nunes M Review and evaluation of dizer – an automatic discourse analyzer for brazilian portuguese Proceedings of the 7th international conference on Computational Processing of the Portuguese Language, (180-189)
  20. Inui T and Okumura M Investigating the characteristics of causal relations in Japanese text Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, (37-44)
  21. ACM
    Barcellini F, Détienne F, Burkhardt J and Sack W Thematic coherence and quotation practices in OSS design-oriented online discussions Proceedings of the 2005 ACM International Conference on Supporting Group Work, (177-186)
  22. Zhang Z and Radev D Combining labeled and unlabeled data for learning cross-document structural relationships Proceedings of the First international joint conference on Natural Language Processing, (32-41)
  23. Bunt H, Carroll J and Satta G Development in parsing technology New developments in parsing technology, (1-18)
  24. ACM
    Hoffmann A and Pham S Towards topic-based summarization for interactive document viewing Proceedings of the 2nd international conference on Knowledge capture, (28-35)
  25. ACM
    Zhang Z, Otterbacher J and Radev D Learning cross-document structural relationships using boosting Proceedings of the twelfth international conference on Information and knowledge management, (124-130)
  26. Alonso i Alemany L and Fuentes Fort M Integrating cohesion and coherence for automatic summarization Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2, (1-8)
  27. Mercer R and Di Marco C The importance of fine-grained cue phrases in scientific citations Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence, (550-556)
  28. Le H and Abeysinghe G A study to improve the efficiency of a discourse parsing system Proceedings of the 4th international conference on Computational linguistics and intelligent text processing, (101-114)
  29. Zhang Z, Blair-Goldensohn S and Radev D Towards CST-enhanced summarization Eighteenth national conference on Artificial intelligence, (439-445)
  30. Delannoy J What are the points? Proceedings of the workshop on Human Language Technology and Knowledge Management - Volume 2001, (1-8)
  31. ACM
    Sack W Conversation map Proceedings of the 5th international conference on Intelligent user interfaces, (233-240)
  32. ACM
    Chuang W and Yang J Extracting sentence segments for text summarization Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, (152-159)
  33. Marcu D (2000). The rhetorical parsing of unrestricted texts, Computational Linguistics, 26:3, (395-448), Online publication date: 1-Sep-2000.
  34. ACM
    Marcu D The automatic construction of large-scale corpora for summarization research Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, (137-144)
  35. ACM
    Lin C Training a selection function for extraction Proceedings of the eighth international conference on Information and knowledge management, (55-62)
  36. Hovy E and Lin C Automated text summarization and the SUMMARIST system Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, (197-214)
  37. Cristea D, Ide N and Romary L Veins Theory Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, (281-285)
  38. Marcu D The rhetorical parsing of natural language texts Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, (96-103)
Contributors
  • University of Toronto
  • Amazon.com, Inc.

Recommendations