ABSTRACT
Scoring sentences in documents given abstract summaries created by humans is important in extractive multi-document summarization. In this paper, we formulate extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference. We calculate scores for sentences in document clusters based on their latent characteristics using a hierarchical topic model. Then, using these scores, we train a regression model based on the lexical and structural characteristics of the sentences, and use the model to score sentences of new documents to form a summary. Our system advances current state-of-the-art improving ROUGE scores by ~7%. Generated summaries are less redundant and more coherent based upon manual quality evaluations.
- }}R. Barzilay and L. Lee. Catching the drift: Probabilistic content models with applications to generation and summarization. In In Proc. HLT-NAACL'04, 2004.Google Scholar
- }}D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In In Neural Information Processing Systems {NIPS}, 2003a.Google Scholar
- }}D. Blei, T. Griffiths, and M. Jordan. The nested chinese restaurant process and bayesian non-parametric inference of topic hierarchies. In Journal of ACM, 2009. Google ScholarDigital Library
- }}D. M. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Jrnl. Machine Learning Research, 3:993--1022, 2003b. Google ScholarDigital Library
- }}S. R. K. Branavan, H. Chen, J. Eisenstein, and R. Barzilay. Learning document-level semantic properties from free-text annotations. In Journal of Artificial Intelligence Research, volume 34, 2009. Google ScholarDigital Library
- }}J. M. Conroy, J. D. Schlesinger, and D. P. O'Leary. Topic focused multi-cument summarization using an approximate oracle score. In In Proc. ACL'06, 2006. Google ScholarDigital Library
- }}H. Daumé III and D. Marcu. Bayesian query focused summarization. In Proc. ACL-06, 2006. Google ScholarDigital Library
- }}H. Drucker, C. J. C. Burger, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. In NIPS 9, 1997.Google Scholar
- }}A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In NAACL HLT-09, 2009. Google ScholarDigital Library
- }}T. Joachims. Making large-scale svm learning practical. In In Advances in Kernel Methods - Support Vector Learning. MIT Press., 1999. Google ScholarDigital Library
- }}C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In In Proc. ACL Workshop on Text Summarization Branches Out, 2004.Google Scholar
- }}C.-Y. Lin and E. H. Hovy. Automatic evaluation of summaries using n-gram co-occurance statistics. In Proc. HLT-NAACL, Edmonton, Canada, 2003. Google ScholarDigital Library
- }}C. Manning and H. Schuetze. Foundations of statistical natural language processing. In MIT Press. Cambridge, MA, 1999. Google ScholarDigital Library
- }}A. Nenkova and L. Vanderwende. The impact of frequency on summarization. In Tech. Report MSR-TR-2005-101, Microsoft Research, Redwood, Washington, 2005.Google Scholar
- }}D. R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based summarization for multiple documents. In In Int. Jrnl. Information Processing and Management, 2004. Google ScholarDigital Library
- }}D. Shen, J. T. Sun, H. Li, Q. Yang, and Z. Chen. Document summarization using conditional random fields. In Proc. IJCAI'07, 2007. Google ScholarDigital Library
- }}J. Tang, L. Yao, and D. Chens. Multi-topic based query-oriented summarization. In SIAM International Conference Data Mining, 2009.Google ScholarCross Ref
- }}I. Titov and R. McDonald. A joint model of text and aspect ratings for sentiment summarization. In ACL-08: HLT, 2008.Google Scholar
- }}K. Toutanova, C. Brockett, M. Gamon, J. Jagarlamudi, H. Suzuki, and L. Vanderwende. The phthy summarization system: Microsoft research at duc 2007. In Proc. DUC, 2007.Google Scholar
- }}J. Y. Yeh, H.-R. Ke, W. P. Yang, and I-H. Meng. Text summarization using a trainable summarizer and latent semantic analysis. In Information Processing and Management, 2005. Google ScholarDigital Library
Index Terms
- A hybrid hierarchical model for multi-document summarization
Recommendations
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Hybrid multi-document summarization using pre-trained language models
AbstractAbstractive multi-document summarization is a type of automatic text summarization. It obtains information from multiple documents and generates a human-like summary from them. In this paper, we propose an abstractive multi-document ...
Highlights- Introducing a multi-document summarizer, called HMSumm, based on pre-trained methods.
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataExtraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Comments