ABSTRACT
A news article generally contains a high-level overview of the facts early on, followed by paragraphs of more detailed information. This structure allows copy editors to truncate the latter paragraphs of an article in order to satisfy space limitations without losing critical information. Existing approaches to this problem of automatic multi-article layout focus exclusively on maximizing content and aesthetics. However, no algorithm can determine how "good" a truncation point is based on the semantic content, or article readability. Yet, disregarding the semantic information within the article can lead to either overly aggressive cutting, thereby eliminating key content and potentially confusing the reader; conversely, it may set too generous of a truncation point, thus leaving in superfluous content and making automatic layout more difficult. This is one of the remaining challenges on the path from manual layouts to fully automated processes with high quality output. In this work, we present a new semantic-focused approach to rate the quality of a truncation point. We built models based on results from an extensive user study on over 700 news articles. Further results show that existing techniques over-cut content. We demonstrate the layout impact through a second evaluation that implements our models in the first layout approach that integrates both layout and semantic quality. The primary contribution of this work is the demonstration that semantic-based modeling is critical for high-quality automated document synthesis within a real-world context.
- I. Ahmadullin and N. Damera-Venkata. Hierarchical probabilistic model for news composition. In DocEng, page 141, New York, New York, USA, Sept. 2013. ACM Request Permissions. Google ScholarDigital Library
- G. J. Badros, A. Borning, and P. J. Stuckey. The Cassowary linear arithmetic constraint solving algorithm. TOCHI, 8(4 (Dec)):267--306, Dec. 2001. Google ScholarDigital Library
- J. Batista and D. Oliverira. Two algorithms for automatic document page layout. In DocEng, page 141, New York, New York, USA, Sept. 2008. ACM Request Permissions. Google ScholarDigital Library
- D. Beeferman, A. Berger, and J. Lafferty. Statistical Models for Text Segmentation. Machine learning, 34(1-3):177--210, 1999. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, Mar. 2003. Google ScholarDigital Library
- A. Brüggemann-Klein, R. Klein, and S. Wohlfeil. On the pagination of complex documents. Lecture Notes in Computer Science, 2598:49--68, 2003. Google ScholarDigital Library
- F. Chua and S. Asur. Automatic Summarization of Events From Social Media. In ICWSM, 2013.Google Scholar
- P. Ciancarini, A. Di Iorio, L. Furini, and F. Vitali. High-quality pagination for publishing. Software - Practice & Experience, 42(6), June 2012. Google ScholarDigital Library
- N. Damera-Venkata, J. Bento, and E. O'Brien-Strain. Probabilistic document model for automated document composition. In DocEng, page 3, New York, New York, USA, Sept. 2011. ACM Request Permissions. Google ScholarDigital Library
- H. P. Edmundson. New Methods in Automatic Extracting. Journal of the ACM (JACM, 16(2), Apr. 1969. Google ScholarDigital Library
- G. Erkan and D. R. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. J Artif Intell Res(JAIR), 2004. Google ScholarDigital Library
- M. Fiszman, T. C. Rindflesch, and H. Kilicoglu. Abstraction summarization for managing the biomedical research literature. pages 76--83, May 2004. Google ScholarDigital Library
- F. Giannetti. An exploratory mapping strategy for web-driven magazines. In Proceeding of the eighth ACM symposium, pages 223--229, New York, New York, USA, 2008. ACM Press. Google ScholarDigital Library
- A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. pages 362--370, May 2009. Google ScholarDigital Library
- J. Hailpern, N. Damera-Venkata, and M. Danilevsky. Pagination: It's what you say, not how long it takes to say it. In DocENG. ACM, 2014. Google ScholarDigital Library
- J. Hailpern and B. A. Huberman. Echo: the editor's wisdom with the elegance of a magazine. In EICS. ACM Request Permissions, June 2013. Google ScholarDigital Library
- M. A. Hearst. TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), Mar. 1997. Google ScholarDigital Library
- N. Hurst, W. Li, and K. Marriott. Review of automatic document formatting. In DocEng, page 99, New York, New York, USA, Sept. 2009. ACM Request Permissions. Google ScholarDigital Library
- C. Jacobs, W. Li, E. Schrier, D. Bargeron, and D. Salesin. Adaptive grid-based document layout. SIGGRAPH, 22(3):838--847, July 2003. Google ScholarDigital Library
- N. Jamil, J. Mueller, C. Lutteroth, and G. Weber. Extending Linear Relaxation for User Interface Layout. In ICTAI. IEEE Computer Society, Nov. 2012. Google ScholarDigital Library
- I. Kastner and C. Monz. Automatic single-document key fact extraction from newswire articles. In EACL. Association for Computational Linguistics, Mar. 2009. Google ScholarDigital Library
- R. Katragadda, P. Pingali, and V. Varma. Sentence position revisited: a robust light-weight update summarization baseline algorithm. pages 46--52, June 2009. Google ScholarDigital Library
- C.-Y. Lin and E. Hovy. Identifying topics by position. In ANCL. Association for Computational Linguistics, Mar. 1997.Google ScholarDigital Library
- J. Liu, E. Wagner, and L. Birnbaum. Compare & contrast: using the web to discover comparable cases for news stories. In WWW, page 541, New York, New York, USA, May 2007. ACM. Google ScholarDigital Library
- C. Lutteroth, R. Strandh, and G. Weber. Domain Specific High-Level Constraints for User Interface Layout. Constraints, 13(3), Sept. 2008. Google ScholarDigital Library
- R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In EMNLP, 2004.Google Scholar
- A. Nenkova. Automatic text summarization of newswire: lessons learned from the document understanding conference. In AAAI. AAAI Press, July 2005. Google ScholarDigital Library
- A. Nenkova and L. Vanderwende. The impact of frequency on summarization. Technical Report MSR-TR-2005-101, Microsoft Research, 2005.Google Scholar
- L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 2002. Google ScholarDigital Library
- H. Schütze and C. Silverstein. Projections for efficient document clustering. In SIGIR, pages 74--81, New York, New York, USA, Dec. 1997. ACM Request Permissions. Google ScholarDigital Library
- A. Scoditti and W. Stuerzlinger. A new layout method for graphical user interfaces. In TIC-STH, pages 642--647. IEEE, 2009.Google ScholarCross Ref
- Y. Seki, K. Eguchi, and N. Kando. Compact Summarization for Mobile Phones. Mobile and Ubiquitous Information Access, 2954(Chapter 13):172--186, 2004.Google Scholar
- J. Seo and W. B. Croft. Unsupervised estimation of dirichlet smoothing parameters. In SIGIR '10, pages 759--760, New York, New York, USA, 2010. ACM Press. Google ScholarDigital Library
- T. A. van Dijk. News as discourse. Lawrence Erlbaum Associates, Inc, 1988.Google Scholar
- T. Weninger, W. H. Hsu, and J. Han. CETR: content extraction via tag ratios. WWW 2010, 2010. Google ScholarDigital Library
- C. C. Yang and F. L. Wang. Automatic summarization of nancial news delivery on mobile devices. In WWW'03, 2003.Google Scholar
- C. C. Yang and F. L. Wang. Hierarchical summarization of large documents. J. of the American Society for Information Science and Technology, 59(6), Apr. 2008. Google ScholarDigital Library
- C. Zeidler, J. Müller, C. Lutteroth, and G. Weber. Comparing the usability of grid-bag and constraint-based layouts. In OzCHI, pages 674--682, New York, New York, USA, Nov. 2012. ACM Request Permissions. Google ScholarDigital Library
- C. Zhai. Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2009. Google ScholarDigital Library
Index Terms
- Truncation: all the news that fits we'll print
Recommendations
Pagination: it's what you say, not how long it takes to say it
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringPagination the process of determining where to break an article across pages in a multi-article layout is a common layout challenge for most commercially printed newspapers and magazines. To date, no one has created an algorithm that determines a ...
A General Framework for Globally Optimized Pagination
DocEng '16: Proceedings of the 2016 ACM Symposium on Document EngineeringPagination problems deal with questions around transforming a source text stream into a formatted document by dividing it up into individual columns and pages, including adding auxiliary elements that have some relationship to the source stream data but ...
Adaptive grid-based document layout
SIGGRAPH '03: ACM SIGGRAPH 2003 PapersGrid-based page designs are ubiquitous in commercially printed publications, such as newspapers and magazines. Yet, to date, no one has invented a good way to easily and automatically adapt such designs to arbitrarily-sized electronic displays. The ...
Comments