Abstract
This paper presents new approaches to headline generation for English newspaper texts, with an eye toward the production of document surrogates for document selection in cross-language information retrieval. This task is difficult because the user must make decisions about relevance based on (often poor) translations of retrieved documents. To facilitate the decision-making process we need translations that can be assessed rapidly and accurately; our approach is to provide an English headline for the non-English document. We describe two approaches to headline generation and their application to the recent DARPA TIDES-2003 Surprise Language Exercise for Hindi. For comparison, we also implemented an alternative method for surrogate generation: a system that produces topic lists for (Hindi) articles. We present the results of a series of experiments comparing each of these approaches. We demonstrate in both automatic and human evaluations that our linguistically motivated approach outperforms two other surrogate-generation methods: a statistical system and a topic discovery system.
- Bahl, L., Jelinek, F., and Mercer, R. 1983. A maximum likelihood approach to speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5, 2, 179--190.Google Scholar
- Bangalore, S. and Rambow, O. 2000. Exploiting a probabilistic hierarchical model for generation. In COLING 2000; Proceedings of the 18th International Conference on Computational Linguistics. (Saarbrücken, Germany, July 31--Aug 4, 2000), Morgan Kaufmann, San Mateo, CA, 42--48. Google Scholar
- Bikel, D., Schwartz, R., and Weischedel, R. 1999. An algorithm that learns what's in a name. Machine Learning 34, 1/3. Google Scholar
- Brown, P., Cocke, J., Pietra, S., Pietra, V., Jelinek, F., Lafferty, J., Mercer, R., and Roossin, P. 1990. A statistical approach to machine translation. Computational Linguistics 16, 2, 79--85. Google Scholar
- Charniak, M. 1997. Statistical parsing with a context-free grammar and word statistics. In AAAI97, IAAI97: Proceedings of the 14th National Conference on Artificial Intelligence and 9th Innovative Applications of Artificial Intelligence Conference (Providence, RI, July 27--31, 1997). AAAI Press/The MIT Press, Cambridge, MA, 598--603. Google Scholar
- Chomsky, N. A. 1981. Lectures on Government and Binding. Foris Publications, Dordrecht, Holland.Google Scholar
- Collins, M. 1997a. The EM Algorithm (In fulfillment of the Written Preliminary Exam II Requirement).Google Scholar
- Collins, M. 1997b. Three Generative, Lexicalised Models for Statistical Parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics (Madrid, Spain, July 7--12, 1997). Morgan Kaufmann/ACL, San Mateo, CA, 16--23. Google Scholar
- Cutting, D., Pedersen, J., and Sibun, P. 1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing (Trento, Italy). Google Scholar
- Dunning, T. 1994. Statistical identification of language. Technical Report MCCS 94-273, New Mexico State University.Google Scholar
- Edmundson, H. 1969. New methods in automatic extracting. Journal of the ACM 16, 2. Google Scholar
- Gotoh, Y. and Reynolds, S. 2000. Sentence boundary detection in broadcast speech transcripts. In Proceedings of the International Speech Communication Association Workshop: Automatic Speech Recognition: Challenges for the New Millennium (Paris).Google Scholar
- Johnson, F., Paice, C., Black, W., and Neal, A. 1993. The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management 1, 3, 215--242.Google Scholar
- Knight, K. and Marcu, D. 2000. Statistics-based summarization---step one: Sentence compression. In The 17th National Conference of the American Association for Artificial Intelligence AAAI2000 (Austin, TX). Google Scholar
- Kupiec, J., Pedersen, J., and Chen, F. 1995. A trainable document summarizer. In Proceedings of the 18th ACM-SIGIR Conference. Google Scholar
- Langkilde, I. and Knight, K. 1998. Generation that exploits corpus-based statistical knowledge. In COLING-ACL '98: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (Montreal, Canada, Aug. 10--14, 1998), 2 volumes. ACL/Morgan Kaufmann, 704--710. Google Scholar
- Lin, C.-Y. and Hovy, E. 2003. Automatic Evaluation of Summaries Using n-Gram Co-Occurrences Statistics. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (Edmonton, AB). Google Scholar
- Lin, D. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING/ACL. Google Scholar
- Luhn, H. 1958. The automatic creation of literature abstracts. IBM Journal of Research Development 2, 2, 159--165.Google Scholar
- Mann, W. C., Matthiesen, C. M. I. M., and Thompson, S. A. 1992. Rhetorical structure theory and text analysis. In Discoure Description, W. C. Mann and S. A. Thompson, Eds. J. Benjamin Publishing, Amsterdam.Google Scholar
- Mårdh, I. 1980. Headlinese: On the Grammar of English Front Page Headlines. Malmo.Google Scholar
- Mays, E., Damerau, F., and Mercer, R. 1990. Context-based spelling correction. In Proceedings of IBM Natural Language ITL (France). 517--522.Google Scholar
- Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., and Weischedel, R. 1998. Algorithms that learn to extract information; BBN: Description of the SIFT system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7) (Fairfax, VA, Apr. 29--May 1, 1998).Google Scholar
- Miller, S., Ramshaw, L., Fox, H., and Weischedel, R. 2000. A novel use of statistical parsing to extract information from text. In Proceedings of the First Meeting of the North American Chapter of the ACL (Seattle, WA). 226--233. Google Scholar
- Paice, C. and Jones, A. 1993. The Identification of important concepts in highly structured technical papers. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in IR. Google Scholar
- Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of Association of Computational Linguistics (Philadelphia, PA). Google Scholar
- Pereira, F., Tishby, N., and Lee, L. 1993. Distributional clustering of English words. In Proceedings of 31st Annual Meeting of the Association for Computational Linguistics (Columbus, OH, June 22--26, 1993), 183--190. Google Scholar
- Radev, D. R. and McKeown, K. R. 1998. Generating natural language summaries from multiple on-line sources. Computational Linguistics 24, 469--500. Google Scholar
- Rooney, E. and Witte, O. 2000. Copy Editing for Professionals. Stipes Publishing Co.Google Scholar
- Schwartz, R., Imai, T., Jubala, F., Nguyen, L., and Makhoul, J. 1999. A maximum likelihood model for topic classification of broadcast news. In Eurospeech-97 (Rhodes, Greece).Google Scholar
- Schwartz, R., Sista, S., and Leek, T. R. 2001. Unsupervised topic discovery. In Proceedings of Workshop on Language Modeling and Information Retrieval (Pittsburgh, PA). 72--77.Google Scholar
- Teufel, S. and Moens, M. 1997. Sentence extraction as a classification task. In Proceedings of the Workshop on Intelligent and scalable Text summarization, ACL/EACL (Madrid, Spain).Google Scholar
- Zechner, K. 1995. Automatic Text Abstracting by Selecting Relevant Passages. M.S. thesis, Center for Cognitive Science, University of Edinburgh.Google Scholar
Index Terms
- Cross-language headline generation for Hindi
Recommendations
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsTodays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Hindi to English and Marathi to English Cross Language Information Retrieval Evaluation
Advances in Multilingual and Multimodal Information RetrievalIn this paper, we present our Hindi to English and Marathi to English CLIR systems developed as part of our participation in the CLEF 2007 Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries. Query words not ...
A Neural Framework for English-Hindi Cross-Lingual Natural Language Inference
Neural Information ProcessingAbstractRecognizing Textual Entailment (RTE) between two pieces of texts is a very crucial problem in Natural Language Processing (NLP), and it adds further challenges when involving two different languages, i.e. in cross-lingual scenario. The paucity of ...
Comments