Abstract
Most of the research on temporal tagging so far is done for processing English text documents. There are hardly any multilingual temporal taggers supporting more than two languages. Recently, the temporal tagger HeidelTime has been made publicly available, supporting the integration of new languages by developing language-dependent resources without modifying the source code.
In this article, we describe our work on developing such resources for two Asian and two Romance languages: Arabic, Vietnamese, Spanish, and Italian. While temporal tagging of the two Romance languages has been addressed before, there has been almost no research on Arabic and Vietnamese temporal tagging so far. Furthermore, we analyze language-dependent challenges for temporal tagging and explain the strategies we followed to address them. Our evaluation results on publicly available and newly annotated corpora demonstrate the high quality of our new resources for the four languages, which we make publicly available to the research community.
- Omar Alonso, Jannik Strötgen, Ricardo Baeza-Yates, and Michael Gertz. 2011. Temporal information retrieval: Challenges and opportunities. In Proceedings of the 1st International Temporal Web Analytics Workshop. 1--8.Google Scholar
- André Bittar, Pascal Amsili, Pascal Denis, and Laurence Danlos. 2011. French TimeBank: An ISO-TimeML annotated reference corpus. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Short Papers - Vol. 2). 130--134. Google ScholarDigital Library
- Nicolas Boffo and Océane Ho Dinh. 2010. Automatic processing of temporality for VIET4NooJ. In Proceedings of the NooJ Conference. 39--41.Google Scholar
- Tommaso Caselli. 2010. It-TimeML: TimeML Annotation Scheme for Italian. Version 1.3.1. Tech. rep. Instituto di Linguistica Computazionale C.N.R.Google Scholar
- Tommaso Caselli, Felice dell’Orletta, and Irina Prodanof. 2009. TETI: A TimeML compliant TimEx tagger for Italian. In Proceedings of the International Multiconference on Computer Science and Information Technology. 185--192.Google ScholarCross Ref
- Tommaso Caselli, Valentina Bartalesi Lenzi, Rachele Sprugnoli, Emanuele Pianta, and Irina Prodanof. 2011. Annotating events, temporal expressions and relations in Italian: The It-TimeML experience for the Ita-TimeBank. In Proceedings of the 5th Linguistic Annotation Workshop. 143--151. Google ScholarDigital Library
- Angel X. Chang and Christopher D. Manning. 2012. SUTime: A library for recognizing and normalizing time expressions. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 3735--3740.Google Scholar
- Ali Farghaly and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inform. Process. 8, 4, Article 14. Google ScholarDigital Library
- Lisa Ferro, Laurie Gerber, Inderjeet Mani, Beth Sundheim, and George Wilson. 2005. TIDES 2005 Standard for the Annotation of Temporal Expressions. Tech. rep., MITRE Corporation.Google Scholar
- David Ferrucci and Adam Lally. 2004. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Lang. Eng. 10, 3--4, 327--348. Google ScholarDigital Library
- Marta Guerrero Nieto and Roser Saurí. 2012. ModeS TimeBank 1.0. Tech. rep., Linguistic Data Consortium (LDC), Philadelphia, PA.Google Scholar
- Philippe Lambert, Sylviane R. Schwer, and Nicolas Boffo. 2012. A new model of time expressions detection and annotation in Vietnamese: The hôm case. In Proceedings of the International Conference on Asian Language Processing. 181--184. Google ScholarDigital Library
- Valentina Bartalesi Lenzi and Rachele Sprugnoli. 2007. Evalita 2007: Description and results of the TERN task. In Proceedings of the Evalita Workshop.Google Scholar
- Hector Llorens, Estela Saquete, and Borja Navarro. 2010. TIPSem (English and Spanish): Evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. 284--291. Google ScholarDigital Library
- Bernardo Magnini, Emanuele Pianta, Christian Girardi, Matteo Negri, Lorenza Romano, Manuela Speranza, Valentina Bartalesi Lenzi, and Rachele Sprugnoli. 2006. I-CAB: The Italian Content Annotation Bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation.Google Scholar
- Inderjeet Mani and George Wilson. 2000. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 69--76. Google ScholarDigital Library
- Pawel Mazur. 2012. Broad-Coverage Rule-Based Processing of Temporal Expressions. Ph.D. dissertation, Macquarie University and Wroclaw University of Technology.Google Scholar
- Pawel Mazur and Robert Dale. 2009. The DANTE temporal expression tagger. In Proceedings of the 3rd Language and Technology Conference. 245--257.Google ScholarDigital Library
- Pawel Mazur and Robert Dale. 2010. WikiWars: A new corpus for research on temporal expressions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 913--922. Google ScholarDigital Library
- Matteo Negri. 2007. Dealing with Italian temporal expressions: The ITA-CHRONOS system. In Proceedings of the Evalita Workshop.Google Scholar
- Matteo Negri and Luca Marseglia. 2004. Recognition and Normalization of Time Expressions: ITC-irst at TERN 2004. Tech. rep.Google Scholar
- Matteo Negri, Estela Saquete, Patricio Martínez-Barco, and Rafael Muñoz. 2006. Evaluating knowledge-based approaches to the multilingual extension of a temporal expression normalizer. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events. 30--37. Google ScholarDigital Library
- Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. 2010. JVnTextPro: a Tool to Process Vietnamese Texts. Tech. rep., Version 2.0, http://jvntextpro.sourceforge.net/.Google Scholar
- Dinh-Hoa Nguyen. 1997. Vietnamese. Vol. 9. John Benjamins Publishing Company.Google Scholar
- Marcel Puchol-Blasco, Estela Saquete, and Patricio Martínez-Barco. 2007. Multilingual extension of temporal expression recognition using parallel corpora. In Proceedings of the 14th International Symposium on Temporal Representation and Reasoning. 175--180. Google ScholarDigital Library
- James Pustejovsky, Robert Knippen, Jessica Littman, and Roser Saurí. 2005. Temporal and event information in natural language text. Lang. Resources Eval. 39, 2--3, 123--164.Google Scholar
- Iman Saleh, Lamia Tounsi, and Josef van Genabith. 2011. ZamAn and Raqm: Extracting temporal and numerical expressions in Arabic. In Proceedings of the 7th Asia Information Retrieval Societies Conference. 562--573. Google ScholarDigital Library
- Estela Saquete, Rafael Muñoz, and Patricio Martínez-Barco. 2006. Event ordering using TERSEO system. Data Knowl. Eng. 58, 1, 70--89. Google ScholarDigital Library
- Estela Saquete and James Pustejovsky. 2011. Automatic transformation from TIDES to TimeML annotation. Lang. Resources Eval. 45, 4, 495--523. Google ScholarDigital Library
- Roser Saurí and Toni Badia. 2012. Spanish TimeBank 1.0. Tech. rep., Linguistic Data Consortium (LDC), Philadelphia, PA.Google Scholar
- Roser Saurí, Estela Saquete, and James Pustejovsky. 2010. Annotating Time Expressions in Spanish. TimeML Annotation Guidelines. Tech. rep. BM 2010-02, Barcelona Media.Google Scholar
- Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing.Google Scholar
- Jannik Strötgen and Michael Gertz. 2010. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation. 321--324. Google ScholarDigital Library
- Jannik Strötgen and Michael Gertz. 2011. WikiWarsDE: A German corpus of narratives annotated with temporal expressions. In Proceedings of the Conference of the German Society for Computational Linguistics and Language Technology. 129--134.Google Scholar
- Jannik Strötgen and Michael Gertz. 2012. Temporal tagging on different domains: Challenges, strategies, and gold standards. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 3746--3753.Google Scholar
- Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain temporal tagging. Lang. Resources Eval. 47, 2, 269--298.Google ScholarCross Ref
- Jannik Strötgen, Julian Zell, and Michael Gertz. 2013. HeidelTime: Tuning English and developing Spanish resources for TempEval-3. In Proceedings of the 7th International Workshop on Semantic Evaluation. 15--19.Google Scholar
- Pham Thi Xuan Thao, Tran Quoc Tri, Ai Kawazoe, Dien Dinh, and Nigel Collier. 2007. Construction of Vietnamese corpora for named entity recognition. In Proceedings of the Large Scale Semantic Access to Content (Text, Image, Video, and Sound). 719--724. Google ScholarDigital Library
- Laurence C. Thompson. 1991. A Vietnamese Reference Grammar. University of Hawaii Press.Google Scholar
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. 173--180. Google ScholarDigital Library
- Tran Quoc Tri, Pham Thi Xuan Thao, Quoc-Hung Ngo, Dien Dinh, and Nigel Collier. 2007. Named entity recognition in Vietnamese documents. Progress Inform. 4, 5--13.Google Scholar
- Naushad UzZaman, Hector Llorens, James F. Allen, Leon Derczynski, Marc Verhagen, and James Pustejovsky. 2012. TempEval-3: Evaluating events, time expressions, and temporal relations. CoRR abs/1206.5333.Google Scholar
- Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James Allen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of the 7th International Workshop on Semantic Evaluation. 1--9.Google Scholar
- Marc Verhagen and James Pustejovsky. 2008. Temporal processing with the TARSQI toolkit. In Proceedings of the 22nd International Conference on on Computational Linguistics: Demonstration Papers. 189--192. Google ScholarDigital Library
- Marc Verhagen, Roser Saurí, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. 57--62. Google ScholarDigital Library
Index Terms
- Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese
Recommendations
Temponym Tagging: Temporal Scopes for Textual Phrases
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebFor many NLP and IR applications, anchored temporal information extracted from textual documents is of utmost importance. Thus, temporal tagging -- the extraction and normalization of temporal expressions -- has gained a lot of attention in recent years ...
Evaluating Various Tokenizers for Arabic Text Classification
AbstractThe first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of ...
ArSphere: Arabic word vectors embedded in a polar sphere
AbstractWord embeddings mean the mapping of words into vectors in an N-dimensional space. ArSphere: is an approach that designs word embeddings for the Arabic language. This approach overcomes one of the shortcomings of word embeddings (for English ...
Comments