research-article

Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese

Authors:
Jannik Strötgen

Heidelberg University

Heidelberg University
View Profile

,
Ayser Armiti

Heidelberg University

Heidelberg University
View Profile

,
Tran Van Canh

Heidelberg University

Heidelberg University
View Profile

,
Julian Zell

Heidelberg University

Heidelberg University
View Profile

,
Michael Gertz

Heidelberg University

Heidelberg University
View Profile

ACM Transactions on Asian Language Information Processing Volume 13 Issue 1Article No.: 1pp 1–21https://doi.org/10.1145/2540989

Published:01 February 2014Publication History

ACM Transactions on Asian Language Information Processing

Abstract

Most of the research on temporal tagging so far is done for processing English text documents. There are hardly any multilingual temporal taggers supporting more than two languages. Recently, the temporal tagger HeidelTime has been made publicly available, supporting the integration of new languages by developing language-dependent resources without modifying the source code.

In this article, we describe our work on developing such resources for two Asian and two Romance languages: Arabic, Vietnamese, Spanish, and Italian. While temporal tagging of the two Romance languages has been addressed before, there has been almost no research on Arabic and Vietnamese temporal tagging so far. Furthermore, we analyze language-dependent challenges for temporal tagging and explain the strategies we followed to address them. Our evaluation results on publicly available and newly annotated corpora demonstrate the high quality of our new resources for the four languages, which we make publicly available to the research community.

References

Omar Alonso, Jannik Strötgen, Ricardo Baeza-Yates, and Michael Gertz. 2011. Temporal information retrieval: Challenges and opportunities. In Proceedings of the 1st International Temporal Web Analytics Workshop. 1--8.Google Scholar
André Bittar, Pascal Amsili, Pascal Denis, and Laurence Danlos. 2011. French TimeBank: An ISO-TimeML annotated reference corpus. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Short Papers - Vol. 2). 130--134. Google ScholarDigital Library
Nicolas Boffo and Océane Ho Dinh. 2010. Automatic processing of temporality for VIET4NooJ. In Proceedings of the NooJ Conference. 39--41.Google Scholar
Tommaso Caselli. 2010. It-TimeML: TimeML Annotation Scheme for Italian. Version 1.3.1. Tech. rep. Instituto di Linguistica Computazionale C.N.R.Google Scholar
Tommaso Caselli, Felice dell’Orletta, and Irina Prodanof. 2009. TETI: A TimeML compliant TimEx tagger for Italian. In Proceedings of the International Multiconference on Computer Science and Information Technology. 185--192.Google ScholarCross Ref
Tommaso Caselli, Valentina Bartalesi Lenzi, Rachele Sprugnoli, Emanuele Pianta, and Irina Prodanof. 2011. Annotating events, temporal expressions and relations in Italian: The It-TimeML experience for the Ita-TimeBank. In Proceedings of the 5th Linguistic Annotation Workshop. 143--151. Google ScholarDigital Library
Angel X. Chang and Christopher D. Manning. 2012. SUTime: A library for recognizing and normalizing time expressions. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 3735--3740.Google Scholar
Ali Farghaly and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inform. Process. 8, 4, Article 14. Google ScholarDigital Library
Lisa Ferro, Laurie Gerber, Inderjeet Mani, Beth Sundheim, and George Wilson. 2005. TIDES 2005 Standard for the Annotation of Temporal Expressions. Tech. rep., MITRE Corporation.Google Scholar
David Ferrucci and Adam Lally. 2004. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Lang. Eng. 10, 3--4, 327--348. Google ScholarDigital Library
Marta Guerrero Nieto and Roser Saurí. 2012. ModeS TimeBank 1.0. Tech. rep., Linguistic Data Consortium (LDC), Philadelphia, PA.Google Scholar
Philippe Lambert, Sylviane R. Schwer, and Nicolas Boffo. 2012. A new model of time expressions detection and annotation in Vietnamese: The hôm case. In Proceedings of the International Conference on Asian Language Processing. 181--184. Google ScholarDigital Library
Valentina Bartalesi Lenzi and Rachele Sprugnoli. 2007. Evalita 2007: Description and results of the TERN task. In Proceedings of the Evalita Workshop.Google Scholar
Hector Llorens, Estela Saquete, and Borja Navarro. 2010. TIPSem (English and Spanish): Evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. 284--291. Google ScholarDigital Library
Bernardo Magnini, Emanuele Pianta, Christian Girardi, Matteo Negri, Lorenza Romano, Manuela Speranza, Valentina Bartalesi Lenzi, and Rachele Sprugnoli. 2006. I-CAB: The Italian Content Annotation Bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation.Google Scholar
Inderjeet Mani and George Wilson. 2000. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 69--76. Google ScholarDigital Library
Pawel Mazur. 2012. Broad-Coverage Rule-Based Processing of Temporal Expressions. Ph.D. dissertation, Macquarie University and Wroclaw University of Technology.Google Scholar
Pawel Mazur and Robert Dale. 2009. The DANTE temporal expression tagger. In Proceedings of the 3rd Language and Technology Conference. 245--257.Google ScholarDigital Library
Pawel Mazur and Robert Dale. 2010. WikiWars: A new corpus for research on temporal expressions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 913--922. Google ScholarDigital Library
Matteo Negri. 2007. Dealing with Italian temporal expressions: The ITA-CHRONOS system. In Proceedings of the Evalita Workshop.Google Scholar
Matteo Negri and Luca Marseglia. 2004. Recognition and Normalization of Time Expressions: ITC-irst at TERN 2004. Tech. rep.Google Scholar
Matteo Negri, Estela Saquete, Patricio Martínez-Barco, and Rafael Muñoz. 2006. Evaluating knowledge-based approaches to the multilingual extension of a temporal expression normalizer. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events. 30--37. Google ScholarDigital Library
Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. 2010. JVnTextPro: a Tool to Process Vietnamese Texts. Tech. rep., Version 2.0, http://jvntextpro.sourceforge.net/.Google Scholar
Dinh-Hoa Nguyen. 1997. Vietnamese. Vol. 9. John Benjamins Publishing Company.Google Scholar
Marcel Puchol-Blasco, Estela Saquete, and Patricio Martínez-Barco. 2007. Multilingual extension of temporal expression recognition using parallel corpora. In Proceedings of the 14th International Symposium on Temporal Representation and Reasoning. 175--180. Google ScholarDigital Library
James Pustejovsky, Robert Knippen, Jessica Littman, and Roser Saurí. 2005. Temporal and event information in natural language text. Lang. Resources Eval. 39, 2--3, 123--164.Google Scholar
Iman Saleh, Lamia Tounsi, and Josef van Genabith. 2011. ZamAn and Raqm: Extracting temporal and numerical expressions in Arabic. In Proceedings of the 7th Asia Information Retrieval Societies Conference. 562--573. Google ScholarDigital Library
Estela Saquete, Rafael Muñoz, and Patricio Martínez-Barco. 2006. Event ordering using TERSEO system. Data Knowl. Eng. 58, 1, 70--89. Google ScholarDigital Library
Estela Saquete and James Pustejovsky. 2011. Automatic transformation from TIDES to TimeML annotation. Lang. Resources Eval. 45, 4, 495--523. Google ScholarDigital Library
Roser Saurí and Toni Badia. 2012. Spanish TimeBank 1.0. Tech. rep., Linguistic Data Consortium (LDC), Philadelphia, PA.Google Scholar
Roser Saurí, Estela Saquete, and James Pustejovsky. 2010. Annotating Time Expressions in Spanish. TimeML Annotation Guidelines. Tech. rep. BM 2010-02, Barcelona Media.Google Scholar
Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing.Google Scholar
Jannik Strötgen and Michael Gertz. 2010. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation. 321--324. Google ScholarDigital Library
Jannik Strötgen and Michael Gertz. 2011. WikiWarsDE: A German corpus of narratives annotated with temporal expressions. In Proceedings of the Conference of the German Society for Computational Linguistics and Language Technology. 129--134.Google Scholar
Jannik Strötgen and Michael Gertz. 2012. Temporal tagging on different domains: Challenges, strategies, and gold standards. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 3746--3753.Google Scholar
Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain temporal tagging. Lang. Resources Eval. 47, 2, 269--298.Google ScholarCross Ref
Jannik Strötgen, Julian Zell, and Michael Gertz. 2013. HeidelTime: Tuning English and developing Spanish resources for TempEval-3. In Proceedings of the 7th International Workshop on Semantic Evaluation. 15--19.Google Scholar
Pham Thi Xuan Thao, Tran Quoc Tri, Ai Kawazoe, Dien Dinh, and Nigel Collier. 2007. Construction of Vietnamese corpora for named entity recognition. In Proceedings of the Large Scale Semantic Access to Content (Text, Image, Video, and Sound). 719--724. Google ScholarDigital Library
Laurence C. Thompson. 1991. A Vietnamese Reference Grammar. University of Hawaii Press.Google Scholar
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. 173--180. Google ScholarDigital Library
Tran Quoc Tri, Pham Thi Xuan Thao, Quoc-Hung Ngo, Dien Dinh, and Nigel Collier. 2007. Named entity recognition in Vietnamese documents. Progress Inform. 4, 5--13.Google Scholar
Naushad UzZaman, Hector Llorens, James F. Allen, Leon Derczynski, Marc Verhagen, and James Pustejovsky. 2012. TempEval-3: Evaluating events, time expressions, and temporal relations. CoRR abs/1206.5333.Google Scholar
Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James Allen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of the 7th International Workshop on Semantic Evaluation. 1--9.Google Scholar
Marc Verhagen and James Pustejovsky. 2008. Temporal processing with the TARSQI toolkit. In Proceedings of the 22nd International Conference on on Computational Linguistics: Demonstration Papers. 189--192. Google ScholarDigital Library
Marc Verhagen, Roser Saurí, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation. 57--62. Google ScholarDigital Library

Index Terms

Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Temponym Tagging: Temporal Scopes for Textual Phrases
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

For many NLP and IR applications, anchored temporal information extracted from textual documents is of utmost importance. Thus, temporal tagging -- the extraction and normalization of temporal expressions -- has gained a lot of attention in recent years ...
Read More
Evaluating Various Tokenizers for Arabic Text Classification
Abstract
The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of ...
Read More
ArSphere: Arabic word vectors embedded in a polar sphere
Abstract
Word embeddings mean the mapping of words into vectors in an N-dimensional space. ArSphere: is an approach that designs word embeddings for the Arabic language. This approach overcomes one of the shortcomings of word embeddings (for English ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian Language Information Processing Volume 13, Issue 1
February 2014
93 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/2590408
Editor:
Richard Sproat
Google, Inc., USA
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2014
- Accepted: 1 October 2013
- Revised: 1 September 2013
- Received: 1 May 2013
Published in talip Volume 13, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic NLP
HeidelTime
TIMEX3
Temporal tagging
Vietnamese NLP
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 387
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Temponym Tagging: Temporal Scopes for Textual Phrases

Evaluating Various Tokenizers for Arabic Text Classification

ArSphere: Arabic word vectors embedded in a polar sphere

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Time for More Languages: Temporal Tagging of Arabic, Italian, Spanish, and Vietnamese

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Temponym Tagging: Temporal Scopes for Textual Phrases

Evaluating Various Tokenizers for Arabic Text Classification

ArSphere: Arabic word vectors embedded in a polar sphere

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media