research-article

Open Access

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

Authors:
Ali Saeed

COMSATS University Islamabad, Lahore Campus, Lahore, Punjab, Pakistan

COMSATS University Islamabad, Lahore Campus, Lahore, Punjab, Pakistan

0000-0002-3779-2633
View Profile

,
Rao Muhammad Adeel Nawab

COMSATS University Islamabad, Lahore Campus, Lahore, Punjab, Pakistan

COMSATS University Islamabad, Lahore Campus, Lahore, Punjab, Pakistan
View Profile

,
Mark Stevenson

University of Sheffield, Sheffield, United Kingdom

University of Sheffield, Sheffield, United Kingdom
View Profile

,
Paul Rayson

Lancaster University, Bailrigg, Lancaster, United Kingdom

Lancaster University, Bailrigg, Lancaster, United Kingdom
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18 Issue 4Article No.: 40pp 1–14https://doi.org/10.1145/3314940

Published:07 May 2019Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity, and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare, and evaluate WSD techniques. These are available for many languages, but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-gram are applied to the corpus, and the best performance (accuracy of 57.71%) is achieved using word 4-gram. The corpus is freely available to the research community to encourage further WSD research in Urdu.

References

Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing 21, 1 (2017), 515--522.Google ScholarCross Ref
E. Agirre, I. Aldezabal, J. Etxeberria, E. Izagirre, K. Mendizabal, E. Pociello, and M. Quintian. 2005. EUSEMCOR: Euskarako Corpusa Semantikoki Etiketatzeko Eskuliburua; Editatze-, Etiketatze-Eta Epaitze-Lanak. Internal Technical Report.Google Scholar
E. Agirre, O. Lopez de Lacalle, C. Fellbaum, A. Marchetti, A. Toral, P. T. J. M. Vossen, L. Màrques, et al. 2010. SemEval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10). 75--80. Google ScholarDigital Library
James Allen. 1995. Natural Language Understanding. Pearson. Google ScholarDigital Library
Syed Zulqarnain Arif, Muhammad Mateen Yaqoob, Atif Rehman, and Fuzel Jamil. 2016. Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security 14, 5 (2016), 738.Google Scholar
Inger Askehave and John M. Swales. 2001. Genre identification and communicative purpose: A problem and a possible solution. Applied Linguistics 22, 2 (2001), 195--212.Google ScholarCross Ref
John Bateman and Michael Zock. 2003. Natural language generation. In The Oxford Handbook of Computational Linguistics (2nd ed.), R. Mitkov (Ed.). Oxford University Press, Oxford, UK, 284--304.Google Scholar
Luisa Bentivogli, Christian Girardi, and Emanuele Pianta. 2003. The MEANING Italian corpus. In Proceedings of the 2003 Corpus Linguistics Conference. 103--112.Google Scholar
Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific American 284, 5 (2001), 34--43.Google ScholarCross Ref
Urdu Dictionary Board. 2008. Urdu Lughat. Urdu Lughat Board, Karachi, Pakistan.Google Scholar
Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. 2012. Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th Global WordNet Conference (GWC’12). 56--63.Google Scholar
Abraham Bookstein and Don Kraft. 1977. Operations research applied to document indexing and retrieval decisions. Journal of the ACM 24, 3 (1977), 418--427. Google ScholarDigital Library
Rebecca Bruce and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. 139--146. Google ScholarDigital Library
Stefano Ceri, Adnan Abid, Mamoun Abu Helou, Davide Barbieri, Alessandro Bozzon, Daniele Braga, Marco Brambilla, et al. 2010. Search computing: Managing complex search queries. IEEE Internet Computing 14, 6 (2010), 14--22. Google ScholarDigital Library
Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City 1, 2 (2007), 1.Google Scholar
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 2006 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 5. Google ScholarDigital Library
Ido Dagan, Lillian Lee, and Fernando Pereira. 1997. Similarity-based methods for word sense disambiguation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics. 56--63. Google ScholarDigital Library
Nadir Durrani and Sarmad Hussain. 2010. Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528--536. Google ScholarDigital Library
Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. 1--5. Google ScholarDigital Library
Philip Edmonds and Adam Kilgarriff. 2002. Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering 8, 4 (2002), 279--291. Google ScholarDigital Library
Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion, T. Dalgleish and M. Power (Eds.). John Wiley 8 Sons, West Sussex, England, 45--60.Google Scholar
Mohamed Abdel Fattah and Fuji Ren. 2008. Automatic text summarization. World Academy of Science, Engineering and Technology 37 (2008), 2008.Google Scholar
Wael H. Gomaa and Aly A. Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.Google ScholarCross Ref
Udo Hahn and Inderjeet Mani. 2000. The challenges of automatic summarization. Computer 33, 11 (2000), 29--36. Google ScholarDigital Library
Nina Heck and Bettina Mohr. 2017. Response hand differentially affects action word processing. Frontiers in Psychology 8 (2017), 2223.Google ScholarCross Ref
Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.Google Scholar
W. John Hutchins. 1995. Machine translation: A brief history. In Concise History of the Language Sciences. Elsevier, 431--445.Google Scholar
Rubén Izquierdo-Beviá, Lorenza Moreno-Monteagudo, Borja Navarro, and Armando Suárez. 2006. Spanish all-words semantic class disambiguation using cast3lb corpus. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 879--888. Google ScholarDigital Library
Bushra Jawaid, Amir Kamran, and Ondrej Bojar. 2014. A tagged corpus and a tagger for Urdu. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2938--2943.Google Scholar
Jing Jiang. 2012. Information extraction from text. In Mining Text Data. Springer, 11--41.Google Scholar
Wahab Khan, Ali Daud, Jamal A. Nasir, and Tehmina Amjad. 2016. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science 43, 4 (2016), 66--84.Google Scholar
Adam Kilgarriff. 2004. How dominant is the commonest sense of a word? In Proceedings of the International Conference on Text, Speech, and Dialogue. 103--111.Google ScholarCross Ref
Svetla Koeva, Sv Leseva, and Maria Todorova. 2006. Bulgarian sense tagged corpus. In Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages. 79--87.Google Scholar
Lawrence R. Lawlor. 1980. Overlap, similarity, and competition coefficients. Ecology 61, 2 (1980), 245--251.Google ScholarCross Ref
Claudia Leacock, Geoffrey Towell, and Ellen Voorhees. 1993. Corpus-based statistical sense resolution. In Proceedings of the Workshop on Human Language Technology. 260--265. Google ScholarDigital Library
Gurpreet Lehal. 2010. A word segmentation system for handling space omission problem in Urdu script. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 43--50.Google Scholar
John B. MacArthur. 1988. An analysis of the content of corporate submissions on proposed accounting standards in the UK. Accounting and Business Research 18, 71 (1988), 213--226.Google ScholarCross Ref
Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google Scholar
Neetu Mishra and Tanveer J. Siddiqui. 2012. An investigation to semi supervised approach for HINDI word sense disambiguation. In Proceedings of Trends in Innovative Computing 2012: Intelligent Systems Design.Google Scholar
Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, et al. 2003. Building the Italian syntactic-semantic treebank. In Treebanks. Springer, 189--210.Google Scholar
Miguel Murguía and José Luis Villaseñor. 2003. Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications. In Annales Botanici Fennici. JSTOR, 415--421.Google Scholar
Dipak Narayan, Debasri Chakrabarti, Prabhakar Pande, and Pushpak Bhattacharyya. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the 1st International Conference on Global WordNet.Google Scholar
Asma Naseer and Sarmad Hussain. 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.Google Scholar
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 10. Google ScholarDigital Library
A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson. 2018. A word sense disambiguation corpus for Urdu. In Language Resources and Evaluation. Springer, 1--22.Google Scholar
Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX99: Standardizing Lexical Resources.Google Scholar
Hieu V. Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Proceedings of the Asian Conference on Computer Vision. 709--720. Google ScholarDigital Library
Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1.Google Scholar
Francois Paradis and Catherine Berrut. 1996. Experiments with theme extraction in explanatory texts. In Proceedings of the 2nd International Conference on Conceptions of Library and Information Science (CoLIS’96). 13--16.Google Scholar
Rebecca J. Passonneau, Collin Baker, Christiane Fellbaum, and Nancy Ide. 2012. The MASC word sense sentence corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3025--3030.Google Scholar
Michel Pêcheux. 1995. Automatic Discourse Analysis. Vol. 5. Rodopi.Google Scholar
Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. 87--92. Google ScholarDigital Library
Tariq Rahman. 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics, Vol. 99. 100.Google Scholar
Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126--135. Google ScholarDigital Library
Adriana Roventini, Alone Antonietta, Francesca Bertagna, Nicoletta Calzolari, Cacila Jessica, Girardi Christian, Magnini Bernardo, et al. 2003. ItalWordNet: Building a large semantic database for the automatic treatment of Italian. Linguistica Computazionale 18 (2003), 745--791.Google Scholar
Hassan Sajid. 2007. Urdu Part of Speech Tagset. Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Lahore, Pakistan.Google Scholar
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press.Google Scholar
UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507--2518.Google Scholar
Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proceedings of SENSEVAL-3: The 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google Scholar
Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427--437. Google ScholarDigital Library
Radu Soricut and Eric Brill. 2004. Automatic question answering: Beyond the factoid. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04).Google Scholar
Steve Stemler. 2001. An overview of content analysis. Practical Assessment, Research and Evaluation 7, 17 (2001), 137--146.Google Scholar
Xue-Ren Sun, Shao-He Lv, Xiao-Dong Wang, and Dong Wang. 2017. Chinese word sense disambiguation using a LSTM. In ITM Web of Conferences, Vol. 12. EDP Sciences, 01027.Google ScholarCross Ref
Vikas Thada and Vivek Jaglan. 2013. Comparison of Jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology 2, 4 (2013), 202--205.Google Scholar
Saba Urooj, Sana Shams, Sarmad Hussain, and Farah Adeeba. 2014. Sense Tagged CLE Urdu Digest Corpus. Centre for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan.Google Scholar
Arthur A. Van Hoff. 1998. System for adding requested document cross references to a document by annotation proxy configured to merge and a directory generator and annotation server. US Patent 5,822,539.Google Scholar
Piek Vossen, Rubén Izquierdo, and Attila Görög. 2013. DutchSemCor: In quest of the ideal sense-tagged corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 710--718.Google Scholar
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv:1603.07012.Google Scholar
Ayesha Zafar, Afia Mahmood, Farhat Abdullah, Saira Zahid, Sarmad Hussain, and Asad Mustafa. 2012. Developing Urdu WordNet using the merge approach. In Proceedings of the Conference on Language and Technology. 55--59.Google Scholar
Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv:1708.02657.Google Scholar

Index Terms

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

A word sense disambiguation corpus for Urdu
Abstract
The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...
Read More
Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The ...
Read More
Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18, Issue 4
December 2019
305 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3327969
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2019
- Accepted: 1 February 2019
- Revised: 1 November 2018
- Received: 1 August 2018
Published in tallip Volume 18, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Word sense disambiguation
all-words task
sense tagged Urdu corpus
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 1,455
  Total Downloads
- Downloads (Last 12 months)153
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A word sense disambiguation corpus for Urdu

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

A word sense disambiguation corpus for Urdu

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media