skip to main content
research-article
Open Access

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

Published:07 May 2019Publication History
Skip Abstract Section

Abstract

Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity, and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare, and evaluate WSD techniques. These are available for many languages, but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-gram are applied to the corpus, and the best performance (accuracy of 57.71%) is achieved using word 4-gram. The corpus is freely available to the research community to encourage further WSD research in Urdu.

References

  1. Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing 21, 1 (2017), 515--522.Google ScholarGoogle ScholarCross RefCross Ref
  2. E. Agirre, I. Aldezabal, J. Etxeberria, E. Izagirre, K. Mendizabal, E. Pociello, and M. Quintian. 2005. EUSEMCOR: Euskarako Corpusa Semantikoki Etiketatzeko Eskuliburua; Editatze-, Etiketatze-Eta Epaitze-Lanak. Internal Technical Report.Google ScholarGoogle Scholar
  3. E. Agirre, O. Lopez de Lacalle, C. Fellbaum, A. Marchetti, A. Toral, P. T. J. M. Vossen, L. Màrques, et al. 2010. SemEval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10). 75--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. James Allen. 1995. Natural Language Understanding. Pearson. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Syed Zulqarnain Arif, Muhammad Mateen Yaqoob, Atif Rehman, and Fuzel Jamil. 2016. Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security 14, 5 (2016), 738.Google ScholarGoogle Scholar
  6. Inger Askehave and John M. Swales. 2001. Genre identification and communicative purpose: A problem and a possible solution. Applied Linguistics 22, 2 (2001), 195--212.Google ScholarGoogle ScholarCross RefCross Ref
  7. John Bateman and Michael Zock. 2003. Natural language generation. In The Oxford Handbook of Computational Linguistics (2nd ed.), R. Mitkov (Ed.). Oxford University Press, Oxford, UK, 284--304.Google ScholarGoogle Scholar
  8. Luisa Bentivogli, Christian Girardi, and Emanuele Pianta. 2003. The MEANING Italian corpus. In Proceedings of the 2003 Corpus Linguistics Conference. 103--112.Google ScholarGoogle Scholar
  9. Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific American 284, 5 (2001), 34--43.Google ScholarGoogle ScholarCross RefCross Ref
  10. Urdu Dictionary Board. 2008. Urdu Lughat. Urdu Lughat Board, Karachi, Pakistan.Google ScholarGoogle Scholar
  11. Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. 2012. Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th Global WordNet Conference (GWC’12). 56--63.Google ScholarGoogle Scholar
  12. Abraham Bookstein and Don Kraft. 1977. Operations research applied to document indexing and retrieval decisions. Journal of the ACM 24, 3 (1977), 418--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rebecca Bruce and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. 139--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Stefano Ceri, Adnan Abid, Mamoun Abu Helou, Davide Barbieri, Alessandro Bozzon, Daniele Braga, Marco Brambilla, et al. 2010. Search computing: Managing complex search queries. IEEE Internet Computing 14, 6 (2010), 14--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City 1, 2 (2007), 1.Google ScholarGoogle Scholar
  16. Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 2006 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ido Dagan, Lillian Lee, and Fernando Pereira. 1997. Similarity-based methods for word sense disambiguation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics. 56--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nadir Durrani and Sarmad Hussain. 2010. Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528--536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Philip Edmonds and Adam Kilgarriff. 2002. Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering 8, 4 (2002), 279--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion, T. Dalgleish and M. Power (Eds.). John Wiley 8 Sons, West Sussex, England, 45--60.Google ScholarGoogle Scholar
  22. Mohamed Abdel Fattah and Fuji Ren. 2008. Automatic text summarization. World Academy of Science, Engineering and Technology 37 (2008), 2008.Google ScholarGoogle Scholar
  23. Wael H. Gomaa and Aly A. Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.Google ScholarGoogle ScholarCross RefCross Ref
  24. Udo Hahn and Inderjeet Mani. 2000. The challenges of automatic summarization. Computer 33, 11 (2000), 29--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nina Heck and Bettina Mohr. 2017. Response hand differentially affects action word processing. Frontiers in Psychology 8 (2017), 2223.Google ScholarGoogle ScholarCross RefCross Ref
  26. Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.Google ScholarGoogle Scholar
  27. W. John Hutchins. 1995. Machine translation: A brief history. In Concise History of the Language Sciences. Elsevier, 431--445.Google ScholarGoogle Scholar
  28. Rubén Izquierdo-Beviá, Lorenza Moreno-Monteagudo, Borja Navarro, and Armando Suárez. 2006. Spanish all-words semantic class disambiguation using cast3lb corpus. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 879--888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Bushra Jawaid, Amir Kamran, and Ondrej Bojar. 2014. A tagged corpus and a tagger for Urdu. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2938--2943.Google ScholarGoogle Scholar
  30. Jing Jiang. 2012. Information extraction from text. In Mining Text Data. Springer, 11--41.Google ScholarGoogle Scholar
  31. Wahab Khan, Ali Daud, Jamal A. Nasir, and Tehmina Amjad. 2016. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science 43, 4 (2016), 66--84.Google ScholarGoogle Scholar
  32. Adam Kilgarriff. 2004. How dominant is the commonest sense of a word? In Proceedings of the International Conference on Text, Speech, and Dialogue. 103--111.Google ScholarGoogle ScholarCross RefCross Ref
  33. Svetla Koeva, Sv Leseva, and Maria Todorova. 2006. Bulgarian sense tagged corpus. In Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages. 79--87.Google ScholarGoogle Scholar
  34. Lawrence R. Lawlor. 1980. Overlap, similarity, and competition coefficients. Ecology 61, 2 (1980), 245--251.Google ScholarGoogle ScholarCross RefCross Ref
  35. Claudia Leacock, Geoffrey Towell, and Ellen Voorhees. 1993. Corpus-based statistical sense resolution. In Proceedings of the Workshop on Human Language Technology. 260--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gurpreet Lehal. 2010. A word segmentation system for handling space omission problem in Urdu script. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 43--50.Google ScholarGoogle Scholar
  37. John B. MacArthur. 1988. An analysis of the content of corporate submissions on proposed accounting standards in the UK. Accounting and Business Research 18, 71 (1988), 213--226.Google ScholarGoogle ScholarCross RefCross Ref
  38. Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google ScholarGoogle Scholar
  39. Neetu Mishra and Tanveer J. Siddiqui. 2012. An investigation to semi supervised approach for HINDI word sense disambiguation. In Proceedings of Trends in Innovative Computing 2012: Intelligent Systems Design.Google ScholarGoogle Scholar
  40. Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, et al. 2003. Building the Italian syntactic-semantic treebank. In Treebanks. Springer, 189--210.Google ScholarGoogle Scholar
  41. Miguel Murguía and José Luis Villaseñor. 2003. Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications. In Annales Botanici Fennici. JSTOR, 415--421.Google ScholarGoogle Scholar
  42. Dipak Narayan, Debasri Chakrabarti, Prabhakar Pande, and Pushpak Bhattacharyya. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the 1st International Conference on Global WordNet.Google ScholarGoogle Scholar
  43. Asma Naseer and Sarmad Hussain. 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.Google ScholarGoogle Scholar
  44. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson. 2018. A word sense disambiguation corpus for Urdu. In Language Resources and Evaluation. Springer, 1--22.Google ScholarGoogle Scholar
  46. Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX99: Standardizing Lexical Resources.Google ScholarGoogle Scholar
  47. Hieu V. Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Proceedings of the Asian Conference on Computer Vision. 709--720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1.Google ScholarGoogle Scholar
  49. Francois Paradis and Catherine Berrut. 1996. Experiments with theme extraction in explanatory texts. In Proceedings of the 2nd International Conference on Conceptions of Library and Information Science (CoLIS’96). 13--16.Google ScholarGoogle Scholar
  50. Rebecca J. Passonneau, Collin Baker, Christiane Fellbaum, and Nancy Ide. 2012. The MASC word sense sentence corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3025--3030.Google ScholarGoogle Scholar
  51. Michel Pêcheux. 1995. Automatic Discourse Analysis. Vol. 5. Rodopi.Google ScholarGoogle Scholar
  52. Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. 87--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Tariq Rahman. 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics, Vol. 99. 100.Google ScholarGoogle Scholar
  54. Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Adriana Roventini, Alone Antonietta, Francesca Bertagna, Nicoletta Calzolari, Cacila Jessica, Girardi Christian, Magnini Bernardo, et al. 2003. ItalWordNet: Building a large semantic database for the automatic treatment of Italian. Linguistica Computazionale 18 (2003), 745--791.Google ScholarGoogle Scholar
  56. Hassan Sajid. 2007. Urdu Part of Speech Tagset. Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Lahore, Pakistan.Google ScholarGoogle Scholar
  57. Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press.Google ScholarGoogle Scholar
  58. UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507--2518.Google ScholarGoogle Scholar
  59. Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proceedings of SENSEVAL-3: The 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google ScholarGoogle Scholar
  60. Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Radu Soricut and Eric Brill. 2004. Automatic question answering: Beyond the factoid. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04).Google ScholarGoogle Scholar
  62. Steve Stemler. 2001. An overview of content analysis. Practical Assessment, Research and Evaluation 7, 17 (2001), 137--146.Google ScholarGoogle Scholar
  63. Xue-Ren Sun, Shao-He Lv, Xiao-Dong Wang, and Dong Wang. 2017. Chinese word sense disambiguation using a LSTM. In ITM Web of Conferences, Vol. 12. EDP Sciences, 01027.Google ScholarGoogle ScholarCross RefCross Ref
  64. Vikas Thada and Vivek Jaglan. 2013. Comparison of Jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology 2, 4 (2013), 202--205.Google ScholarGoogle Scholar
  65. Saba Urooj, Sana Shams, Sarmad Hussain, and Farah Adeeba. 2014. Sense Tagged CLE Urdu Digest Corpus. Centre for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan.Google ScholarGoogle Scholar
  66. Arthur A. Van Hoff. 1998. System for adding requested document cross references to a document by annotation proxy configured to merge and a directory generator and annotation server. US Patent 5,822,539.Google ScholarGoogle Scholar
  67. Piek Vossen, Rubén Izquierdo, and Attila Görög. 2013. DutchSemCor: In quest of the ideal sense-tagged corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 710--718.Google ScholarGoogle Scholar
  68. Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv:1603.07012.Google ScholarGoogle Scholar
  69. Ayesha Zafar, Afia Mahmood, Farhat Abdullah, Saira Zahid, Sarmad Hussain, and Asad Mustafa. 2012. Developing Urdu WordNet using the merge approach. In Proceedings of the Conference on Language and Technology. 55--59.Google ScholarGoogle Scholar
  70. Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv:1708.02657.Google ScholarGoogle Scholar

Index Terms

  1. A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 4
      December 2019
      305 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3327969
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 May 2019
      • Accepted: 1 February 2019
      • Revised: 1 November 2018
      • Received: 1 August 2018
      Published in tallip Volume 18, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format