Abstract
Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity, and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare, and evaluate WSD techniques. These are available for many languages, but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-gram are applied to the corpus, and the best performance (accuracy of 57.71%) is achieved using word 4-gram. The corpus is freely available to the research community to encourage further WSD research in Urdu.
- Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing 21, 1 (2017), 515--522.Google ScholarCross Ref
- E. Agirre, I. Aldezabal, J. Etxeberria, E. Izagirre, K. Mendizabal, E. Pociello, and M. Quintian. 2005. EUSEMCOR: Euskarako Corpusa Semantikoki Etiketatzeko Eskuliburua; Editatze-, Etiketatze-Eta Epaitze-Lanak. Internal Technical Report.Google Scholar
- E. Agirre, O. Lopez de Lacalle, C. Fellbaum, A. Marchetti, A. Toral, P. T. J. M. Vossen, L. Màrques, et al. 2010. SemEval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10). 75--80. Google ScholarDigital Library
- James Allen. 1995. Natural Language Understanding. Pearson. Google ScholarDigital Library
- Syed Zulqarnain Arif, Muhammad Mateen Yaqoob, Atif Rehman, and Fuzel Jamil. 2016. Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security 14, 5 (2016), 738.Google Scholar
- Inger Askehave and John M. Swales. 2001. Genre identification and communicative purpose: A problem and a possible solution. Applied Linguistics 22, 2 (2001), 195--212.Google ScholarCross Ref
- John Bateman and Michael Zock. 2003. Natural language generation. In The Oxford Handbook of Computational Linguistics (2nd ed.), R. Mitkov (Ed.). Oxford University Press, Oxford, UK, 284--304.Google Scholar
- Luisa Bentivogli, Christian Girardi, and Emanuele Pianta. 2003. The MEANING Italian corpus. In Proceedings of the 2003 Corpus Linguistics Conference. 103--112.Google Scholar
- Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific American 284, 5 (2001), 34--43.Google ScholarCross Ref
- Urdu Dictionary Board. 2008. Urdu Lughat. Urdu Lughat Board, Karachi, Pakistan.Google Scholar
- Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. 2012. Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th Global WordNet Conference (GWC’12). 56--63.Google Scholar
- Abraham Bookstein and Don Kraft. 1977. Operations research applied to document indexing and retrieval decisions. Journal of the ACM 24, 3 (1977), 418--427. Google ScholarDigital Library
- Rebecca Bruce and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. 139--146. Google ScholarDigital Library
- Stefano Ceri, Adnan Abid, Mamoun Abu Helou, Davide Barbieri, Alessandro Bozzon, Daniele Braga, Marco Brambilla, et al. 2010. Search computing: Managing complex search queries. IEEE Internet Computing 14, 6 (2010), 14--22. Google ScholarDigital Library
- Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City 1, 2 (2007), 1.Google Scholar
- Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 2006 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 5. Google ScholarDigital Library
- Ido Dagan, Lillian Lee, and Fernando Pereira. 1997. Similarity-based methods for word sense disambiguation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics. 56--63. Google ScholarDigital Library
- Nadir Durrani and Sarmad Hussain. 2010. Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528--536. Google ScholarDigital Library
- Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. 1--5. Google ScholarDigital Library
- Philip Edmonds and Adam Kilgarriff. 2002. Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering 8, 4 (2002), 279--291. Google ScholarDigital Library
- Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion, T. Dalgleish and M. Power (Eds.). John Wiley 8 Sons, West Sussex, England, 45--60.Google Scholar
- Mohamed Abdel Fattah and Fuji Ren. 2008. Automatic text summarization. World Academy of Science, Engineering and Technology 37 (2008), 2008.Google Scholar
- Wael H. Gomaa and Aly A. Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.Google ScholarCross Ref
- Udo Hahn and Inderjeet Mani. 2000. The challenges of automatic summarization. Computer 33, 11 (2000), 29--36. Google ScholarDigital Library
- Nina Heck and Bettina Mohr. 2017. Response hand differentially affects action word processing. Frontiers in Psychology 8 (2017), 2223.Google ScholarCross Ref
- Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.Google Scholar
- W. John Hutchins. 1995. Machine translation: A brief history. In Concise History of the Language Sciences. Elsevier, 431--445.Google Scholar
- Rubén Izquierdo-Beviá, Lorenza Moreno-Monteagudo, Borja Navarro, and Armando Suárez. 2006. Spanish all-words semantic class disambiguation using cast3lb corpus. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 879--888. Google ScholarDigital Library
- Bushra Jawaid, Amir Kamran, and Ondrej Bojar. 2014. A tagged corpus and a tagger for Urdu. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2938--2943.Google Scholar
- Jing Jiang. 2012. Information extraction from text. In Mining Text Data. Springer, 11--41.Google Scholar
- Wahab Khan, Ali Daud, Jamal A. Nasir, and Tehmina Amjad. 2016. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science 43, 4 (2016), 66--84.Google Scholar
- Adam Kilgarriff. 2004. How dominant is the commonest sense of a word? In Proceedings of the International Conference on Text, Speech, and Dialogue. 103--111.Google ScholarCross Ref
- Svetla Koeva, Sv Leseva, and Maria Todorova. 2006. Bulgarian sense tagged corpus. In Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages. 79--87.Google Scholar
- Lawrence R. Lawlor. 1980. Overlap, similarity, and competition coefficients. Ecology 61, 2 (1980), 245--251.Google ScholarCross Ref
- Claudia Leacock, Geoffrey Towell, and Ellen Voorhees. 1993. Corpus-based statistical sense resolution. In Proceedings of the Workshop on Human Language Technology. 260--265. Google ScholarDigital Library
- Gurpreet Lehal. 2010. A word segmentation system for handling space omission problem in Urdu script. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 43--50.Google Scholar
- John B. MacArthur. 1988. An analysis of the content of corporate submissions on proposed accounting standards in the UK. Accounting and Business Research 18, 71 (1988), 213--226.Google ScholarCross Ref
- Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google Scholar
- Neetu Mishra and Tanveer J. Siddiqui. 2012. An investigation to semi supervised approach for HINDI word sense disambiguation. In Proceedings of Trends in Innovative Computing 2012: Intelligent Systems Design.Google Scholar
- Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, et al. 2003. Building the Italian syntactic-semantic treebank. In Treebanks. Springer, 189--210.Google Scholar
- Miguel Murguía and José Luis Villaseñor. 2003. Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications. In Annales Botanici Fennici. JSTOR, 415--421.Google Scholar
- Dipak Narayan, Debasri Chakrabarti, Prabhakar Pande, and Pushpak Bhattacharyya. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the 1st International Conference on Global WordNet.Google Scholar
- Asma Naseer and Sarmad Hussain. 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.Google Scholar
- Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 10. Google ScholarDigital Library
- A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson. 2018. A word sense disambiguation corpus for Urdu. In Language Resources and Evaluation. Springer, 1--22.Google Scholar
- Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX99: Standardizing Lexical Resources.Google Scholar
- Hieu V. Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Proceedings of the Asian Conference on Computer Vision. 709--720. Google ScholarDigital Library
- Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1.Google Scholar
- Francois Paradis and Catherine Berrut. 1996. Experiments with theme extraction in explanatory texts. In Proceedings of the 2nd International Conference on Conceptions of Library and Information Science (CoLIS’96). 13--16.Google Scholar
- Rebecca J. Passonneau, Collin Baker, Christiane Fellbaum, and Nancy Ide. 2012. The MASC word sense sentence corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3025--3030.Google Scholar
- Michel Pêcheux. 1995. Automatic Discourse Analysis. Vol. 5. Rodopi.Google Scholar
- Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. 87--92. Google ScholarDigital Library
- Tariq Rahman. 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics, Vol. 99. 100.Google Scholar
- Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126--135. Google ScholarDigital Library
- Adriana Roventini, Alone Antonietta, Francesca Bertagna, Nicoletta Calzolari, Cacila Jessica, Girardi Christian, Magnini Bernardo, et al. 2003. ItalWordNet: Building a large semantic database for the automatic treatment of Italian. Linguistica Computazionale 18 (2003), 745--791.Google Scholar
- Hassan Sajid. 2007. Urdu Part of Speech Tagset. Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Lahore, Pakistan.Google Scholar
- Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press.Google Scholar
- UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507--2518.Google Scholar
- Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proceedings of SENSEVAL-3: The 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Google Scholar
- Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427--437. Google ScholarDigital Library
- Radu Soricut and Eric Brill. 2004. Automatic question answering: Beyond the factoid. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04).Google Scholar
- Steve Stemler. 2001. An overview of content analysis. Practical Assessment, Research and Evaluation 7, 17 (2001), 137--146.Google Scholar
- Xue-Ren Sun, Shao-He Lv, Xiao-Dong Wang, and Dong Wang. 2017. Chinese word sense disambiguation using a LSTM. In ITM Web of Conferences, Vol. 12. EDP Sciences, 01027.Google ScholarCross Ref
- Vikas Thada and Vivek Jaglan. 2013. Comparison of Jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology 2, 4 (2013), 202--205.Google Scholar
- Saba Urooj, Sana Shams, Sarmad Hussain, and Farah Adeeba. 2014. Sense Tagged CLE Urdu Digest Corpus. Centre for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan.Google Scholar
- Arthur A. Van Hoff. 1998. System for adding requested document cross references to a document by annotation proxy configured to merge and a directory generator and annotation server. US Patent 5,822,539.Google Scholar
- Piek Vossen, Rubén Izquierdo, and Attila Görög. 2013. DutchSemCor: In quest of the ideal sense-tagged corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 710--718.Google Scholar
- Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv:1603.07012.Google Scholar
- Ayesha Zafar, Afia Mahmood, Farhat Abdullah, Saira Zahid, Sarmad Hussain, and Asad Mustafa. 2012. Developing Urdu WordNet using the merge approach. In Proceedings of the Conference on Language and Technology. 55--59.Google Scholar
- Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv:1708.02657.Google Scholar
Index Terms
- A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation
Recommendations
A word sense disambiguation corpus for Urdu
AbstractThe aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an ...
Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The ...
Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The ...
Comments