skip to main content
research-article

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Authors Info & Claims
Published:12 November 2018Publication History
Skip Abstract Section

Abstract

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection and traceability. To address the problem of anonymity, authorship analysis is used to identify individuals by their writing styles without knowing their actual identities. Most authorship studies are dedicated to English due to its widespread use over the Internet, but recent cyber-attacks such as the distribution of Stuxnet indicate that Internet crimes are not limited to a certain community, language, culture, ideology, or ethnicity. To effectively investigate cybercrime and to address the problem of anonymity in online communication, there is a pressing need to study authorship analysis of languages such as Arabic, Chinese, Turkish, and so on. Arabic, the focus of this study, is the fourth most widely used language on the Internet. This study investigates authorship of Arabic discourse/text, especially tiny text, Twitter posts. We benchmark the performance of a profile-based approach that uses n-grams as features and compare it with state-of-the-art instance-based classification techniques. Then we adapt an event-visualization tool that is developed for English to accommodate both Arabic and English languages and visualize the result of the attribution evidence. In addition, we investigate the relative effect of the training set, the length of tweets, and the number of authors on authorship classification accuracy. Finally, we show that diacritics have an insignificant effect on the attribution process and part-of-speech tags are less effective than character-level and word-level n-grams.

References

  1. Ahmed Abbasi and Hsinchun Chen. 2005a. Applying authorship analysis to arabic web content. In Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI'05). Springer-Verlag, Berlin, Heidelberg, 183--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ahmed Abbasi and Hsinchun Chen. 2005b. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 5 (2005), 67--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ahmed Abbasi and Hsinchun Chen. 2006. Visualizing authorship for identification. In Proceedings of the International Conference on Intelligence and Security Informatics. Springer, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mahmoud Al-Ayyoub, Ahmed Alwajeeh, and Ismail Hmeidi. 2017. An extensive study of authorship authentication of arabic articles. Int. J. Web Inf. Syst. 13, 1 (2017), 85--104.Google ScholarGoogle ScholarCross RefCross Ref
  5. Mahmoud Al-Ayyoub, Yaser Jararweh, Abdullateef Rabab’ah, and Monther Aldwairi. 2017. Feature extraction and selection for arabic tweets authorship authentication. J. Ambient Intell. Hum. Comput. 8, 3 (01 Jun 2017), 383--393.Google ScholarGoogle ScholarCross RefCross Ref
  6. Alaa Saleh Altheneyan and Mohamed El Bachir Menai. 2014. Naïve bayes classifiers for authorship attribution of arabic texts. J. King Saud Univ. Comput. Inf. Sci. 26, 4 (2014), 473--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ahmed Alwajeeh, Mahmoud Al-Ayyoub, and Ismail Hmeidi. 2014. On authorship authentication of arabic articles. In Proceedings of the 5th International Conference on Information and Communication Systems (ICICS’14). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  8. ArabiNames.com. 2015. Arabi Names. Retreived from http://arabinames.com/categories.aspx.Google ScholarGoogle Scholar
  9. Victor Benjamin, Wingyan Chung, Ahmed Abbasi, Joshua Chuang, Catherine A. Larson, and Hsinchun Chen. 2014. Evaluating text visualization for authorship analysis. Secur. Inf. 3, 1 (2014), 10.Google ScholarGoogle ScholarCross RefCross Ref
  10. Mudit Bhargava, Pulkit Mehndiratta, and Krishna Asawa. 2013. Stylometric analysis for authorship attribution on twitter. In Big Data Analytics. Springer, 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Thiago Cavalcante, Anderson Rocha, and Ariadne Carvalho. 2014. Large-scale micro-blog authorship attribution: Beyond simple feature engineering. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, 399--407.Google ScholarGoogle Scholar
  13. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Carole E. Chaski. 2005. Who’s at the keyboard? Authorship attribution in digital evidence investigations. J. Dig. Evidence 4, 1 (2005), 1--13.Google ScholarGoogle Scholar
  15. Na Cheng, Rajarathnam Chandramouli, and K. P. Subbalakshmi. 2011. Author gender identification from text. Dig. Invest. 8, 1 (2011), 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Progress in Pattern Recognition, Image Analysis and Applications. Springer, 844--853. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Olivier de Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001. Mining E-mail content for author identification forensics. ACM SIGMOD Reco. 30, 4 (2001), 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2007. Automated methods for processing arabic text: From tokenization to base phrase chunking. Arabic Computational Morphology: Knowledge-Based and Empirical Methods. (2007).Google ScholarGoogle Scholar
  19. Steven H. H. Ding, Benjamin C. M. Fung, and Mourad Debbabi. 2015. A visualizable evidence-driven approach for authorship attribution. ACM Trans. Inf. Syst. Secur. 17, 3, Article 12 (March 2015), 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E. Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level N-grams: The source code author profile (SCAP) method. Int. J. Dig. Evidence 6, 1 (2007), 1--18.Google ScholarGoogle Scholar
  21. Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model.. In Proceedings of the AAAI Conference on Artificial Intelligence. 4212--4213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nizar Habash and Owen Rambow. 2005. Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In Proceedings of the Conference of American Association for Computational Linguistics. 578--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. A. Hall. 1998. Correlation-based feature subset selection for machine learning. (unpublished).Google ScholarGoogle Scholar
  25. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. Newslett. 11, 1 (2009), 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jiawei Han and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Markus Hofmann and Ralf Klinkenberg. 2013. RapidMiner: Data Mining Use Cases and Business Analytics Applications. Chapman 8 Hall/CRC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Giacomo Inches, Morgan Harvey, and Fabio Crestani. 2013. Finding participants in a chat: Authorship attribution for conversational documents. In Proceedings of the International Conference on Social Computing (SocialCom’13). IEEE, 272--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5 (Suppl.) (2008), S42--S51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Shunichi Ishihara. 2011. A forensic authorship classification in sms messages: A likelihood ratio based approach using N-Gram. In Proceedings of the Australasian Language Technology Association Workshop 2011. 47--56.Google ScholarGoogle Scholar
  31. George H. John and Pat Langley. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francsisco, CA, 338--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03), Vol. 3. 255--264.Google ScholarGoogle Scholar
  34. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 1746--1751.Google ScholarGoogle ScholarCross RefCross Ref
  35. Bradley Kjell, W. Addison Woods, and Ophir Frieder. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1 (1994), 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci.Technol. 60, 1 (2009), 9--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sushil Kumar and Mousmi A. Chaurasia. 2012. Assessment on stylometry for multilingual manuscript. Assessment 2, 9 (2012), 1--6.Google ScholarGoogle Scholar
  38. Robert Layton, Stephen McCombie, and Paul Watters. 2012. Authorship attribution of irc messages using inverse author frequency. In Proceedings of the 3rd Cybercrime and Trustworthy Computing Workshop (CTC’12). IEEE, 7--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Robert Layton, Paul Watters, and Richard Dazeley. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the Second Cybercrime and Trustworthy Computing Workshop. IEEE, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Robert Layton, Paul Watters, and Richard Dazeley. 2012. Recentred local profiles for authorship attribution. Nat. Lang. Eng. 18, 3 (7 2012), 293--312.Google ScholarGoogle Scholar
  41. Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 681--691.Google ScholarGoogle ScholarCross RefCross Ref
  42. Mark Liberman. 2008. Ask Language Log: Comparing the Vocabularies of Different Languages. Retrieved from http://itre.cis.upenn.edu/ myl/languagelog/archives/005514.html.Google ScholarGoogle Scholar
  43. Kim Luyckx and Walter Daelemans. 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 513--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kim Luyckx and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Liter. Ling. Comput. 26, 1 (2011), 35--55.Google ScholarGoogle ScholarCross RefCross Ref
  45. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics (ACL), 55--60.Google ScholarGoogle ScholarCross RefCross Ref
  46. Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization, Vol. 752. Citeseer, 41--48.Google ScholarGoogle Scholar
  47. Miniwatts Marketing Group. 2013. Internet World Users by Language. Retreived from http://www.internetworldstats.com/stats7.htm.Google ScholarGoogle Scholar
  48. Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google ScholarGoogle Scholar
  49. Ahmed Fawzi Otoom, Emad E. Abdullah, Shifaa Jaafer, Aseel Hamdallh, and Dana Amer. 2014. Towards author identification of arabic text articles. In Proceedings of the 5th Internation Conference on Information and Communication Systems (ICICS’14). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  50. Siham Ouamour and Halim Sayoud. 2013. Authorship attribution of short historical arabic texts based on lexical features. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC’13). IEEE, 144--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'14), Vol. 14. 1094--1101.Google ScholarGoogle Scholar
  52. John Ross Quinlan. 1993. C4.5: Programs for machine learning. Vol. 1. The Morgan Kaufmann Series in Machine Learning, Morgan Kaufmann, San Mateo, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Abdullateef Rabab’ah, Mahmoud Al-Ayyoub, Yaser Jararweh, and Monther Aldwairi. 2016. Authorship attribution of arabic tweets. In Proceedings of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA’16). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  54. Roshan Ragel, Pramod Herath, and Upul Senanayake. 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 8th International Conference on Industrial and Information Systems (ICIIS’13). IEEE, 387--392.Google ScholarGoogle ScholarCross RefCross Ref
  55. Dylan Rhodes. 2015. Author attribution with CNNs. Retrieved August 22, 2016 from https://www.semanticscholar.org/paper/Author-Attribution-with-Cnn-s-Rhodes/0a904f9d6b47dfc574f681f4d3b41bd840871b6f/pdf.Google ScholarGoogle Scholar
  56. David C. Rubin. 1978. Word-initial and word-final Ngram frequencies. J. Literacy Res. 10, 2 (1978), 171--183.Google ScholarGoogle Scholar
  57. Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. 2016. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Preprint arXiv:1609.06686 (2016).Google ScholarGoogle Scholar
  58. Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’13). 1880--1891.Google ScholarGoogle Scholar
  59. Kareem Shaker and David Corne. 2010. Authorship attribution in arabic using a hybrid of evolutionary search and linear discriminant analysis. In Proceedings of The Computational Intelligence (UKCI’10) Workshop. IEEE, UK, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  60. Kareem Shaker, David Corne, and Richard Everson. 2007. Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2071--2077.Google ScholarGoogle ScholarCross RefCross Ref
  61. Armin Shmilovici. 2005. Support Vector Machines. Vol. 12. Springer New York, NY, 257--276.Google ScholarGoogle Scholar
  62. Prasha Shrestha, Sebastian Sierra, Fabio A. González, Paolo Rosso, Manuel Montes-y Gómez, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 669.Google ScholarGoogle ScholarCross RefCross Ref
  63. Rui Sousa Silva, Gustavo Laboreiro, Luís Sarmento, Tim Grant, Eugénio Oliveira, and Belinda Maia. 2011. ‘twazn me!!!;(’Automatic authorship analysis of micro-blogging messages. In Natural Language Processing and Information Systems. Springer, 161--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Steve Simon. 2005. When the F Test Is Significant, but Tukey Is Not. Retrieved from http://www.pmean.com/05/TukeyTest.html.Google ScholarGoogle Scholar
  65. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (2009), 538--556. Google ScholarGoogle ScholarCross RefCross Ref
  66. Nick Taylor. 2015. Twitter and Open Data in Academia. Retrieved from https://twittercommunity.com/t/twitter-and-open-data-in-academia/51934.Google ScholarGoogle Scholar
  67. Twitter. 2017. Developer Agreement and Policy. Retrieved from https://dev.twitter.com/overview/terms/agreement-and-policyGoogle ScholarGoogle Scholar
  68. Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. 649--657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 3 (2006), 378--393. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Arabic Authorship Attribution: An Extensive Study on Twitter Posts

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 1
                        March 2019
                        196 pages
                        ISSN:2375-4699
                        EISSN:2375-4702
                        DOI:10.1145/3292011
                        Issue’s Table of Contents

                        Copyright © 2018 ACM

                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 12 November 2018
                        • Accepted: 1 June 2018
                        • Revised: 1 March 2018
                        • Received: 1 October 2015
                        Published in tallip Volume 18, Issue 1

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article
                        • Research
                        • Refereed

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader

                      HTML Format

                      View this article in HTML Format .

                      View HTML Format