research-article

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Authors:
Malik H. Altakrori

School of Computer Science, McGill University, QC, Canada

School of Computer Science, McGill University, QC, Canada
View Profile

,
Farkhund Iqbal

College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates

College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates
View Profile

,
Benjamin C. M. Fung

School of Information Studies, McGill University, QC, Canada

School of Information Studies, McGill University, QC, Canada

0000-0001-8423-2906
View Profile

,
Steven H. H. Ding

School of Information Studies, McGill University, QC, Canada

School of Information Studies, McGill University, QC, Canada
View Profile

,
Abdallah Tubaishat

College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates

College of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18 Issue 1Article No.: 5pp 1–51https://doi.org/10.1145/3236391

Published:12 November 2018Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection and traceability. To address the problem of anonymity, authorship analysis is used to identify individuals by their writing styles without knowing their actual identities. Most authorship studies are dedicated to English due to its widespread use over the Internet, but recent cyber-attacks such as the distribution of Stuxnet indicate that Internet crimes are not limited to a certain community, language, culture, ideology, or ethnicity. To effectively investigate cybercrime and to address the problem of anonymity in online communication, there is a pressing need to study authorship analysis of languages such as Arabic, Chinese, Turkish, and so on. Arabic, the focus of this study, is the fourth most widely used language on the Internet. This study investigates authorship of Arabic discourse/text, especially tiny text, Twitter posts. We benchmark the performance of a profile-based approach that uses n-grams as features and compare it with state-of-the-art instance-based classification techniques. Then we adapt an event-visualization tool that is developed for English to accommodate both Arabic and English languages and visualize the result of the attribution evidence. In addition, we investigate the relative effect of the training set, the length of tweets, and the number of authors on authorship classification accuracy. Finally, we show that diacritics have an insignificant effect on the attribution process and part-of-speech tags are less effective than character-level and word-level n-grams.

References

Ahmed Abbasi and Hsinchun Chen. 2005a. Applying authorship analysis to arabic web content. In Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI'05). Springer-Verlag, Berlin, Heidelberg, 183--197. Google ScholarDigital Library
Ahmed Abbasi and Hsinchun Chen. 2005b. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 5 (2005), 67--75. Google ScholarDigital Library
Ahmed Abbasi and Hsinchun Chen. 2006. Visualizing authorship for identification. In Proceedings of the International Conference on Intelligence and Security Informatics. Springer, 60--71. Google ScholarDigital Library
Mahmoud Al-Ayyoub, Ahmed Alwajeeh, and Ismail Hmeidi. 2017. An extensive study of authorship authentication of arabic articles. Int. J. Web Inf. Syst. 13, 1 (2017), 85--104.Google ScholarCross Ref
Mahmoud Al-Ayyoub, Yaser Jararweh, Abdullateef Rabab’ah, and Monther Aldwairi. 2017. Feature extraction and selection for arabic tweets authorship authentication. J. Ambient Intell. Hum. Comput. 8, 3 (01 Jun 2017), 383--393.Google ScholarCross Ref
Alaa Saleh Altheneyan and Mohamed El Bachir Menai. 2014. Naïve bayes classifiers for authorship attribution of arabic texts. J. King Saud Univ. Comput. Inf. Sci. 26, 4 (2014), 473--484. Google ScholarDigital Library
Ahmed Alwajeeh, Mahmoud Al-Ayyoub, and Ismail Hmeidi. 2014. On authorship authentication of arabic articles. In Proceedings of the 5th International Conference on Information and Communication Systems (ICICS’14). IEEE, 1--6.Google ScholarCross Ref
ArabiNames.com. 2015. Arabi Names. Retreived from http://arabinames.com/categories.aspx.Google Scholar
Victor Benjamin, Wingyan Chung, Ahmed Abbasi, Joshua Chuang, Catherine A. Larson, and Hsinchun Chen. 2014. Evaluating text visualization for authorship analysis. Secur. Inf. 3, 1 (2014), 10.Google ScholarCross Ref
Mudit Bhargava, Pulkit Mehndiratta, and Krishna Asawa. 2013. Stylometric analysis for authorship attribution on twitter. In Big Data Analytics. Springer, 37--47. Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarDigital Library
Thiago Cavalcante, Anderson Rocha, and Ariadne Carvalho. 2014. Large-scale micro-blog authorship attribution: Beyond simple feature engineering. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, 399--407.Google Scholar
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27. Google ScholarDigital Library
Carole E. Chaski. 2005. Who’s at the keyboard? Authorship attribution in digital evidence investigations. J. Dig. Evidence 4, 1 (2005), 1--13.Google Scholar
Na Cheng, Rajarathnam Chandramouli, and K. P. Subbalakshmi. 2011. Author gender identification from text. Dig. Invest. 8, 1 (2011), 78--88. Google ScholarDigital Library
Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Progress in Pattern Recognition, Image Analysis and Applications. Springer, 844--853. Google ScholarDigital Library
Olivier de Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001. Mining E-mail content for author identification forensics. ACM SIGMOD Reco. 30, 4 (2001), 55--64. Google ScholarDigital Library
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2007. Automated methods for processing arabic text: From tokenization to base phrase chunking. Arabic Computational Morphology: Knowledge-Based and Empirical Methods. (2007).Google Scholar
Steven H. H. Ding, Benjamin C. M. Fung, and Mourad Debbabi. 2015. A visualizable evidence-driven approach for authorship attribution. ACM Trans. Inf. Syst. Secur. 17, 3, Article 12 (March 2015), 30 pages. Google ScholarDigital Library
Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E. Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level N-grams: The source code author profile (SCAP) method. Int. J. Dig. Evidence 6, 1 (2007), 1--18.Google Scholar
Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model.. In Proceedings of the AAAI Conference on Artificial Intelligence. 4212--4213. Google ScholarDigital Library
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press Cambridge. Google ScholarDigital Library
Nizar Habash and Owen Rambow. 2005. Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In Proceedings of the Conference of American Association for Computational Linguistics. 578--580. Google ScholarDigital Library
M. A. Hall. 1998. Correlation-based feature subset selection for machine learning. (unpublished).Google Scholar
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. Newslett. 11, 1 (2009), 10--18. Google ScholarDigital Library
Jiawei Han and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
Markus Hofmann and Ralf Klinkenberg. 2013. RapidMiner: Data Mining Use Cases and Business Analytics Applications. Chapman 8 Hall/CRC. Google ScholarDigital Library
Giacomo Inches, Morgan Harvey, and Fabio Crestani. 2013. Finding participants in a chat: Authorship attribution for conversational documents. In Proceedings of the International Conference on Social Computing (SocialCom’13). IEEE, 272--279. Google ScholarDigital Library
Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5 (Suppl.) (2008), S42--S51. Google ScholarDigital Library
Shunichi Ishihara. 2011. A forensic authorship classification in sms messages: A likelihood ratio based approach using N-Gram. In Proceedings of the Australasian Language Technology Association Workshop 2011. 47--56.Google Scholar
George H. John and Pat Langley. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francsisco, CA, 338--345. Google ScholarDigital Library
Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarDigital Library
Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03), Vol. 3. 255--264.Google Scholar
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 1746--1751.Google ScholarCross Ref
Bradley Kjell, W. Addison Woods, and Ophir Frieder. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1 (1994), 141--150. Google ScholarDigital Library
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci.Technol. 60, 1 (2009), 9--26. Google ScholarDigital Library
Sushil Kumar and Mousmi A. Chaurasia. 2012. Assessment on stylometry for multilingual manuscript. Assessment 2, 9 (2012), 1--6.Google Scholar
Robert Layton, Stephen McCombie, and Paul Watters. 2012. Authorship attribution of irc messages using inverse author frequency. In Proceedings of the 3rd Cybercrime and Trustworthy Computing Workshop (CTC’12). IEEE, 7--13. Google ScholarDigital Library
Robert Layton, Paul Watters, and Richard Dazeley. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the Second Cybercrime and Trustworthy Computing Workshop. IEEE, 1--8. Google ScholarDigital Library
Robert Layton, Paul Watters, and Richard Dazeley. 2012. Recentred local profiles for authorship attribution. Nat. Lang. Eng. 18, 3 (7 2012), 293--312.Google Scholar
Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 681--691.Google ScholarCross Ref
Mark Liberman. 2008. Ask Language Log: Comparing the Vocabularies of Different Languages. Retrieved from http://itre.cis.upenn.edu/ myl/languagelog/archives/005514.html.Google Scholar
Kim Luyckx and Walter Daelemans. 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 513--520. Google ScholarDigital Library
Kim Luyckx and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Liter. Ling. Comput. 26, 1 (2011), 35--55.Google ScholarCross Ref
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics (ACL), 55--60.Google ScholarCross Ref
Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization, Vol. 752. Citeseer, 41--48.Google Scholar
Miniwatts Marketing Group. 2013. Internet World Users by Language. Retreived from http://www.internetworldstats.com/stats7.htm.Google Scholar
Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google Scholar
Ahmed Fawzi Otoom, Emad E. Abdullah, Shifaa Jaafer, Aseel Hamdallh, and Dana Amer. 2014. Towards author identification of arabic text articles. In Proceedings of the 5th Internation Conference on Information and Communication Systems (ICICS’14). IEEE, 1--4.Google ScholarCross Ref
Siham Ouamour and Halim Sayoud. 2013. Authorship attribution of short historical arabic texts based on lexical features. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC’13). IEEE, 144--147. Google ScholarDigital Library
Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'14), Vol. 14. 1094--1101.Google Scholar
John Ross Quinlan. 1993. C4.5: Programs for machine learning. Vol. 1. The Morgan Kaufmann Series in Machine Learning, Morgan Kaufmann, San Mateo, CA. Google ScholarDigital Library
Abdullateef Rabab’ah, Mahmoud Al-Ayyoub, Yaser Jararweh, and Monther Aldwairi. 2016. Authorship attribution of arabic tweets. In Proceedings of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA’16). 1--6.Google ScholarCross Ref
Roshan Ragel, Pramod Herath, and Upul Senanayake. 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 8th International Conference on Industrial and Information Systems (ICIIS’13). IEEE, 387--392.Google ScholarCross Ref
Dylan Rhodes. 2015. Author attribution with CNNs. Retrieved August 22, 2016 from https://www.semanticscholar.org/paper/Author-Attribution-with-Cnn-s-Rhodes/0a904f9d6b47dfc574f681f4d3b41bd840871b6f/pdf.Google Scholar
David C. Rubin. 1978. Word-initial and word-final Ngram frequencies. J. Literacy Res. 10, 2 (1978), 171--183.Google Scholar
Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. 2016. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Preprint arXiv:1609.06686 (2016).Google Scholar
Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’13). 1880--1891.Google Scholar
Kareem Shaker and David Corne. 2010. Authorship attribution in arabic using a hybrid of evolutionary search and linear discriminant analysis. In Proceedings of The Computational Intelligence (UKCI’10) Workshop. IEEE, UK, 1--6.Google ScholarCross Ref
Kareem Shaker, David Corne, and Richard Everson. 2007. Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2071--2077.Google ScholarCross Ref
Armin Shmilovici. 2005. Support Vector Machines. Vol. 12. Springer New York, NY, 257--276.Google Scholar
Prasha Shrestha, Sebastian Sierra, Fabio A. González, Paolo Rosso, Manuel Montes-y Gómez, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 669.Google ScholarCross Ref
Rui Sousa Silva, Gustavo Laboreiro, Luís Sarmento, Tim Grant, Eugénio Oliveira, and Belinda Maia. 2011. ‘twazn me&excl;&excl;&excl;;(’Automatic authorship analysis of micro-blogging messages. In Natural Language Processing and Information Systems. Springer, 161--168. Google ScholarDigital Library
Steve Simon. 2005. When the F Test Is Significant, but Tukey Is Not. Retrieved from http://www.pmean.com/05/TukeyTest.html.Google Scholar
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (2009), 538--556. Google ScholarCross Ref
Nick Taylor. 2015. Twitter and Open Data in Academia. Retrieved from https://twittercommunity.com/t/twitter-and-open-data-in-academia/51934.Google Scholar
Twitter. 2017. Developer Agreement and Policy. Retrieved from https://dev.twitter.com/overview/terms/agreement-and-policyGoogle Scholar
Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann. Google ScholarDigital Library
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. 649--657. Google ScholarDigital Library
Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 3 (2006), 378--393. Google ScholarDigital Library

Index Terms

Recommendations

Survey of Authorship Identification Tasks on Arabic Texts
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital ...
Read More
Authorship Attribution for a Resource Poor Language—Urdu
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of ...
Read More
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 18, Issue 1
March 2019
196 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3292011
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2018
- Accepted: 1 June 2018
- Revised: 1 March 2018
- Received: 1 October 2015
Published in tallip Volume 18, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Authorship attribution
short text
social media
twitter
visualization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 471
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Survey of Authorship Identification Tasks on Arabic Texts

Authorship Attribution for a Resource Poor Language—Urdu

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Survey of Authorship Identification Tasks on Arabic Texts

Authorship Attribution for a Resource Poor Language—Urdu

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media