Abstract
Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection and traceability. To address the problem of anonymity, authorship analysis is used to identify individuals by their writing styles without knowing their actual identities. Most authorship studies are dedicated to English due to its widespread use over the Internet, but recent cyber-attacks such as the distribution of Stuxnet indicate that Internet crimes are not limited to a certain community, language, culture, ideology, or ethnicity. To effectively investigate cybercrime and to address the problem of anonymity in online communication, there is a pressing need to study authorship analysis of languages such as Arabic, Chinese, Turkish, and so on. Arabic, the focus of this study, is the fourth most widely used language on the Internet. This study investigates authorship of Arabic discourse/text, especially tiny text, Twitter posts. We benchmark the performance of a profile-based approach that uses n-grams as features and compare it with state-of-the-art instance-based classification techniques. Then we adapt an event-visualization tool that is developed for English to accommodate both Arabic and English languages and visualize the result of the attribution evidence. In addition, we investigate the relative effect of the training set, the length of tweets, and the number of authors on authorship classification accuracy. Finally, we show that diacritics have an insignificant effect on the attribution process and part-of-speech tags are less effective than character-level and word-level n-grams.
- Ahmed Abbasi and Hsinchun Chen. 2005a. Applying authorship analysis to arabic web content. In Proceedings of the 2005 IEEE International Conference on Intelligence and Security Informatics (ISI'05). Springer-Verlag, Berlin, Heidelberg, 183--197. Google ScholarDigital Library
- Ahmed Abbasi and Hsinchun Chen. 2005b. Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20, 5 (2005), 67--75. Google ScholarDigital Library
- Ahmed Abbasi and Hsinchun Chen. 2006. Visualizing authorship for identification. In Proceedings of the International Conference on Intelligence and Security Informatics. Springer, 60--71. Google ScholarDigital Library
- Mahmoud Al-Ayyoub, Ahmed Alwajeeh, and Ismail Hmeidi. 2017. An extensive study of authorship authentication of arabic articles. Int. J. Web Inf. Syst. 13, 1 (2017), 85--104.Google ScholarCross Ref
- Mahmoud Al-Ayyoub, Yaser Jararweh, Abdullateef Rabab’ah, and Monther Aldwairi. 2017. Feature extraction and selection for arabic tweets authorship authentication. J. Ambient Intell. Hum. Comput. 8, 3 (01 Jun 2017), 383--393.Google ScholarCross Ref
- Alaa Saleh Altheneyan and Mohamed El Bachir Menai. 2014. Naïve bayes classifiers for authorship attribution of arabic texts. J. King Saud Univ. Comput. Inf. Sci. 26, 4 (2014), 473--484. Google ScholarDigital Library
- Ahmed Alwajeeh, Mahmoud Al-Ayyoub, and Ismail Hmeidi. 2014. On authorship authentication of arabic articles. In Proceedings of the 5th International Conference on Information and Communication Systems (ICICS’14). IEEE, 1--6.Google ScholarCross Ref
- ArabiNames.com. 2015. Arabi Names. Retreived from http://arabinames.com/categories.aspx.Google Scholar
- Victor Benjamin, Wingyan Chung, Ahmed Abbasi, Joshua Chuang, Catherine A. Larson, and Hsinchun Chen. 2014. Evaluating text visualization for authorship analysis. Secur. Inf. 3, 1 (2014), 10.Google ScholarCross Ref
- Mudit Bhargava, Pulkit Mehndiratta, and Krishna Asawa. 2013. Stylometric analysis for authorship attribution on twitter. In Big Data Analytics. Springer, 37--47. Google ScholarDigital Library
- Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarDigital Library
- Thiago Cavalcante, Anderson Rocha, and Ariadne Carvalho. 2014. Large-scale micro-blog authorship attribution: Beyond simple feature engineering. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, 399--407.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 27. Google ScholarDigital Library
- Carole E. Chaski. 2005. Who’s at the keyboard? Authorship attribution in digital evidence investigations. J. Dig. Evidence 4, 1 (2005), 1--13.Google Scholar
- Na Cheng, Rajarathnam Chandramouli, and K. P. Subbalakshmi. 2011. Author gender identification from text. Dig. Invest. 8, 1 (2011), 78--88. Google ScholarDigital Library
- Rosa María Coyotl-Morales, Luis Villaseñor-Pineda, Manuel Montes-y Gómez, and Paolo Rosso. 2006. Authorship attribution using word sequences. In Progress in Pattern Recognition, Image Analysis and Applications. Springer, 844--853. Google ScholarDigital Library
- Olivier de Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001. Mining E-mail content for author identification forensics. ACM SIGMOD Reco. 30, 4 (2001), 55--64. Google ScholarDigital Library
- Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2007. Automated methods for processing arabic text: From tokenization to base phrase chunking. Arabic Computational Morphology: Knowledge-Based and Empirical Methods. (2007).Google Scholar
- Steven H. H. Ding, Benjamin C. M. Fung, and Mourad Debbabi. 2015. A visualizable evidence-driven approach for authorship attribution. ACM Trans. Inf. Syst. Secur. 17, 3, Article 12 (March 2015), 30 pages. Google ScholarDigital Library
- Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E. Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level N-grams: The source code author profile (SCAP) method. Int. J. Dig. Evidence 6, 1 (2007), 1--18.Google Scholar
- Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model.. In Proceedings of the AAAI Conference on Artificial Intelligence. 4212--4213. Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press Cambridge. Google ScholarDigital Library
- Nizar Habash and Owen Rambow. 2005. Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In Proceedings of the Conference of American Association for Computational Linguistics. 578--580. Google ScholarDigital Library
- M. A. Hall. 1998. Correlation-based feature subset selection for machine learning. (unpublished).Google Scholar
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. Newslett. 11, 1 (2009), 10--18. Google ScholarDigital Library
- Jiawei Han and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Markus Hofmann and Ralf Klinkenberg. 2013. RapidMiner: Data Mining Use Cases and Business Analytics Applications. Chapman 8 Hall/CRC. Google ScholarDigital Library
- Giacomo Inches, Morgan Harvey, and Fabio Crestani. 2013. Finding participants in a chat: Authorship attribution for conversational documents. In Proceedings of the International Conference on Social Computing (SocialCom’13). IEEE, 272--279. Google ScholarDigital Library
- Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5 (Suppl.) (2008), S42--S51. Google ScholarDigital Library
- Shunichi Ishihara. 2011. A forensic authorship classification in sms messages: A likelihood ratio based approach using N-Gram. In Proceedings of the Australasian Language Technology Association Workshop 2011. 47--56.Google Scholar
- George H. John and Pat Langley. 1995. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francsisco, CA, 338--345. Google ScholarDigital Library
- Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarDigital Library
- Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03), Vol. 3. 255--264.Google Scholar
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). Association for Computational Linguistics, Doha, Qatar, 1746--1751.Google ScholarCross Ref
- Bradley Kjell, W. Addison Woods, and Ophir Frieder. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1 (1994), 141--150. Google ScholarDigital Library
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci.Technol. 60, 1 (2009), 9--26. Google ScholarDigital Library
- Sushil Kumar and Mousmi A. Chaurasia. 2012. Assessment on stylometry for multilingual manuscript. Assessment 2, 9 (2012), 1--6.Google Scholar
- Robert Layton, Stephen McCombie, and Paul Watters. 2012. Authorship attribution of irc messages using inverse author frequency. In Proceedings of the 3rd Cybercrime and Trustworthy Computing Workshop (CTC’12). IEEE, 7--13. Google ScholarDigital Library
- Robert Layton, Paul Watters, and Richard Dazeley. 2010. Authorship attribution for twitter in 140 characters or less. In Proceedings of the Second Cybercrime and Trustworthy Computing Workshop. IEEE, 1--8. Google ScholarDigital Library
- Robert Layton, Paul Watters, and Richard Dazeley. 2012. Recentred local profiles for authorship attribution. Nat. Lang. Eng. 18, 3 (7 2012), 293--312.Google Scholar
- Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 681--691.Google ScholarCross Ref
- Mark Liberman. 2008. Ask Language Log: Comparing the Vocabularies of Different Languages. Retrieved from http://itre.cis.upenn.edu/ myl/languagelog/archives/005514.html.Google Scholar
- Kim Luyckx and Walter Daelemans. 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 513--520. Google ScholarDigital Library
- Kim Luyckx and Walter Daelemans. 2011. The effect of author set size and data size in authorship attribution. Liter. Ling. Comput. 26, 1 (2011), 35--55.Google ScholarCross Ref
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford coreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics (ACL), 55--60.Google ScholarCross Ref
- Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization, Vol. 752. Citeseer, 41--48.Google Scholar
- Miniwatts Marketing Group. 2013. Internet World Users by Language. Retreived from http://www.internetworldstats.com/stats7.htm.Google Scholar
- Frederick Mosteller and David Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.Google Scholar
- Ahmed Fawzi Otoom, Emad E. Abdullah, Shifaa Jaafer, Aseel Hamdallh, and Dana Amer. 2014. Towards author identification of arabic text articles. In Proceedings of the 5th Internation Conference on Information and Communication Systems (ICICS’14). IEEE, 1--4.Google ScholarCross Ref
- Siham Ouamour and Halim Sayoud. 2013. Authorship attribution of short historical arabic texts based on lexical features. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC’13). IEEE, 144--147. Google ScholarDigital Library
- Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'14), Vol. 14. 1094--1101.Google Scholar
- John Ross Quinlan. 1993. C4.5: Programs for machine learning. Vol. 1. The Morgan Kaufmann Series in Machine Learning, Morgan Kaufmann, San Mateo, CA. Google ScholarDigital Library
- Abdullateef Rabab’ah, Mahmoud Al-Ayyoub, Yaser Jararweh, and Monther Aldwairi. 2016. Authorship attribution of arabic tweets. In Proceedings of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA’16). 1--6.Google ScholarCross Ref
- Roshan Ragel, Pramod Herath, and Upul Senanayake. 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 8th International Conference on Industrial and Information Systems (ICIIS’13). IEEE, 387--392.Google ScholarCross Ref
- Dylan Rhodes. 2015. Author attribution with CNNs. Retrieved August 22, 2016 from https://www.semanticscholar.org/paper/Author-Attribution-with-Cnn-s-Rhodes/0a904f9d6b47dfc574f681f4d3b41bd840871b6f/pdf.Google Scholar
- David C. Rubin. 1978. Word-initial and word-final Ngram frequencies. J. Literacy Res. 10, 2 (1978), 171--183.Google Scholar
- Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. 2016. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Preprint arXiv:1609.06686 (2016).Google Scholar
- Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’13). 1880--1891.Google Scholar
- Kareem Shaker and David Corne. 2010. Authorship attribution in arabic using a hybrid of evolutionary search and linear discriminant analysis. In Proceedings of The Computational Intelligence (UKCI’10) Workshop. IEEE, UK, 1--6.Google ScholarCross Ref
- Kareem Shaker, David Corne, and Richard Everson. 2007. Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). IEEE, 2071--2077.Google ScholarCross Ref
- Armin Shmilovici. 2005. Support Vector Machines. Vol. 12. Springer New York, NY, 257--276.Google Scholar
- Prasha Shrestha, Sebastian Sierra, Fabio A. González, Paolo Rosso, Manuel Montes-y Gómez, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 669.Google ScholarCross Ref
- Rui Sousa Silva, Gustavo Laboreiro, Luís Sarmento, Tim Grant, Eugénio Oliveira, and Belinda Maia. 2011. ‘twazn me!!!;(’Automatic authorship analysis of micro-blogging messages. In Natural Language Processing and Information Systems. Springer, 161--168. Google ScholarDigital Library
- Steve Simon. 2005. When the F Test Is Significant, but Tukey Is Not. Retrieved from http://www.pmean.com/05/TukeyTest.html.Google Scholar
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (2009), 538--556. Google ScholarCross Ref
- Nick Taylor. 2015. Twitter and Open Data in Academia. Retrieved from https://twittercommunity.com/t/twitter-and-open-data-in-academia/51934.Google Scholar
- Twitter. 2017. Developer Agreement and Policy. Retrieved from https://dev.twitter.com/overview/terms/agreement-and-policyGoogle Scholar
- Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann. Google ScholarDigital Library
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. 649--657. Google ScholarDigital Library
- Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 3 (2006), 378--393. Google ScholarDigital Library
Index Terms
- Arabic Authorship Attribution: An Extensive Study on Twitter Posts
Recommendations
Survey of Authorship Identification Tasks on Arabic Texts
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital ...
Authorship Attribution for a Resource Poor Language—Urdu
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of ...
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalAuthorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...
Comments