ABSTRACT
We present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves improved performance across a variety of languages without requiring extensive pre-processing or feature selection. To demonstrate the effectiveness and language independence of our approach, we present experimental results on Greek, English, and Chinese data. We show that our approach achieves state of the art performance in each of these cases. In particular, we obtain a 18% accuracy improvement over the best published results for a Greek data set, while using a far simpler technique than previous investigations.
- A. Aizawa. 2001. Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings 6th NLP Pac. Rim Symp. NLPRS-01.Google Scholar
- C. Apté, F. Damerau and S. Weiss. 1994. Toward Language Independent Automated Learning of Text Categorization Models. In Proceedings SIGIR-94. Google ScholarDigital Library
- T. Bell, J. Cleary and I. Witten. 1990. Text Compression. Prentice Hall. Google ScholarDigital Library
- W. Cavnar and J. Trenkle. 1994. N-Gram-Based Text Categorization. In Proceedings SDAIR-94.Google Scholar
- S. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. TR- 10-98, Harvard.Google Scholar
- M. Ephratt. 1997. Authorship Attribution - the Case of Lexical Innovations. In Proc. ACH-ALLC-97.Google Scholar
- D. Holmes and R. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. In Literary and Linguistic Computing, 10, 111--127.Google ScholarCross Ref
- H. Love, (2002). Attributing Authorship: An Introduction. Cambridge University Press.Google Scholar
- S. Scott and S. Matwin. 1999. Feature Engineering for Text Classification. In Proceedings ICML-99. Google ScholarDigital Library
- E. Stamatatos, N. Fakotakis and G. Kokkinakis. 1999. Automatic Authorship Attribution. In EACL-99 Google ScholarDigital Library
- E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Comput. Ling., 26(4), pp. 471--495. Google ScholarDigital Library
- E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2001. Computer-based Authorship Attribution without Lexical Measures Computers and the Humanities, 35, pp. 193--214.Google Scholar
- I. Witten, Z. Bray, M. Mahoui and W. Teahan. 1999. Text mining: A New Frontier for Lossless Compression. Proceedings IEEE Data Compression 97 Google ScholarDigital Library
Recommendations
Arabic Authorship Attribution: An Extensive Study on Twitter Posts
Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection ...
Authorship Attribution for a Resource Poor Language—Urdu
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of ...
Language models and fusion for authorship attribution
AbstractWe deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora ...
Comments