Article

Free Access

Language independent authorship attribution using character level language models

Authors:
Fuchun Peng

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Dale Schuurmans

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Shaojun Wang

University of Waterloo, Canada

University of Waterloo, Canada
View Profile

,
Vlado Keselj

Dalhousie University, Canada

Dalhousie University, Canada
View Profile

EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1April 2003Pages 267–274https://doi.org/10.3115/1067807.1067843

Published:12 April 2003Publication History

EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1

Pages 267–274

ABSTRACT

We present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves improved performance across a variety of languages without requiring extensive pre-processing or feature selection. To demonstrate the effectiveness and language independence of our approach, we present experimental results on Greek, English, and Chinese data. We show that our approach achieves state of the art performance in each of these cases. In particular, we obtain a 18% accuracy improvement over the best published results for a Greek data set, while using a far simpler technique than previous investigations.

References

A. Aizawa. 2001. Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings 6th NLP Pac. Rim Symp. NLPRS-01.Google Scholar
C. Apté, F. Damerau and S. Weiss. 1994. Toward Language Independent Automated Learning of Text Categorization Models. In Proceedings SIGIR-94. Google ScholarDigital Library
T. Bell, J. Cleary and I. Witten. 1990. Text Compression. Prentice Hall. Google ScholarDigital Library
W. Cavnar and J. Trenkle. 1994. N-Gram-Based Text Categorization. In Proceedings SDAIR-94.Google Scholar
S. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. TR- 10-98, Harvard.Google Scholar
M. Ephratt. 1997. Authorship Attribution - the Case of Lexical Innovations. In Proc. ACH-ALLC-97.Google Scholar
D. Holmes and R. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. In Literary and Linguistic Computing, 10, 111--127.Google ScholarCross Ref
H. Love, (2002). Attributing Authorship: An Introduction. Cambridge University Press.Google Scholar
S. Scott and S. Matwin. 1999. Feature Engineering for Text Classification. In Proceedings ICML-99. Google ScholarDigital Library
E. Stamatatos, N. Fakotakis and G. Kokkinakis. 1999. Automatic Authorship Attribution. In EACL-99 Google ScholarDigital Library
E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Comput. Ling., 26(4), pp. 471--495. Google ScholarDigital Library
E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2001. Computer-based Authorship Attribution without Lexical Measures Computers and the Humanities, 35, pp. 193--214.Google Scholar
I. Witten, Z. Bray, M. Mahoui and W. Teahan. 1999. Text mining: A New Frontier for Lossless Compression. Proceedings IEEE Data Compression 97 Google ScholarDigital Library

Recommendations

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection ...
Read More
Authorship Attribution for a Resource Poor Language—Urdu
Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of ...
Read More
Language models and fusion for authorship attribution
Abstract
We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
April 2003
394 pages
ISBN:1333567890
Program Chairs:
Ann Copestake
United Kingdom
,
Jan Hajic
Czech Republic
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 12 April 2003
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate100of360submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 813
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Language independent authorship attribution using character level language models

EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Authorship Attribution for a Resource Poor Language—Urdu

Language models and fusion for authorship attribution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Language independent authorship attribution using character level language models

EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1

ABSTRACT

References

Cited By

Recommendations

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Authorship Attribution for a Resource Poor Language—Urdu

Language models and fusion for authorship attribution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media