skip to main content
10.1145/1871985.1871993acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Classifying latent user attributes in twitter

Published:30 October 2010Publication History

ABSTRACT

Social media outlets such as Twitter have become an important forum for peer interaction. Thus the ability to classify latent user attributes, including gender, age, regional origin, and political orientation solely from Twitter user language or similar highly informal content has important applications in advertising, personalization, and recommendation. This paper includes a novel investigation of stacked-SVM-based classification algorithms over a rich set of original features, applied to classifying these four user attributes. It also includes extensive analysis of features and approaches that are effective and not effective in classifying user attributes in Twitter-style informal written genres as distinct from the other primarily spoken genres previously studied in the user-property classification literature. Our models, singly and in ensemble, significantly outperform baseline models in all cases. A detailed analysis of model components and features provides an often entertaining insight into distinctive language-usage variation across gender, age, regional origin and political orientation in modern informal communication.

References

  1. T. Bocklet, A. Maier, and E. Nöth. Age determination of children in preschool and primary school age with gmm-based supervectors and support vector machines/regression. In TSD '08: Proceedings of the 11th international conference on Text, Speech and Dialogue, pages 253--260, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Boulis and M. Ostendorf. A quantitative analysis of lexical differences between genders in telephone conversations. In ACL '05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 435--442, Morristown, NJ, USA, 2005. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Burger and J. Henderson. An exploration of observable features related to blogger age. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium, 2006.Google ScholarGoogle Scholar
  4. J. Coates. Language and Gender: A Reader. Blackwell Publishers, 1998.Google ScholarGoogle Scholar
  5. P. Eckert and S. McConnell-Ginet. Language and Gender. Cambridge University Press, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Fischer. Social influences on the choice of a linguistic variant. In Proceedings of Word, 1958.Google ScholarGoogle ScholarCross RefCross Ref
  7. N. Garera and D. Yarowsky. Modeling latent biographic attributes in conversational genres. In Proceedings of the Joint Conference of Association of Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP), pages 710--718, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Herring and J. Paolillo. Gender and genre variation in weblogs. In Journal of Sociolinguistics, 2006.Google ScholarGoogle Scholar
  9. T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Labov. The Social Stratification of English in New York City. Center for Applied Linguistics, Washington DC, 1966.Google ScholarGoogle Scholar
  11. R. K. Macaulay. Talk that counts: Age, Gender, and Social Class Differences in Discourse. Oxford University Press, 2005.Google ScholarGoogle Scholar
  12. S. Nowson and J. Oberlander. The identity of bloggers: Openness and gender in personal weblogs. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium, 2006.Google ScholarGoogle Scholar
  13. S. Singh. A pilot study on gender differences in conversational speech on lexical richness measures. In Literary and Linguistic Computing, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Thomas, B. Pang, and L. Lee. Get out the vote: determining support or opposition from congressional floor-debate transcripts. In EMNLP '06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Classifying latent user attributes in twitter

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SMUC '10: Proceedings of the 2nd international workshop on Search and mining user-generated contents
      October 2010
      136 pages
      ISBN:9781450303866
      DOI:10.1145/1871985

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 October 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SMUC '10 Paper Acceptance Rate15of25submissions,60%Overall Acceptance Rate15of25submissions,60%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader