ABSTRACT
Most sentiment analysis approaches use as baseline a support vector machines (SVM) classifier with binary unigram weights. In this paper, we explore whether more sophisticated feature weighting schemes from Information Retrieval can enhance classification accuracy. We show that variants of the classic tf.idf scheme adapted to sentiment analysis provide significant increases in accuracy, especially when using a sublinear function for term frequency weights and document frequency smoothing. The techniques are tested on a wide selection of data sets and produce the best accuracy to our knowledge.
- }}Ahmed Abbasi, Hsinchun Chen, and Arab Salem. 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst., 26(3):1--34. Google ScholarDigital Library
- }}Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don't add up: ad-hoc retrieval results since 1998. In David Wai Lok Cheung, Il Y. Song, Wesley W. Chu, Xiaohua Hu, Jimmy J. Lin, David Wai Lok Cheung, Il Y. Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin, editors, CIKM, pages 601--610, New York, NY, USA. ACM. Google ScholarDigital Library
- }}Anthony Aue and Michael Gamon. 2005. Customizing sentiment classifiers to new domains: A case study. In Proceedings of Recent Advances in Natural Language Processing (RANLP).Google Scholar
- }}John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 440--447, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
- }}Ann Devitt and Khurshid Ahmad. 2007. Sentiment polarity identification in financial news: A cohesion-based approach. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 984--991, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
- }}Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874. Google ScholarDigital Library
- }}Stephan Greene and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 503--511, Boulder, Colorado, June. Association for Computational Linguistics. Google ScholarDigital Library
- }}K. Sparck Jones, S. Walker, and S. E. Robertson. 2000. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6):779--808. Google ScholarDigital Library
- }}Chenghua Lin and Yulan He. 2009. Joint sentiment/topic model for sentiment analysis. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 375--384, New York, NY, USA. ACM. Google ScholarDigital Library
- }}Wei-Hao Lin, Theresa Wilson, Janyce Wiebe, and Alexander Hauptmann. 2006. Which side are you on? identifying perspectives at the document and sentence levels. In Proceedings of the Conference on Natural Language Learning (CoNLL). Google ScholarDigital Library
- }}Hugo Liu. 2004. MontyLingua: An end-to-end natural language processor with common sense. Technical report, MIT.Google Scholar
- }}C. Macdonald and I. Ounis. 2006. The trec blogs06 collection: Creating and analysing a blog test collection. DCS Technical Report Series.Google Scholar
- }}Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July. Google ScholarDigital Library
- }}J. R. Martin and P. R. R. White. 2005. The language of evaluation: appraisal in English / J. R. Martin and P. R. R. White. Palgrave Macmillan, Basingstoke:.Google Scholar
- }}Justin Martineau and Tim Finin. 2009. Delta TFIDF: An Improved Feature Space for Sentiment Analysis. In Proceedings of the Third AAAI Internatonal Conference on Weblogs and Social Media, San Jose, CA, May. AAAI Press. (poster paper).Google Scholar
- }}A. Mccallum and K. Nigam. 1998. A comparison of event models for naive bayes text classification.Google Scholar
- }}G. Mishne. 2005. Experiments with mood classification in blog posts. In 1st Workshop on Stylistic Analysis Of Text For Information Access.Google Scholar
- }}Tony Mullen and Nigel Collier. 2004. Sentiment analysis using support vector machines with diverse information sources. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 412--418, Barcelona, Spain, July. Association for Computational Linguistics.Google Scholar
- }}Charles E. Osgood. 1967. The measurement of meaning / {by} {Charles E. Osgood, George J. Suci {and} Percy H. Tannenbaum}. University of Illinois Press, Urbana:, 2nd ed. edition.Google Scholar
- }}Iadh Ounis, Craig Macdonald, and Ian Soboroff. 2008. Overview of the trec-2008 blog trac. In The Seventeenth Text REtrieval Conference (TREC 2008) Proceedings. NIST.Google Scholar
- }}Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In In Proceedings of the ACL, pages 271--278. Google ScholarDigital Library
- }}B. Pang and L. Lee. 2008. Opinion Mining and Sentiment Analysis. Now Publishers Inc.Google Scholar
- }}Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarDigital Library
- }}Rudy Prabowo and Mike Thelwall. 2009. Sentiment analysis: A combined approach. Journal of Informetrics, 3(2):143--157, April.Google ScholarCross Ref
- }}Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at trec-3. In TREC, pages 0-.Google Scholar
- }}S E Robertson, S Walker, S Jones, M M Hancock-Beaulieu, and M Gatford. 1996. Okapi at trec-2. In In The Second Text REtrieval Conference (TREC-2), NIST Special Special Publication 500--215, pages 21--34.Google Scholar
- }}Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple bm25 extension to multiple weighted fields. In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42--49, New York, NY, USA. ACM. Google ScholarDigital Library
- }}Gerard Salton and Chris Buckley. 1987. Term weighting approaches in automatic text retrieval. Technical report, Ithaca, NY, USA. Google ScholarDigital Library
- }}Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA. Google ScholarDigital Library
- }}G. Salton. 1971. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Google ScholarDigital Library
- }}Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1ñ47. Google ScholarDigital Library
- }}Amit Singhal, Gerard Salton, and Chris Buckley. 1995. Length normalization in degraded text collections. Technical report, Ithaca, NY, USA. Google ScholarDigital Library
- }}Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. CoRR, abs/cs/0607062.Google Scholar
- }}Peter D. Turney. 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In ACL, pages 417--424. Google ScholarDigital Library
- }}Casey Whitelaw, Navendu Garg, and Shlomo Argamon. 2005. Using appraisal groups for sentiment analysis. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 625--631, New York, NY, USA. ACM. Google ScholarDigital Library
- }}Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), Vancouver, CA. Google ScholarDigital Library
- }}Ian H. Witten and Eibe Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, 1st edition, October. Google ScholarDigital Library
- }}Alex Wright. 2009. Mining the web for feelings, not facts. August 23, NY Times, last accessed October 2,2009, http://http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html?_r=1.Google Scholar
- }}O. F. Zaidan, J. Eisner, and C. D. Piatko. 2007. Using Annotator Rationales to Improve Machine Learning for Text Categorization. Proceedings of NAACL HLT, pages 260--267.Google Scholar
- }}Justin Zobel and Alistair Moffat. 1998. Exploring the similarity space. SIGIR Forum, 32(1): 18--34. Google ScholarDigital Library
Index Terms
- A study of information retrieval weighting schemes for sentiment analysis
Recommendations
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementSentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognitionPublic opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...
New term weighting schemes with combination of multiple classifiers for sentiment analysis
The rapid growth of social media on the Web, such as forum discussions, reviews, blogs, micro-blogs, social networks and Twitter has created huge volume of opinionated data in digital forms. Therefore, last decade showed growth of sentiment analysis ...
Comments