ABSTRACT
People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared with the Stanford NER system. T-ner leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms co-training, increasing F1 by 25% over ten common entity types.
Our NLP tools are available at: http://github.com/aritter/twitter_nlp
- Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. Google ScholarDigital Library
- Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92--100. Google ScholarDigital Library
- Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Linguist. Google ScholarDigital Library
- Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10. Google ScholarDigital Library
- Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-of-speech tagging. In AAAI, pages 784--789. Google ScholarDigital Library
- Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Empirical Methods in Natural Language Processing.Google Scholar
- Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In Proceedings of the 20th international joint conference on Artifical intelligence. Google ScholarDigital Library
- Doug Downey, Oren Etzioni, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(11):726--748. Google ScholarDigital Library
- Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL '09. Google ScholarDigital Library
- Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. Google ScholarDigital Library
- Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating named entities in Twitter data with crowd-sourcing. In Proceedings of the NAACL Workshop on Creating Speech and Text Language Data With Amazon's Mechanical Turk. Association for Computational Linguistics, June. Google ScholarDigital Library
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05. Google ScholarDigital Library
- Radu Florian. 2002. Named entity recognition as a house of cards: classifier stacking. In Proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02. Google ScholarDigital Library
- Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Proceedings of the International Conference on Semantic Computing. Google ScholarDigital Library
- Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. In ACL. Google ScholarDigital Library
- Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical report, Microsoft Research.Google Scholar
- Stephan Gouws, Donald Metzler, Congxing Cai, and Eduard Hovy. 2011. Contextual bearing on linguistic variation in social media. In ACL Workshop on Language in Social Media, Portland, Oregon, USA. To appear. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, April.Google Scholar
- Mark Hachman. 2011. Humanity's tweets: Just 20 terabytes. In PCMAG.COM.Google Scholar
- Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. Google ScholarDigital Library
- Catherine Kobus, François Yvon, and Géraldine Damnati. 2008. Normalizing sms: are two metaphors better than one? In COLING, pages 441--448. Google ScholarDigital Library
- Zornitsa Kozareva and Eduard H. Hovy. 2010. Not all seeds are equal: Measuring the quality of text mining seeds. In HLT-NAACL. Google ScholarDigital Library
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282--289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011. Recognizing named entities in tweets. In ACL. Google ScholarDigital Library
- Brian Locke and James Martin. 2009. Named entity recognition: Adapting to microblogging. In Senior Thesis, University of Colorado.Google Scholar
- Mitchell P. Marcus, Beatrice Santorini, and Mary A. Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. Google ScholarDigital Library
- Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu.Google Scholar
- Tara McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10. Google ScholarDigital Library
- Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 443--450, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP 2009. Google ScholarDigital Library
- Christoph Müller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197--214. Peter Lang, Frankfurt A. M., Germany.Google Scholar
- Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP '09, pages 248--256, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Christina Sauper, Aria Haghighi, and Regina Barzilay. 2010. Incorporating content structure into text analysis applications. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10, pages 377--387, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Google ScholarDigital Library
- Sameer Singh, Dustin Hillard, and Chris Leggetter. 2010. Minimally-supervised extraction of entities from text advertisements. In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Google ScholarDigital Library
- Charles Sutton. 2004. Collective segmentation and labeling of distant entities in information extraction.Google Scholar
- Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1473--1481. Association for Computational Linguistics. Google ScholarDigital Library
- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, ConLL '00. Google ScholarDigital Library
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Google ScholarDigital Library
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10. Google ScholarDigital Library
- Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Google ScholarDigital Library
- David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL '95. Google ScholarDigital Library
- Named entity recognition in tweets: an experimental study
Recommendations
Named entity recognition for tweets
Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in contextTwo main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-...
Joint inference of named entity recognition and normalization for tweets
ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1Tweets represent a critical source of fresh information, in which named entities occur frequently with rich variations. We study the problem of named entity normalization (NEN) for tweets. Two main challenges are the errors propagated from named entity ...
Named Entity System for Tweets in Hindi Language
Due to the growing need of smart-health applications in Hindi language, there is a rapid demand for health-related Named Entity Recognition NER system for Hindi. For the purpose of the same, this research considers Twitter social network to extract ...
Comments