skip to main content
10.5555/2145432.2145595dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Named entity recognition in tweets: an experimental study

Published:27 July 2011Publication History

ABSTRACT

People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared with the Stanford NER system. T-ner leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms co-training, increasing F1 by 25% over ten common entity types.

Our NLP tools are available at: http://github.com/aritter/twitter_nlp

References

  1. Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Linguist. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the third ACM international conference on Web search and data mining, WSDM '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-of-speech tagging. In AAAI, pages 784--789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  8. Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In Proceedings of the 20th international joint conference on Artifical intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Doug Downey, Oren Etzioni, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(11):726--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tim Finin, Will Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating named entities in Twitter data with crowd-sourcing. In Proceedings of the NAACL Workshop on Creating Speech and Text Language Data With Amazon's Mechanical Turk. Association for Computational Linguistics, June. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Radu Florian. 2002. Named entity recognition as a house of cards: classifier stacking. In Proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Proceedings of the International Conference on Semantic Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. In ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Joshua T. Goodman. 2001. A bit of progress in language modeling. Technical report, Microsoft Research.Google ScholarGoogle Scholar
  18. Stephan Gouws, Donald Metzler, Congxing Cai, and Eduard Hovy. 2011. Contextual bearing on linguistic variation in social media. In ACL Workshop on Language in Social Media, Portland, Oregon, USA. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, April.Google ScholarGoogle Scholar
  20. Mark Hachman. 2011. Humanity's tweets: Just 20 terabytes. In PCMAG.COM.Google ScholarGoogle Scholar
  21. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Catherine Kobus, François Yvon, and Géraldine Damnati. 2008. Normalizing sms: are two metaphors better than one? In COLING, pages 441--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zornitsa Kozareva and Eduard H. Hovy. 2010. Not all seeds are equal: Measuring the quality of text mining seeds. In HLT-NAACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282--289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011. Recognizing named entities in tweets. In ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Brian Locke and James Martin. 2009. Named entity recognition: Adapting to microblogging. In Senior Thesis, University of Colorado.Google ScholarGoogle Scholar
  27. Mitchell P. Marcus, Beatrice Santorini, and Mary A. Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  29. Tara McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT '05, pages 443--450, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christoph Müller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197--214. Peter Lang, Frankfurt A. M., Germany.Google ScholarGoogle Scholar
  33. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP '09, pages 248--256, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Christina Sauper, Aria Haghighi, and Regina Barzilay. 2010. Incorporating content structure into text analysis applications. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP '10, pages 377--387, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sameer Singh, Dustin Hillard, and Chris Leggetter. 2010. Minimally-supervised extraction of entities from text advertisements. In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Charles Sutton. 2004. Collective segmentation and labeling of distant entities in information extraction.Google ScholarGoogle Scholar
  38. Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1473--1481. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, ConLL '00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL '95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Named entity recognition in tweets: an experimental study

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
        July 2011
        1647 pages
        ISBN:9781937284114

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 27 July 2011

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate73of234submissions,31%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader