skip to main content
10.1145/3219819.3219899acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Pangloss: Fast Entity Linking in Noisy Text Environments

Published:19 July 2018Publication History

ABSTRACT

Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.

References

  1. J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.Google ScholarGoogle Scholar
  2. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, pages 722--735. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284 (5): 34--43, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Blanco, G. Ottaviano, and E. Meij. Fast and space-efficient entity linking for queries. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 179--188. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3 (Jan): 993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, page 6, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics, pages 127--135, 2012.Google ScholarGoogle Scholar
  8. C. Cherry and H. Guo. The unreasonable effectiveness of word representations for Twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735--745, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  9. K. Clark and C. D. Manning. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016.Google ScholarGoogle Scholar
  10. S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.Google ScholarGoogle Scholar
  11. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41 (6): 391, 1990.Google ScholarGoogle Scholar
  12. S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing space amplification in RocksDB. In CIDR, 2017.Google ScholarGoogle Scholar
  13. M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 277--285. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics, 2: 259--272, 2014.Google ScholarGoogle Scholar
  15. S. Guo, M.-W. Chang, and E. Kiciman. To link or not to link? A study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1020--1030, 2013 a.Google ScholarGoogle Scholar
  16. Y. Guo, B. Qin, T. Liu, and S. Li. Microblog entity linking by leveraging extra posts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 863--868, 2013 b.Google ScholarGoogle Scholar
  17. B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial Intelligence, 194: 130--150, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Hasibi, K. Balog, and S. E. Bratsberg. On the reproducibility of the TAGME entity linking system. In Proceedings of 38th European Conference on Information Retrieval, ECIR '16, pages 436--449. Springer, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782--792. Association for Computational Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 457--466. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378--1387, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729--1744. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60, 2014.Google ScholarGoogle Scholar
  24. L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. The Journal of Open Source Software, 2 (11): 205, 2017.Google ScholarGoogle Scholar
  25. R. Mihalcea and A. Csomai. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 233--242. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 509--518. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Moro, A. Raganato, and R. Navigli. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2: 231--244, 2014.Google ScholarGoogle Scholar
  29. D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30 (1): 3--26, 2007.Google ScholarGoogle Scholar
  30. R. Navigli. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41 (2): 10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. M. Powers. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning, pages 151--160. Association for Computational Linguistics, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524--1534. Association for Computational Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 68--76. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27 (2): 443--460, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012, 15, 2012.Google ScholarGoogle Scholar
  36. A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google ScholarGoogle Scholar
  37. R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926--934, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4): 521--544, 2001. Google ScholarGoogle ScholarCross RefCross Ref
  39. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, pages 697--706. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591--1601, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  41. I. Yamada, H. Takeda, and Y. Takefuji. Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-generated Text, pages 136--140, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  42. I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343, 2016.Google ScholarGoogle Scholar
  43. Z. Zheng, F. Li, M. Huang, and X. Zhu. Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 483--491. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Zwicklbauer, C. Seifert, and M. Granitzer. Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 425--434. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pangloss: Fast Entity Linking in Noisy Text Environments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader