ABSTRACT
Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.
- J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705, 2016.Google Scholar
- S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, pages 722--735. Springer, 2007. Google ScholarDigital Library
- T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284 (5): 34--43, 2001.Google ScholarDigital Library
- R. Blanco, G. Ottaviano, and E. Meij. Fast and space-efficient entity linking for queries. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 179--188. ACM, 2015. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3 (Jan): 993--1022, 2003. Google ScholarDigital Library
- A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al. Learning structured embeddings of knowledge bases. In AAAI, volume 6, page 6, 2011. Google ScholarDigital Library
- A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics, pages 127--135, 2012.Google Scholar
- C. Cherry and H. Guo. The unreasonable effectiveness of word representations for Twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735--745, 2015.Google ScholarCross Ref
- K. Clark and C. D. Manning. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016.Google Scholar
- S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.Google Scholar
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41 (6): 391, 1990.Google Scholar
- S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing space amplification in RocksDB. In CIDR, 2017.Google Scholar
- M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 277--285. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics, 2: 259--272, 2014.Google Scholar
- S. Guo, M.-W. Chang, and E. Kiciman. To link or not to link? A study on end-to-end tweet entity linking. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1020--1030, 2013 a.Google Scholar
- Y. Guo, B. Qin, T. Liu, and S. Li. Microblog entity linking by leveraging extra posts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 863--868, 2013 b.Google Scholar
- B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial Intelligence, 194: 130--150, 2013. Google ScholarDigital Library
- F. Hasibi, K. Balog, and S. E. Bratsberg. On the reproducibility of the TAGME entity linking system. In Proceedings of 38th European Conference on Information Retrieval, ECIR '16, pages 436--449. Springer, 2016.Google ScholarCross Ref
- J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782--792. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 457--466. ACM, 2009. Google ScholarDigital Library
- A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378--1387, 2016. Google ScholarDigital Library
- J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1729--1744. ACM, 2015. Google ScholarDigital Library
- C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60, 2014.Google Scholar
- L. McInnes, J. Healy, and S. Astels. HDBSCAN: Hierarchical density based clustering. The Journal of Open Source Software, 2 (11): 205, 2017.Google Scholar
- R. Mihalcea and A. Csomai. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, pages 233--242. ACM, 2007. Google ScholarDigital Library
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013. Google ScholarDigital Library
- D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 509--518. ACM, 2008. Google ScholarDigital Library
- A. Moro, A. Raganato, and R. Navigli. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2: 231--244, 2014.Google Scholar
- D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30 (1): 3--26, 2007.Google Scholar
- R. Navigli. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41 (2): 10, 2009. Google ScholarDigital Library
- D. M. Powers. Applications and explanations of Zipf's law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning, pages 151--160. Association for Computational Linguistics, 1998. Google ScholarDigital Library
- A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524--1534. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 68--76. ACM, 2013. Google ScholarDigital Library
- W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27 (2): 443--460, 2015.Google ScholarDigital Library
- S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012, 15, 2012.Google Scholar
- A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google Scholar
- R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926--934, 2013. Google ScholarDigital Library
- W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4): 521--544, 2001. Google ScholarCross Ref
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, pages 697--706. ACM, 2007. Google ScholarDigital Library
- Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph and text jointly embedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591--1601, 2014.Google ScholarCross Ref
- I. Yamada, H. Takeda, and Y. Takefuji. Enhancing named entity recognition in Twitter messages using entity linking. In Proceedings of the Workshop on Noisy User-generated Text, pages 136--140, 2015.Google ScholarCross Ref
- I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343, 2016.Google Scholar
- Z. Zheng, F. Li, M. Huang, and X. Zhu. Learning to link entities with knowledge base. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 483--491. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- S. Zwicklbauer, C. Seifert, and M. Granitzer. Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 425--434. ACM, 2016. Google ScholarDigital Library
Index Terms
- Pangloss: Fast Entity Linking in Noisy Text Environments
Recommendations
DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionIn this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
A graph-based approach for ontology population with named entities
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementAutomatically populating ontology with named entities extracted from the unstructured text has become a key issue for Semantic Web and knowledge management techniques. This issue naturally consists of two subtasks: (1) for the entity mention whose ...
WebSAIL wikifier at ERD 2014
ERD '14: Proceedings of the first international workshop on Entity recognition & disambiguationIn this paper, we report on our participation in Entity Recognition and Disambiguation Challenge 2014. We present WebSAIL Wikifier, an entity recognition and disambiguation system that identifies and links textual mentions to their referent entities in ...
Comments