ABSTRACT
Ambiguity of entity mentions and concept references is a challenge to mining text beyond surface-level keywords. We describe an effective method of disambiguating surface forms and resolving them to Wikipedia entities and concepts. Our method employs an extensive set of features mined from Wikipedia and other large data sources, and combines the features using a machine learning approach with automatically generated training data. Based on a manually labeled evaluation set containing over 1000 news articles, our resolution model has 85% precision and 87.8% recall. The performance is significantly better than three baselines based on traditional context similarities or sense commonness measurements. Our method can be applied to other languages and scales well to new entities and concepts.
- Bagga, Amit and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the Vector Space Model. Proceedings of the 17th international conference on Computational linguistics. Google ScholarDigital Library
- Bunescu, Razvan and Marius Pasca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation. Proceedings of the 11th Conference of the European Chapter of the Association of Computational Linguistics (EACL-2006).Google Scholar
- Cucerzan, Silviu. 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar
- Fleischman, Ben Michael and Eduard Hovy. 2004. Multi-Document Person Name Resolution. Proceesing of the Association for Computational Linguistics.Google Scholar
- Friedman, J. H. 2001. Stochastic gradient boosting. Computational Statistics and Data Analysis, 38:367--378. Google ScholarDigital Library
- Han, Xianpei and Jun Zhao 2009. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval.Google ScholarCross Ref
- Mann, S. Gidon and David Yarowsky. 2003. Unsupervised Personal Name Disambiguation. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003. Google ScholarDigital Library
- Milne, David and Ian H. Witten. 2008a. Learning to Link with Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2008). Google ScholarDigital Library
- Milne, David and Ian H. Witten. 2008b. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence.Google Scholar
- Pedersen, Ted, Amruta Purandare and Anagha Kulkarni. 2005. Name Discrimination by Clustering Similar Contexts. Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (2005). Google ScholarDigital Library
- Ravin, Y. and Z. Kazi. 1999. Is Hillary Rodham Clinton the President? In Association for Computational Linguistics Workshop on Coreference and its Applications. Google ScholarDigital Library
- Yarowsky, David. 1995. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189--196. Google ScholarDigital Library
- Zheng, Zhaohui, K. Chen, G. Sun, and H. Zha. 2007. A regression framework for learning ranking functions using relative relevance judgments. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 287--294. Google ScholarDigital Library
- Resolving surface forms to Wikipedia topics
Recommendations
Surface Name Errors in Wikipedia
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)Surface name is the string used to refer to an entity in a text corpus. Crowd-sourced knowledge repositories such as Wikipedia can have multiple types of errors, including surface name errors. This paper focuses on identifying and correcting surface ...
Named entity recognition in Wikipedia
People's Web '09: Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic ResourcesNamed entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia's link structure to automatically generate near gold-standard annotations. Until now, these ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Comments