ABSTRACT
A lightweight method distinguishes articles within Wikipedia that are classes (Novel, Book) from other articles (Three Men in a Boat, Diary of a Pilgrimage). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools such as part of speech taggers, named entity recognizers or syntactic parsers. Experimental results show that classes can be identified among Wikipedia articles in multiple languages, at aggregate precision and recall generally above 0.9 and 0.6 respectively.
- C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. 2009. DBpedia - a Crystallization Point for the Web of Data. Journal of Web Semantics Vol. 7, 3 (2009), 154--165. Google ScholarDigital Library
- R. Blanco, G. Ottaviano, and E. Meij. 2015. Fast and Space-Efficient Entity Linking in Queries Proceedings of the 8th ACM Conference on Web Search and Data Mining (WSDM-15). Shanghai, China, 179--188. Google ScholarDigital Library
- G. Boleda, A. Gupta, and S. Padó. 2017. Instances and Concepts in Distributional Space. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-17). Valencia, Spain, 79--85.Google Scholar
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge Proceedings of the 2008 International Conference on Management of Data (SIGMOD-08). Vancouver, Canada, 1247--1250. Google ScholarDigital Library
- D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-17). Vancouver, Canada, 1870--1879.Google Scholar
- A. Chisholm and B. Hachey. 2015. Entity disambiguation with Web links. Transactions of the Association for Computational Linguistics Vol. 3 (2015), 145--156.Google ScholarCross Ref
- M. Dryer and M. Haspelmath (Eds.).. 2013. World Atlas of Language Structures. Max Planck Institute for Evolutionary Anthropology.Google Scholar
- O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. 2011. Open Information Extraction: The Second Generation Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI-11). Barcelona, Spain, 3--10. Google ScholarDigital Library
- A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying Relations for Open Information Extraction Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-11). Edinburgh, Scotland, 1535--1545. Google ScholarDigital Library
- C. Fellbaum (Ed.).. 1998. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press.Google Scholar
- T. Flati, D. Vannella, T. Pasini, and R. Navigli. 2014. Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14). Baltimore, Maryland, 945--955.Google Scholar
- O. Ganea, M. Ganea, A. Lucchi, C. Eickhoff, and T. Hofmann. 2016. Probabilistic Bag-Of-Hyperlinks Model for Entity Linking Proceedings of the 25th World Wide Web Conference (WWW-16). Montreal, Canada, 927--938. Google ScholarDigital Library
- M. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora Proceedings of the 14th International Conference on Computational Linguistics (COLING-92). Nantes, France, 539--545. Google ScholarDigital Library
- J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum. 2013. YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence Journal. Special Issue on Artificial Intelligence, Wikipedia and Semi-Structured Resources Vol. 194 (2013), 28--61. Google ScholarDigital Library
- J. Hu, G. Wang, F. Lochovsky, J. Sun, and Z. Chen. 2009. Understanding User's Query Intent with Wikipedia Proceedings of the 18th World Wide Web Conference (WWW-09). Madrid, Spain, 471--480. Google ScholarDigital Library
- D. Lenat. 1995. CYC: a Large-Scale Investment in Knowledge Infrastructure. Commun. ACM Vol. 38, 11 (1995), 32--38. Google ScholarDigital Library
- Mausam, M. Schmitz, S. Soderland, R. Bart, and O. Etzioni. 2012. Open Language Learning for Information Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-12). Jeju Island, Korea, 523--534. Google ScholarDigital Library
- G. Miller and F. Hristea. 2006. WordNet Nouns: Classes and Instances. Computational Linguistics Vol. 32, 1 (2006), 1--3. Google ScholarDigital Library
- V. Nastase and M. Strube. 2008. Decoding Wikipedia Categories for Knowledge Acquisition Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI-08). Chicago, Illinois, 1219--1224. Google ScholarDigital Library
- V. Nastase and M. Strube. 2013. Transforming Wikipedia into a Large Scale Multilingual Concept Network. Artificial Intelligence Vol. 194 (2013), 62--85. Google ScholarDigital Library
- X. Pan, T. Cassidy, U. Hermjakob, H. Ji, and K. Knight. 2015. Unsupervised Entity Linking with Abstract Meaning Representation Proceedings of the 2015 Conference of the North American Association for Computational Linguistics (NAACL-HLT-15). Denver, Colorado, 1130--1139.Google Scholar
- S. Ponzetto and R. Navigli. 2009. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09). Pasadena, California, 2083--2088. Google ScholarDigital Library
- S. Ponzetto and M. Strube. 2007. Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07). Vancouver, British Columbia, 1440--1447. Google ScholarDigital Library
- M. Porter. 1980. An algorithm for suffix stripping. Program Vol. 14, 3 (1980), 130--137.Google ScholarCross Ref
- L. Ratinov, D. Roth, D. Downey, and M. Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-11). Portland, Oregon, 1375--1384. Google ScholarDigital Library
- M. Remy. 2002. Wikipedia: The Free Encyclopedia. Online Information Review Vol. 26, 6 (2002), 434.Google ScholarCross Ref
- U. Scaiella, P. Ferragina, A. Marino, and M. Ciaramita. 2012. Topical Clustering of Search Results. In Proceedings of the 5th ACM Conference on Web Search and Data Mining (WSDM-12). Seattle, Washington, 223--232. Google ScholarDigital Library
- J. Seitner, C. Bizer, K. Eckert, S. Faralli, R. Meusel, H. Paulheim, and S. Ponzetto. 2016. A Large Database of Hypernymy Relations Extracted from the Web Proceedings of the 10th Conference on Language Resources and Evaluation (LREC-16). Portoroz, Slovenia, 360--367.Google Scholar
- A. Singhal. 2012. Introducing the Knowledge Graph: Things, not Strings. Corporate blog.Google Scholar
- Y. Sun, A. Singla, D. Fox, and A. Krause. 2015. Building Hierarchies of Concepts via Crowdsourcing Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15). Buenos Aires, Argentina, 844--851. Google ScholarDigital Library
- D. Tsurel, D. Pelleg, I. Guy, and D. Shahaf. 2017. Fun Facts: Automatic Trivia Fact Extraction from Wikipedia Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17). Cambridge, United Kingdom, 345--354. Google ScholarDigital Library
- D. Vrandeuciç and M. Krötzsch. 2014. Wikidata: A Free Collaborative Knowledge Base. Commun. ACM Vol. 57 (2014), 78--85. Google ScholarDigital Library
- Z. Wang, Z. Li, J. Li, J. Tang, and J. Pan. 2013. Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13). Sofia, Bulgaria, 641--650.Google Scholar
- F. Wu and D. Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10). Uppsala, Sweden, 118--127. Google ScholarDigital Library
- W. Wu, H. Li, H. Wang, and K. Zhu. 2012. Probase: a Probabilistic Taxonomy for Text Understanding Proceedings of the 2012 International Conference on Management of Data (SIGMOD-12). Scottsdale, Arizona, 481--492. Google ScholarDigital Library
- Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, and M. Ishizuka. 2009. Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP-09). Singapore, 1021--1029. Google ScholarDigital Library
- C. Zirn, V. Nastase, and M. Strube. 2008. Distinguishing Between Instances and Classes in the Wikipedia Taxonomy Proceedings of the 5th European Semantic Web Conference (ESWC-08). Tenerife, Spain, 376--387. Google ScholarDigital Library
Index Terms
- Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles
Recommendations
Lightweight Lexical and Semantic Evidence for Detecting Classes Among Wikipedia Articles
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningA supervised method relies on simple, lightweight features in order to distinguish Wikipedia articles that are classes (Shield volcano) from other articles (Kilauea). The features are lexical or semantic in nature. Experimental results in multiple ...
Approximate Definitional Constructs as Lightweight Evidence for Detecting Classes Among Wikipedia Articles
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementA lightweight method applies a few extraction patterns to the task of distinguishing Wikipedia articles that are classes ("Walled garden", "Garden") from other articles ("High Hazels Park"). The method acquires a set of classes, based on patterns ...
Weakly-supervised discovery of named entities using web search queries
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementA seed-based framework for textual information extraction allows for weakly supervised extraction of named entities from anonymized Web search queries. The extraction is guided by a small set of seed named entities, without any need for handcrafted ...
Comments