skip to main content
10.1145/3178876.3186025acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Free Access

Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles

Published:10 April 2018Publication History

ABSTRACT

A lightweight method distinguishes articles within Wikipedia that are classes (Novel, Book) from other articles (Three Men in a Boat, Diary of a Pilgrimage). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools such as part of speech taggers, named entity recognizers or syntactic parsers. Experimental results show that classes can be identified among Wikipedia articles in multiple languages, at aggregate precision and recall generally above 0.9 and 0.6 respectively.

References

  1. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. 2009. DBpedia - a Crystallization Point for the Web of Data. Journal of Web Semantics Vol. 7, 3 (2009), 154--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Blanco, G. Ottaviano, and E. Meij. 2015. Fast and Space-Efficient Entity Linking in Queries Proceedings of the 8th ACM Conference on Web Search and Data Mining (WSDM-15). Shanghai, China, 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Boleda, A. Gupta, and S. Padó. 2017. Instances and Concepts in Distributional Space. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-17). Valencia, Spain, 79--85.Google ScholarGoogle Scholar
  4. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge Proceedings of the 2008 International Conference on Management of Data (SIGMOD-08). Vancouver, Canada, 1247--1250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-17). Vancouver, Canada, 1870--1879.Google ScholarGoogle Scholar
  6. A. Chisholm and B. Hachey. 2015. Entity disambiguation with Web links. Transactions of the Association for Computational Linguistics Vol. 3 (2015), 145--156.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Dryer and M. Haspelmath (Eds.).. 2013. World Atlas of Language Structures. Max Planck Institute for Evolutionary Anthropology.Google ScholarGoogle Scholar
  8. O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. 2011. Open Information Extraction: The Second Generation Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI-11). Barcelona, Spain, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying Relations for Open Information Extraction Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-11). Edinburgh, Scotland, 1535--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Fellbaum (Ed.).. 1998. WordNet: An Electronic Lexical Database and Some of its Applications. MIT Press.Google ScholarGoogle Scholar
  11. T. Flati, D. Vannella, T. Pasini, and R. Navigli. 2014. Two Is Bigger (and Better) Than One: the Wikipedia Bitaxonomy Project. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14). Baltimore, Maryland, 945--955.Google ScholarGoogle Scholar
  12. O. Ganea, M. Ganea, A. Lucchi, C. Eickhoff, and T. Hofmann. 2016. Probabilistic Bag-Of-Hyperlinks Model for Entity Linking Proceedings of the 25th World Wide Web Conference (WWW-16). Montreal, Canada, 927--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora Proceedings of the 14th International Conference on Computational Linguistics (COLING-92). Nantes, France, 539--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Hoffart, F. Suchanek, K. Berberich, and G. Weikum. 2013. YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence Journal. Special Issue on Artificial Intelligence, Wikipedia and Semi-Structured Resources Vol. 194 (2013), 28--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Hu, G. Wang, F. Lochovsky, J. Sun, and Z. Chen. 2009. Understanding User's Query Intent with Wikipedia Proceedings of the 18th World Wide Web Conference (WWW-09). Madrid, Spain, 471--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Lenat. 1995. CYC: a Large-Scale Investment in Knowledge Infrastructure. Commun. ACM Vol. 38, 11 (1995), 32--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mausam, M. Schmitz, S. Soderland, R. Bart, and O. Etzioni. 2012. Open Language Learning for Information Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-12). Jeju Island, Korea, 523--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Miller and F. Hristea. 2006. WordNet Nouns: Classes and Instances. Computational Linguistics Vol. 32, 1 (2006), 1--3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Nastase and M. Strube. 2008. Decoding Wikipedia Categories for Knowledge Acquisition Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI-08). Chicago, Illinois, 1219--1224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Nastase and M. Strube. 2013. Transforming Wikipedia into a Large Scale Multilingual Concept Network. Artificial Intelligence Vol. 194 (2013), 62--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Pan, T. Cassidy, U. Hermjakob, H. Ji, and K. Knight. 2015. Unsupervised Entity Linking with Abstract Meaning Representation Proceedings of the 2015 Conference of the North American Association for Computational Linguistics (NAACL-HLT-15). Denver, Colorado, 1130--1139.Google ScholarGoogle Scholar
  22. S. Ponzetto and R. Navigli. 2009. Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09). Pasadena, California, 2083--2088. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ponzetto and M. Strube. 2007. Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI-07). Vancouver, British Columbia, 1440--1447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Porter. 1980. An algorithm for suffix stripping. Program Vol. 14, 3 (1980), 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  25. L. Ratinov, D. Roth, D. Downey, and M. Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-11). Portland, Oregon, 1375--1384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Remy. 2002. Wikipedia: The Free Encyclopedia. Online Information Review Vol. 26, 6 (2002), 434.Google ScholarGoogle ScholarCross RefCross Ref
  27. U. Scaiella, P. Ferragina, A. Marino, and M. Ciaramita. 2012. Topical Clustering of Search Results. In Proceedings of the 5th ACM Conference on Web Search and Data Mining (WSDM-12). Seattle, Washington, 223--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Seitner, C. Bizer, K. Eckert, S. Faralli, R. Meusel, H. Paulheim, and S. Ponzetto. 2016. A Large Database of Hypernymy Relations Extracted from the Web Proceedings of the 10th Conference on Language Resources and Evaluation (LREC-16). Portoroz, Slovenia, 360--367.Google ScholarGoogle Scholar
  29. A. Singhal. 2012. Introducing the Knowledge Graph: Things, not Strings. Corporate blog.Google ScholarGoogle Scholar
  30. Y. Sun, A. Singla, D. Fox, and A. Krause. 2015. Building Hierarchies of Concepts via Crowdsourcing Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15). Buenos Aires, Argentina, 844--851. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Tsurel, D. Pelleg, I. Guy, and D. Shahaf. 2017. Fun Facts: Automatic Trivia Fact Extraction from Wikipedia Proceedings of the 10th ACM Conference on Web Search and Data Mining (WSDM-17). Cambridge, United Kingdom, 345--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Vrandeuciç and M. Krötzsch. 2014. Wikidata: A Free Collaborative Knowledge Base. Commun. ACM Vol. 57 (2014), 78--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Wang, Z. Li, J. Li, J. Tang, and J. Pan. 2013. Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-13). Sofia, Bulgaria, 641--650.Google ScholarGoogle Scholar
  34. F. Wu and D. Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10). Uppsala, Sweden, 118--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Wu, H. Li, H. Wang, and K. Zhu. 2012. Probase: a Probabilistic Taxonomy for Text Understanding Proceedings of the 2012 International Conference on Management of Data (SIGMOD-12). Scottsdale, Arizona, 481--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, and M. Ishizuka. 2009. Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP-09). Singapore, 1021--1029. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Zirn, V. Nastase, and M. Strube. 2008. Distinguishing Between Instances and Classes in the Wikipedia Taxonomy Proceedings of the 5th European Semantic Web Conference (ESWC-08). Tenerife, Spain, 376--387. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding Needles in an Encyclopedic Haystack: Detecting Classes Among Wikipedia Articles

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WWW '18: Proceedings of the 2018 World Wide Web Conference
          April 2018
          2000 pages
          ISBN:9781450356398

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          International World Wide Web Conferences Steering Committee

          Republic and Canton of Geneva, Switzerland

          Publication History

          • Published: 10 April 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format