skip to main content
10.1145/1281192.1281299acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Corroborate and learn facts from the web

Published:12 August 2007Publication History

ABSTRACT

The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.

References

  1. R. G. Bing Liu and Y. Zhao. Mining data records in web pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), pages 601--606, Washington, D.C, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin. Extracting patterns and relations from the world wide web. In WebDB '98: Selected papers from the International Workshop on The World Wide Web and Databases, pages 172--183, London, UK, 1999. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S.-C. L. Chia-Hui Chang. Iepad: Information extraction based on pattern discovery. In Proceedings of The tenth International World Wide Web Conference (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Cohen, M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in html documents. In Proceedings of The Eleventh International World Wide Web Conference (WWW), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Feldman, B. Rosenfeld, S. Soderland, and O. Etzioni. Self-supervised relation extraction from the web. In ISMIS, pages 755--764, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Harabagiu, M. Pasca, and S. Maiorano. Experiments with open-domain textual question answering. In Proceedings of the 18th conference on Computational linguistics, pages 292--298, Morristown, NJ, USA, 2000. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93-114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. W. N. Kushmerick and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729--737, San Francisco, CA, 1997.Google ScholarGoogle Scholar
  11. D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 41--47, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Corroborate and learn facts from the web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2007
        1080 pages
        ISBN:9781595936097
        DOI:10.1145/1281192

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader