Article

Corroborate and learn facts from the web

Authors:
Shubin Zhao

Google Inc., New York, NY

Google Inc., New York, NY
View Profile

,
Jonathan Betz

Google Inc., New York, NY

Google Inc., New York, NY
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 995–1003https://doi.org/10.1145/1281192.1281299

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 995–1003

ABSTRACT

The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.

References

R. G. Bing Liu and Y. Zhao. Mining data records in web pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), pages 601--606, Washington, D.C, 2003. Google ScholarDigital Library
S. Brin. Extracting patterns and relations from the world wide web. In WebDB '98: Selected papers from the International Workshop on The World Wide Web and Databases, pages 172--183, London, UK, 1999. Springer-Verlag. Google ScholarDigital Library
S.-C. L. Chia-Hui Chang. Iepad: Information extraction based on pattern discovery. In Proceedings of The tenth International World Wide Web Conference (WWW). Google ScholarDigital Library
W. Cohen, M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in html documents. In Proceedings of The Eleventh International World Wide Web Conference (WWW), 2002. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, 2004. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall. 2004. Google ScholarDigital Library
R. Feldman, B. Rosenfeld, S. Soderland, and O. Etzioni. Self-supervised relation extraction from the web. In ISMIS, pages 755--764, 2006. Google ScholarDigital Library
S. Harabagiu, M. Pasca, and S. Maiorano. Experiments with open-domain textual question answering. In Proceedings of the 18th conference on Computational linguistics, pages 292--298, Morristown, NJ, USA, 2000. Association for Computational Linguistics. Google ScholarDigital Library
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93-114, 2001. Google ScholarDigital Library
D. W. N. Kushmerick and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729--737, San Francisco, CA, 1997.Google Scholar
D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 41--47, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library

Index Terms

Corroborate and learn facts from the web
1. Computing methodologies
  1. Machine learning
    1. Learning settings
2. Information systems
  1. Information retrieval

Recommendations

Automatic extraction of acronym definitions from the Web

Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as ...
Read More
AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB

Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to ...
Read More
A bootstrapping method for extracting attribute names with keys from the web
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

A large number of semi-structured documents (HTML documents) exist on the Web. To improve the accessibility of information related to an object, such as a product, human, or place, its attribute names and attribute values need to be extracted. Recently, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bootstrapping
information extraction
web mining
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 1,375
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Corroborate and learn facts from the web

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic extraction of acronym definitions from the Web

AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB

A bootstrapping method for extracting attribute names with keys from the web