ABSTRACT
The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.
- R. G. Bing Liu and Y. Zhao. Mining data records in web pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), pages 601--606, Washington, D.C, 2003. Google ScholarDigital Library
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB '98: Selected papers from the International Workshop on The World Wide Web and Databases, pages 172--183, London, UK, 1999. Springer-Verlag. Google ScholarDigital Library
- S.-C. L. Chia-Hui Chang. Iepad: Information extraction based on pattern discovery. In Proceedings of The tenth International World Wide Web Conference (WWW). Google ScholarDigital Library
- W. Cohen, M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in html documents. In Proceedings of The Eleventh International World Wide Web Conference (WWW), 2002. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, 2004. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall. 2004. Google ScholarDigital Library
- R. Feldman, B. Rosenfeld, S. Soderland, and O. Etzioni. Self-supervised relation extraction from the web. In ISMIS, pages 755--764, 2006. Google ScholarDigital Library
- S. Harabagiu, M. Pasca, and S. Maiorano. Experiments with open-domain textual question answering. In Proceedings of the 18th conference on Computational linguistics, pages 292--298, Morristown, NJ, USA, 2000. Association for Computational Linguistics. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93-114, 2001. Google ScholarDigital Library
- D. W. N. Kushmerick and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729--737, San Francisco, CA, 1997.Google Scholar
- D. Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In ACL '02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 41--47, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
Index Terms
- Corroborate and learn facts from the web
Recommendations
Automatic extraction of acronym definitions from the Web
Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as ...
AUTOMATIC ANNOTATION OF AMBIGUOUS PERSONAL NAMES ON THE WEB
Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to ...
A bootstrapping method for extracting attribute names with keys from the web
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingA large number of semi-structured documents (HTML documents) exist on the Web. To improve the accessibility of information related to an object, such as a product, human, or place, its attribute names and attribute values need to be extracted. Recently, ...
Comments