In the past few years the World Wide Web has emerged as an important source of data, much of it in the form of unstructured text. This thesis describes an extensible model for information extraction that takes advantage of the unique characteristics of Web text and leverages existent search engine technology in order to ensure the quality of the extracted information. The key features of our approach are the use of lexico-syntactic patterns, Web-scale statistics and unsupervised or semi-supervised learning methods. Our information extraction model has been instantiated and extended in order to solve a set of diverse information extraction tasks: subclass and related class extraction, relation property learning, the acquisition of salient product features and corresponding user opinions from customer reviews and finally, the mining of commonsense information from the Web for the benefit of integrated AI systems.
Cited By
- Lin T, Mausam and Etzioni O Identifying functional relations in web text Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, (1266-1276)
- Srinivasan P and Yates A Quantifier scope disambiguation using extracted pragmatic knowledge Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3, (1465-1474)
- Yates A and Etzioni O (2009). Unsupervised methods for determining object and relation synonyms on the web, Journal of Artificial Intelligence Research, 34:1, (255-296), Online publication date: 1-Jan-2009.
- Etzioni O, Banko M, Soderland S and Weld D (2008). Open information extraction from the web, Communications of the ACM, 51:12, (68-74), Online publication date: 1-Dec-2008.
Index Terms
- Information extraction from unstructured web text
Recommendations
Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web
Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but ...
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web ...
A Template-Based Tibetan Web Text Information Extraction Method
ICINIS '11: Proceedings of the 2011 4th International Conference on Intelligent Networks and Intelligent SystemsIn order to build a large Tibetan corpus, the researcher proposes a simple and effective method of text information extraction over Tibetan Web pages. Most web pages too much noise information unrelated to the content of the text, which makes it ...