skip to main content
Information extraction from unstructured document
Publisher:
  • University of New South Wales
  • P.O. Box 1 Kensington, NSW 2033
  • Australia
Order Number:AAI0807038
Pages:
1
Bibliometrics
Skip Abstract Section
Abstract

Online information resources that are based on a structured data collection provide a rich set of operators for accessing their content. However, the vast majority of online information sources are based on collections of documents in unstructured form, and are not amenable to searching or navigation other than by the relatively unsophisticated methods of keyword-based search and document-at-a-time retrieval. Manually creating large structured collections from large sets of unstructured documents is not feasible. Thus, there is a need to develop tools which can automate (as much as possible) the process of extracting the information from a wide variety of unstructured documents into structured form.

Over the past decade, there has been intense research toward achieving the goal of effective information extraction. However, most research to date has traded off the level of automation against the level of structuredness in the documents. Some systems have focused on achieving a high level of automation but with the requirement of well-structured input texts. Other systems require manual interaction as part of the extraction process, but work with a relatively unstructured input texts. Still other systems have taken a middle road, with medium levels of automation on a reasonable range of documents.

The main contribution of this dissertation is to propose a novel approach to the problem of information extraction that fills a gap in the space of solutions to this problem: we make minimal assumptions about the structure or format of input documents, and we require minimal manual effort from users. The key idea behind our approach is that, instead of designing extraction rules manually, we incorporate machine learning algorithms into our system, using multiple different learners to handle the different tasks involved in information extraction: feature selection, region identification, text classification, synopsis extraction, pattern discovery and pattern matching.

In this dissertation, we describe complete solutions including architectures, algorithms and implementations which address three of the most important problems in today's information extraction: document decomposition, text classification and data extraction. Our solutions achieve information extraction effectiveness that is as good or better than other related systems.

Contributors
  • UNSW Sydney

Recommendations