Information extraction from unstructured document

January 2004

Author:
Liping Ma

Publisher:

University of New South Wales
P.O. Box 1 Kensington, NSW 2033
Australia

Order Number:AAI0807038

Pages:

Purchase on ProQuest

Bibliometrics

Abstract

Online information resources that are based on a structured data collection provide a rich set of operators for accessing their content. However, the vast majority of online information sources are based on collections of documents in unstructured form, and are not amenable to searching or navigation other than by the relatively unsophisticated methods of keyword-based search and document-at-a-time retrieval. Manually creating large structured collections from large sets of unstructured documents is not feasible. Thus, there is a need to develop tools which can automate (as much as possible) the process of extracting the information from a wide variety of unstructured documents into structured form.

Over the past decade, there has been intense research toward achieving the goal of effective information extraction. However, most research to date has traded off the level of automation against the level of structuredness in the documents. Some systems have focused on achieving a high level of automation but with the requirement of well-structured input texts. Other systems require manual interaction as part of the extraction process, but work with a relatively unstructured input texts. Still other systems have taken a middle road, with medium levels of automation on a reasonable range of documents.

The main contribution of this dissertation is to propose a novel approach to the problem of information extraction that fills a gap in the space of solutions to this problem: we make minimal assumptions about the structure or format of input documents, and we require minimal manual effort from users. The key idea behind our approach is that, instead of designing extraction rules manually, we incorporate machine learning algorithms into our system, using multiple different learners to handle the different tasks involved in information extraction: feature selection, region identification, text classification, synopsis extraction, pattern discovery and pattern matching.

In this dissertation, we describe complete solutions including architectures, algorithms and implementations which address three of the most important problems in today's information extraction: document decomposition, text classification and data extraction. Our solutions achieve information extraction effectiveness that is as good or better than other related systems.

Cited By

Ma L and Shepherd J Information extraction using two-phase pattern discovery Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (534-535)

Contributors

Liping Ma
UNSW Sydney
- Publication Years2002 - 2004
- Publication counts5
- Citation count9
- Available for Download2
- Downloads (cumulative)898
- Downloads (12 months)16
- Downloads (6 weeks)6
- Average Downloads per Article449
- Average Citation per Article2
View Full Profile

Index Terms

Information extraction from unstructured document
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval

Recommendations

Automatic office document classification and information extraction
Read More
Automatic Extraction and Processing of Document References
Read More
Textline information extraction from grayscale camera-captured document images
ICIP'09: Proceedings of the 16th IEEE international conference on Image processing

Cameras offer flexible document imaging, but with uneven shading and non-planar page shape. Therefore cameracaptured documents need to go through dewarping before being processed by traditional text recognition methods. Curled textline detection is an ...
Read More

Comments

Browse Theses

Sections

Cited By

Index Terms

Automatic office document classification and information extraction

Automatic Extraction and Processing of Document References

Textline information extraction from grayscale camera-captured document images

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Automatic office document classification and information extraction

Automatic Extraction and Processing of Document References

Textline information extraction from grayscale camera-captured document images