research-article

Free Access

Information Extraction: Distilling structured data from unstructured text

Author:
Andrew McCallum

University of Massachusetts, Amherst

University of Massachusetts, Amherst
View Profile

Authors Info & Claims

Queue Volume 3 Issue 9November 2005pp 48–57https://doi.org/10.1145/1105664.1105679

Published:01 November 2005Publication History

Queue

Abstract

In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.

References

McCallum, A., Corrada-Emanuel, A., and Wang, X. 2005. Topic and role discovery in social networks. International Joint Conferences on Artificial Intelligence. Google ScholarDigital Library
Collins, M., and Singer, Y. 1999. Unsupervised models for named entity classification.Google Scholar
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML: 282--289. Google ScholarDigital Library
Klein, D., Smarr, J., Nguyen, H., and Manning, C. 2003. Named entity recognition with character-level models. Proceedings of the Seventh Conference on Natural Language Learning. Google ScholarDigital Library
Wang, X., Mohanty, N., and McCallum, A. 2005. Group and topic discovery from relations and text. In Workshop on Link Discovery (LinkKDD), Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: A high-performance learning name-finder. Proceedings of ANLP: 194--201. Google ScholarDigital Library
McCallum, A., and Jensen, D. 2003. A note on the unification of information extraction and data mining using conditional-probability, relational models. IJCAI Workshop on Learning Statistical Models from Relational Data.Google Scholar
Lawrence, S., Giles, C. L., and Bollacker, K. 1999. Digital libraries and autonomous citation indexing. IEEE Computer 32(6): 67--71. Google ScholarDigital Library
Soderland, S., and Lehnert, W. G. 1994. Corpus-driven knowledge acquisition for discourse analysis. AAAI. Google ScholarDigital Library
Kleinberg, J. 2002. Bursty and hierarchical structure in streams. ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
See reference 5.Google Scholar
Carvalho, V. R., and Cohen, W. W. 2004. Learning to extract signature and reply lines from e-mail. Conference on E-mail and Spam (CEAS).Google Scholar
Califf, M. E., and Mooney, R. 1999. Relational learning of pattern-match rules for information extraction. Proceedings of the National Conference on Artificial Intelligence. Google ScholarDigital Library
See reference 6.Google Scholar
See reference 4.Google Scholar
See reference 7.Google Scholar
See reference 8.Google Scholar
Freitag, D., and McCallum, A. K. 1999. Information extraction with HMMs and shrinkage. Proceedings of the AAAI Workshop on Machine Learning for Information Extraction.Google Scholar
Roth, D., and Yih, W. 2002. Probabilistic reasoning for entity and relation recognition. COLING. Google ScholarDigital Library
See reference 1.Google Scholar
See reference 3.Google Scholar
Nahm, U. Y., and Mooney, R. J. 2000. A mutually beneficial integration of data mining and information extraction. AAAI/IAAI: 627--632. Google ScholarDigital Library
See reference 9.Google Scholar
Culotta, A., and Sorensen, J. 2004. Dependency tree kernels for relation extraction. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarDigital Library
Ando, R. K., and Zhang, T. 2005. A high-performance semi-supervised learning method for text chunking. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarDigital Library
See reference 3.Google Scholar
McCallum, A., Freitag, D., and Pereira, F. 2000. Maximum entropy Markov models for information extraction and segmentation. Proceedings of ICML: 591--598. Google ScholarDigital Library
Wellner, B., McCallum, A., Peng, F., and Hay, M. 2004. An integrated, conditional model of information extraction and co-reference with application to citation matching. Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarDigital Library
Kristjannson, T., Culotta, A., Viola, P., and McCallum, A. 2004. Interactive information extraction with conditional random fields. Nineteenth National Conference on Artificial Intelligence. Google ScholarDigital Library

Index Terms

Information Extraction: Distilling structured data from unstructured text

Recommendations

Automatic office document classification and information extraction
Read More
Visual information extraction

Typographic and visual information is an integral part of textual documents. Most information extraction (IE) systems ignore most of this visual information, processing the text as a linear sequence of words. Thus, much valuable information is lost. In ...
Read More
Information extraction from unstructured document
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Queue Volume 3, Issue 9
Social Computing
November 2005
48 pages
ISSN:1542-7730
EISSN:1542-7749
DOI:10.1145/1105664
Issue’s Table of Contents

Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Editor picked
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 43,474
  Total Downloads
- Downloads (Last 12 months)2,616
- Downloads (Last 6 weeks)336
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Information Extraction: Distilling structured data from unstructured text

Queue

Abstract

References

Cited By

Index Terms

Recommendations

Automatic office document classification and information extraction

Visual information extraction

Information extraction from unstructured document

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Information Extraction: Distilling structured data from unstructured text

Queue

Abstract

References

Cited By

Index Terms

Recommendations

Automatic office document classification and information extraction

Visual information extraction

Information extraction from unstructured document

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media