Article

Free Access

Snowball: extracting relations from large plain-text collections

Authors:
Eugene Agichtein

Department of Computer Science, Columbia University, 12 14 Amsterdam Avenue, New York, NY

Department of Computer Science, Columbia University, 12 14 Amsterdam Avenue, New York, NY
View Profile

,
Luis Gravano

Department of Computer Science, Columbia University, 12 14 Amsterdam Avenue, New York, NY

Department of Computer Science, Columbia University, 12 14 Amsterdam Avenue, New York, NY
View Profile

DL '00: Proceedings of the fifth ACM conference on Digital librariesJune 2000Pages 85–94https://doi.org/10.1145/336597.336644

Published:01 June 2000Publication History

DL '00: Proceedings of the fifth ACM conference on Digital libraries

Pages 85–94

ABSTRACT

Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

References

1.Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.Google Scholar
2.Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarDigital Library
3.Sergey Brin. Extracting patterns and relations from the World- Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases (WebDB' 98), March 1998. Google ScholarDigital Library
4.William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM International Conference on Management of Data (SIGMOD' 98), 1998. Google ScholarDigital Library
5.Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIG- DAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.Google Scholar
6.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999. Google ScholarDigital Library
7.David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997. Google ScholarDigital Library
8.D. Fisher, S. Soderland, J. McCarthy, F. Feng, and W. Lehnert. Description of the UMass systems as used for MUC-6. In Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995. Google ScholarDigital Library
9.William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice- Hall, 1992. Google ScholarDigital Library
10.Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarDigital Library
11.Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049, 1996. Google ScholarDigital Library
12.Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarDigital Library
13.Gerard Salton. Automatic Text Processing: The transformarion, analysis, and retrieval of information by computer. Addison-Wesley, 1989. Google ScholarDigital Library
14.Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google Scholar
15.D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-196. Cambridge, MA, 1995. Google ScholarDigital Library
16.Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In Proceedings of the 1999 Workshop on Web Information and Data Management, 1999. Google ScholarDigital Library

Index Terms

Snowball: extracting relations from large plain-text collections
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Mark-copy: fast copying GC with less space overhead
OOPSLA '03: Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications

Copying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Read More
Mark-copy: fast copying GC with less space overhead
Special Issue: Proceedings of the OOPSLA '03 conference

Copying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Read More
A generational on-the-fly garbage collector for Java
PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation

An on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DL '00: Proceedings of the fifth ACM conference on Digital libraries
June 2000
294 pages
ISBN:158113231X
DOI:10.1145/336597
Chairmen:
Peter J. Nürnberg
Aalborg Univ., Esbjerg, Denmark
,
David L. Hicks
Aalborg Univ., Esbjerg, Denmark
,
Richard Furuta
Texas A & M Univ., College Station
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
DL '00 Paper Acceptance Rate44of132submissions,33%Overall Acceptance Rate95of346submissions,27%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 720
  Total Citations
  View Citations
- 3,662
  Total Downloads
- Downloads (Last 12 months)306
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Snowball: extracting relations from large plain-text collections

DL '00: Proceedings of the fifth ACM conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mark-copy: fast copying GC with less space overhead

Mark-copy: fast copying GC with less space overhead

A generational on-the-fly garbage collector for Java

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Snowball: extracting relations from large plain-text collections

DL '00: Proceedings of the fifth ACM conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mark-copy: fast copying GC with less space overhead

Mark-copy: fast copying GC with less space overhead

A generational on-the-fly garbage collector for Java

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media