ABSTRACT
Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
- 1.Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.Google Scholar
- 2.Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarDigital Library
- 3.Sergey Brin. Extracting patterns and relations from the World- Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases (WebDB' 98), March 1998. Google ScholarDigital Library
- 4.William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM International Conference on Management of Data (SIGMOD' 98), 1998. Google ScholarDigital Library
- 5.Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIG- DAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.Google Scholar
- 6.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999. Google ScholarDigital Library
- 7.David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997. Google ScholarDigital Library
- 8.D. Fisher, S. Soderland, J. McCarthy, F. Feng, and W. Lehnert. Description of the UMass systems as used for MUC-6. In Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995. Google ScholarDigital Library
- 9.William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice- Hall, 1992. Google ScholarDigital Library
- 10.Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarDigital Library
- 11.Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049, 1996. Google ScholarDigital Library
- 12.Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarDigital Library
- 13.Gerard Salton. Automatic Text Processing: The transformarion, analysis, and retrieval of information by computer. Addison-Wesley, 1989. Google ScholarDigital Library
- 14.Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google Scholar
- 15.D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-196. Cambridge, MA, 1995. Google ScholarDigital Library
- 16.Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In Proceedings of the 1999 Workshop on Web Information and Data Management, 1999. Google ScholarDigital Library
Index Terms
- Snowball: extracting relations from large plain-text collections
Recommendations
Mark-copy: fast copying GC with less space overhead
OOPSLA '03: Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applicationsCopying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
Mark-copy: fast copying GC with less space overhead
Special Issue: Proceedings of the OOPSLA '03 conferenceCopying garbage collectors have a number of advantages over non-copying collectors, including cheap allocation and avoiding fragmentation. However, in order to provide completeness (the guarantee to reclaim each garbage object eventually), standard ...
A generational on-the-fly garbage collector for Java
PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementationAn on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications ...
Comments