skip to main content
10.1145/336597.336644acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article
Free Access

Snowball: extracting relations from large plain-text collections

Authors Info & Claims
Published:01 June 2000Publication History

ABSTRACT

Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

References

  1. 1.Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.Google ScholarGoogle Scholar
  2. 2.Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Sergey Brin. Extracting patterns and relations from the World- Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases (WebDB' 98), March 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACM International Conference on Management of Data (SIGMOD' 98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIG- DAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.Google ScholarGoogle Scholar
  6. 6.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.D. Fisher, S. Soderland, J. McCarthy, F. Feng, and W. Lehnert. Description of the UMass systems as used for MUC-6. In Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice- Hall, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.Gerard Salton. Automatic Text Processing: The transformarion, analysis, and retrieval of information by computer. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.Google ScholarGoogle Scholar
  15. 15.D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-196. Cambridge, MA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In Proceedings of the 1999 Workshop on Web Information and Data Management, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Snowball: extracting relations from large plain-text collections

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DL '00: Proceedings of the fifth ACM conference on Digital libraries
          June 2000
          294 pages
          ISBN:158113231X
          DOI:10.1145/336597

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2000

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          DL '00 Paper Acceptance Rate44of132submissions,33%Overall Acceptance Rate95of346submissions,27%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader