skip to main content
research-article

Document Spanners: A Formal Approach to Information Extraction

Published:06 May 2015Publication History
Skip Abstract Section

Abstract

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.

We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners—the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

References

  1. Alfred V. Aho. 1990. Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A). North Holland, 255--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. James F. Allen. 1983. Maintaining knowledge about temporal intervals. Commun. ACM 26, 11, 832--843. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III. Association for Computational Linguistics, 23--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of PODS. ACM, 68--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Pablo Barceló, Diego Figueira, and Leonid Libkin. 2012a. Graph logics with rational relations and the generalized intersection problem. In Proceedings of LICS. IEEE, 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pablo Barceló, Leonid Libkin, Anthony Widjaja Lin, and Peter T. Wood. 2012b. Expressive languages for path queries over graph-structured data. ACM Trans. Datab. Syst. 37, 4, 31. DOI: http://dx.doi.org/10.1145/2389241.2389250 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Pablo Barceló, Juan L. Reutter, and Leonid Libkin. 2013. Parameterized regular expressions and their languages. Theoret. Comput. Sci. 474, 21--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael Benedikt, Leonid Libkin, Thomas Schwentick, and Luc Segoufin. 2003. Definable relations and first-order query languages over strings. J. ACM 50, 5, 694--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jean Berstel. 1979. Transductions and Context-Free Languages. Teubner Studienbücher, Stuttgart.Google ScholarGoogle Scholar
  10. Anthony J. Bonner and Giansalvatore Mecca. 1998. Sequences, datalog, and transducers. J. Comput. Syst. Sci. 57, 3, 234--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000a. Containment of conjunctive regular path queries with inverse. In Proceedings of KR 2000. 176--185.Google ScholarGoogle Scholar
  12. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000b. View-based query processing and constraint satisfaction. In Proceedings of LICS. 361--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. 2003. A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14, 6, 1007--1018.Google ScholarGoogle ScholarCross RefCross Ref
  14. Cezar Câmpeanu and Nicolae Santean. 2009. On the intersection of regex languages with regular languages. Theoret. Comput. Sci. 410, 24--25, 2336--2344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Benjamin Carle and Paliath Narendran. 2009. On extended regular expressions. In Proceedings of LATA 2009. Lecture Notes in Computer Science, vol. 5457. 279--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. 2010. SystemT: An algebraic approach to declarative information extraction. In Proceedings of the 48th Annual/Meeting of the Association for Computer Linguisties (ACL'10). 128--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mariano P. Consens and Alberto O. Mendelzon. 1990. GraphLog: A visual formalism for real life recursion. In Proceedings of PODS. ACM, 404--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Isabel F. Cruz, Alberto O. Mendelzon, and Peter T. Wood. 1987. A graphical query language supporting recursion. In Proceedings of SIGMOD Conference. ACM, 323--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hamish Cunningham. 2002. GATE, A General Architecture for text engineering. Comput. Human. 36, 2, 223--254.Google ScholarGoogle ScholarCross RefCross Ref
  20. Alin Deutsch and Val Tannen. 2001. Optimization properties for classes of conjunctive regular path queries. In Proceedings of DBPL. 21--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Calvin C. Elgot and J. E. Mezei. 1965. On relations defined by generalized finite automata. IBM J. Res. Devel. 9, 47--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2013. Spanners: A formal framework for information extraction. In Proceedings of PODS. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2014. Cleaning inconsistencies in information extraction via prioritized repairs. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'14). ACM, 164--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Daniela Florescu, Alon Y. Levy, and Dan Suciu. 1998. Query containment for conjunctive queries with regular expressions. In Proceedings of PODS. 139--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dayne Freitag. 1998. Toward general-purpose learning for information extraction. In Proceedings of COLING-ACL. 404--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dominik D. Freydenberger. 2011. Extended regular expressions: Succinctness and decidability. In Proceedings of STACS (LIPIcs). Vol. 9, Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik, 507--518.Google ScholarGoogle Scholar
  27. Jeffrey Friedl. 2006. Mastering Regular Expressions. O'Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Seymour Ginsburg and Xiaoyang Sean Wang. 1998. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56, 1, 1--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gösta Grahne, Matti Nykänen, and Esko Ukkonen. 1999. Reasoning about strings in databases. J. Comput. Syst. Sci. 59, 1, 116--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In Proceedings of COLING. 466--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Orna Grumberg, Orna Kupferman, and Sarai Sheinvald. 2010. Variable automata over infinite alphabets. In Proceedings of LATA. 561--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Donald E. Knuth. 1968. Semantics of context-free languages. Math. Syst. Theory 2, 2, 127--145.Google ScholarGoogle ScholarCross RefCross Ref
  33. Donald E. Knuth. 1971. Correction: Semantics of context-free languages. Math. Syst. Theory 5, 1, 95--96.Google ScholarGoogle ScholarCross RefCross Ref
  34. Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. 2008. SystemT: A system for declarative information extraction. SIGMOD Record 37, 4, 7--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Morgan Kaufmann, 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. R. Leek. 1997. Information extraction using hidden Markov models. Master's thesis University of California, San Diego.Google ScholarGoogle Scholar
  37. Peter Linz. 2001. An Introduction to Formal Languages and Automata 3rd Ed. Jones and Bartlett Publishers, Inc., Sudbury, M. A. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. 2010. Automatic rule refinement for information extraction. Proc. VLDB Endow. 3, 1--2, 588--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of ICML. Morgan Kaufmann, 591--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Frank Neven and Thomas Schwentick. 2002. Query automata over finite trees. Theoret. Comput. Sci. 275, 2, 633--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Frank Neven and Jan Van den Bussche. 2002. Expressiveness of structured document query languages based on attribute grammars. J. ACM 49, 1, 56--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Maurice Nivat. 1968. Transduction des langages de Chomsky. Ann. Inst. Fourier 18, 339--455.Google ScholarGoogle ScholarCross RefCross Ref
  43. Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An algebraic approach to rule-based information extraction. In Proceedings of ICDE. IEEE, 933--942. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In Proceedings of AAAI. AAAI Press/The MIT Press, 811--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy G. Lehnert. 1995. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of IJCAI. Morgan Kaufmann, 1314--1321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. 2012. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64, 2--3, 209--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sheng Yu. 1997. Regular Languages. In Handbook of Formal Languages, Grzegorz Rozenberg and Arto Salomaa (Eds.), vol. 1, Springer, Chapter 2.Google ScholarGoogle Scholar

Index Terms

  1. Document Spanners: A Formal Approach to Information Extraction

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      cover image Journal of the ACM
                      Journal of the ACM  Volume 62, Issue 2
                      May 2015
                      304 pages
                      ISSN:0004-5411
                      EISSN:1557-735X
                      DOI:10.1145/2772377
                      Issue’s Table of Contents

                      Copyright © 2015 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 6 May 2015
                      • Revised: 1 October 2014
                      • Accepted: 1 October 2014
                      • Received: 1 August 2013
                      Published in jacm Volume 62, Issue 2

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article
                      • Research
                      • Refereed

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader