Abstract
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.
We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners—the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.
- Alfred V. Aho. 1990. Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A). North Holland, 255--300. Google ScholarDigital Library
- James F. Allen. 1983. Maintaining knowledge about temporal intervals. Commun. ACM 26, 11, 832--843. Google ScholarDigital Library
- Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III. Association for Computational Linguistics, 23--30. Google ScholarDigital Library
- Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of PODS. ACM, 68--79. Google ScholarDigital Library
- Pablo Barceló, Diego Figueira, and Leonid Libkin. 2012a. Graph logics with rational relations and the generalized intersection problem. In Proceedings of LICS. IEEE, 115--124. Google ScholarDigital Library
- Pablo Barceló, Leonid Libkin, Anthony Widjaja Lin, and Peter T. Wood. 2012b. Expressive languages for path queries over graph-structured data. ACM Trans. Datab. Syst. 37, 4, 31. DOI: http://dx.doi.org/10.1145/2389241.2389250 Google ScholarDigital Library
- Pablo Barceló, Juan L. Reutter, and Leonid Libkin. 2013. Parameterized regular expressions and their languages. Theoret. Comput. Sci. 474, 21--45. Google ScholarDigital Library
- Michael Benedikt, Leonid Libkin, Thomas Schwentick, and Luc Segoufin. 2003. Definable relations and first-order query languages over strings. J. ACM 50, 5, 694--751. Google ScholarDigital Library
- Jean Berstel. 1979. Transductions and Context-Free Languages. Teubner Studienbücher, Stuttgart.Google Scholar
- Anthony J. Bonner and Giansalvatore Mecca. 1998. Sequences, datalog, and transducers. J. Comput. Syst. Sci. 57, 3, 234--259. Google ScholarDigital Library
- Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000a. Containment of conjunctive regular path queries with inverse. In Proceedings of KR 2000. 176--185.Google Scholar
- Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. 2000b. View-based query processing and constraint satisfaction. In Proceedings of LICS. 361--371. Google ScholarDigital Library
- Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. 2003. A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14, 6, 1007--1018.Google ScholarCross Ref
- Cezar Câmpeanu and Nicolae Santean. 2009. On the intersection of regex languages with regular languages. Theoret. Comput. Sci. 410, 24--25, 2336--2344. Google ScholarDigital Library
- Benjamin Carle and Paliath Narendran. 2009. On extended regular expressions. In Proceedings of LATA 2009. Lecture Notes in Computer Science, vol. 5457. 279--289. Google ScholarDigital Library
- Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. 2010. SystemT: An algebraic approach to declarative information extraction. In Proceedings of the 48th Annual/Meeting of the Association for Computer Linguisties (ACL'10). 128--137. Google ScholarDigital Library
- Mariano P. Consens and Alberto O. Mendelzon. 1990. GraphLog: A visual formalism for real life recursion. In Proceedings of PODS. ACM, 404--416. Google ScholarDigital Library
- Isabel F. Cruz, Alberto O. Mendelzon, and Peter T. Wood. 1987. A graphical query language supporting recursion. In Proceedings of SIGMOD Conference. ACM, 323--330. Google ScholarDigital Library
- Hamish Cunningham. 2002. GATE, A General Architecture for text engineering. Comput. Human. 36, 2, 223--254.Google ScholarCross Ref
- Alin Deutsch and Val Tannen. 2001. Optimization properties for classes of conjunctive regular path queries. In Proceedings of DBPL. 21--39. Google ScholarDigital Library
- Calvin C. Elgot and J. E. Mezei. 1965. On relations defined by generalized finite automata. IBM J. Res. Devel. 9, 47--68. Google ScholarDigital Library
- Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2013. Spanners: A formal framework for information extraction. In Proceedings of PODS. 37--48. Google ScholarDigital Library
- Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2014. Cleaning inconsistencies in information extraction via prioritized repairs. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'14). ACM, 164--175. Google ScholarDigital Library
- Daniela Florescu, Alon Y. Levy, and Dan Suciu. 1998. Query containment for conjunctive queries with regular expressions. In Proceedings of PODS. 139--148. Google ScholarDigital Library
- Dayne Freitag. 1998. Toward general-purpose learning for information extraction. In Proceedings of COLING-ACL. 404--408. Google ScholarDigital Library
- Dominik D. Freydenberger. 2011. Extended regular expressions: Succinctness and decidability. In Proceedings of STACS (LIPIcs). Vol. 9, Schloss Dagstuhl -- Leibniz-Zentrum fuer Informatik, 507--518.Google Scholar
- Jeffrey Friedl. 2006. Mastering Regular Expressions. O'Reilly Media. Google ScholarDigital Library
- Seymour Ginsburg and Xiaoyang Sean Wang. 1998. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56, 1, 1--26. Google ScholarDigital Library
- Gösta Grahne, Matti Nykänen, and Esko Ukkonen. 1999. Reasoning about strings in databases. J. Comput. Syst. Sci. 59, 1, 116--162. Google ScholarDigital Library
- Ralph Grishman and Beth Sundheim. 1996. Message understanding conference- 6: A brief history. In Proceedings of COLING. 466--471. Google ScholarDigital Library
- Orna Grumberg, Orna Kupferman, and Sarai Sheinvald. 2010. Variable automata over infinite alphabets. In Proceedings of LATA. 561--572. Google ScholarDigital Library
- Donald E. Knuth. 1968. Semantics of context-free languages. Math. Syst. Theory 2, 2, 127--145.Google ScholarCross Ref
- Donald E. Knuth. 1971. Correction: Semantics of context-free languages. Math. Syst. Theory 5, 1, 95--96.Google ScholarCross Ref
- Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. 2008. SystemT: A system for declarative information extraction. SIGMOD Record 37, 4, 7--13. Google ScholarDigital Library
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Morgan Kaufmann, 282--289. Google ScholarDigital Library
- T. R. Leek. 1997. Information extraction using hidden Markov models. Master's thesis University of California, San Diego.Google Scholar
- Peter Linz. 2001. An Introduction to Formal Languages and Automata 3rd Ed. Jones and Bartlett Publishers, Inc., Sudbury, M. A. Google ScholarDigital Library
- B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. 2010. Automatic rule refinement for information extraction. Proc. VLDB Endow. 3, 1--2, 588--597. Google ScholarDigital Library
- Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of ICML. Morgan Kaufmann, 591--598. Google ScholarDigital Library
- Frank Neven and Thomas Schwentick. 2002. Query automata over finite trees. Theoret. Comput. Sci. 275, 2, 633--674. Google ScholarDigital Library
- Frank Neven and Jan Van den Bussche. 2002. Expressiveness of structured document query languages based on attribute grammars. J. ACM 49, 1, 56--100. Google ScholarDigital Library
- Maurice Nivat. 1968. Transduction des langages de Chomsky. Ann. Inst. Fourier 18, 339--455.Google ScholarCross Ref
- Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An algebraic approach to rule-based information extraction. In Proceedings of ICDE. IEEE, 933--942. Google ScholarDigital Library
- Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In Proceedings of AAAI. AAAI Press/The MIT Press, 811--816. Google ScholarDigital Library
- Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy G. Lehnert. 1995. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of IJCAI. Morgan Kaufmann, 1314--1321. Google ScholarDigital Library
- Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. 2012. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64, 2--3, 209--246. Google ScholarDigital Library
- Sheng Yu. 1997. Regular Languages. In Handbook of Formal Languages, Grzegorz Rozenberg and Arto Salomaa (Eds.), vol. 1, Springer, Chapter 2.Google Scholar
Index Terms
- Document Spanners: A Formal Approach to Information Extraction
Recommendations
Spanners: a formal framework for information extraction
PODS '13: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systemsAn intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into ...
Document Spanners: From Expressive Power to Decision Problems
We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (...
Complexity Bounds for Relational Algebra over Document Spanners
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document ...
Comments