ABSTRACT
Deterministic regular expressions are widely used in XML processing. For instance, all regular expressions in DTDs and XML Schemas are required to be deterministic. In this paper we show that determinism of a regular expression e can be tested in linear time. The best known algorithms, based on the Glushkov automaton, require O(σ|e|) time, where σ is the number of distinct symbols in e. We further show that matching a word w against an expression e can be achieved in combined linear time O(|e|+|w|), for a wide range of deterministic regular expressions: (i) star-free (for multiple input words), (ii) bounded-occurrence, i.e., expressions in which each symbol appears a bounded number of times, and (iii) bounded plus-depth, i.e., expressions in which the nesting depth of alternating plus (union) and concatenation symbols is bounded. Our algorithms use a new structural decomposition of the parse tree of e. For matching arbitrary deterministic regular expressions we present an O(|e| + |w|log log|e|) time algorithm.
- M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms, 57(2):75--94, 2005. Google ScholarDigital Library
- G. Berry and R. Sethi. From regular expressions to deterministic automata. Theor. Comput. Sci., 48(3):117--126, 1986. Google ScholarDigital Library
- G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. TWEB, 4(4), 2010. Google ScholarDigital Library
- G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: A practical study. In WEBDB, pages 79--84, 2004. Google ScholarDigital Library
- G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren. Inference of concise regular expressions and DTDs. TODS, 35(2), 2010. Google ScholarDigital Library
- P. Bille and M. Thorup. Faster regular expression matching. In ICALP, pages 171--182, 2009. Google ScholarDigital Library
- M. Bojanczyk and P. Parys. XPath evaluation in linear time. J. ACM, 58(4):17, 2011. Google ScholarDigital Library
- A. Brüggemann-Klein. Regular expressions into finite automata. TCS, 120(2):197--213, 1993. Google ScholarDigital Library
- C.-H. Chang and R. Paige. From regular expressions to DFA's using compressed NFA's. TCS, 178(1--2):1--36, 1997. Google ScholarDigital Library
- M. Farach and S. Muthukrishnan. Optimal parallel dictionary matching and compression. In SPAA, pages 244--253, 1995. Google ScholarDigital Library
- Apache Software Foundation. Xerces-C++ ver.\ 3.1.1. http://xerces.apache.org/xerces-c/. File MixedContentModel.cpp.Google Scholar
- V. M. Glushkov. The abstract theory of automata. Russ. Math. Surveys, 16(5):1--53, 1961.Google ScholarCross Ref
- S. Grijzenhout. Quality of the XML web. Master's thesis, University of Amsterdam, July 2010. Draft, see also http://data.politicalmashup.nl/xmlweb/.Google Scholar
- C. Hagenah and A. Muscholl. Computing epsilon-free NFA from regular expressions in O(n łog 2(n)) time. In MFCS, pages 277--285, 1998. Google ScholarDigital Library
- D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338--355, 1984. Google ScholarDigital Library
- J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation, Second Edition. Addison-Wesley, 2000. Google ScholarDigital Library
- R. Kecher. Reset an array in constant time. Blog entry cplusplus.co.il, May 2009.Google Scholar
- P. Kilpeläinen. Checking determinism of XML schema content models in optimal time. Inf. Syst., 36(3):596--617, 2011. Google ScholarDigital Library
- P. Kilpeläinen and R. Tuhkanen. One-unambiguity of regular expressions with numeric occurrence indicators. Inf. Comput., 205(6):890--916, 2007. Google ScholarDigital Library
- C. Konrad and F. Magniez. Validating XML documents in the streaming model with external memory. In ICDT, 2012. Google ScholarDigital Library
- W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML Schema. TODS, 31(3):770--813, 2006. Google ScholarDigital Library
- B. Moret and H. Shapiro. Algorithms from P to NP. Benjamin/Cummings, 1990.Google Scholar
- S. Muthukrishnan and Müller M. Time and space efficient method-lookup for object-oriented programs. In SODA, pages 42--51, 1996. Google ScholarDigital Library
- E. W. Myers. A four russians algorithm for regular expression pattern matching. J. ACM, 39(2):430--448, 1992. Google ScholarDigital Library
- J.-L. Ponty, D. Ziadi, and J.-M. Champarnaud. A new quadratic algorithm to convert a regular expression into an automaton. In WIA, pages 109--119, 1996. Google ScholarDigital Library
- G. Schnitger. Regular expressions and NFAs without epsilon-transitions. In STACS, pages 432--443, 2006. Google ScholarDigital Library
- L. Segoufin and C. Sirangelo. Constant-memory validation of streaming XML documents against DTDs. In ICDT, pages 299--313, 2007. Google ScholarDigital Library
- L. Segoufin and V. Vianu. Validating streaming XML documents. In PODS, pages 53--64, 2002. Google ScholarDigital Library
Index Terms
- Deterministic regular expressions in linear time
Recommendations
Deterministic regular expressions with back-references
AbstractMost modern libraries for regular expression matching allow back-references (i.e., repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between ...
Checking determinism of regular expressions with counting
In this paper, we first present characterizations of weak determinism for regular expressions with counting, based on which we develop an O ( | Σ E | | E | ) time algorithm to check whether an expression E with counting is weakly deterministic, where Σ ...
Closure properties and descriptional complexity of deterministic regular expressions
We study the descriptional complexity of regular languages that are definable by deterministic regular expressions, i.e., we examine worst-case blow-ups in size when translating between different representations for such languages. As representations of ...
Comments