skip to main content
10.1145/2213556.2213566acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Deterministic regular expressions in linear time

Published:21 May 2012Publication History

ABSTRACT

Deterministic regular expressions are widely used in XML processing. For instance, all regular expressions in DTDs and XML Schemas are required to be deterministic. In this paper we show that determinism of a regular expression e can be tested in linear time. The best known algorithms, based on the Glushkov automaton, require O(σ|e|) time, where σ is the number of distinct symbols in e. We further show that matching a word w against an expression e can be achieved in combined linear time O(|e|+|w|), for a wide range of deterministic regular expressions: (i) star-free (for multiple input words), (ii) bounded-occurrence, i.e., expressions in which each symbol appears a bounded number of times, and (iii) bounded plus-depth, i.e., expressions in which the nesting depth of alternating plus (union) and concatenation symbols is bounded. Our algorithms use a new structural decomposition of the parse tree of e. For matching arbitrary deterministic regular expressions we present an O(|e| + |w|log log|e|) time algorithm.

References

  1. M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms, 57(2):75--94, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Berry and R. Sethi. From regular expressions to deterministic automata. Theor. Comput. Sci., 48(3):117--126, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. J. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministic regular expressions for the inference of schemas from XML data. TWEB, 4(4), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: A practical study. In WEBDB, pages 79--84, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. J. Bex, F. Neven, T. Schwentick, and S. Vansummeren. Inference of concise regular expressions and DTDs. TODS, 35(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bille and M. Thorup. Faster regular expression matching. In ICALP, pages 171--182, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bojanczyk and P. Parys. XPath evaluation in linear time. J. ACM, 58(4):17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Brüggemann-Klein. Regular expressions into finite automata. TCS, 120(2):197--213, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C.-H. Chang and R. Paige. From regular expressions to DFA's using compressed NFA's. TCS, 178(1--2):1--36, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Farach and S. Muthukrishnan. Optimal parallel dictionary matching and compression. In SPAA, pages 244--253, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Apache Software Foundation. Xerces-C++ ver.\ 3.1.1. http://xerces.apache.org/xerces-c/. File MixedContentModel.cpp.Google ScholarGoogle Scholar
  12. V. M. Glushkov. The abstract theory of automata. Russ. Math. Surveys, 16(5):1--53, 1961.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Grijzenhout. Quality of the XML web. Master's thesis, University of Amsterdam, July 2010. Draft, see also http://data.politicalmashup.nl/xmlweb/.Google ScholarGoogle Scholar
  14. C. Hagenah and A. Muscholl. Computing epsilon-free NFA from regular expressions in O(n łog 2(n)) time. In MFCS, pages 277--285, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338--355, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages and Computation, Second Edition. Addison-Wesley, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Kecher. Reset an array in constant time. Blog entry cplusplus.co.il, May 2009.Google ScholarGoogle Scholar
  18. P. Kilpeläinen. Checking determinism of XML schema content models in optimal time. Inf. Syst., 36(3):596--617, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kilpeläinen and R. Tuhkanen. One-unambiguity of regular expressions with numeric occurrence indicators. Inf. Comput., 205(6):890--916, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Konrad and F. Magniez. Validating XML documents in the streaming model with external memory. In ICDT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML Schema. TODS, 31(3):770--813, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Moret and H. Shapiro. Algorithms from P to NP. Benjamin/Cummings, 1990.Google ScholarGoogle Scholar
  23. S. Muthukrishnan and Müller M. Time and space efficient method-lookup for object-oriented programs. In SODA, pages 42--51, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. W. Myers. A four russians algorithm for regular expression pattern matching. J. ACM, 39(2):430--448, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J.-L. Ponty, D. Ziadi, and J.-M. Champarnaud. A new quadratic algorithm to convert a regular expression into an automaton. In WIA, pages 109--119, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Schnitger. Regular expressions and NFAs without epsilon-transitions. In STACS, pages 432--443, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Segoufin and C. Sirangelo. Constant-memory validation of streaming XML documents against DTDs. In ICDT, pages 299--313, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Segoufin and V. Vianu. Validating streaming XML documents. In PODS, pages 53--64, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deterministic regular expressions in linear time

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
        May 2012
        332 pages
        ISBN:9781450312486
        DOI:10.1145/2213556

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 May 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PODS '12 Paper Acceptance Rate26of101submissions,26%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader