skip to main content
article

Path sharing and predicate evaluation for high-performance XML filtering

Authors Info & Claims
Published:01 December 2003Publication History
Skip Abstract Section

Abstract

XML filtering systems aim to provide fast, on-the-fly matching of XML-encoded data to large numbers of query specifications containing constraints on both structure and content. It is now well accepted that approaches using event-based parsing and Finite State Machines (FSMs) can provide the basis for highly scalable structure-oriented XML filtering systems. The XFilter system [Altinel and Franklin 2000] was the first published FSM-based XML filtering approach. XFilter used a separate FSM per path query and a novel indexing mechanism to allow all of the FSMs to be executed simultaneously during the processing of a document. Building on the insights of the XFilter work, we describe a new method, called "YFilter" that combines all of the path queries into a single Nondeterministic Finite Automaton (NFA). YFilter exploits commonality among queries by merging common prefixes of the query paths such that they are processed at most once. The resulting shared processing provides tremendous improvements in structure matching performance but complicates the handling of value-based predicates.In this article, we first describe the XFilter and YFilter approaches and present results of a detailed performance comparison of structure matching for these algorithms as well as a hybrid approach. The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits. We then propose two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates, and compare the performance of these two techniques. The results of this latter study demonstrate some key differences between shared XML filtering and traditional database query processing. Finally, we describe how the YFilter approach is extended to handle more complicated queries containing nested path expressions.

References

  1. Aksoy, D., Altinel, M., Bose, R., Cetintemel, U., Franklin, M. J., Wang, J., and Zdonik, S. B. 1998. Research in data broadcast and dissemination. In Proceedings of the of the 1st International Conference on Advanced Multimedia Content Processing. Springer, Berlin, Germany, 194--207.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Altinel, M., Aksoy, D., Baby, T., Franklin, M. J., Shapiro, W., and Zdonik, S. B. 1999. DBIS-Toolkit: Adaptable middleware for large scale data delivery. In Proceedings of the of SIGMOD 1999. ACM, New York, 544--546.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Altinel, M., and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of VLDB 2000. Morgan-Kaufmann, San Francisco, Calif., 53--64.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache XML Project. 1999. Xerces Java parser 1.2.3 Release. http://xml.apache.org/xerces-j/index.html.]]Google ScholarGoogle Scholar
  5. Belkin, N. J. and Croft, B. W. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM, 35, 12, 29--38.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bruno, N., Gravano, L., Koudas, N., and Srivastava, D. 2003. Navigation- vs. index-based XML multi-query processing. In Proceedings of ICDE 2003. IEEE Computer Society Press, Los Alamitos, Calif.]]Google ScholarGoogle Scholar
  7. Bruno, N., Srivastava, D., and Koudas, N. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of SIGMOD 2002. ACM, New York, 310--321.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Busse, R., Carey, M., Florescu, D., Kersten, M., Manolescu, I., Schmidt, A., and Waas, F. 2001. Xmark: An XML benchmark project. http://monetdb.cwi.nl/xml/index.html.]]Google ScholarGoogle Scholar
  9. Cetintemel, U., Franklin, M. J., and Giles, C. L. 2000. Self-adaptive user profiles for large scale data delivery. In Proceedings of ICDE 2000. IEEE Computer Society, Los Alamitos, Calif., 622--633.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chamberlin, D., Fankhauser, P., Florescu, D., Marchiori, M., and Robie, J. 2002. XML query use cases. W3C Working Draft. http://www.w3.org/TR/2002/WD-xmlquery-use-cases-20020430.]]Google ScholarGoogle Scholar
  11. Chan, C., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of ICDE 2002. IEEE Computer Society, Los Alamitos, Calif., 235--244.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chen, J., DeWitt, D. J., and Naughton, J. F. 2002. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In Proceedings of ICDE 2002. IEEE Computer Society Press, Los Alamitos, Calif., 345--356.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chen, J., Dewitt, D. J., Tian, F., and Wang, Y. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. In Proceedings of SIGMOD 2000. ACM, New York, 379--390.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Clark, J. 1999. XSL transformations (XSLT)---version 1.0. http://www.w3.org/TR/xslt.]]Google ScholarGoogle Scholar
  15. Clark, J., and DeRose, S. 1999. XML path language (XPath)---version 1.0. http://www.w3.org/TR/xpath.]]Google ScholarGoogle Scholar
  16. Cover, R. 1999. The SGML/XML Web Page. http://www.w3.org/TR/xslt.]]Google ScholarGoogle Scholar
  17. DeRose, S., Daniel Jr., R., and Maler, E. 1999. XML pointer language (XPointer). http://www.w3.org/TR/WD-xptr.]]Google ScholarGoogle Scholar
  18. Diaz, A. L., and Lovell, D. 1999. XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.]]Google ScholarGoogle Scholar
  19. Fabret, F., Jacobsen, H-A., Llirbat, F., Pereira, J., Ross, K. A., and Shasha, D. 2001. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proceedings of SIGMOD 2001. ACM, New York, 115--126.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Foltz, P. W., and Dumais, S. T. 1992. Personalized information delivery: An analysis of information filtering methods. Commun. ACM 35, 12, 51--60.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of VLDB 1997. Morgan-Kaufmann, San Francisco, Calif., 436--445.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of ICDT 2003. Springer, Berlin, Germany, 173--189.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hanson, E. N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J. B., and Vernon, A. 1999. Scalable trigger processing. In Proceedings of ICDE 1999. IEEE Computer Society Press, Los Alamitos, Calif., 266--275.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addition-Wesley, Boston, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ives, Z., Levy, A., and Weld, D. 2000. Efficient evaluation of regular path expressions on streaming XML data. Technical Report, University of Washington, Seattle, Wash.]]Google ScholarGoogle Scholar
  26. Kay, M. 2001. Saxon: the XSLT processor. http://users.iclway.co.uk/mhkay/saxon/.]]Google ScholarGoogle Scholar
  27. Lakshmanan, L. V. S., and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of EDBT 2002. Springer, Berlin, Germany, 142--160.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ley, M. 2001. DBLP DTD. http://www.acm.org/sigmod/dblp/db/about/dblp.dtd.]]Google ScholarGoogle Scholar
  29. Liu, L., Pu, C., and Tang, W. 1999. Continual queries for Internet scale event-driven information delivery. Special Issue on Web Technologies, IEEE TKDE, 11, 4, 610--628.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Madden, S., Shah, M., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In Proceedings of SIGMOD 2002. ACM, New York, 49--60.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistrctured hierarchical data. In Proceedings of ICDE 1997. IEEE Computer Society Press, Los Alamitos, Calif., 79--90.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nguyen, B., Abiteboul, S., Cobena, G., and Preda, M. 2001. Monitoring XML data on the Web. In Proceedings of SIGMOD 2001. ACM, New York, 437--448.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ozen, B., Kilic, O., Altinel, M., and Dogac, A. 2001. Highly personalized information delivery to mobile clients. In Proceedings of the 2nd ACM International Workshop on Data Engineering for Wireless and Mobile Access. ACM, New York, 35--41.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pereira, J., Fabret, F., Llirbat, F., and Jacobsen, H.-A. 2001. WebFilter: A high-throughput XML-based publish and subscribe System. In Proceedings of VLDB 2001. Morgan-Kaufmann, San Francisco, Calif., 723--724.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Salton, G. 1989. Automatic Text Processing. Addison-Wesley, Boston, Mass.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sax Project Organization. 2001. SAX: Simple API for XML. http://www. saxproject.org.]]Google ScholarGoogle Scholar
  37. Schreier, U., Pirahesh, H., Agrawal, R., and Mohan, C. 1991. Alert: An architecture for transforming a passive DBMS into an active DBMS. In Proceedings of VLDB 1991. Morgan-Kaufmann, San Francisco, Calif., 469--478.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stonebraker, M., Jhingran, A., Goh, J., and Potamianos, S. 1990. On rules, procedures, caching and views in data base systems. In Proceedings of SIGMOD 1990. ACM, New York, 281--290.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sun Microsystems, Inc. 2001. Java XML pack. Winter 01 update release. http://java.sun.com/xml/downloads/javaxmlpack.html.]]Google ScholarGoogle Scholar
  40. Terry, D. B., Goldberg, D., Nichols, D. A., and Oki, B. M. 1992. Continuous queries over append-only databases. In Proceedings of SIGMOD 1992. ACM, New York, 321--330.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Watson, B. W. 1997. Practical optimization for automata. In Proceedings of the 2nd International Workshop on Implementing Automata. Springer, Berlin, Germany, 232--240.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Widom, J. and Finklestein, S. J. 1990. Set-oriented production rules in relational database systems. In Proceedings of SIGMOD 1990. ACM, New York, 259--270.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wutka. 2000. DTD parser. http://www.wutka.com/dtdparser.html.]]Google ScholarGoogle Scholar
  44. Yan, T. W. and Garcia-Molina, H. 1994. Index structures for selective dissemination of information under Boolean model. ACM Trans. Datab. Syst. 19, 2, 332--364.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yan, T. W. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst. 24, 4, 529--565.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Path sharing and predicate evaluation for high-performance XML filtering

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader