Abstract
XML filtering systems aim to provide fast, on-the-fly matching of XML-encoded data to large numbers of query specifications containing constraints on both structure and content. It is now well accepted that approaches using event-based parsing and Finite State Machines (FSMs) can provide the basis for highly scalable structure-oriented XML filtering systems. The XFilter system [Altinel and Franklin 2000] was the first published FSM-based XML filtering approach. XFilter used a separate FSM per path query and a novel indexing mechanism to allow all of the FSMs to be executed simultaneously during the processing of a document. Building on the insights of the XFilter work, we describe a new method, called "YFilter" that combines all of the path queries into a single Nondeterministic Finite Automaton (NFA). YFilter exploits commonality among queries by merging common prefixes of the query paths such that they are processed at most once. The resulting shared processing provides tremendous improvements in structure matching performance but complicates the handling of value-based predicates.In this article, we first describe the XFilter and YFilter approaches and present results of a detailed performance comparison of structure matching for these algorithms as well as a hybrid approach. The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits. We then propose two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates, and compare the performance of these two techniques. The results of this latter study demonstrate some key differences between shared XML filtering and traditional database query processing. Finally, we describe how the YFilter approach is extended to handle more complicated queries containing nested path expressions.
- Aksoy, D., Altinel, M., Bose, R., Cetintemel, U., Franklin, M. J., Wang, J., and Zdonik, S. B. 1998. Research in data broadcast and dissemination. In Proceedings of the of the 1st International Conference on Advanced Multimedia Content Processing. Springer, Berlin, Germany, 194--207.]] Google ScholarDigital Library
- Altinel, M., Aksoy, D., Baby, T., Franklin, M. J., Shapiro, W., and Zdonik, S. B. 1999. DBIS-Toolkit: Adaptable middleware for large scale data delivery. In Proceedings of the of SIGMOD 1999. ACM, New York, 544--546.]] Google ScholarDigital Library
- Altinel, M., and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of VLDB 2000. Morgan-Kaufmann, San Francisco, Calif., 53--64.]] Google ScholarDigital Library
- Apache XML Project. 1999. Xerces Java parser 1.2.3 Release. http://xml.apache.org/xerces-j/index.html.]]Google Scholar
- Belkin, N. J. and Croft, B. W. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM, 35, 12, 29--38.]] Google ScholarDigital Library
- Bruno, N., Gravano, L., Koudas, N., and Srivastava, D. 2003. Navigation- vs. index-based XML multi-query processing. In Proceedings of ICDE 2003. IEEE Computer Society Press, Los Alamitos, Calif.]]Google Scholar
- Bruno, N., Srivastava, D., and Koudas, N. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of SIGMOD 2002. ACM, New York, 310--321.]] Google ScholarDigital Library
- Busse, R., Carey, M., Florescu, D., Kersten, M., Manolescu, I., Schmidt, A., and Waas, F. 2001. Xmark: An XML benchmark project. http://monetdb.cwi.nl/xml/index.html.]]Google Scholar
- Cetintemel, U., Franklin, M. J., and Giles, C. L. 2000. Self-adaptive user profiles for large scale data delivery. In Proceedings of ICDE 2000. IEEE Computer Society, Los Alamitos, Calif., 622--633.]] Google ScholarDigital Library
- Chamberlin, D., Fankhauser, P., Florescu, D., Marchiori, M., and Robie, J. 2002. XML query use cases. W3C Working Draft. http://www.w3.org/TR/2002/WD-xmlquery-use-cases-20020430.]]Google Scholar
- Chan, C., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of ICDE 2002. IEEE Computer Society, Los Alamitos, Calif., 235--244.]] Google ScholarDigital Library
- Chen, J., DeWitt, D. J., and Naughton, J. F. 2002. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In Proceedings of ICDE 2002. IEEE Computer Society Press, Los Alamitos, Calif., 345--356.]] Google ScholarDigital Library
- Chen, J., Dewitt, D. J., Tian, F., and Wang, Y. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. In Proceedings of SIGMOD 2000. ACM, New York, 379--390.]] Google ScholarDigital Library
- Clark, J. 1999. XSL transformations (XSLT)---version 1.0. http://www.w3.org/TR/xslt.]]Google Scholar
- Clark, J., and DeRose, S. 1999. XML path language (XPath)---version 1.0. http://www.w3.org/TR/xpath.]]Google Scholar
- Cover, R. 1999. The SGML/XML Web Page. http://www.w3.org/TR/xslt.]]Google Scholar
- DeRose, S., Daniel Jr., R., and Maler, E. 1999. XML pointer language (XPointer). http://www.w3.org/TR/WD-xptr.]]Google Scholar
- Diaz, A. L., and Lovell, D. 1999. XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.]]Google Scholar
- Fabret, F., Jacobsen, H-A., Llirbat, F., Pereira, J., Ross, K. A., and Shasha, D. 2001. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proceedings of SIGMOD 2001. ACM, New York, 115--126.]] Google ScholarDigital Library
- Foltz, P. W., and Dumais, S. T. 1992. Personalized information delivery: An analysis of information filtering methods. Commun. ACM 35, 12, 51--60.]] Google ScholarDigital Library
- Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of VLDB 1997. Morgan-Kaufmann, San Francisco, Calif., 436--445.]] Google ScholarDigital Library
- Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of ICDT 2003. Springer, Berlin, Germany, 173--189.]] Google ScholarDigital Library
- Hanson, E. N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J. B., and Vernon, A. 1999. Scalable trigger processing. In Proceedings of ICDE 1999. IEEE Computer Society Press, Los Alamitos, Calif., 266--275.]] Google ScholarDigital Library
- Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addition-Wesley, Boston, Mass.]] Google ScholarDigital Library
- Ives, Z., Levy, A., and Weld, D. 2000. Efficient evaluation of regular path expressions on streaming XML data. Technical Report, University of Washington, Seattle, Wash.]]Google Scholar
- Kay, M. 2001. Saxon: the XSLT processor. http://users.iclway.co.uk/mhkay/saxon/.]]Google Scholar
- Lakshmanan, L. V. S., and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of EDBT 2002. Springer, Berlin, Germany, 142--160.]] Google ScholarDigital Library
- Ley, M. 2001. DBLP DTD. http://www.acm.org/sigmod/dblp/db/about/dblp.dtd.]]Google Scholar
- Liu, L., Pu, C., and Tang, W. 1999. Continual queries for Internet scale event-driven information delivery. Special Issue on Web Technologies, IEEE TKDE, 11, 4, 610--628.]] Google ScholarDigital Library
- Madden, S., Shah, M., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In Proceedings of SIGMOD 2002. ACM, New York, 49--60.]] Google ScholarDigital Library
- Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistrctured hierarchical data. In Proceedings of ICDE 1997. IEEE Computer Society Press, Los Alamitos, Calif., 79--90.]] Google ScholarDigital Library
- Nguyen, B., Abiteboul, S., Cobena, G., and Preda, M. 2001. Monitoring XML data on the Web. In Proceedings of SIGMOD 2001. ACM, New York, 437--448.]] Google ScholarDigital Library
- Ozen, B., Kilic, O., Altinel, M., and Dogac, A. 2001. Highly personalized information delivery to mobile clients. In Proceedings of the 2nd ACM International Workshop on Data Engineering for Wireless and Mobile Access. ACM, New York, 35--41.]] Google ScholarDigital Library
- Pereira, J., Fabret, F., Llirbat, F., and Jacobsen, H.-A. 2001. WebFilter: A high-throughput XML-based publish and subscribe System. In Proceedings of VLDB 2001. Morgan-Kaufmann, San Francisco, Calif., 723--724.]] Google ScholarDigital Library
- Salton, G. 1989. Automatic Text Processing. Addison-Wesley, Boston, Mass.]] Google ScholarDigital Library
- Sax Project Organization. 2001. SAX: Simple API for XML. http://www. saxproject.org.]]Google Scholar
- Schreier, U., Pirahesh, H., Agrawal, R., and Mohan, C. 1991. Alert: An architecture for transforming a passive DBMS into an active DBMS. In Proceedings of VLDB 1991. Morgan-Kaufmann, San Francisco, Calif., 469--478.]] Google ScholarDigital Library
- Stonebraker, M., Jhingran, A., Goh, J., and Potamianos, S. 1990. On rules, procedures, caching and views in data base systems. In Proceedings of SIGMOD 1990. ACM, New York, 281--290.]] Google ScholarDigital Library
- Sun Microsystems, Inc. 2001. Java XML pack. Winter 01 update release. http://java.sun.com/xml/downloads/javaxmlpack.html.]]Google Scholar
- Terry, D. B., Goldberg, D., Nichols, D. A., and Oki, B. M. 1992. Continuous queries over append-only databases. In Proceedings of SIGMOD 1992. ACM, New York, 321--330.]] Google ScholarDigital Library
- Watson, B. W. 1997. Practical optimization for automata. In Proceedings of the 2nd International Workshop on Implementing Automata. Springer, Berlin, Germany, 232--240.]] Google ScholarDigital Library
- Widom, J. and Finklestein, S. J. 1990. Set-oriented production rules in relational database systems. In Proceedings of SIGMOD 1990. ACM, New York, 259--270.]] Google ScholarDigital Library
- Wutka. 2000. DTD parser. http://www.wutka.com/dtdparser.html.]]Google Scholar
- Yan, T. W. and Garcia-Molina, H. 1994. Index structures for selective dissemination of information under Boolean model. ACM Trans. Datab. Syst. 19, 2, 332--364.]] Google ScholarDigital Library
- Yan, T. W. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst. 24, 4, 529--565.]] Google ScholarDigital Library
Index Terms
- Path sharing and predicate evaluation for high-performance XML filtering
Recommendations
Value-based predicate filtering of XML documents
In recent years, publish-subscribe systems based on XML filtering have received much attention in ubiquitous computing environments and Internet applications. The main challenge is to process a large number of content against millions of user ...
XML filtering system based on ontology
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in IndiaIn recent years, publish subscribe systems based on XML document filtering has been on tremendous rise. Many filtering mechanisms exist such as XFilter, YFilter, AFilter etc. But most of these mechanisms do not filter the document for semantic matched ...
Efficient string-based XML stream prefiltering
ADC '12: Proceedings of the Twenty-Third Australasian Database Conference - Volume 124Whenever huge XML documents have to be evaluated according to a given XPath or XQuery query, parsing the whole document in form of e. g. SAX events is the baseline that is common to all evaluators. But typically only few parts of the document are really ...
Comments