skip to main content
article

XML stream processing using tree-edit distance embeddings

Published:01 March 2005Publication History
Skip Abstract Section

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

References

  1. Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Philadeplphia, PA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (Philadelphia, PA). 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apostolico, A. and Galil, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, U.K. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM'02), (Cambridge, MA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chan, C.-Y., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (Malaga, Spain). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Charikar, M. and Sahai, A. 2002. Dimension reduction in the l1 norm. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cormode, G., Datar, M., Indyk, P., and Muthukrishnan, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cormode, G., Indyk, P., Koudas, N., and Muthukrishnan, S. 2002b. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cormode, G. and Muthukrishnan, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Diao, Y., Altinel, M., Franklin, M. J., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4 (Dec.), 467--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-based multi-query processing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).Google ScholarGoogle Scholar
  21. Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 1999. An approximate L1-difference algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (New York City, NY). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Florescu, D., Koller, D., and Levy, A. 1997. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).Google ScholarGoogle Scholar
  24. Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).Google ScholarGoogle Scholar
  25. Garofalakis, M. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma, Italy). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Garofalakis, M. and Kumar, A. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (San Diego, CA). 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th Annual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gravano, L., Ipeirotis, P. G., Jagadish, H., Koudas, N., Muthuksrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Guha, S., Jagadish, H., Koudas, N., Srivastava, D., and Yu, T. 2002. Approximate XML joins. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gupta, A. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (Redondo Beach, CA). 189--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Indyk, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science (Las Vegas, NV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 363--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ioannidis, Y. E. and Poosala, V. 1999. Histogram-based approximation of set-valued query answers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, Scotland). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Johnson, W. B. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.Google ScholarGoogle ScholarCross RefCross Ref
  39. Karp, R. M. and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2 (Mar.), 249--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Knuth, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  41. Lakshmanan, L. V. S. and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT'2002, Prague, Czech Republic). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nolan, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/~jpnolan/stable/stable.html.Google ScholarGoogle Scholar
  45. Polyzotis, N. and Garofalakis, M. 2002. Statistical synopses for graph-structured XML databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Shapira, D. and Storer, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'2002), Fukuoka, Japan). 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 428--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Uchaikin, V. V. and Zolotarev, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland.Google ScholarGoogle Scholar
  51. Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245--1262. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. XML stream processing using tree-edit distance embeddings

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader