article

XML stream processing using tree-edit distance embeddings

Authors:
Minos Garofalakis

Bell Labs, Lucent Technologies, Murray Hill, NJ

Bell Labs, Lucent Technologies, Murray Hill, NJ
View Profile

,
Amit Kumar

Indian Institute of Technology, New Delhi, India

Indian Institute of Technology, New Delhi, India
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 30 Issue 1pp 279–332https://doi.org/10.1145/1061318.1061326

Published:01 March 2005Publication History

ACM Transactions on Database Systems

Abstract

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L₁ vector space while guaranteeing a (worst-case) upper bound of O(log²n log*n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model.

References

Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). 275--286. Google ScholarDigital Library
Alon, N., Gibbons, P. B., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Philadeplphia, PA). Google ScholarDigital Library
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing (Philadelphia, PA). 20--29. Google ScholarDigital Library
Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 53--64. Google ScholarDigital Library
Apostolico, A. and Galil, Z., Eds. 1997. Pattern Matching Algorithms. Oxford University Press, Oxford, U.K. Google ScholarDigital Library
Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. 2002. Characterizing memory requirements for queries over continuous data streams. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 221--232. Google ScholarDigital Library
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Madison, WI). 1--16. Google ScholarDigital Library
Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RANDOM'02), (Cambridge, MA). Google ScholarDigital Library
Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 111--122. Google ScholarDigital Library
Chan, C.-Y., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). Google ScholarDigital Library
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming (Malaga, Spain). Google ScholarDigital Library
Charikar, M. and Sahai, A. 2002. Dimension reduction in the l1 norm. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (Vancouver, B.C., Canada). Google ScholarDigital Library
Cormode, G., Datar, M., Indyk, P., and Muthukrishnan, S. 2002a. Comparing data streams using hamming norms (how to zero in). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 335--345. Google ScholarDigital Library
Cormode, G., Indyk, P., Koudas, N., and Muthukrishnan, S. 2002b. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the Eighteenth International Conference on Data Engineering (San Jose, CA). Google ScholarDigital Library
Cormode, G. and Muthukrishnan, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, CA). Google ScholarDigital Library
Dasu, T. and Johnson, T. 2003. Exploratory Data Mining and Data Cleaning. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, NY. Google ScholarDigital Library
Diao, Y., Altinel, M., Franklin, M. J., Zhang, H., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst. 28, 4 (Dec.), 467--516. Google ScholarDigital Library
Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany). Google ScholarDigital Library
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 61--72. Google ScholarDigital Library
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-based multi-query processing over data streams. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).Google Scholar
Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 1999. An approximate L1-difference algorithm for massive data streams. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (New York City, NY). Google ScholarDigital Library
Florescu, D., Koller, D., and Levy, A. 1997. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases (Athens, Greece). Google ScholarDigital Library
Ganguly, S., Garofalakis, M., and Rastogi, R. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'2004, Heraklion-Crete, Greece).Google Scholar
Garofalakis, M., Gehrke, J., and Rastogi, R. 2002. Querying and mining data streams: you only get one look (Tutorial). In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China).Google Scholar
Garofalakis, M. and Gibbons, P. B. 2001. Approximate query processing: Taming the terabytes (Tutorial). In Proceedings of the 27th International Conference on Very Large Data Bases (Roma, Italy). Google ScholarDigital Library
Garofalakis, M. and Kumar, A. 2003. Correlating XML data streams using tree-edit distance embeddings. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (San Diego, CA). 143--154. Google ScholarDigital Library
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th Annual ACM Symposium on the Theory of Computing (Montreal, P.Q., Canada). Google ScholarDigital Library
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2001. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). Google ScholarDigital Library
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 454--465. Google ScholarDigital Library
Gravano, L., Ipeirotis, P. G., Jagadish, H., Koudas, N., Muthuksrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (Rome, Italy). Google ScholarDigital Library
Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (Santa Barbara, CA). Google ScholarDigital Library
Guha, S., Jagadish, H., Koudas, N., Srivastava, D., and Yu, T. 2002. Approximate XML joins. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). Google ScholarDigital Library
Gupta, A. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA). Google ScholarDigital Library
Indyk, P. 2000. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (Redondo Beach, CA). 189--197. Google ScholarDigital Library
Indyk, P. 2001. Algorithmic aspects of geometric embeddings. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science (Las Vegas, NV). Google ScholarDigital Library
Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th International Conference on Very Large Data Bases (Cairo, Egypt). 363--372. Google ScholarDigital Library
Ioannidis, Y. E. and Poosala, V. 1999. Histogram-based approximation of set-valued query answers. In Proceedings of the 25th International Conference on Very Large Data Bases (Edinburgh, Scotland). Google ScholarDigital Library
Johnson, W. B. and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into Hilbert space. Contemp. Math. 26, 189--206.Google ScholarCross Ref
Karp, R. M. and Rabin, M. O. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Devel. 31, 2 (Mar.), 249--260. Google ScholarDigital Library
Knuth, D. E. 1973. The Art of Computer Programming (Vol. 1/Fundamental Algorithms). Addison-Wesley, Reading, MA.Google Scholar
Lakshmanan, L. V. S. and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of the 8th International Conference on Extending Database Technology (EDBT'2002, Prague, Czech Republic). Google ScholarDigital Library
Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). 346--357. Google ScholarDigital Library
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K. Google ScholarDigital Library
Nolan, J. P. 2004. Stable distributions: Models for heavy-tailed data. Available online at http://academic2.american.edu/~jpnolan/stable/stable.html.Google Scholar
Polyzotis, N. and Garofalakis, M. 2002. Statistical synopses for graph-structured XML databases. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). Google ScholarDigital Library
Polyzotis, N., Garofalakis, M., and Ioannidis, Y. 2004. Approximate XML query answers. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France). Google ScholarDigital Library
Schmidt, A., Waas, F., Kersten, M., Carey, M. J., Manolescu, I., and Busse, R. 2002. XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Data Bases (Hong Kong, China). Google ScholarDigital Library
Shapira, D. and Storer, J. A. 2002. Edit distance with move operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM'2002), Fukuoka, Japan). 85--98. Google ScholarDigital Library
Thaper, N., Guha, S., Indyk, P., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, WI). 428--439. Google ScholarDigital Library
Uchaikin, V. V. and Zolotarev, V. M. 1999. Chance and Stability : Stable Distributions and their Applications. VSP, Utrecht, The Netherland.Google Scholar
Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191--211. Google ScholarDigital Library
Vitter, J. S. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). Google ScholarDigital Library
Zhang, K. and Shasha, D. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 6 (Dec.), 1245--1262. Google ScholarDigital Library

Index Terms

XML stream processing using tree-edit distance embeddings

Recommendations

Correlating XML data streams using tree-edit distance embeddings
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is ...
Read More
XSKETCH synopses for XML data graphs

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language ...
Read More
Analysis of tree edit distance on XML data
CIIT '07: The Sixth IASTED International Conference on Communications, Internet, and Information Technology

The problem of comparing tree structures occurs in various areas in computer science and engineering, including the application to XML data processing. To solve this problem, tree edit distance is a common and significant measurement defining the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 30, Issue 1
Special Issue: SIGMOD/PODS 2003
March 2005
332 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1061318
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2005
Published in tods Volume 30, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML
approximate query processing
data streams
data synopses
metric-space embeddings
tree-edit distance
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 1,418
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Correlating XML data streams using tree-edit distance embeddings

XSKETCH synopses for XML data graphs

Analysis of tree edit distance on XML data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Correlating XML data streams using tree-edit distance embeddings

XSKETCH synopses for XML data graphs

Analysis of tree edit distance on XML data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media