skip to main content
10.5555/1182635.1164163acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Composite subset measures

Authors Info & Claims
Published:01 September 2006Publication History

ABSTRACT

Measures are numeric summaries of a collection of data records produced by applying aggregation functions. Summarizing a collection of subsets of a large dataset, by computing a measure for each subset in the (typically, user-specified) collection is a fundamental problem. The multidimensional data model, which treats records as points in a space defined by dimension attributes, offers a natural space of data subsets to be considered as summarization candidates, and traditional SQL and OLAP constructs, such as GROUP BY and CUBE, allow us to compute measures for subsets drawn from this space. However, GROUP BY only allows us to summarize a limited collection of subsets, and CUBE summarizes all subsets in this space. Further, they restrict the measure used to summarize a data subset to be a one-step aggregation, using functions such as SUM, of field-values in the data records.In this paper, we introduce composite subset measures, computed by aggregating not only data records but also the measures of other related subsets. We allow summarization of naturally related regions in the multidimensional space, offering more flexibility than either GROUP BY or CUBE in the choice of what data subsets to summarize. Thus, our framework allows more meaningful summaries to be computed for a targeted collection of data subsets.We propose an algebra called AW-RA and an equivalent pictorial language called aggregation workflows. Aggregation workflows allow for intuitive expression of composite measure queries, and the underlying algebra is designed to facilitate efficient multiscan execution. We describe an evaluation framework based on multiple passes of sorting and scanning over the original dataset. In each pass, several measures are evaluated simultaneously, and dependencies between these measures and containment relationships between the underlying subsets of data are orchestrated to reduce the memory footprint of the computation. We present a performance evaluation that demonstrates the benefits of our approach.

References

  1. {1} S. Agarwal, R. Agrawal, et. al., On the Computation of Multi-dimensional Aggregates, in VLDB'96, 1996, 506-521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} M.O. Akinde and M.H. Böhlen, Efficient Computation of Subqueries in Complex OLAP. in ICDE, 2003, 163.Google ScholarGoogle ScholarCross RefCross Ref
  3. {3} D. Chatziantoniou, Evaluation of Ad Hoc OLAP: In-Place Computation. in SSDBM, 1999, 34-43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} D. Chatziantoniou, M.O. Akinde, et. al. The MD-join: An Operator for Complex OLAP, in ICDE'01, 2001, 524-533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} D. Chatziantoniou and K.A. Ross, Querying Multiple Features of Groups in Relational Databases, in VLDB '96, 1996, 295-306. 5 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} S. Chaudhuri and K. Shim, Optimizing Queries with Aggregate Views, in EDBT, 1996, 167-182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} B. Chen, V. Yegneswaran, P. Barford and R. Ramakrishnan: Toward a Query Language for Network Attack Data. ICDE NetDB Workshops 2006: 28 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} L. Chen, R. Ramakrishnan et. al. Composite Subset Meausres, Technical Report 1557, University of Wisconsin - Madison. http://www.cs.wisc.edu/techreports/Google ScholarGoogle Scholar
  9. {9} Z. Chen and V. Narasayya, Efficient computation of multiple group by queries, in SIGMOD '05, 2005, 263-274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, In OSDI'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} DShield Project, http://www.dshield.orgGoogle ScholarGoogle Scholar
  12. {12} S. Ghemawat, H. Gobioff and S. T. Leung, The Google File System, in SOSP'03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} G. Graefe, Query evaluation techniques for large databases, ACM Comput. Surv., vol. 25, pp. 73-169, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} J. Gray, S. Chaudhuri, et. al., Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Min. Knowl. Discov., vol. 1, pp. 29-53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} A. Gupta, V. Harinarayan, et. al., Aggregate-Query Processing in Data Warehousing Environments, in VLDB '95, 1995, 358-369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. {16} H. Gupta, V. Harinarayan, et. al., Index Selection for OLAP, in ICDE'97, 1997, 208-219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} Hadoop Project, http://lucene.apache.org/hadoop/Google ScholarGoogle Scholar
  18. {18} Z. Huang, L. Chen, J. Cai, D. S. Gross, D. R. Musicant, R. Ramakrishnan, J. J. Schauer, S. J. Wright: Mass Spectrum Labeling: Theory and Practice. ICDM 2004: 122-129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} Microsoft MDX Specification http://msdn.microsoft.com/Google ScholarGoogle Scholar
  21. {21} R. Pang, V. Yegneswaran, et. al. Characteristics of Internet Background Radiation, In IMC'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} R. Pike, S. Dorward, R. Griesemer and S. Quinlan, Interpreting the Data: Parallel Analysis with Sawzall, SOSP'03Google ScholarGoogle Scholar
  23. {23} K.A. Ross, D. Srivastava, et. al., Complex Aggregation at Multiple Granularities, in EDBT '98, 1998, 263-277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} D.B. Shmoys, Tardos, An approximation algorithm for the generalized assignment problem, Math. Program., vol. 62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. {25} A. Shukla, P. Deshpande, et. al., Materialized View Selection for Multidimensional Datasets, in VLDB '98, 488-499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. {26} A. Witkowski, S. Bellamkonda, et. al., Spreadsheets in RDBMS for OLAP, In SIGMOD '03, 2003, 52-63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {27} W.P. Yan and P.A. Larson, Eager Aggregation and Lazy Aggregation, in VLDB, 1995, 345-357. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Composite subset measures

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader