Article

Composite subset measures

Authors:
Lei Chen

Computer Sciences Department, University of Wisconsin, Madison, WI

Computer Sciences Department, University of Wisconsin, Madison, WI
View Profile

,
Raghu Ramakrishnan

Computer Sciences Department, University of Wisconsin, Madison, WI and Yahoo! Research, Santa Clara, CA

Computer Sciences Department, University of Wisconsin, Madison, WI and Yahoo! Research, Santa Clara, CA
View Profile

,
Paul Barford

Computer Sciences Department, University of Wisconsin, Madison, WI

Computer Sciences Department, University of Wisconsin, Madison, WI
View Profile

,
Bee-Chung Chen

Computer Sciences Department, University of Wisconsin, Madison, WI

Computer Sciences Department, University of Wisconsin, Madison, WI
View Profile

,
Vinod Yegneswaran

Computer Sciences Department, University of Wisconsin, Madison, WI

Computer Sciences Department, University of Wisconsin, Madison, WI
View Profile

Authors Info & Claims

VLDB '06: Proceedings of the 32nd international conference on Very large data basesSeptember 2006Pages 403–414

Published:01 September 2006Publication History

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

Pages 403–414

ABSTRACT

Measures are numeric summaries of a collection of data records produced by applying aggregation functions. Summarizing a collection of subsets of a large dataset, by computing a measure for each subset in the (typically, user-specified) collection is a fundamental problem. The multidimensional data model, which treats records as points in a space defined by dimension attributes, offers a natural space of data subsets to be considered as summarization candidates, and traditional SQL and OLAP constructs, such as GROUP BY and CUBE, allow us to compute measures for subsets drawn from this space. However, GROUP BY only allows us to summarize a limited collection of subsets, and CUBE summarizes all subsets in this space. Further, they restrict the measure used to summarize a data subset to be a one-step aggregation, using functions such as SUM, of field-values in the data records.In this paper, we introduce composite subset measures, computed by aggregating not only data records but also the measures of other related subsets. We allow summarization of naturally related regions in the multidimensional space, offering more flexibility than either GROUP BY or CUBE in the choice of what data subsets to summarize. Thus, our framework allows more meaningful summaries to be computed for a targeted collection of data subsets.We propose an algebra called AW-RA and an equivalent pictorial language called aggregation workflows. Aggregation workflows allow for intuitive expression of composite measure queries, and the underlying algebra is designed to facilitate efficient multiscan execution. We describe an evaluation framework based on multiple passes of sorting and scanning over the original dataset. In each pass, several measures are evaluated simultaneously, and dependencies between these measures and containment relationships between the underlying subsets of data are orchestrated to reduce the memory footprint of the computation. We present a performance evaluation that demonstrates the benefits of our approach.

References

{1} S. Agarwal, R. Agrawal, et. al., On the Computation of Multi-dimensional Aggregates, in VLDB'96, 1996, 506-521. Google ScholarDigital Library
{2} M.O. Akinde and M.H. Böhlen, Efficient Computation of Subqueries in Complex OLAP. in ICDE, 2003, 163.Google ScholarCross Ref
{3} D. Chatziantoniou, Evaluation of Ad Hoc OLAP: In-Place Computation. in SSDBM, 1999, 34-43. Google ScholarDigital Library
{4} D. Chatziantoniou, M.O. Akinde, et. al. The MD-join: An Operator for Complex OLAP, in ICDE'01, 2001, 524-533. Google ScholarDigital Library
{5} D. Chatziantoniou and K.A. Ross, Querying Multiple Features of Groups in Relational Databases, in VLDB '96, 1996, 295-306. 5 Google ScholarDigital Library
{6} S. Chaudhuri and K. Shim, Optimizing Queries with Aggregate Views, in EDBT, 1996, 167-182. Google ScholarDigital Library
{7} B. Chen, V. Yegneswaran, P. Barford and R. Ramakrishnan: Toward a Query Language for Network Attack Data. ICDE NetDB Workshops 2006: 28 Google ScholarDigital Library
{8} L. Chen, R. Ramakrishnan et. al. Composite Subset Meausres, Technical Report 1557, University of Wisconsin - Madison. http://www.cs.wisc.edu/techreports/Google Scholar
{9} Z. Chen and V. Narasayya, Efficient computation of multiple group by queries, in SIGMOD '05, 2005, 263-274. Google ScholarDigital Library
{10} J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, In OSDI'04, 2004. Google ScholarDigital Library
{11} DShield Project, http://www.dshield.orgGoogle Scholar
{12} S. Ghemawat, H. Gobioff and S. T. Leung, The Google File System, in SOSP'03, 2003. Google ScholarDigital Library
{13} G. Graefe, Query evaluation techniques for large databases, ACM Comput. Surv., vol. 25, pp. 73-169, 1993. Google ScholarDigital Library
{14} J. Gray, S. Chaudhuri, et. al., Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Min. Knowl. Discov., vol. 1, pp. 29-53, 1997. Google ScholarDigital Library
{15} A. Gupta, V. Harinarayan, et. al., Aggregate-Query Processing in Data Warehousing Environments, in VLDB '95, 1995, 358-369. Google ScholarDigital Library
{16} H. Gupta, V. Harinarayan, et. al., Index Selection for OLAP, in ICDE'97, 1997, 208-219. Google ScholarDigital Library
{17} Hadoop Project, http://lucene.apache.org/hadoop/Google Scholar
{18} Z. Huang, L. Chen, J. Cai, D. S. Gross, D. R. Musicant, R. Ramakrishnan, J. J. Schauer, S. J. Wright: Mass Spectrum Labeling: Theory and Practice. ICDM 2004: 122-129. Google ScholarDigital Library
{19} T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179. Google ScholarDigital Library
{20} Microsoft MDX Specification http://msdn.microsoft.com/Google Scholar
{21} R. Pang, V. Yegneswaran, et. al. Characteristics of Internet Background Radiation, In IMC'04, 2004. Google ScholarDigital Library
{22} R. Pike, S. Dorward, R. Griesemer and S. Quinlan, Interpreting the Data: Parallel Analysis with Sawzall, SOSP'03Google Scholar
{23} K.A. Ross, D. Srivastava, et. al., Complex Aggregation at Multiple Granularities, in EDBT '98, 1998, 263-277. Google ScholarDigital Library
{24} D.B. Shmoys, Tardos, An approximation algorithm for the generalized assignment problem, Math. Program., vol. 62. Google ScholarDigital Library
{25} A. Shukla, P. Deshpande, et. al., Materialized View Selection for Multidimensional Datasets, in VLDB '98, 488-499. Google ScholarDigital Library
{26} A. Witkowski, S. Bellamkonda, et. al., Spreadsheets in RDBMS for OLAP, In SIGMOD '03, 2003, 52-63. Google ScholarDigital Library
{27} W.P. Yan and P.A. Larson, Eager Aggregation and Lazy Aggregation, in VLDB, 1995, 345-357. Google ScholarDigital Library

Index Terms

Composite subset measures
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
    2. Query languages
  2. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Query model and processing optimizations for composite subset measures
Read More
Feature subset selection based on fuzzy entropy measures for handling classification problems

In this paper, we present a new method for dealing with feature subset selection based on fuzzy entropy measures for handling classification problems. First, we discretize numeric features to construct the membership function of each fuzzy set of a ...
Read More
Small Superset and Big Subset Obfuscation
Information Security and Privacy
Abstract
Let $S = {1, \dots, n}$ be a set of integers and X be a subset of S. We study the boolean function $f_{X} (Y)$ which outputs 1 if and only if Y is a small enough superset (resp., big enough subset) of X. Our purpose is to protect X from being known when the ... $_{}$
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages
Editors:
Umeshwar Dayal,
Khu-Yong Whang,
David Lomet,
Gustavo Alonso,
Guy Lohman,
Martin Kersten,
Sang K. Cha,
Young-Kuk Kim
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2006
Check for updates
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 306
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Composite subset measures

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

ABSTRACT

References

Cited By

Index Terms

Recommendations

Query model and processing optimizations for composite subset measures

Feature subset selection based on fuzzy entropy measures for handling classification problems

Small Superset and Big Subset Obfuscation