research-article

NoDB: efficient query execution on raw data files

Authors:
Ioannis Alagiannis

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Renata Borovica

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Miguel Branco

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Stratos Idreos

CWI, Amsterdam, Switzerland

CWI, Amsterdam, Switzerland
View Profile

,
Anastasia Ailamaki

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 241–252https://doi.org/10.1145/2213836.2213864

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 241–252

ABSTRACT

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze.

Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure.

Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.

References

S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000. Google ScholarDigital Library
S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, 2004. Google ScholarDigital Library
A. Ailamaki, V. Kantere, and D. Dash. Managing scientific data. Commun. ACM, 53:68--78, 2010. Google ScholarDigital Library
N. Bruno and S. Chaudhuri. Automatic physical database tuning: a relaxation-based approach. In SIGMOD, 2005. Google ScholarDigital Library
N. Bruno and S. Chaudhuri. To tune or not to tune?: a lightweight physical design alerter. In VLDB, 2006. Google ScholarDigital Library
S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In VLDB, 1997. Google ScholarDigital Library
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB, 2:1481--1492, 2009. Google ScholarDigital Library
B. Dageville, D. Das, K. Dias, K. Yagoub, M. Zait, and M. Ziauddin. Automatic sql tuning in oracle 10g. In VLDB, 2004. Google ScholarDigital Library
D. Dash, N. Polyzotis, and A. Ailamaki. Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB, 4:362--372, 2011. Google ScholarDigital Library
G. Graefe, S. Idreos, H. Kuno, and S. Manegold. Benchmarking adaptive indexing. In TPCTC, 2011. Google ScholarDigital Library
G. Graefe and H. Kuno. Adaptive indexing for relational keys. ICDEW, 0:69--74, 2010.Google Scholar
G. Graefe and H. Kuno. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT, 2010. Google ScholarDigital Library
J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. Scientific data management in the coming decade. SIGMOD Rec., 34:34--41, 2005. Google ScholarDigital Library
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. here are my queries. where are my results? In CIDR, 2011.Google Scholar
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.Google Scholar
S. Idreos, M. L. Kersten, and S. Manegold. Updating a cracked database. In SIGMOD, 2007. Google ScholarDigital Library
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009. Google ScholarDigital Library
S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores. PVLDB, 4:586--597, 2011. Google ScholarDigital Library
H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making database systems usable. In SIGMOD, 2007. Google ScholarDigital Library
A. Jain, A. Doan, and L. Gravano. Optimizing sql queries over text databases. In ICDE, 2008. Google ScholarDigital Library
M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In VLDB, 2011.Google Scholar
K. Lorincz, K. Redwine, and J. Tov. Grep versus flatsql versus mysql: Queries using unix tools vs. a dbms, 2003.Google Scholar
A. Nandi and H. V.Jagadish. Guided interaction: Rethinking the query-result paradigm. In VLDB, 2011.Google Scholar
S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In SSDBM, 2004. Google ScholarDigital Library
M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! a wrapper architecture for legacy data sources. In VLDB, 1997. Google ScholarDigital Library
K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. Colt: continuous on-line tuning. In SIGMOD, 2006. Google ScholarDigital Library
M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR, 2009.Google Scholar
G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An optimizer smart enough to recommend its own indexes. In ICDE, 2000.Google ScholarCross Ref
D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. J. Storm, C. Garcia-Arellano, and S. Fadden. Db2 design advisor: Integrated automatic physical database. In VLDB, 2004. Google ScholarDigital Library

Index Terms

NoDB: efficient query execution on raw data files

Recommendations

NoDB: efficient query execution on raw data files

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already ...
Read More
NoDB in action: adaptive query processing on raw data

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications ...
Read More
Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive loading
in situ querying
positional map
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 132
  Total Citations
  View Citations
- 1,896
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NoDB: efficient query execution on raw data files

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

NoDB: efficient query execution on raw data files

NoDB in action: adaptive query processing on raw data

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

NoDB: efficient query execution on raw data files

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

NoDB: efficient query execution on raw data files

NoDB in action: adaptive query processing on raw data

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media