skip to main content
10.1145/2213836.2213864acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

NoDB: efficient query execution on raw data files

Published:20 May 2012Publication History

ABSTRACT

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze.

Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure.

Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.

References

  1. S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ailamaki, V. Kantere, and D. Dash. Managing scientific data. Commun. ACM, 53:68--78, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Bruno and S. Chaudhuri. Automatic physical database tuning: a relaxation-based approach. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Bruno and S. Chaudhuri. To tune or not to tune?: a lightweight physical design alerter. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB, 2:1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Dageville, D. Das, K. Dias, K. Yagoub, M. Zait, and M. Ziauddin. Automatic sql tuning in oracle 10g. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Dash, N. Polyzotis, and A. Ailamaki. Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB, 4:362--372, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Graefe, S. Idreos, H. Kuno, and S. Manegold. Benchmarking adaptive indexing. In TPCTC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Graefe and H. Kuno. Adaptive indexing for relational keys. ICDEW, 0:69--74, 2010.Google ScholarGoogle Scholar
  12. G. Graefe and H. Kuno. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. Scientific data management in the coming decade. SIGMOD Rec., 34:34--41, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. here are my queries. where are my results? In CIDR, 2011.Google ScholarGoogle Scholar
  15. S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.Google ScholarGoogle Scholar
  16. S. Idreos, M. L. Kersten, and S. Manegold. Updating a cracked database. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores. PVLDB, 4:586--597, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making database systems usable. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Jain, A. Doan, and L. Gravano. Optimizing sql queries over text databases. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In VLDB, 2011.Google ScholarGoogle Scholar
  22. K. Lorincz, K. Redwine, and J. Tov. Grep versus flatsql versus mysql: Queries using unix tools vs. a dbms, 2003.Google ScholarGoogle Scholar
  23. A. Nandi and H. V.Jagadish. Guided interaction: Rethinking the query-result paradigm. In VLDB, 2011.Google ScholarGoogle Scholar
  24. S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In SSDBM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! a wrapper architecture for legacy data sources. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. Colt: continuous on-line tuning. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR, 2009.Google ScholarGoogle Scholar
  28. G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An optimizer smart enough to recommend its own indexes. In ICDE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. J. Storm, C. Garcia-Arellano, and S. Fadden. Db2 design advisor: Integrated automatic physical database. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NoDB: efficient query execution on raw data files

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
            May 2012
            886 pages
            ISBN:9781450312479
            DOI:10.1145/2213836

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 May 2012

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader