skip to main content
10.5555/1182635.1164201acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

How to wring a table dry: entropy compression of relations and querying of compressed relations

Published:01 September 2006Publication History

ABSTRACT

We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.We also describe initial investigations into efficient querying over compressed data. We present a novel Huffman coding scheme, called segregated coding, that allows range and equality predicates on compressed data, without accessing the full dictionary. We also exploit the delta coding to speed up scans, by reusing computations performed on nearly identical records. Initial results from a prototype suggest that with these optimizations, we can efficiently scan, tokenize and apply predicates on compressed relations.

References

  1. {1} Bala Iyer, David Wilhite. Data Compression Support in Data Bases. In VLDB 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. {2} G. Antoshenkov, D. Lomet, J. Murray. Order Preserving String Compression. In ICDE 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. {3} T. Cover, J. Thomas. Elements of Information Theory. John Wiley, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} Eric W. Weisstein et al. "Catalan Number." In MathWorld. http://mathworld.wolfram.com/CatalanNumber.html.Google ScholarGoogle Scholar
  5. {5} A. Ailamaki, D. J. DeWitt, M. D. Hill, M. Skounakis Weaving relations for cache performance. In VLDB 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} M. Stonebraker et al. C-Store: A Column Oriented DBMS. In VLDB 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} S. Babu et al. SPARTAN: A model-based semantic compression system for massive tables. SIGMOD 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} J. Goldstein, R. Ramakrishnan, U. Shaft. Compressing Relations and Indexes. In ICDE 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} G. V. Cormack and R. N. Horspool. Data Compression using Dynamic Markov Modelling, Computer Journal 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} G. V. Cormack, Data Compression In a Database System, Comm. of the ACM 28(12), 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} G. Copeland and S. Khoshafian. A decomposition storage model. In SIGMOD 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} Sybase IQ. www.sybase.com/products/informationmanagementGoogle ScholarGoogle Scholar
  13. {13} D. Knuth. The Art of Computer Programming. Addison Wesley, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} S. Barnard and J. M. Child. Higher Algebra, Macmillan India Ltd., 1994.Google ScholarGoogle Scholar
  15. {15} T. C. Hu, A. C. Tucker, Optimal computer search trees and variable-length alphabetic cods, SIAM J. Appl. Math, 1971.Google ScholarGoogle Scholar
  16. {16} Huffman, D. A method for the construction of minimum redundancy codes. Proc. I.R.E. 40. 9. 1952.Google ScholarGoogle Scholar
  17. {17} Jim Gray. Commentary on 2005 Datamation benchmark. http://research.microsoft.com/barc/SortBenchmarkGoogle ScholarGoogle Scholar
  18. {18} A. Zandi et al. Sort Order Preserving Data Compresion for Extended Alphabets. Data Compression Conference 1993.Google ScholarGoogle Scholar
  19. {19} K. Bharat et al. The Connectivity server: Fast access to Linkage information on the web. In WWW 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} G. Graefe and L. Shapiro. Data Compression and Database Performance. In Symp on Applied Computing, 1991.Google ScholarGoogle Scholar
  21. {21} M. Poess et al. Data compression in Oracle. In VLDB 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How to wring a table dry: entropy compression of relations and querying of compressed relations

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader