skip to main content
article

An approach to phrase selection for offline data compression

Authors Info & Claims
Published:01 January 2002Publication History
Skip Abstract Section

Abstract

Recently several offline data compression schemes have been published that expend large amounts of computing resources when encoding a file, but decode the file quickly. These compressors work by identifying phrases in the input data, and storing the data as a series of pointer to these phrases. This paper explores the application of an algorithm for computing all repeating substrings within a string for phrase selection in an offline data compressor. Using our approach, we obtain compression similar to that of the best known offline compressors on genetic data, but poor results on general text. It seems, however, that an alternate approach based on selecting repeating substrings is feasible.

References

  1. {Apostolico & Lonardi, 2000} Apostolico, A. & Lonardi, S. (2000). Off-line compression by greedy textual substitution. Proc. IEEE, 88(11):1733-1744.Google ScholarGoogle ScholarCross RefCross Ref
  2. {Baghdadi et al, 2001} Baghdadi, L., Franek, F., Smyth, W. F. & Xiao, X. (2001). A fast space-efficient approach to substring refinement. Preprint.Google ScholarGoogle Scholar
  3. {Bentley & McIlroy, 1999} Bentley, J. & McIlroy, D. (1999). Data compression using long common strings. In Storer, J. A. & Cohn, M., editors, Proc. IEEE Data Compression Conference, pages 287-295, Snowbird, Utah. IEEE Computer Society Press, Los Alamitos, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {Calgary, 2001} The Calgary Corpus (2001). http://links.uwaterloo.ca/calgary.corpus.html.Google ScholarGoogle Scholar
  5. {Cannane & Williams, 2001} Cannane, A. & Williams, H. (2001). General-purpose compression for efficient retrieval. Journal of the American Society for Information Science, 52(5):430-437. Google ScholarGoogle ScholarCross RefCross Ref
  6. {Canterbury, 2001} The Canterbury Corpus (2001). http://corpus.canterbury.ac.nz/.Google ScholarGoogle Scholar
  7. {Crochemore, 1981} Crochemore, M. (1981). An optimal algorithm for computing the repetitions in a word. Information Processing Letters, 12(5):244-250.Google ScholarGoogle ScholarCross RefCross Ref
  8. {Franek & Smyth, 2001} Franek, F. & Smyth, W. F. (2001). Crochemore's algorithm revisited --- a fast space-efficient approach. Preprint.Google ScholarGoogle Scholar
  9. {Larsson & Moffat, 2000} Larsson, N. & Moffat, A. (2000). Offline dictionary-based compression. Proc. IEEE, 88(11):1722-1732.Google ScholarGoogle ScholarCross RefCross Ref
  10. {Nevill-Manning & Witten, 1994} Nevill-Manning, C. G. & Witten, D. L. M. (1994). Compression by induction of hierarchical grammars. In Storer, J. A. & Cohn, M., editors, Proc. IEEE Data Compression Conference, pages 244-253, Snowbird, Utah. IEEE Computer Society Press, Los Alamitos, California.Google ScholarGoogle ScholarCross RefCross Ref
  11. {Purdue, 2001} The Purdue Corpus (2001). http://www.cs.purdue.edu/ homes/stelo/Off-line/.Google ScholarGoogle Scholar
  12. {Rubin, 1976} Rubin, F. (1976). Experiments in text file compression. Communications of the ACM, 19(11):617-623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {Shannon, 1948} Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423, 623-656. Reproduced in {Shannon & Weaver, 1971}.Google ScholarGoogle ScholarCross RefCross Ref
  14. {Shannon & Weaver, 1971} Shannon, C. E. & Weaver, W. (1971). The Mathematical Theory of Communication. The University of Illinois Press, Urbana, Illinois. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {Smyth & Tang, 2001} Smyth, W. F. & Tang, Y. (2001). Computing all repeats using O(n) space and O(n log n) time. Preprint.Google ScholarGoogle Scholar
  16. {Storer & Szymanski, 1982} Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29:928-951. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {Turpin & Moffat, 2000} Turpin, A. & Moffat, A. (2000). Housekeeping for prefix coding. IEEE Transactions on Communications, 48(4):622-628.Google ScholarGoogle ScholarCross RefCross Ref
  18. {Ukkonen, 1995} Ukkonen, E. (1995). Online construction of suffix trees. Algorithmica, 14(3):249-260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {Yang, 2000} Yang, Lu (2000). Computing a k-cover of a string. M.Sc. thesis, Department of Computing & Software, McMaster University, Hamilton, Ontario, Canada.Google ScholarGoogle Scholar
  20. {Ziv & Lempel, 1977} Ziv, J. & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337-343.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An approach to phrase selection for offline data compression

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics