Abstract
Recently several offline data compression schemes have been published that expend large amounts of computing resources when encoding a file, but decode the file quickly. These compressors work by identifying phrases in the input data, and storing the data as a series of pointer to these phrases. This paper explores the application of an algorithm for computing all repeating substrings within a string for phrase selection in an offline data compressor. Using our approach, we obtain compression similar to that of the best known offline compressors on genetic data, but poor results on general text. It seems, however, that an alternate approach based on selecting repeating substrings is feasible.
- {Apostolico & Lonardi, 2000} Apostolico, A. & Lonardi, S. (2000). Off-line compression by greedy textual substitution. Proc. IEEE, 88(11):1733-1744.Google ScholarCross Ref
- {Baghdadi et al, 2001} Baghdadi, L., Franek, F., Smyth, W. F. & Xiao, X. (2001). A fast space-efficient approach to substring refinement. Preprint.Google Scholar
- {Bentley & McIlroy, 1999} Bentley, J. & McIlroy, D. (1999). Data compression using long common strings. In Storer, J. A. & Cohn, M., editors, Proc. IEEE Data Compression Conference, pages 287-295, Snowbird, Utah. IEEE Computer Society Press, Los Alamitos, California. Google ScholarDigital Library
- {Calgary, 2001} The Calgary Corpus (2001). http://links.uwaterloo.ca/calgary.corpus.html.Google Scholar
- {Cannane & Williams, 2001} Cannane, A. & Williams, H. (2001). General-purpose compression for efficient retrieval. Journal of the American Society for Information Science, 52(5):430-437. Google ScholarCross Ref
- {Canterbury, 2001} The Canterbury Corpus (2001). http://corpus.canterbury.ac.nz/.Google Scholar
- {Crochemore, 1981} Crochemore, M. (1981). An optimal algorithm for computing the repetitions in a word. Information Processing Letters, 12(5):244-250.Google ScholarCross Ref
- {Franek & Smyth, 2001} Franek, F. & Smyth, W. F. (2001). Crochemore's algorithm revisited --- a fast space-efficient approach. Preprint.Google Scholar
- {Larsson & Moffat, 2000} Larsson, N. & Moffat, A. (2000). Offline dictionary-based compression. Proc. IEEE, 88(11):1722-1732.Google ScholarCross Ref
- {Nevill-Manning & Witten, 1994} Nevill-Manning, C. G. & Witten, D. L. M. (1994). Compression by induction of hierarchical grammars. In Storer, J. A. & Cohn, M., editors, Proc. IEEE Data Compression Conference, pages 244-253, Snowbird, Utah. IEEE Computer Society Press, Los Alamitos, California.Google ScholarCross Ref
- {Purdue, 2001} The Purdue Corpus (2001). http://www.cs.purdue.edu/ homes/stelo/Off-line/.Google Scholar
- {Rubin, 1976} Rubin, F. (1976). Experiments in text file compression. Communications of the ACM, 19(11):617-623. Google ScholarDigital Library
- {Shannon, 1948} Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423, 623-656. Reproduced in {Shannon & Weaver, 1971}.Google ScholarCross Ref
- {Shannon & Weaver, 1971} Shannon, C. E. & Weaver, W. (1971). The Mathematical Theory of Communication. The University of Illinois Press, Urbana, Illinois. Google ScholarDigital Library
- {Smyth & Tang, 2001} Smyth, W. F. & Tang, Y. (2001). Computing all repeats using O(n) space and O(n log n) time. Preprint.Google Scholar
- {Storer & Szymanski, 1982} Storer, J. A. & Szymanski, T. G. (1982). Data compression via textual substitution. Journal of the ACM, 29:928-951. Google ScholarDigital Library
- {Turpin & Moffat, 2000} Turpin, A. & Moffat, A. (2000). Housekeeping for prefix coding. IEEE Transactions on Communications, 48(4):622-628.Google ScholarCross Ref
- {Ukkonen, 1995} Ukkonen, E. (1995). Online construction of suffix trees. Algorithmica, 14(3):249-260.Google ScholarDigital Library
- {Yang, 2000} Yang, Lu (2000). Computing a k-cover of a string. M.Sc. thesis, Department of Computing & Software, McMaster University, Hamilton, Ontario, Canada.Google Scholar
- {Ziv & Lempel, 1977} Ziv, J. & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23(3):337-343.Google ScholarDigital Library
Index Terms
- An approach to phrase selection for offline data compression
Recommendations
An approach to phrase selection for offline data compression
ACSC '02: Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4Recently several offline data compression schemes have been published that expend large amounts of computing resources when encoding a file, but decode the file quickly. These compressors work by identifying phrases in the input data, and storing the ...
Comments