research-article

Free Access

The sequence memoizer

Authors:
Frank Wood

Columbia University, New York

Columbia University, New York
View Profile

,
Jan Gasthaus

University College London, England

University College London, England
View Profile

,
Cédric Archambeau

Xerox Research Centre Europe, Grenoble, France

Xerox Research Centre Europe, Grenoble, France
View Profile

,
Lancelot James

Hong Kong University of Science and Technology, Kowloon, Hong Kong

Hong Kong University of Science and Technology, Kowloon, Hong Kong
View Profile

,
Yee Whye Teh

University College London, England

University College London, England
View Profile

Authors Info & Claims

Communications of the ACM Volume 54 Issue 2February 2011pp 91–98https://doi.org/10.1145/1897816.1897842

Published:01 February 2011Publication History

Communications of the ACM

Abstract

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.

References

Bartlett, N., Pfau, D., Wood, F. Forgetting counts: Constant memory inference for a dependent hierarchical Pitman--Yor process. In 27th International Conference on Machine Learning, to appear (2010).Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 3 (2003), 1137--1155. Google ScholarDigital Library
Chen, S.F., Goodman, J. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 4 (1999), 359--394.Google ScholarDigital Library
Cleary, J.G., Teahan, W.J. Unbounded length contexts for PPM. Comput. J. 40 (1997), 67--75.Google ScholarCross Ref
Doucet, A., de Freitas, N., Gordon, N.J. Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. Springer-Verlag, New York, May 2001.Google Scholar
Gasthaus, J., Teh, Y.W. Improvements to the sequence memoizer. In Advances in Neural Information Processing Systems 23, to appear (2010).Google Scholar
Gasthaus, J., Wood, F., Teh, Y.W. Lossless compression based on the sequence memoizer. Data Compression Conference 2010. J.A. Storer, M.W. Marcellin, eds. Los Alamitos, CA, USA, 2010, 337--345. IEEE Computer Society. Google ScholarDigital Library
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. Bayesian data analysis. Chapman & Hall, CRC, 2nd edn, 2004.Google Scholar
Giegerich, R., Kurtz, S. From Ukkonen to McCreight and Weiner: A unifying view of linear-lime suffix tree construction. Algorithmica 19, 3 (1997), 331--353.Google ScholarCross Ref
Goldwater, S., Griffiths, T.L., Johnson, M. Interpolating between types and tokens by estimating power law generators. In Advances in Neural Information Processing Systems 18 (2006), MIT Press, 459--466.Google Scholar
MacKay, D.J.C., Peto, L.B. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1, 2 (1995), 289--307.Google ScholarCross Ref
Mahoney, M. Large text compression benchmark. URL: http://www.mattmahoney.net/text/text.html (2009).Google Scholar
Mnih, A., Yuecheng, Z., Hinton, G. Improving a statistical language model through non-linear prediction. Neurocomputing 72, 7--9 (2009), 1414--1418. Google ScholarDigital Library
Pitman, J. Coalescents with multiple collisions. Ann. Probab. 27 (1999), 1870--1902.Google ScholarCross Ref
Pitman, J., Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 (1997), 855--900.Google ScholarCross Ref
Robert, C.P., Casella, G. Monte Carlo Statistical Methods. Springer Verlag, 2004. Google ScholarCross Ref
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. (reprinted in ACM SIGMOBILE Mobile Computing and Communications Review 2001) (1948). Google ScholarDigital Library
Teh, Y.W. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the Association for Computational Linguistics (2006), 985--992. Google ScholarDigital Library
Willems, F.M.J. The context-tree weighting method: Extensions. IEEE Trans. Inform. Theory 44, 2 (1998), 792--798. Google ScholarDigital Library
Willems, F.M.J. CTW website. URL: http://www.ele.tue.nl/ctw/ (2009).Google Scholar
Wood, F., Archambeau, C., Gasthaus, J., James, L., Teh, Y.W. A stochastic memoizer for sequence data. In 26th International Conference on Machine Learning (2009), 1129--1136. Google ScholarDigital Library
Wu, P., Teahan, W.J. A new PPM variant for Chinese text compression. Nat. Lang. Eng. 14, 3 (2007), 417--430. Google ScholarDigital Library
Zipf, G. Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, Cambridge, MA, 1932.Google ScholarCross Ref

Index Terms

The sequence memoizer
1. Computing methodologies
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Improvements to the sequence memoizer
NIPS'10: Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1

The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-...
Read More
Zero-Correlation Zone Sequence Set Constructed from a Perfect Sequence and a Complementary Sequence Pair

The present paper introduces the construction of a class of sequence sets with zero-correlation zones called zero-correlation zone sequence sets. The proposed zero-correlation zone sequence set can be generated from an arbitrary perfect sequence and an ...
Read More
Binary Zero-Correlation Zone Sequence Set Constructed from an M-Sequence

The present paper introduces an improved construction of a class of binary sequences having a zero-correlation zone (hereafter binary ZCZ sequence set). The cross-correlation function and the side-lobe of the auto-correlation function of the proposed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 54, Issue 2
February 2011
115 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1897816
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 6,693
  Total Downloads
- Downloads (Last 12 months)246
- Downloads (Last 6 weeks)38
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The sequence memoizer

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Improvements to the sequence memoizer

Zero-Correlation Zone Sequence Set Constructed from a Perfect Sequence and a Complementary Sequence Pair

Binary Zero-Correlation Zone Sequence Set Constructed from an M-Sequence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

The sequence memoizer

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Improvements to the sequence memoizer

Zero-Correlation Zone Sequence Set Constructed from a Perfect Sequence and a Complementary Sequence Pair

Binary Zero-Correlation Zone Sequence Set Constructed from an M-Sequence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media