ICASSP
30th Anniversary

Tutorial TUT-3: Data compression, Good-Turing estimation, and language modeling

Instructors

A. Orlitsky; University of California, San Diego

Time & Location

Friday, March 18, 14:00 - 17:00, Location: CC: Room 113-A

Abstract

Compression, estimation, and modeling address essentially the same technical problem. Yet they have been studied in their respective disciplines---information-theory, statistics, and linguistics/signal processing---using vastly divergent techniques. Taking a unified universal-compression viewpoint, we relate these approaches and use results derived in each to explain phenomena observed in the others.

We begin with classical data-compression definitions and results, present several popular universal-compression algorithms, and explain why their performance degrades as the alphabet size increases to that encountered in language modeling. We then provide some of the intuition behind the cryptic Good-Turing estimator of low-probability events, describe recent developments on the optimality of some of its variations, and relate these results to those derived in language modeling for speech recognition and natural language processing. We conclude by comparing the performance of these and related algorithms on benchmark corpora.

Audience: The tutorial is self-contained and aimed at researchers interested in statistical and information-theoretic views of language modeling and compression.

Materials: Slides, summary, and references will be provided.

Outline:

Compression

Theory: Optimal compression, entropy
Algorithms: Shannon, Huffman, arithmetic coding

Universal compression

Theory: Redundancy, bounds, effect of alphabet-size
Algorithms: Lempel-Ziv, context-trees, Burrows-Wheeler

Probability estimation

Theory: Consistency, attenuation
Algorithms: Laplace, Krichevsky-Trofimov, Good-Turing

Language modeling

Theory: Perplexity, smoothing
Algorithms: Jelinek-Mercer, Kneser-Ney, prediction by partial matching

Evaluation

Performance of above and related algorithms on the Wall Street Journal database, interpretation

Presenter Information

Alon Orlitsky received B.Sc. degrees in Mathematics and Electrical Engineering from Ben Gurion University in 1980 and 1981, and M.Sc. and Ph.D. degrees in Electrical Engineering from Stanford University in 1982 and 1986.

From 1986 to 1996 he was with the Communications Analysis Research Department of Bell Laboratories. He spent the following year as a quantitative analyst at D.E. Shaw and Company, an investment firm in New York city. In 1997 he joined the University of California, San Diego, where he is currently a professor of Electrical and Computer Engineering and of Computer Science and Engineering.

Alon's research concerns information theory, learning, and speech recognition. He is a recipient of the 1981 ITT International Fellowship and of the 1992 IEEE W.R.G. Baker Paper Award.