The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex,...
The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.
List of Figures xixSeries Foreword xxiForeword xxiiiPreface xxvIntroductory Material 1Learning, Regularity, and Compression 3Regularity and Learning 4Regularity and Compression 4Solomonoff's Breakthrough - Kolmogorov Complexity 8Making the Idea Applicable 10Crude MDL, Refined MDL and Universal Coding 12From Crude to Refined MDL 14Universal Coding and Refined MDL 17Refined MDL for Model Selection 18Refined MDL for Prediction and Hypothesis Selection 20Some Remarks on Model Selection 23Model Selection among Non-Nested Models 23Goals of Model vs. Point Hypothesis Selection 25The MDL Philosophy 26MDL, Occam's Razor, and the "True Model" 29Answer to Criticism No. 1 30Answer to Criticism No. 2 32History and Forms of MDL 36What Is MDL? 37MDL Literature 38Summary and Outlook 40Probabilistic and Statistical Preliminaries 41General Mathematical Preliminaries 41Probabilistic Preliminaries 46Definitions; Notational Conventions 46Probabilistic Sources 53Limit Theorems and Statements 55Probabilistic Models 57Probabilistic Model Classes 60Kinds of Probabilistic Models 62Terminological Preliminaries 69Modeling Preliminaries: Goals and Methods for Inductive Inference 71Consistency 71Basic Concepts of Bayesian Statistics 74Summary and Outlook 78Information-Theoretic Preliminaries 79Coding Preliminaries 79Restriction to Prefix Coding Systems; Descriptions as Messages 83Different Kinds of Codes 86Assessing the Efficiency of Description Methods 90The Most Important Section of This Book: Probabilities and Code Lengths 90The Kraft Inequality 91Code Lengths "Are" Probabilities 95Immediate Insights and Consequences 99Probabilities and Code Lengths, Part II 101(Relative) Entropy and the Information Inequality 103Uniform Codes, Maximum Entropy, and Minimax Codelength 106Summary, Outlook, Further Reading 106Information-Theoretic Properties of Statistical Models 109Introduction 109Likelihood and Observed Fisher Information 111KL Divergence and Expected Fisher Information 117Maximum Likelihood: Data vs. Parameters 124Summary and Outlook 130Crude Two-Part Code MDL 131Introduction: Making Two-Part MDL Precise 132Two-Part Code MDL for Markov Chain Selection 133The Code C[subscript 2] 135The Code C[subscript 1] 137Crude Two-Part Code MDL for Markov Chains 138Simplistic Two-Part Code MDL Hypothesis Selection 139Two-Part MDL for Tasks Other Than Hypothesis Selection 141Behavior of Two-Part Code MDL 142Two-Part Code MDL and Maximum Likelihood 144The Maximum Likelihood Principle 144MDL vs. ML 147MDL as a Maximum Probability Principle 148Computing and Approximating Two-Part MDL in Practice 150Justifying Crude MDL: Consistency and Code Design 152A General Consistency Result 153Code Design for Two-Part Code MDL 157Summary and Outlook 163Appendix: Proof of Theorem 5.1 163Universal Coding 165Universal Coding with Countable Models 171Universal Coding: The Basic Idea 172Two-part Codes as Simple Universal Codes 174From Universal Codes to Universal Models 175Formal Definition of Universality 177The Finite Case 178Minimax Regret and Normalized ML 179NML vs. Two-Part vs. Bayes 182The Countably Infinite Case 184The Two-Part and Bayesian Codes 184The NML Code 187Prequential Universal Models 190Distributions as Prediction Strategies 190Bayes Is Prequential; NML and Two-part Are Not 193The Prequential Plug-In Model 197Individual vs. Stochastic Universality 199Stochastic Redundancy 199Uniformly Universal Models 201Summary, Outlook and Further Reading 204Parametric Models: Normalized Maximum Likelihood 207Introduction 207Preliminaries 208Asymptotic Expansion of Parametric Complexity 211The Meaning of [Characters not reproducible] 216Complexity and Functional Form 217KL Divergence and Distinguishability 219Complexity and Volume 222Complexity and the Number of Distinguishable Distributions 224Explicit and Simplified Computations 226Parametric Models: Bayes 231The Bayesian Regret 231Basic Interpretation of Theorem 8.1 233Bayes Meets Minimax - Jeffreys' Prior 234Jeffreys' Prior and the Boundary 237How to Prove the Bayesian and NML Regret Theorems 239Proof Sketch of Theorem 8.1 239Beyond Exponential Families 241Proof Sketch of Theorem 7.1 243Stochastic Universality 244Appendix: Proofs of Theorem 8.1 and Theorem 8.2 248Parametric Models: Prequential Plug-in 257Prequential Plug-in for Exponential Families 257The Plug-in vs. the Bayes Universal Model 262More Precise Asymptotics 265Summary 269Parametric Models: Two-Part 271The Ordinary Two-Part Universal Model 271Derivation of the Two-Part Code Regret 274Proof Sketch of Theorem 10.1 277Discussion 282The Conditional Two-Part Universal Code 284Conditional Two-Part Codes for Discrete Exponential Families 286Distinguishability and the Phase Transition 290Summary and Outlook 293NML With Infinite Complexity 295Introduction 295Examples of Undefined NML Distribution 298Examples of Undefined Jeffreys' Prior 299Metauniversal Codes 301Constrained Parametric Complexity 302Meta-Two-Part Coding 303Renormalized Maximum Likelihood 306NML with Luckiness 308Asymptotic Expansion of LNML 312Conditional Universal Models 316Bayesian Approach with Jeffreys' Prior 317Conditional NML 320Liang and Barron's Approach 325Summary and Remarks 329Appendix: Proof of Theorem 11.4 329Linear Regression 335Introduction 336Prelude: The Normal Location Family 338Least-Squares Estimation 340The Normal Equations 342Composition of Experiments 345Penalized Least-Squares 346The Linear Model 348Bayesian Linea Model M[superscript X] with Gaussian Prior 354Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors 359Universal Models for Linear Regression 363NML 363Bayes and LNML 364Bayes-Jeffreys and CNML 365Beyond Parametrics 369Introduction 370CUP: Unions of Parametric Models 372CUP vs. Parametric Models 375Universal Codes Based on Histograms 376Redundancy of Universal CUP Histogram Codes 380Nonparametric Redundancy 383Standard CUP Universal Codes 384Minimax Nonparametric Redundancy 387Gaussian Process Regression 390Kernelization of Bayesian Linear Regression 390Gaussian Processes 394Gaussian Processes as Universal Models 396Conclusion and Further Reading 402Refined MDL 403MDL Model Selection 409Introduction 409Simple Refined MDL Model Selection 411Compression Interpretation 415Counting Interpretation 416Bayesian Interpretation 418Prequential Interpretation 419General Parametric Model Selection 420Models with Infinite Complexities 420Comparing Many or Infinitely Many Models 422The General Picture 425Practical Issues in MDL Model Selection 428Calculating Universal Codelengths 428Computational Efficiency and Practical Quality of Non-NML Universal Codes 429Model Selection with Conditional NML and Plug-in Codes 431General Warnings about Model Selection 435MDL Model Selection for Linear Regression 438Rissanen's RNML Approach 439Hansen and Yu's gMDL Approach 443Liang and Barron's Approach 446Discussion 448Worst Case vs. Average Case 451MDL Prediction and Estimation 459Introduction 459MDL for Prediction and Predictive Estimation 460Prequential MDL Estimators 461Prequential MDL Estimators Are Consistent 465Parametric and Nonparametric Examples 469Cesaro KL consistency vs. KL consistency 472Two-Part Code MDL for Point Hypothesis Selection 476Discussion of Two-Part Consistency Theorem 478MDL Parameter Estimation 483MDL Estimators vs. Luckiness ML Estimators 487What Estimator To Use? 491Comparison to Bayesian Estimators 493Summary and Outlook 498Appendix: Proof of Theorem 15.3 499MDL Consistency and Convergence 501Introduction 501The Scenarios Considered 501Consistency: Prequential and Two-Part MDL Estimators 502Consistency: MDL Model Selection 505Selection between a Union of Parametric Models 505Nonparametric Model Selection Based on CUP Model Class 508MDL Consistency Peculiarities 511Risks and Rates 515Relations between Divergences and Risk Measures 517Minimax Rates 519MDL Rates of Convergence 520Prequential and Two-Part MDL Estimators 520MDL Model Selection 522MDL in Context 523MDL and Frequentist Paradigms 524Sanity Check or Design Principle? 525The Weak Prequential Principle 528MDL vs. Frequentist Principles: Remaining Issues 529MDL and Bayesian Inference 531Luckiness Functions vs. Prior Distributions 534MDL, Bayes, and Occam 539MDL and Brands of Bayesian Statistics 544Conclusion: a Common Future after All? 548MDL, AIC and BIC 549BIC 549AIC 550Combining the Best of AIC and BIC 552MDL and MML 555Strict Minimum Message Length 556Comparison to MDL 558The Wallace-Freeman Estimator 560MDL and Prequential Analysis 562MDL and Cross-Validation 565MDL and Maximum Entropy 567Kolmogorov Complexity and Structure Function 570MDL and Individual Sequence Prediction 573MDL and Statistical Learning Theory 579Structural Risk Minimization 581PAC-Bayesian Approaches 585PAC-Bayesand MDL 588The Road Ahead 592Additional Background 597The Exponential or "Maximum Entropy" Families 599Introduction 600Definition and Overview 601Basic Properties 605Mean-Value, Canonical, and Other Parameterizations 609The Mean Value Parameterization 609Other Parameterizations 611Relating Mean-Value and Canonical Parameters 613Exponential Families of General Probabilistic Sources 617Fisher Information Definitions and Characterizations 619Information-Theoretic Properties of Exponential Families 623Introduction 624Robustness of Exponential Family Codes 624If [Characters not reproducible][subscript mean] Does Not Contain the Mean 627Behavior at the ML Estimate [Beta] 629Behavior of the ML Estimate [Beta] 632Central Limit Theorem 633Large Deviations 634Maximum Entropy and Minimax Codelength 637Exponential Families and Maximum Entropy 638Exponential Families and Minimax Codelength 641The Compression Game 643Likelihood Ratio Families and Renyi Divergences 645The Likelihood Ratio Family 647Summary 650References 651List of Symbols 675Subject Index 679