The Minimum Description Length Principle

Hardcover
from $0.00

Author: Peter D. Grunwald

ISBN-10: 0262072815

ISBN-13: 9780262072816

Category: Machine Learning

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex,...

Search in google:

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern.This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.

List of Figures     xixSeries Foreword     xxiForeword     xxiiiPreface     xxvIntroductory Material     1Learning, Regularity, and Compression     3Regularity and Learning     4Regularity and Compression     4Solomonoff's Breakthrough - Kolmogorov Complexity     8Making the Idea Applicable     10Crude MDL, Refined MDL and Universal Coding     12From Crude to Refined MDL     14Universal Coding and Refined MDL     17Refined MDL for Model Selection     18Refined MDL for Prediction and Hypothesis Selection     20Some Remarks on Model Selection     23Model Selection among Non-Nested Models     23Goals of Model vs. Point Hypothesis Selection     25The MDL Philosophy     26MDL, Occam's Razor, and the "True Model"     29Answer to Criticism No. 1     30Answer to Criticism No. 2     32History and Forms of MDL     36What Is MDL?     37MDL Literature     38Summary and Outlook     40Probabilistic and Statistical Preliminaries     41General Mathematical Preliminaries     41Probabilistic Preliminaries     46Definitions; Notational Conventions     46Probabilistic Sources     53Limit Theorems and Statements     55Probabilistic Models     57Probabilistic Model Classes     60Kinds of Probabilistic Models     62Terminological Preliminaries     69Modeling Preliminaries: Goals and Methods for Inductive Inference     71Consistency     71Basic Concepts of Bayesian Statistics     74Summary and Outlook     78Information-Theoretic Preliminaries     79Coding Preliminaries     79Restriction to Prefix Coding Systems; Descriptions as Messages     83Different Kinds of Codes     86Assessing the Efficiency of Description Methods     90The Most Important Section of This Book: Probabilities and Code Lengths     90The Kraft Inequality     91Code Lengths "Are" Probabilities     95Immediate Insights and Consequences     99Probabilities and Code Lengths, Part II     101(Relative) Entropy and the Information Inequality     103Uniform Codes, Maximum Entropy, and Minimax Codelength     106Summary, Outlook, Further Reading     106Information-Theoretic Properties of Statistical Models     109Introduction     109Likelihood and Observed Fisher Information     111KL Divergence and Expected Fisher Information     117Maximum Likelihood: Data vs. Parameters     124Summary and Outlook     130Crude Two-Part Code MDL     131Introduction: Making Two-Part MDL Precise     132Two-Part Code MDL for Markov Chain Selection     133The Code C[subscript 2]     135The Code C[subscript 1]     137Crude Two-Part Code MDL for Markov Chains     138Simplistic Two-Part Code MDL Hypothesis Selection     139Two-Part MDL for Tasks Other Than Hypothesis Selection     141Behavior of Two-Part Code MDL     142Two-Part Code MDL and Maximum Likelihood     144The Maximum Likelihood Principle     144MDL vs. ML     147MDL as a Maximum Probability Principle     148Computing and Approximating Two-Part MDL in Practice     150Justifying Crude MDL: Consistency and Code Design     152A General Consistency Result      153Code Design for Two-Part Code MDL     157Summary and Outlook     163Appendix: Proof of Theorem 5.1     163Universal Coding     165Universal Coding with Countable Models     171Universal Coding: The Basic Idea     172Two-part Codes as Simple Universal Codes     174From Universal Codes to Universal Models     175Formal Definition of Universality     177The Finite Case     178Minimax Regret and Normalized ML     179NML vs. Two-Part vs. Bayes     182The Countably Infinite Case     184The Two-Part and Bayesian Codes     184The NML Code     187Prequential Universal Models     190Distributions as Prediction Strategies     190Bayes Is Prequential; NML and Two-part Are Not     193The Prequential Plug-In Model     197Individual vs. Stochastic Universality     199Stochastic Redundancy     199Uniformly Universal Models     201Summary, Outlook and Further Reading     204Parametric Models: Normalized Maximum Likelihood     207Introduction     207Preliminaries      208Asymptotic Expansion of Parametric Complexity     211The Meaning of [Characters not reproducible]     216Complexity and Functional Form     217KL Divergence and Distinguishability     219Complexity and Volume     222Complexity and the Number of Distinguishable Distributions     224Explicit and Simplified Computations     226Parametric Models: Bayes     231The Bayesian Regret     231Basic Interpretation of Theorem 8.1     233Bayes Meets Minimax - Jeffreys' Prior     234Jeffreys' Prior and the Boundary     237How to Prove the Bayesian and NML Regret Theorems     239Proof Sketch of Theorem 8.1     239Beyond Exponential Families     241Proof Sketch of Theorem 7.1     243Stochastic Universality     244Appendix: Proofs of Theorem 8.1 and Theorem 8.2     248Parametric Models: Prequential Plug-in     257Prequential Plug-in for Exponential Families     257The Plug-in vs. the Bayes Universal Model     262More Precise Asymptotics     265Summary     269Parametric Models: Two-Part     271The Ordinary Two-Part Universal Model     271Derivation of the Two-Part Code Regret     274Proof Sketch of Theorem 10.1     277Discussion     282The Conditional Two-Part Universal Code     284Conditional Two-Part Codes for Discrete Exponential Families     286Distinguishability and the Phase Transition     290Summary and Outlook     293NML With Infinite Complexity     295Introduction     295Examples of Undefined NML Distribution     298Examples of Undefined Jeffreys' Prior     299Metauniversal Codes     301Constrained Parametric Complexity     302Meta-Two-Part Coding     303Renormalized Maximum Likelihood     306NML with Luckiness     308Asymptotic Expansion of LNML     312Conditional Universal Models     316Bayesian Approach with Jeffreys' Prior     317Conditional NML     320Liang and Barron's Approach     325Summary and Remarks     329Appendix: Proof of Theorem 11.4     329Linear Regression     335Introduction     336Prelude: The Normal Location Family      338Least-Squares Estimation     340The Normal Equations     342Composition of Experiments     345Penalized Least-Squares     346The Linear Model     348Bayesian Linea Model M[superscript X] with Gaussian Prior     354Bayesian Linear Models M[superscript X] and S[superscript X] with Noninformative Priors     359Universal Models for Linear Regression     363NML     363Bayes and LNML     364Bayes-Jeffreys and CNML     365Beyond Parametrics     369Introduction     370CUP: Unions of Parametric Models     372CUP vs. Parametric Models     375Universal Codes Based on Histograms     376Redundancy of Universal CUP Histogram Codes     380Nonparametric Redundancy     383Standard CUP Universal Codes     384Minimax Nonparametric Redundancy     387Gaussian Process Regression     390Kernelization of Bayesian Linear Regression     390Gaussian Processes     394Gaussian Processes as Universal Models     396Conclusion and Further Reading     402Refined MDL     403MDL Model Selection     409Introduction     409Simple Refined MDL Model Selection     411Compression Interpretation     415Counting Interpretation     416Bayesian Interpretation     418Prequential Interpretation     419General Parametric Model Selection     420Models with Infinite Complexities     420Comparing Many or Infinitely Many Models     422The General Picture     425Practical Issues in MDL Model Selection     428Calculating Universal Codelengths     428Computational Efficiency and Practical Quality of Non-NML Universal Codes     429Model Selection with Conditional NML and Plug-in Codes     431General Warnings about Model Selection     435MDL Model Selection for Linear Regression     438Rissanen's RNML Approach     439Hansen and Yu's gMDL Approach     443Liang and Barron's Approach     446Discussion     448Worst Case vs. Average Case     451MDL Prediction and Estimation     459Introduction     459MDL for Prediction and Predictive Estimation     460Prequential MDL Estimators      461Prequential MDL Estimators Are Consistent     465Parametric and Nonparametric Examples     469Cesaro KL consistency vs. KL consistency     472Two-Part Code MDL for Point Hypothesis Selection     476Discussion of Two-Part Consistency Theorem     478MDL Parameter Estimation     483MDL Estimators vs. Luckiness ML Estimators     487What Estimator To Use?     491Comparison to Bayesian Estimators     493Summary and Outlook     498Appendix: Proof of Theorem 15.3     499MDL Consistency and Convergence     501Introduction     501The Scenarios Considered     501Consistency: Prequential and Two-Part MDL Estimators     502Consistency: MDL Model Selection     505Selection between a Union of Parametric Models     505Nonparametric Model Selection Based on CUP Model Class     508MDL Consistency Peculiarities     511Risks and Rates     515Relations between Divergences and Risk Measures     517Minimax Rates     519MDL Rates of Convergence     520Prequential and Two-Part MDL Estimators     520MDL Model Selection     522MDL in Context     523MDL and Frequentist Paradigms     524Sanity Check or Design Principle?     525The Weak Prequential Principle     528MDL vs. Frequentist Principles: Remaining Issues     529MDL and Bayesian Inference     531Luckiness Functions vs. Prior Distributions     534MDL, Bayes, and Occam     539MDL and Brands of Bayesian Statistics     544Conclusion: a Common Future after All?     548MDL, AIC and BIC     549BIC     549AIC     550Combining the Best of AIC and BIC     552MDL and MML     555Strict Minimum Message Length     556Comparison to MDL     558The Wallace-Freeman Estimator     560MDL and Prequential Analysis     562MDL and Cross-Validation     565MDL and Maximum Entropy     567Kolmogorov Complexity and Structure Function     570MDL and Individual Sequence Prediction     573MDL and Statistical Learning Theory     579Structural Risk Minimization     581PAC-Bayesian Approaches     585PAC-Bayesand MDL     588The Road Ahead     592Additional Background     597The Exponential or "Maximum Entropy" Families     599Introduction     600Definition and Overview     601Basic Properties     605Mean-Value, Canonical, and Other Parameterizations     609The Mean Value Parameterization     609Other Parameterizations     611Relating Mean-Value and Canonical Parameters     613Exponential Families of General Probabilistic Sources     617Fisher Information Definitions and Characterizations     619Information-Theoretic Properties of Exponential Families     623Introduction     624Robustness of Exponential Family Codes     624If [Characters not reproducible][subscript mean] Does Not Contain the Mean     627Behavior at the ML Estimate [Beta]     629Behavior of the ML Estimate [Beta]     632Central Limit Theorem     633Large Deviations     634Maximum Entropy and Minimax Codelength     637Exponential Families and Maximum Entropy     638Exponential Families and Minimax Codelength     641The Compression Game      643Likelihood Ratio Families and Renyi Divergences     645The Likelihood Ratio Family     647Summary     650References     651List of Symbols     675Subject Index     679