The rapid advancement in the theoretical understanding of statistical and machine learning methods for semisupervised learning has made it difficult for nonspecialists to keep up to date in the field. Providing a broad, accessible treatment of the theory as well as linguistic applications, Semisupervised Learning for Computational Linguistics offers self-contained coverage of semisupervised methods that includes background material on supervised and unsupervised learning.\ The book presents a...
The rapid advancement in the theoretical understanding of statistical and machine learning methods for semisupervised learning has made it difficult for nonspecialists to keep up to date in the field. Providing a broad, accessible treatment of the theory as well as linguistic applications, Semisupervised Learning for Computational Linguistics offers self-contained coverage of semisupervised methods that includes background material on supervised and unsupervised learning.The book presents a brief history of semisupervised learning and its place in the spectrum of learning methods before moving on to discuss well-known natural language processing methods, such as self-training and co-training. It then centers on machine learning techniques, including the boundary-oriented methods of perceptrons, boosting, support vector machines (SVMs), and the null-category noise model. In addition, the book covers clustering, the expectation-maximization (EM) algorithm, related generative methods, and agreement methods. It concludes with the graph-based method of label propagation as well as a detailed discussion of spectral methods.Taking an intuitive approach to the material, this lucid book facilitates the application of semisupervised learning methods to natural language processing and provides the framework and motivation for a more systematic study of machine learning.
Introduction 1A brief history 1Probabilistic methods in computational linguistics 1Supervised and unsupervised training 2Semisupervised learning 3Semisupervised learning 4Major varieties of learning problem 4Motivation 6Evaluation 7Active learning 8Organization and assumptions 8Leading ideas 8Mathematical background 10Notation 11Self-training and Co-training 13Classification 13The standard setting 13Features and rules 14Decision lists 16Self-training 18The algorithm 19Parameters and variants 20Evaluation 23Symmetry of features and instances 25Related algorithms 27Co-Training 28Applications of Self-Training and Co-Training 31Part-of-speech tagging 31Information extraction 33Parsing 35Word senses 36WordNet 36Word-sense disambiguation 38Taxonomic inference 40Classification 43Two simple classifiers 43Naive Bayes 43k-nearest-neighbor classifier 45Abstract setting 48Function approximation 48Defining succes 50Fit and simplicity 52Evaluating detectors and classifiers that abstain 53Confidence-rated classifiers 53Measures for detection 54Idealized performance curves 57The multiclass case 59Binary classifiers and ECOC 62Mathematics for Boundary-Oriented Methods 67Linear separators 67Representing a hyperplane 67Eliminating the threshold 69The point-normal form 70Naive Bayes decision boundary 72The gradient 74Graphs and domains 74Convexity 76Differentiation of vector and matrix expressions 79An example: linear regression 81Constrained optimization 83Optimization 83Equality constraints 84Inequality constraints 87The Wolfe dual 91Boundary-Oriented Methods 95The perceptron 97The algorithm 97An example 99Convergence 100The perceptron algorithm as gradient descent 101Game self-teaching 103Boosting 105Abstention 110Semisupervised boosting 111Co-boosting 113Support Vector Machines (SVMs) 114The margin 114Maximizing the margin 116The nonseparable case 119Slack in the separable case 121Multiple slack points 123Transductive SVMs 125Training a transductive SVM 127Null-category noise model 129Clustering 131Cluster and label 131Clustering concepts 132Objective 132Distance and similarity 133Graphs 136Hierarchical clustering 137Self-training revisited 139k-means clustering 139Pseudo relevance feedback 140Graph mincut 143Label propagation 146Clustering by propagation 146Self-training as propagation 147Co-training as propagation 150Bibliographic notes 152Generative Models 153Gaussian mixtures 153Definition and geometric interpretation 153The linear discriminant decision boundary 156Decision-directed approximation 159McLachlan's algorithm 162The EM algorithm 163Maximizing likelihood 163Relative frequency estimation 164Divergence 166The EM algorithm 169Agreement Constraints 175Co-training 175The conditional independence assumption 176The power of conditional independence 178Agreement-based self-teaching 182Random fields 184Applied to self-training and co-training 184Gibbs sampling 186Markov chains and random walks 187Bibliographic notes 192Propagation Methods 193Label propagation 194Random walks 196Harmonic functions 198Fluids 203Flow 203Pressure 205Conservation of energy 209Thomson's principle 210Computing the solution 213Graph mincuts revisited 215Bibliographic notes 220Mathematics for Spectral Methods 221Some basic concepts 221The norm of a vector 221Matrices as linear operators 222The column space 222Eigenvalues and eigenvectors 224Definition of eigenvalues and eigenvectors 224Diagonalization 225Orthogonal diagonalization 226Eigenvalues and the scaling effects of a matrix 227Matrix norms 227The Rayleigh quotient 228The 2 x 2 case 230The general case 232The Courant-Fischer minimax theorem 234Bibliographic notes 236Spectral Methods 237Simple harmonic motion 237Harmonics 237Mixtures of harmonics 239An oscillating particle 241A vibrating string 243Spectra of matrices and graphs 251The spectrum of a matrix 252Relating matrices and graphs 253The Laplacian matrix and graph spectrum 256Spectral clustering 257The second smallest eigenvector of the Laplacian 257The cut size and the Laplacian 259Approximating cut size 260Minimizing cut size 262Ratiocut 263Spectral methods for semisupervised learning 265Harmonics and harmonic functions 265Eigenvalues and energy 267The Laplacian and random fields 268Harmonic functions and the Laplacian 270Using the Laplacian for regularization 272Transduction to induction 274Bibliographic notes 275Bibliography 277Index 301