Statistical Learning and Modeling: Supervised Learning
Fei Wu
College of Computer Science Zhejiang University
http://person.zju.edu.cn/wufei/
Outlines
Linear model for classification
...
Statistical Learning and Modeling: Supervised Learning
Fei Wu
College of Computer Science Zhejiang University
http://person.zju.edu.cn/wufei/
Outlines
Linear model for classification
Ada Boosting
Linear Model for Classification
Learning the parameters of Linear Discriminant Functions
• Three approaches: – Least-squares approach: • making the model predictions as close as possible to a set of target values
– Fisher‟s linear discriminant: • maximum class separation in the output space
– The perceptron algorithm of Rosenblatt: • generalized linear model
Linear Basis Function Models
Parameter optimization via Maximum likelihood
Parameter optimization via Maximum likelihood
• Assume:
• Thus:
• For data set X = {x1, . . . , xN} and target vector t = (t1, . . . , tN)T, the likelihood function:
SSE: sum-of-squares error function
Parameter optimization via Maximum likelihood
• Solving w by Maximum likelihood:
? ? ? = (Φ ? Φ) − 1 Φ ? ?
N × M design matrix
Moore-Penrose pseudo-inverse
Φ † = (Φ ? Φ) − 1 Φ ?
Thus the bias w0 compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.
Parameter optimization via Maximum likelihood
About bias parameter w0:
• Solving the noise precision parameter β by ML:
• Problem:
Parameter optimization via Least Square
– :
– group together:
–
• Learning with training data set: • minimizing a sum-of-squares error function:
• Discriminant function:
Maximum likelihood and least squares for linear regression classification
Maximum likelihood estimation method (MLE) The likelihood function indicates how likely the observed sample is as a function of possible parameter values. Therefore, maximizing the likelihood function determines the parameters that are most likely to produce the observed data. From a statistical point of view, MLE is usually recommended for large samples because it is versatile, applicable to most models and different types of data, and produces the most precise estimates.
Least squares estimation method (LSE) Least squares estimates are calculated by fitting a regression line to the points from a data set that has the minimal sum of the deviations squared (least square error). In reliability analysis, the line and the data are plotted on a probability plot.
In a linear model, if the errors belong to a normal distribution the least squares estimators are also the maximum likelihood estimators.
Fisher‟s linear discriminant
• From the view of dimensionality reduction:
• The simplest measure of the separation of the classes is the separation of the projected class means:
• Problem: we can increase the magnitude of w to make ( ) arbitrarily large!
Fisher‟s linear discriminant
• The Fisher’s criterion: maximize the separation between the projected class means as well as the inverse of the total within-class variance.
Generalized Rayleigh quotient
Between-class covariance matrix
Within-class covariance
[Show More]