We define a criterion \(\small{ C(m) }\) that measures the quality of a model \(\small{ m }\) . In general, we need to find the model \(\small{ m^{*} }\) that minimises the selection criterion \(\small{ C(m) }\) for all possible \(\small{ m }\) .
For beginning, let us recall the mean squared prediction error ( \(\small{ MSPE }\) ) for our model \(\small{ m }\) . The MSPE is the expected value of the squared difference between the fitted values (prediction) \(\small{ {\hat{y}}_{i}\left(m\right) }\) and the values of the unobservable function \(\small{ y_i^\ast }\) .
\[MSPE(m)=\frac{1}{N} \sum_{i=1}^N E[(\hat{y}_i (m)-y_i)^2]\]
The MSPE contains information about the model bias, the estimation error and the variance. With an increasing number of included features \(\small{ \left|m\right| }\) , the model bias will decrease and the estimation error increase. An estimation of the \(\small{ MSPE }\) can be given by
\[\widehat{MSPE}\left(m\right)=\frac{1}{N}RSS\left(m\right)+2\frac{\left|m\right|}{N}{\hat{\sigma}}^2\]
where \(\small{ RSS=N\ast\ MSE=\sum\nolimits_{i=1}^{N}\left({\hat{y}}_i\left(m\right)-y_i\right)^2 }\) . The increasing number of variables \(\small{ \left|m\right| }\) is used as a “punishment” or penalty. As the number of included features grows, the higher will be the value of the selection criterion and hence this model is not being preferred. This prevents overfitting. Collin Mallows later proposed a normalised version of this criterion:
\[C_p\left(m\right)=\frac{RSS\left(m\right)}{{\hat{\sigma}}^2}-N+2\left|m\right|\]
Both criteria lead to the same optimal model \(\small{ m }\) .
Cross Validation Criteria
Alternatives to the MSPE are cross-validation criteria:
\[CV(m)=\frac{1}{N}\sum_{i=1}^{N}\left({\hat{y}}_{-i}(m)-y_i\right)^2\]
This criterion is the mean squared difference between the true value \(\small{ y_i }\) and the prediction \(\small{ {\hat{y}}_{-i}\left(m\right) }\) trained on a data set without the \(\small{ i }\) -th point. Thus, it is called a leave-one-out estimator.
Generalised Information Criterion
An important measure of model selection is the generalised information criterion ( \(\small{ GIC }\) ).
\[GIC\left(m\right)=\ln{\left({\widetilde{\sigma}}^2\left(m\right)\right)}+a_N\frac{\left|m\right|}{N}\]
The \(\small{ GIC }\) is based on the logarithm of the maximum-likelihood estimator \(\small{ {\widetilde{\sigma}}^2\left(m\right) }\) .
We can derive two special cases of the \(\small{ GIC }\) that are very popular among statisticians.
1. Akaike Information Criterion (AIC)
The Akaike Information Criterion sets \(\small{ a_N=2 }\) in the \(\small{ GIC }\) formula. Thus, we have
\[AIC\left(m\right)=\ln{\left({\widetilde{\sigma}}^2\left(m\right)\right)}+2\frac{\left|m\right|}{N}\]
When using OLS regression, the maximum likelihood estimator for the variance of a model’s residuals distributions is \(\small{ {\widetilde{\sigma}}^2\left(m\right)=RSS/N }\) , where \(\small{ RSS }\) is the residual sum of squares.
\[AIC_{OLS}\left(m\right)=\ln{\left(\frac{RSS}{N}\right)}+2\frac{\left|m\right|}{N}\]
The \(\small{ AIC }\) criterion is asymptotically optimal under certain conditions. As well, as \(\small{ CV }\) , \(\small{ Cp }\) and \(\small{ MSPE }\) . However, \(\small{ AIC }\) is not consistent. Under the assumption that the “true model” is not in the candidate set, we are interested in the unknown “oracle model” \(\small{ m^\ast }\) . As \(\small{ N\rightarrow\infty }\) ,
\[\frac{\left|\left|g-\hat{y}\left(\hat{m}\right)\right|\right|^2}{\left|\left|g-\hat{y}\left(m^\ast\right)\right|\right|^2} \rightarrow m_{true}\]
Moreover, the \(\small{ AIC }\) is asymptotically equivalent to the leave-one-out \(\small{ CV }\) criterion.
2. Baysian-Schwarz Information Criterion (BIC)
The Akaike Information Criterion sets \(\small{ a_N=\ln(N) }\) in the \(\small{ GIC }\) formula. Thus, we have
\[BIC\left(m\right)=\ln{\left({\widetilde{\sigma}}^2\left(m\right)\right)}+\ln(N)\frac{\left|m\right|}{N}\]
Again, when using OLS regression, the maximum likelihood estimate for the variance of a model’s residuals distributions is \(\small{ {\widetilde{\sigma}}^2\left(m\right)=RSS/N }\) , where \(\small{ RSS }\) is the residual sum of squares.
\[BIC_{OLS}\left(m\right)=\ln{\left(\frac{RSS}{N}\right)}+\ln(N)\frac{\left|m\right|}{N}\]
The BIC Criterion is consistent. However, it is not asymptotically optimal. As \(\small{ N\rightarrow\infty }\) ,
\[\hat{m} \rightarrow m_{true}\]