Linear Regression #3: Variable Selection

In Econometrics and Data Science, a necessary procedure in building a prediction model is validation. I already presented the most common techniques of model validation in this article.
Let us now have a deeper look on these methods from statistical perspective. For robust prediction models it is vital to split the data with \(\small{ N }\) observations into two different sets. The train set with \(\small{ n_1<N }\) rows and the test/validation set with \(\small{ n_2=N-n_1 }\) rows.
Now, we build the regression model with the training data. Using the mean-square-error, we afterwards evaluate the regression model based on the untouched test data.
\[MSE=\frac{1}{n_2}\sum_{i=n_1+1}^{N}\left({\hat{y}}_i-y_i\right)^2\]
We choose the model that minimises the MSE for our validation set.

I have also described the procedure of feature selection from a data scientist’s perspective in this aricle.

Sequential Variable Selection

There are basically four different algorithms or procedures for selecting variables to include in the model.

Forward Selection

Starting from a minimum set of variables based on economic theory, the algorithm determines which regressor (not yet included) reduces most the RSS in the regression model. The algorithm iteratively adds variables until a stopping criterion is met. That is when \(\small{\Delta RSS < \tau }\), where \(\small{ \tau }\) is a threshold metric to avoid adding variables with little or no reduction of RSS.

Backward Elimination

The other way around, the algorithm starts with the whole set of variables and iteratively excludes features that add little or no reduction of RSS to the model.

Stepwise Selection

The combination of forward and backward selection. The algorithm can add and exclude variables at each iteration.

Random variable selection

Often neglected in traditional statistic theory, but nowadays becoming very popular in famous machine learning algorithms, a random selection of variables can provide a proper prediction model. Though this method is only used in combination with many different individual random models in a composite ensemble model.