Data Science #4: Data Preprocessing

The first thing when obtaining new data is getting familiar with the structure. That is for instance, the meaning, scaling and distribution of variables, the number of observations and potential relationship or dependencies between columns or rows. General measures of statistics (mean, median, interquartile range) can be useful.

Data Selection

The selection of data depends on the business question or the objective of the model. If possible, the raw data should be as clean as possible. In general, the more observations you can get, the more possibilies you will have in the modelling process. You can always select a sub-sample. However, if you have only few observations to train your model on, then you might accept certain limitations of the resulting model.

Cleaning Data

Before training any model, it is vital to reduce noise in the data set. This noisiness is caused by errors, missing values or outliers. Hence, the detection and a proper treatment influences the outcome of your prediction significantly. Concerning errors in your data, so far we can say: Correct the error values if feasible or else treat them as missing values as follows.

Missing Values

One of the first things during data preprocessing is identifying missing values. We can simply achieve this by finding empty cells in our tabular data df:

sapply(df, function(x) sum(

Sometimes it might not be as easy as finding empty cells depending on the data source. Often missing values are decoded by the researcher with some specific value (e.g. “9” or “999”). Useful commands in R to identify those values are




While dealing with missing values the data scientist can follow different approaches. If there is a large data set with only few missing values, it might be comfortable to just delete all incomplete observations.

df <- na.omit(df$columnX)

However in most cases this cleaning method is unsuitable because too much information is lost. Sometimes even missing values have a special meaning. In this case we could add a dummy variable containing the information whether a value was missing or not.

df$newDummyCol = ifelse($columnX),1,0)

After storing this information, the column with the missing values can be edited. The data scientist makes use of various imputation techniques. That is estimating the missing values and filling in a proper estimate for every single observation. A very naïve strategy would be replacement by the mean or median of the complete observations.

df$columnX[$columnX)] = mean(df$columnX[!$columnX)])

More sophisticated methods are linear regressions or decision trees based on the other features/variables

regr =lm(columnX~colY+colZ, data=df[!$columnX),])
df$columnX[$columnX)]=predict(regr, newdata = df[$columnX),])

There are also some packages that can do this process very conveniently.

Outlier Detection

Outlying values are a big issue in data preprocessing. Here, we speak of values that are located far too distant from most other observations. Thus, they might have a misleading effect on the predictions and cause biases or high variance.
The detection of outliers can be obtained by visualisation using histograms or boxplots and by numerical approaches. A simple z-transformation helps to standardise/normalise the data in order to get a better overview on the distribution


where \(\small{ \mu }\) is the mean and \(\small{ \sigma }\) the standard deviation. In R we apply this by

df$columnX = standardize(df$columnX)

We assume that outlying values have z-values larger than 3 or smaller than -3.
Another numerical method to detect outliers is the squared Mahalanobis distance. In addition to the previous z-transformation, the squared Mahalanobis distance does also consider the depending or surrounding variables in context. Hence, this is an even better technique.

\[D^2\ =\ \left(x_j\ -\ \mu\right)^\prime\mathrm{\Sigma}^{-1}\ \left(x_j\ -\ \mu\right)\]

with the covariance matrix \(\small{ \mathrm{\Sigma} }\).

The treatment of outliers resembles the missing value treatment. If these outliers are no errors, it might be possible to keep them in the data set. Otherwise imputation techniques like mean replacement or linear regression can be applied. In some cases, it might even be sensible to exclude the outlying observations.

Data Transformation

Now having a nice clean data set, you should still not start building your prediction model.  Depending on the model type you are using, the transformation of certain variables is necessary in order to increase the performance and accuracy levels of predictions.

Feature Scaling

When it comes to different algorithms, you as a data scientist must consider different requirements. Many models use the Euclidean distance measure between two data points. Hence, they struggle with different scaling of the variables. Especially machine learning algorithms (e.g. k-nearest-neighbors) can reach much better performances when the features are normalised/standardised.

Models that use linear or logistic regression follow the Gauss-Markov-theorem and will (by definition) always give the best linear unbiased estimator (BLUE). That is giving weights to the features automatically. In other words, we don’t have to scale our variables when using linear or logistic regression techniques, but it helps the data scientist to interpret the resulting model. With normalised features, the interpretation of the intercept will be the estimate of \(\small{ y }\) when all regressors \(\small{ x_i }\) are set to their mean value. The coefficients are now in units of the standard deviation. Furthermore, scaling can reduce the computational expense of your model training.

Another set of algorithms, that are completely unaffected by scaling of features are all tree-based learners. But even when you use a decision tree, there is a good reason for normalising your data: You leave the door open to add other learning algorithms for comparison or ensemble learning.

Unsupervised Binning/Distrecisation

The process of aggregating data into groups of higher level of abstraction is called binning (for two groups) or discretisation (for more than two groups). Based on their value, the data is put into classes representing hierarchies. For instance, a person’s height is a continuous variable and can be assigned to dicrete groups “small”, “medium” and “large”. Although some information is lost in this process, you achieve a much cleaner data set with less noise. In most cases, you achieve a more robust prediction model. The number of groups can be chosen by the data scientist.

d = discretize(df$columnX, cuts=3, min_unique=0, na.rm=TRUE)
df$columnXdiscr = predict(d, new_data = df$columnX)

Supervised Discretisation

The term “supervised” is always the hint for labelled data (including the outcome or target value). As an initial step, the target variable can be used to discretise a feature. This method can be used in order to maximise the information gain of discretisation for your prediction.

Feature Engineering

The topic of feature engineering is essential for modern machine learning. It is both difficult and expensive, since the data scientist needs to perform most part of it manually. Our goal is to change or create new features with a higher explanatory character than the raw data could offer.

In most cases, lots of experience and theoretical background are needed in order to understand the relationship between different variables.

a. Aggregations

The simplest way of feature engineering is a basic aggregation of data. That can be obtained by discretisation techniques (already mentioned above).

b. Transformation of Distribution

There are different famous transformation techniques in order to change the distribution of the data. Assume a variable contains both small numeric values and some large values. Here, we can apply a simple log-transformation to lower the larger values and hence tighten the distribution. Another example is the Box-Cox-transformation.
Often the goal is to achieve a more normal distribution, since many learning algorithms (e.g. Regression) need that normal assumption.

c. Trend variables

When it comes to time series, an important feature can be the marginal difference between two data points. This can give us an absolute trend \(\small{ \frac{x_t-x_{t-i}}{i} }\) or a relative trend \(\small{ \frac{x_t-x_{t-i}}{x_{t-i}} }\).

d. Weight of Evidence (WOE)

The weight of evidence (WOE) is a powerful transformation of a discrete (categorical) feature into a numerical measure describing the good or bad influence of this category on the target variable (information value).



\[p\left(GOOD\right)_i=\frac{absolute\ number\ of\ GOODs\ in\ category\ i}{total\ number\ of\ GOODs\ in\ all\ categories\ 1..i}\]

On the training data we can calculate the WOE in R using the library InformationValue:

#load library

#calculate and add column of WOEs on labelled data set (train data)
df$columnXwoe=WOE(df$columnXdiscr, df$target, valueOfGood = 1)

#calculate and store WOE table to merge with unlabelled new data
woe.table.colX=WOETable(df$columnXdiscr, df$target, valueOfGood = 1)

Remember to store the WOETable for later merging with unlabelled new data sets:$columnXdiscr)
tmp2=(merge(tmp, woe.table.colX))
df$columnXwoe =tmp2[order(tmp2$row),]

Feature Selection

It is often not to difficult to create a large variety of features and transformations of the data. However, in the end the best transformation techniques might add only little explanatory power to the machine learning model. Therefore, a proper selection of features needs to be performed before building the model with too many (maybe unimportant) features. The curse of dimensionality is often mentioned in this context. The more different features we add to our model, the more dimensions are used and hence the model becomes more complex. The problem of a more complex model can be a high variance when applying it. Another difficulty of many features is the cost of time and computing power for training the model.

A proper set of features can be obtained by different approaches:

a. Filter Approach

A statistical indicator for variable importance is given by a high (or at least moderate) correlation between a feature and the target variable. Depending on the scaling of the variables, a suitable correlation measures is taken into account (e.g. Pearson, Fisher,…).

The filter approach assesses only one feature at a time and does not consider the interaction between different explanatory features. Therefore, the filter approach is less effective than the following method.

b. Wrapper Approach

A more considerate way to select features is the wrapper approach, that also considers the relationship between the all selected features. This is done iteratively by comparing all possible \(\small{ 2^{n}-1 }\) models containing up to a maximum of \(\small{ n }\) explanatory variables.

The selection can be achieved by forward, backward or stepwise selection. For instance, the forward selection starts with only one feature and iteratively adds one more feature which adds the most explanatory power to the model. This process is stopped as soon as a predefined threshold of marginal accuracy added to the model is not passed anymore.

Using the wrapper approach usually gives a better performance, but also takes more computational cost.