Feature Scaling

Little About¶

Feature scaling should be done after data splitting into training set and test set to prevent information leakage.
Some ML models care about data dominance and other don't. So, we may not always need to do this.
Let say one of the columns data has values like [1, 55, 112, 28, 87], i.e. all the values are in range 0-100, more precisely, the differnce between any two values is not much far from their average difference. This data would do good in model preperation.
But just in case if we have data like: [5, 20, 12, 500, 6000, 22, 24], it may be not good for our model.

Scaler Methods¶

So we scale this data to around same level, there are two main methods to do this:

Standarization

Result data will be in the range of -3 to +3
Works well in all cases

\[ X_{stand} = \frac{X-mean(X)}{Standard-deviation(X)} \]

Normalization

Result data will be in the range of 0 to 1
Recommended when data is normally distributed

\[ X_{norm} = \frac{X-min(X)}{max(X)-min(X)} \]

We should never apply scaling on dummy variables, because they are already scaled (values are either 0 or 1). Also, applying the standarization will convert them into the range (-3, 3) eventually affecting the model

The Code¶

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
    X_test[:, 3:] = sc.transform(X_test[:, 3:])

Here, we do not need to apply fit again on test data because we want to use same scaler of training data on test data.
I.e. we are using same mean and SD obtained from training data (by fit()) on both training and test set (by transform())