Skip to content

Feature Scaling

Little About

  • Feature scaling should be done after data splitting into training set and test set to prevent information leakage.
  • Some ML models care about data dominance and other don't. So, we may not always need to do this.
  • Let say one of the columns data has values like [1, 55, 112, 28, 87], i.e. all the values are in range 0-100, more precisely, the differnce between any two values is not much far from their average difference. This data would do good in model preperation.
  • But just in case if we have data like: [5, 20, 12, 500, 6000, 22, 24], it may be not good for our model.

Scaler Methods

  • So we scale this data to around same level, there are two main methods to do this:
  • Result data will be in the range of -3 to +3
  • Works well in all cases
\[ X_{stand} = \frac{X-mean(X)}{Standard-deviation(X)} \]
  • Result data will be in the range of 0 to 1
  • Recommended when data is normally distributed
\[ X_{norm} = \frac{X-min(X)}{max(X)-min(X)} \]

We should never apply scaling on dummy variables, because they are already scaled (values are either 0 or 1). Also, applying the standarization will convert them into the range (-3, 3) eventually affecting the model

The Code

    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
    X_test[:, 3:] = sc.transform(X_test[:, 3:])
  • Here, we do not need to apply fit again on test data because we want to use same scaler of training data on test data.
  • I.e. we are using same mean and SD obtained from training data (by fit()) on both training and test set (by transform())