Feature Scaling
Little About¶
- Feature scaling should be done after data splitting into training set and test set to prevent information leakage.
- Some ML models care about data dominance and other don't. So, we may not always need to do this.
- Let say one of the columns data has values like [1, 55, 112, 28, 87], i.e. all the values are in range 0-100, more precisely, the differnce between any two values is not much far from their average difference. This data would do good in model preperation.
- But just in case if we have data like: [5, 20, 12, 500, 6000, 22, 24], it may be not good for our model.
Scaler Methods¶
- So we scale this data to around same level, there are two main methods to do this:
- Result data will be in the range of -3 to +3
- Works well in all cases
\[ X_{stand} = \frac{X-mean(X)}{Standard-deviation(X)} \]
- Result data will be in the range of 0 to 1
- Recommended when data is normally distributed
\[ X_{norm} = \frac{X-min(X)}{max(X)-min(X)} \]
We should never apply scaling on dummy variables, because they are already scaled (values are either 0 or 1). Also, applying the standarization will convert them into the range (-3, 3) eventually affecting the model
The Code¶
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
- Here, we do not need to apply fit again on test data because we want to use same scaler of training data on test data.
- I.e. we are using same mean and SD obtained from training data (by
fit()
) on both training and test set (bytransform()
)