Encoding categorical data

One Hot Encoding

  • let say in a country column, there are 3 unique values: [France, Germany, Spain]
  • If we convert this data in such a way that: {France: 0, Germany: 1, Spain: 2},
  • ML model can consider the values as weights, i.e. spain has more effect on depandant variable, even though it is not the case
  • So, we prefer converting this data as follows:
France Germany Spain
1 0 0
0 1 0
0 0 1
  • Process of converting categorical data into such form is known as "One Hot Encoding".
Country Age Salary
France 44 72000
Spain 27 48000
Germany 30 54000
Spain 38 61000
Germany 40 63777
France 35 58000
Spain 38 52000
France 48 79000
Germany 50 83000
France 37 67000
from sklearn.compose import ColumnTransform
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# Here [0] is column number of column index range which contains categorical data
# In our case first column 'Country' is a categorical data column
# 'passthrough' means that we want to keep other columns which dont apply transformation too.
X = np.array(ct.fit_transform(X))
  • First three columns of following table can represent any of ['France', 'Germany', 'Spain']
France Germany Spain Age Salary
1.0 0.0 0.0 44.0 72000.0
0.0 0.0 1.0 27.0 48000.0
0.0 1.0 0.0 30.0 54000.0
0.0 0.0 1.0 38.0 61000.0
0.0 1.0 0.0 40.0 63777.0
1.0 0.0 0.0 35.0 58000.0
0.0 0.0 1.0 38.0 52000.0
1.0 0.0 0.0 48.0 79000.0
0.0 1.0 0.0 50.0 83000.0
1.0 0.0 0.0 37.0 67000.0

Label Encoding

  • To convert a column with binary values (yes/no, true/false etc), we use label encoder.

y = ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

y = [0 1 0 0 1 1 0 1 0 1]