Encoding categorical data
One Hot Encoding¶
- let say in a
country
column, there are 3 unique values: [France
,Germany
,Spain
] - If we convert this data in such a way that: {
France
: 0,Germany
: 1,Spain
: 2}, - ML model can consider the values as weights, i.e. spain has more effect on depandant variable, even though it is not the case
- So, we prefer converting this data as follows:
France | Germany | Spain |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
- Process of converting categorical data into such form is known as "One Hot Encoding".
Country | Age | Salary |
---|---|---|
France | 44 | 72000 |
Spain | 27 | 48000 |
Germany | 30 | 54000 |
Spain | 38 | 61000 |
Germany | 40 | 63777 |
France | 35 | 58000 |
Spain | 38 | 52000 |
France | 48 | 79000 |
Germany | 50 | 83000 |
France | 37 | 67000 |
from sklearn.compose import ColumnTransform
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# Here [0] is column number of column index range which contains categorical data
# In our case first column 'Country' is a categorical data column
# 'passthrough' means that we want to keep other columns which dont apply transformation too.
X = np.array(ct.fit_transform(X))
- First three columns of following table can represent any of ['France', 'Germany', 'Spain']
France | Germany | Spain | Age | Salary |
---|---|---|---|---|
1.0 | 0.0 | 0.0 | 44.0 | 72000.0 |
0.0 | 0.0 | 1.0 | 27.0 | 48000.0 |
0.0 | 1.0 | 0.0 | 30.0 | 54000.0 |
0.0 | 0.0 | 1.0 | 38.0 | 61000.0 |
0.0 | 1.0 | 0.0 | 40.0 | 63777.0 |
1.0 | 0.0 | 0.0 | 35.0 | 58000.0 |
0.0 | 0.0 | 1.0 | 38.0 | 52000.0 |
1.0 | 0.0 | 0.0 | 48.0 | 79000.0 |
0.0 | 1.0 | 0.0 | 50.0 | 83000.0 |
1.0 | 0.0 | 0.0 | 37.0 | 67000.0 |
Label Encoding¶
- To convert a column with binary values (yes/no, true/false etc), we use label encoder.
y = ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y = [0 1 0 0 1 1 0 1 0 1]