Splitting dataset into traing and test set
- Next part of preprocessing is to split dataset into training set and test set
- Ideally it should be around 80% for the training and remaining 20% for the testing
Let say data is as following (data.csv):
Country | Age | Salary | Purchased |
---|---|---|---|
France | 44 | 72000 | No |
Spain | 27 | 48000 | Yes |
Germany | 30 | 54000 | No |
Spain | 38 | 61000 | No |
Germany | 40 | 63777 | Yes |
France | 35 | 58000 | Yes |
Spain | 38 | 52000 | No |
France | 48 | 79000 | Yes |
Germany | 50 | 83000 | No |
France | 37 | 67000 | Yes |
Splitting this dataset will be done as following:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Taking care of missing data
# Encoding categorical variables
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
0.0 | 0.0 | 1.0 | 38.77777777777778 | 52000.0 |
0.0 | 1.0 | 0.0 | 40.0 | 63777.77777777778 |
1.0 | 0.0 | 0.0 | 44.0 | 72000.0 |
0.0 | 0.0 | 1.0 | 38.0 | 61000.0 |
0.0 | 0.0 | 1.0 | 27.0 | 48000.0 |
1.0 | 0.0 | 0.0 | 48.0 | 79000.0 |
0.0 | 1.0 | 0.0 | 50.0 | 83000.0 |
1.0 | 0.0 | 0.0 | 35.0 | 58000.0 |
0.0 | 1.0 | 0.0 | 30.0 | 54000.0 |
1.0 | 0.0 | 0.0 | 37.0 | 67000.0 |
[0, 1, 0, 0, 1, 1, 0, 1]
[0, 1]