Splitting dataset into traing and test set

  • Next part of preprocessing is to split dataset into training set and test set
  • Ideally it should be around 80% for the training and remaining 20% for the testing

Let say data is as following (data.csv):

Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 63777 Yes
France 35 58000 Yes
Spain 38 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

Splitting this dataset will be done as following:

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Taking care of missing data
# Encoding categorical variables

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
0.0 0.0 1.0 38.77777777777778 52000.0
0.0 1.0 0.0 40.0 63777.77777777778
1.0 0.0 0.0 44.0 72000.0
0.0 0.0 1.0 38.0 61000.0
0.0 0.0 1.0 27.0 48000.0
1.0 0.0 0.0 48.0 79000.0
0.0 1.0 0.0 50.0 83000.0
1.0 0.0 0.0 35.0 58000.0
0.0 1.0 0.0 30.0 54000.0
1.0 0.0 0.0 37.0 67000.0

[0, 1, 0, 0, 1, 1, 0, 1]

[0, 1]