Creating training and testing sets from a single dataset
When solving any machine learning problem, we need to build a model or an algorithm and we need to train our model before we use it in practice.
For training the model we need data. Generally, we have a single dataset or more than one dataset to train a particular model.
After we have trained our model, we need to test it, to check its accuracy.
Now there is one thing we need to take care of. And that is, we should never train and test our model on the same data.
Why we cannot train and test our model on the same dataset?
This is because, if we train our model on a particular dataset and also test it using the same dataset, we won’t be able to confirm if the model will work well when newer data instances are brought in front of it.
Because we test our model on the same dataset, obviously it will show good results, because it has learned from it. It is the same as, giving an exam of which the questions and answers are already known to you. You will score good marks because you already know everything. But what about, when you are given a surprise test with new questions which you have never been taught before.
If we ask the same thing in Machine Learning terms, what about newer data instances which our model has never seen before? Will it be able to confirm high accuracy? No, we cannot confirm this. Because we did not test out model on newer data instances.
What should we do when we have just a single dataset?
When we have just a single dataset, we can split the dataset into two parts for training and testing our model. But before we do that, we need to properly shuffle the dataset, so that none of the two parts contains a bias towards some particular data instance or class.
Finally, splitting the dataset
There are various methods by which we can split our dataset proportionally into two halves. We will discuss two of them. The first one uses NumPy’s permutation method and the second one uses the scikit-learn’s train_test_split() method.
Importing the libraries
import numpy as np
from sklearn import datasets
Loading the dataset
Here, we are using the iris dataset, to understand the concept of training and testing datasets.
iris = datasets.load_iris()
x = iris.data
y = iris.target
Using NumPy’s permutation method
i = np.random.permutation(len(iris.data))
The above array is the value of “i”, where the value at any index represents the row in the dataset.
Splitting the dataset
Here, we are splitting the dataset in a manner such that there are 10 rows in the testing dataset and remaining rows in the training dataset.
x_train = x[i[:-10]]
y_train = y[i[:-10]]
x_test = x[i[-10:]]
y_test = y[i[-10:]]
Note that in this method we first shuffle the dataset and then split into two parts.
Using Scikit-Learn’s train_test_split() method
In this method, we do not need to shuffle the dataset, because the train_test_split() function automatically does this for us.
In the below given example, we split the dataset in a 6:4 ratio, where 60% dataset is used for training and the rest 40% dataset is used for testing the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(x, y, train_size = 0.6, test_size = 0.4)
We can also use cross-validation, which is an even better method of testing our model.