Random Undersampling to Handle a Skewed Dataset
Prerequisite: What are Skewed or Imbalanced datasets?
Random Undersampling in Data Science is about creating a balanced dataset by removing some of the data which belongs to the highly dominated class.
Let us take an example to understand this. We will be working with the Credit Card dataset.
Importing the libraries
Code
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
Importing the dataset and creating a data frame
Code
data = pd.read_csv(”creditcard.csv”)
Let us check what columns are there in the dataset
Code
data.columns
Output
Code
data[‘Class’].value_counts()
Output
Here “Class” is a categorical variable, with values 0 and 1. Where 0 means “No Fraud” and 1 means “Fraud”. We can clearly see that the “No Fraud” data dominates over “Fraud” data with a very high percentage. Hence this dataset is highly skewed and cannot be further used for training the model, otherwise, it will overfit the dominating class.
To make it usable for training purposes, we need to first balance the dataset, so that each class in the dataset is in equal proportion with the other class. To do this, we can use the Random Undersampling method.
Random Undersampling
We will first shuffle the entire dataset.
Code
data = data.sample(frac=1)
As we can see from above that there were only 492 samples of “Fraud” data. So we need to extract out 492 samples of the “No Fraud” from the dataset.
Code
fraud_data = data.loc[data[‘Class’] == 1] non_fraud_data = data.loc[data[‘Class’] == 0][:492]
Because now we have equal samples of Fraud and Non-Fraud data, the skewness in the dataset has gone. Thus the new data we have is normally distributed.
Let us concatenate both the datasets — fraud_data & non_fraud_data
Code
normal_distributed_data = pd.concat([fraud_data, non_fraud_data])
Because we simply concatenated both the datasets one after the other, we need to shuffle the newly merged data.
Code
new_data = normal_distributed_data.sample(frac=1, random_state=42)
So we have finally performed, Random Undersampling. Let us now examine our new dataset.
Code
new_data[‘Class’].value_counts()
Output
So, both the classes have 492 values each. Which means that the dataset is now equally distributed.
Visualization of the new dataset
Code
sns.countplot(‘Class’, data=new_data)
Output