Random Undersampling to Handle a Skewed Dataset

Less than 500 views Posted On Aug. 10, 2020

Prerequisite: What are Skewed or Imbalanced datasets?

Random Undersampling in Data Science is about creating a balanced dataset by removing some of the data which belongs to the highly dominated class.

Let us take an example to understand this. We will be working with the Credit Card dataset.

Importing the libraries

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importing the dataset and creating a data frame

Code

data = pd.read_csv(”creditcard.csv”)

Let us check what columns are there in the dataset

Code

data.columns

Output

Code

data[‘Class’].value_counts()

Output

Here “Class” is a categorical variable, with values 0 and 1. Where 0 means “No Fraud” and 1 means “Fraud”. We can clearly see that the “No Fraud” data dominates over “Fraud” data with a very high percentage. Hence this dataset is highly skewed and cannot be further used for training the model, otherwise, it will overfit the dominating class.

To make it usable for training purposes, we need to first balance the dataset, so that each class in the dataset is in equal proportion with the other class. To do this, we can use the Random Undersampling method.

Random Undersampling

We will first shuffle the entire dataset.

Code

data = data.sample(frac=1)

As we can see from above that there were only 492 samples of “Fraud” data. So we need to extract out 492 samples of the “No Fraud” from the dataset.

Code

fraud_data = data.loc[data[‘Class’] == 1]
non_fraud_data = data.loc[data[‘Class’] == 0][:492]

Because now we have equal samples of Fraud and Non-Fraud data, the skewness in the dataset has gone. Thus the new data we have is normally distributed.

Let us concatenate both the datasets — fraud_data & non_fraud_data

Code

normal_distributed_data = pd.concat([fraud_data, non_fraud_data])

Because we simply concatenated both the datasets one after the other, we need to shuffle the newly merged data.

Code

new_data = normal_distributed_data.sample(frac=1, random_state=42)

So we have finally performed, Random Undersampling. Let us now examine our new dataset.

Code

new_data[‘Class’].value_counts()

Output

So, both the classes have 492 values each. Which means that the dataset is now equally distributed.

Visualization of the new dataset

Code

sns.countplot(‘Class’, data=new_data)

Output

Share this tutorial with someone who needs it

What are your thoughts?