Remove duplicate tuples (rows) from the dataset | Python Programming
Duplicate tuples or rows in a dataset are those in which all the attribute values are exactly the same. Duplicate tuples unnecessarily increase the size of data and they are not required, hence, need to be removed.
Let us take a real-life example to understand duplicacy and how to remove it from the dataset.
We are going to take the iris dataset to examine if there are any duplicate rows in it.
The iris dataset has 4 features and 1 target column.
How to check, which tuples (rows) are duplicate?
The pandasDataFrame.duplicated() function helps us know, which tuples or rows in the dataset are duplicates.
We can see from the above code that row 101 and 142 are duplicate. We will remove one of these rows from the data, and keep just one of them.
How to remove duplicate tuples (rows) from the dataset?
The pandasDataFrame.drop_duplicates() function helps us remove the duplicate tuples (rows) from the dataset.
Let us now check if the duplicate tuples have been removed.
So yes, our dataset is now free of duplicate tuples or rows.
Larger or more complex datasets might have even more than two duplicates of just one tuple. In such cases also, the above-mentioned process remains the same and works perfectly.