Prev Next

Removing Duplicates

Identifying Duplicates: Use duplicated() to locate duplicate rows in your DataFrame.
Removing Duplicates: Use drop_duplicates() to eliminate duplicate rows and ensure data uniqueness.
Improving Data Quality: Removing duplicates helps maintain the integrity of your dataset.

Duplicate entries in a dataset can lead to inaccurate analysis and biased results. Pandas provides methods to identify and remove duplicate rows efficiently, ensuring your dataset is clean and unique. This tutorial demonstrates how to work with duplicates in a DataFrame.

Identifying Duplicates

You can use the duplicated() function to identify duplicate rows in your DataFrame. This function returns a Boolean Series indicating whether each row is a duplicate. Here’s an example:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {
    "Name": ["Karthick", "Durai", "Karthick", "Praveen"],
    "Age": [25, 30, 25, 22],
    "City": ["Chennai", "Coimbatore", "Chennai", "Madurai"]
}

df = pd.DataFrame(data)

# Identify duplicate rows
duplicates = df.duplicated()
print(duplicates)

Output

	Duplicated
0	False
1	False
2	True
3	False

Explanation: The duplicated() function identifies duplicate rows in the DataFrame. In this example, the third row is flagged as a duplicate because it repeats the data from the first row.