Removing Duplicates

Duplicate entries in a dataset can lead to inaccurate analysis and biased results. Pandas provides methods to identify and remove duplicate rows efficiently, ensuring your dataset is clean and unique. This tutorial demonstrates how to work with duplicates in a DataFrame.

Identifying Duplicates

You can use the duplicated() function to identify duplicate rows in your DataFrame. This function returns a Boolean Series indicating whether each row is a duplicate. Here’s an example:

import pandas as pd

# Create a sample DataFrame with duplicate rows
data = {
    "Name": ["Karthick", "Durai", "Karthick", "Praveen"],
    "Age": [25, 30, 25, 22],
    "City": ["Chennai", "Coimbatore", "Chennai", "Madurai"]
}

df = pd.DataFrame(data)

# Identify duplicate rows
duplicates = df.duplicated()
print(duplicates)

Output

Duplicated
0 False
1 False
2 True
3 False

Explanation: The duplicated() function identifies duplicate rows in the DataFrame. In this example, the third row is flagged as a duplicate because it repeats the data from the first row.

Removing Duplicates

To remove duplicate rows, use the drop_duplicates() function. This method ensures that each row in the DataFrame is unique. Here’s an example:

# Remove duplicate rows
cleaned_df = df.drop_duplicates()
print(cleaned_df)

Output

Name Age City
Karthick 25 Chennai
Durai 30 Coimbatore
Praveen 22 Madurai

Explanation: The drop_duplicates() function removes rows flagged as duplicates by the duplicated() method. This ensures that the resulting DataFrame contains only unique rows.

Key Takeaways

  • Identifying Duplicates: Use duplicated() to locate duplicate rows in your DataFrame.
  • Removing Duplicates: Use drop_duplicates() to eliminate duplicate rows and ensure data uniqueness.
  • Improving Data Quality: Removing duplicates helps maintain the integrity of your dataset.