Removing Duplicates
Duplicate entries in a dataset can lead to inaccurate analysis and biased results. Pandas provides methods to identify and remove duplicate rows efficiently, ensuring your dataset is clean and unique. This tutorial demonstrates how to work with duplicates in a DataFrame.
Identifying Duplicates
You can use the duplicated()
function to identify duplicate rows in your DataFrame. This function returns a Boolean Series indicating whether each row is a duplicate. Here’s an example:
import pandas as pd
# Create a sample DataFrame with duplicate rows
data = {
"Name": ["Karthick", "Durai", "Karthick", "Praveen"],
"Age": [25, 30, 25, 22],
"City": ["Chennai", "Coimbatore", "Chennai", "Madurai"]
}
df = pd.DataFrame(data)
# Identify duplicate rows
duplicates = df.duplicated()
print(duplicates)
Output
Duplicated | |
---|---|
0 | False |
1 | False |
2 | True |
3 | False |
Explanation: The duplicated()
function identifies duplicate rows in the DataFrame. In this example, the third row is flagged as a duplicate because it repeats the data from the first row.
Removing Duplicates
To remove duplicate rows, use the drop_duplicates()
function. This method ensures that each row in the DataFrame is unique. Here’s an example:
# Remove duplicate rows
cleaned_df = df.drop_duplicates()
print(cleaned_df)
Output
Name | Age | City |
---|---|---|
Karthick | 25 | Chennai |
Durai | 30 | Coimbatore |
Praveen | 22 | Madurai |
Explanation: The drop_duplicates()
function removes rows flagged as duplicates by the duplicated()
method. This ensures that the resulting DataFrame contains only unique rows.
Key Takeaways
- Identifying Duplicates: Use
duplicated()
to locate duplicate rows in your DataFrame. - Removing Duplicates: Use
drop_duplicates()
to eliminate duplicate rows and ensure data uniqueness. - Improving Data Quality: Removing duplicates helps maintain the integrity of your dataset.