Cleaning Wrong Data
Incorrect data entries, such as outliers, invalid values, or mismatched categories, can affect the accuracy of your analysis. Cleaning wrong data involves identifying and fixing these anomalies to ensure consistency and reliability. Pandas provides tools to locate and correct such data effectively.
Identifying Wrong Data
You can identify incorrect data by applying conditions on your DataFrame. For instance, numeric values outside a valid range can indicate errors. Here’s an example:
import pandas as pd
# Create a sample DataFrame with incorrect data
data = {
"Name": ["Karthick", "Durai", "Praveen"],
"Age": [25, -5, 30],
"City": ["Chennai", "Coimbatore", "Madurai"]
}
df = pd.DataFrame(data)
# Identify rows with incorrect ages
wrong_ages = df[df["Age"] < 0]
print(wrong_ages)
Output
Name | Age | City |
---|---|---|
Durai | -5 | Coimbatore |
Explanation: The condition df["Age"] < 0
identifies rows where the age value is negative. These rows are flagged as containing incorrect data for further cleaning.
Fixing Wrong Data
Once identified, incorrect data can be replaced with valid values using loc[]
or dropped entirely using drop()
. Here’s an example:
# Replace incorrect ages with a default value
default_age = 25
df.loc[df["Age"] < 0, "Age"] = default_age
print(df)
Output
Name | Age | City |
---|---|---|
Karthick | 25 | Chennai |
Durai | 25 | Coimbatore |
Praveen | 30 | Madurai |
Explanation: The loc[]
method replaces incorrect ages (negative values) with a default age of 25. This ensures that all age values are within the acceptable range.
Key Takeaways
- Identifying Anomalies: Use conditions to locate rows with incorrect data.
- Replacing Invalid Values: Replace incorrect entries with default or calculated values using
loc[]
. - Ensuring Data Consistency: Cleaning wrong data improves the quality and accuracy of your analysis.