Cleaning Wrong Data

Incorrect data entries, such as outliers, invalid values, or mismatched categories, can affect the accuracy of your analysis. Cleaning wrong data involves identifying and fixing these anomalies to ensure consistency and reliability. Pandas provides tools to locate and correct such data effectively.

Identifying Wrong Data

You can identify incorrect data by applying conditions on your DataFrame. For instance, numeric values outside a valid range can indicate errors. Here’s an example:

import pandas as pd

# Create a sample DataFrame with incorrect data
data = {
    "Name": ["Karthick", "Durai", "Praveen"],
    "Age": [25, -5, 30],
    "City": ["Chennai", "Coimbatore", "Madurai"]
}

df = pd.DataFrame(data)

# Identify rows with incorrect ages
wrong_ages = df[df["Age"] < 0]
print(wrong_ages)

Output

Name Age City
Durai -5 Coimbatore

Explanation: The condition df["Age"] < 0 identifies rows where the age value is negative. These rows are flagged as containing incorrect data for further cleaning.

Fixing Wrong Data

Once identified, incorrect data can be replaced with valid values using loc[] or dropped entirely using drop(). Here’s an example:

# Replace incorrect ages with a default value
default_age = 25
df.loc[df["Age"] < 0, "Age"] = default_age
print(df)

Output

Name Age City
Karthick 25 Chennai
Durai 25 Coimbatore
Praveen 30 Madurai

Explanation: The loc[] method replaces incorrect ages (negative values) with a default age of 25. This ensures that all age values are within the acceptable range.

Key Takeaways

  • Identifying Anomalies: Use conditions to locate rows with incorrect data.
  • Replacing Invalid Values: Replace incorrect entries with default or calculated values using loc[].
  • Ensuring Data Consistency: Cleaning wrong data improves the quality and accuracy of your analysis.