Prev Next

Cleaning Wrong Data

Incorrect data entries, such as outliers, invalid values, or mismatched categories, can affect the accuracy of your analysis. Cleaning wrong data involves identifying and fixing these anomalies to ensure consistency and reliability. Pandas provides tools to locate and correct such data effectively.

Identifying Wrong Data

You can identify incorrect data by applying conditions on your DataFrame. For instance, numeric values outside a valid range can indicate errors. Here’s an example:

import pandas as pd

# Create a sample DataFrame with incorrect data
data = {
    "Name": ["Karthick", "Durai", "Praveen"],
    "Age": [25, -5, 30],
    "City": ["Chennai", "Coimbatore", "Madurai"]
}

df = pd.DataFrame(data)

# Identify rows with incorrect ages
wrong_ages = df[df["Age"] < 0]
print(wrong_ages)

Output

Name	Age	City
Durai	-5	Coimbatore

Explanation: The condition df["Age"] < 0 identifies rows where the age value is negative. These rows are flagged as containing incorrect data for further cleaning.

Fixing Wrong Data

Once identified, incorrect data can be replaced with valid values using loc[] or dropped entirely using drop(). Here’s an example:

# Replace incorrect ages with a default value
default_age = 25
df.loc[df["Age"] < 0, "Age"] = default_age
print(df)

Output

Name	Age	City
Karthick	25	Chennai
Durai	25	Coimbatore
Praveen	30	Madurai

Explanation: The loc[] method replaces incorrect ages (negative values) with a default age of 25. This ensures that all age values are within the acceptable range.

Key Takeaways

Identifying Anomalies: Use conditions to locate rows with incorrect data.
Replacing Invalid Values: Replace incorrect entries with default or calculated values using loc[].
Ensuring Data Consistency: Cleaning wrong data improves the quality and accuracy of your analysis.

Prev Next

Web Design

AI and Data Science

Full Stack Development

Database Tutorials

TryMeYourSelf is optimized for learning and training. Examples might be simplified to improve reading and learning.