Cleaning Wrong Format

Incorrect formatting in data can lead to errors during analysis. Examples include invalid date formats, mixed data types, or inconsistent string patterns. Pandas provides tools to detect and clean data with incorrect formats. This tutorial covers how to identify and fix formatting issues.

Detecting Wrong Format

You can use the pd.to_datetime() function to detect invalid date formats. This function attempts to convert a column into a proper datetime format and raises errors if it encounters invalid entries. Here’s an example:

import pandas as pd

# Create a sample DataFrame with date column
data = {
    "Name": ["Karthick", "Durai", "Praveen"],
    "Birthdate": ["1995-10-15", "invalid_date", "2000-05-25"]
}

df = pd.DataFrame(data)

# Attempt to convert Birthdate to datetime
df["Birthdate"] = pd.to_datetime(df["Birthdate"], errors="coerce")
print(df)

Output

Name Birthdate
Karthick 1995-10-15
Durai NaT
Praveen 2000-05-25

Explanation: The pd.to_datetime() function attempts to convert the Birthdate column to datetime format. Invalid entries, such as "invalid_date," are replaced with NaT (Not a Time), indicating missing or invalid datetime values.

Fixing Wrong Format

Once invalid entries are identified, you can replace or remove them using fillna() or dropna(). Alternatively, you can correct specific entries. Here’s an example:

# Replace invalid dates with a default date
default_date = pd.Timestamp("2000-01-01")
df["Birthdate"] = df["Birthdate"].fillna(default_date)
print(df)

Output

Name Birthdate
Karthick 1995-10-15
Durai 2000-01-01
Praveen 2000-05-25

Explanation: The fillna() function replaces invalid dates (marked as NaT) with a default value, such as "2000-01-01." This ensures that all entries in the Birthdate column are valid datetime values.

Key Takeaways

  • Detecting Invalid Formats: Use pd.to_datetime() to identify incorrect datetime values.
  • Fixing Invalid Entries: Replace invalid values with default or calculated values using fillna().
  • Maintaining Consistency: Cleaning incorrect formats ensures a consistent and error-free dataset.