Cleaning Wrong Format
Incorrect formatting in data can lead to errors during analysis. Examples include invalid date formats, mixed data types, or inconsistent string patterns. Pandas provides tools to detect and clean data with incorrect formats. This tutorial covers how to identify and fix formatting issues.
Detecting Wrong Format
You can use the pd.to_datetime()
function to detect invalid date formats. This function attempts to convert a column into a proper datetime format and raises errors if it encounters invalid entries. Here’s an example:
import pandas as pd
# Create a sample DataFrame with date column
data = {
"Name": ["Karthick", "Durai", "Praveen"],
"Birthdate": ["1995-10-15", "invalid_date", "2000-05-25"]
}
df = pd.DataFrame(data)
# Attempt to convert Birthdate to datetime
df["Birthdate"] = pd.to_datetime(df["Birthdate"], errors="coerce")
print(df)
Output
Name | Birthdate |
---|---|
Karthick | 1995-10-15 |
Durai | NaT |
Praveen | 2000-05-25 |
Explanation: The pd.to_datetime()
function attempts to convert the Birthdate
column to datetime format. Invalid entries, such as "invalid_date," are replaced with NaT
(Not a Time), indicating missing or invalid datetime values.
Fixing Wrong Format
Once invalid entries are identified, you can replace or remove them using fillna()
or dropna()
. Alternatively, you can correct specific entries. Here’s an example:
# Replace invalid dates with a default date
default_date = pd.Timestamp("2000-01-01")
df["Birthdate"] = df["Birthdate"].fillna(default_date)
print(df)
Output
Name | Birthdate |
---|---|
Karthick | 1995-10-15 |
Durai | 2000-01-01 |
Praveen | 2000-05-25 |
Explanation: The fillna()
function replaces invalid dates (marked as NaT
) with a default value, such as "2000-01-01." This ensures that all entries in the Birthdate
column are valid datetime values.
Key Takeaways
- Detecting Invalid Formats: Use
pd.to_datetime()
to identify incorrect datetime values. - Fixing Invalid Entries: Replace invalid values with default or calculated values using
fillna()
. - Maintaining Consistency: Cleaning incorrect formats ensures a consistent and error-free dataset.