Cleaning Data
Data cleaning is an essential step in any data analysis process. It involves handling missing values, correcting errors, and preparing the dataset for analysis. Pandas provides powerful functions to clean and preprocess data efficiently. This tutorial covers some of the most common data cleaning techniques.
Handling Missing Values
Missing data is a common issue in datasets. You can use Pandas methods such as dropna()
to remove rows or columns with missing values, or fillna()
to replace them with a specific value. Here’s an example:
import pandas as pd
# Create a sample DataFrame with missing values
data = {
"Name": ["Karthick", "Durai", "Praveen", None],
"Age": [25, 30, None, 22],
"City": ["Chennai", None, "Coimbatore", "Madurai"]
}
df = pd.DataFrame(data)
# Remove rows with missing values
cleaned_df = df.dropna()
print(cleaned_df)
Output
Name | Age | City |
---|---|---|
Karthick | 25.0 | Chennai |
Explanation: The dropna()
method removes all rows containing missing values from the DataFrame. In this example, only the row with complete data remains in the cleaned DataFrame.
Replacing Values
Instead of removing rows or columns with missing data, you can replace them with a specific value using fillna()
. This is useful for filling missing numeric values with the mean or replacing empty strings with placeholders. Here’s an example:
# Replace missing values in the "Age" column with the mean
mean_age = df["Age"].mean()
df["Age"] = df["Age"].fillna(mean_age)
# Replace missing values in "City" with "Unknown"
df["City"] = df["City"].fillna("Unknown")
print(df)
Output
Name | Age | City |
---|---|---|
Karthick | 25.0 | Chennai |
Durai | 30.0 | Unknown |
Praveen | 25.666666666666668 | Coimbatore |
None | 22.0 | Madurai |
Explanation: In this example, the fillna()
method replaces missing numeric values in the Age
column with the mean, and missing values in the City
column are replaced with "Unknown." This approach retains all rows while handling missing data effectively.
Key Takeaways
- Handling Missing Values: Use
dropna()
to remove rows or columns with missing data orfillna()
to replace them with specific values. - Custom Replacements: Replace missing numeric values with statistical metrics like the mean, median, or mode.
- Data Cleaning: Cleaning data ensures that your dataset is complete and ready for analysis.