Handling Categorical Data
Categorical data represents variables with a fixed set of values, often used for classification and grouping. Converting columns to the category
data type in Pandas reduces memory usage and improves performance. This tutorial demonstrates how to create, encode, and manipulate categorical data.
Creating Categorical Columns
Convert a column to the category
data type using astype('category')
. Here’s an example:
import pandas as pd
# Create a sample DataFrame
data = {
"Name": ["Karthick", "Durai", "Praveen"],
"City": ["Chennai", "Coimbatore", "Madurai"]
}
df = pd.DataFrame(data)
# Convert the City column to categorical
df["City"] = df["City"].astype("category")
print(df)
print(df.dtypes)
Output:
Name | City |
---|---|
Karthick | Chennai |
Durai | Coimbatore |
Praveen | Madurai |
Data types:
Name: object
City: category
Explanation: The astype('category')
method converts the City
column to a categorical data type, reducing memory usage and enabling category-specific operations.
Encoding Categorical Data
Encode categorical data into numerical labels using the cat.codes
attribute. Here’s an example:
# Encode the City column
df["City_Code"] = df["City"].cat.codes
print(df)
Output:
Name | City | City_Code |
---|---|---|
Karthick | Chennai | 0 |
Durai | Coimbatore | 1 |
Praveen | Madurai | 2 |
Explanation: The cat.codes
attribute encodes each category in the City
column as a unique integer, facilitating numerical analysis or machine learning tasks.
Manipulating Categories
Add, remove, or rename categories using cat.add_categories()
, cat.remove_categories()
, or cat.rename_categories()
. Here’s an example:
# Add a new category
df["City"].cat.add_categories(["Trichy"], inplace=True)
# Rename a category
df["City"].cat.rename_categories({"Chennai": "Chennai Metro"}, inplace=True)
# Remove a category
df["City"].cat.remove_categories(["Coimbatore"], inplace=True)
print(df)
Output:
Name | City | City_Code |
---|---|---|
Karthick | Chennai Metro | 0 |
Durai | NaN | 1 |
Praveen | Madurai | 2 |
Explanation: Categories in the City
column are dynamically manipulated. The category Chennai
is renamed, a new category Trichy
is added, and Coimbatore
is removed.
Key Takeaways
- Efficient Data: Use
category
data types for memory efficiency and faster operations. - Encoding: Encode categories numerically using
cat.codes
. - Flexibility: Add, remove, or rename categories dynamically for flexible data management.
- Scalability: Categorical data handling is well-suited for large datasets with repetitive values.