Handling Categorical Data

Categorical data represents variables with a fixed set of values, often used for classification and grouping. Converting columns to the category data type in Pandas reduces memory usage and improves performance. This tutorial demonstrates how to create, encode, and manipulate categorical data.

Creating Categorical Columns

Convert a column to the category data type using astype('category'). Here’s an example:

import pandas as pd

# Create a sample DataFrame
data = {
    "Name": ["Karthick", "Durai", "Praveen"],
    "City": ["Chennai", "Coimbatore", "Madurai"]
}

df = pd.DataFrame(data)

# Convert the City column to categorical
df["City"] = df["City"].astype("category")
print(df)
print(df.dtypes)

Output:

Name City
Karthick Chennai
Durai Coimbatore
Praveen Madurai

Data types:

Name: object

City: category

Explanation: The astype('category') method converts the City column to a categorical data type, reducing memory usage and enabling category-specific operations.

Encoding Categorical Data

Encode categorical data into numerical labels using the cat.codes attribute. Here’s an example:

# Encode the City column
df["City_Code"] = df["City"].cat.codes
print(df)

Output:

Name City City_Code
Karthick Chennai 0
Durai Coimbatore 1
Praveen Madurai 2

Explanation: The cat.codes attribute encodes each category in the City column as a unique integer, facilitating numerical analysis or machine learning tasks.

Manipulating Categories

Add, remove, or rename categories using cat.add_categories(), cat.remove_categories(), or cat.rename_categories(). Here’s an example:

# Add a new category
df["City"].cat.add_categories(["Trichy"], inplace=True)

# Rename a category
df["City"].cat.rename_categories({"Chennai": "Chennai Metro"}, inplace=True)

# Remove a category
df["City"].cat.remove_categories(["Coimbatore"], inplace=True)
print(df)

Output:

Name City City_Code
Karthick Chennai Metro 0
Durai NaN 1
Praveen Madurai 2

Explanation: Categories in the City column are dynamically manipulated. The category Chennai is renamed, a new category Trichy is added, and Coimbatore is removed.

Key Takeaways

  • Efficient Data: Use category data types for memory efficiency and faster operations.
  • Encoding: Encode categories numerically using cat.codes.
  • Flexibility: Add, remove, or rename categories dynamically for flexible data management.
  • Scalability: Categorical data handling is well-suited for large datasets with repetitive values.