Merging and Joining

Combining datasets is a common task in data analysis. Pandas provides powerful functions such as merge(), join(), and concat() to merge and join DataFrames based on keys or indices. This tutorial explains how to use these methods effectively.

Merging DataFrames

The merge() method combines DataFrames based on common columns or indices. It supports different types of joins, such as inner, left, right, and outer. Here’s an example:

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({
    "ID": [1, 2, 3],
    "Name": ["Karthick", "Durai", "Praveen"]
})

df2 = pd.DataFrame({
    "ID": [2, 3, 4],
    "City": ["Chennai", "Madurai", "Trichy"]
})

# Merge DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on="ID", how="inner")
print(merged_df)

Output

ID Name City
2 Durai Chennai
3 Praveen Madurai

Explanation: The merge() method performs an inner join on the ID column. Only rows with matching ID values in both DataFrames are included in the result.

Types of Joins

You can specify the type of join using the how parameter in merge(). The options include:

  • Inner Join: Includes rows with matching keys in both DataFrames (default).
  • Left Join: Includes all rows from the left DataFrame and matching rows from the right.
  • Right Join: Includes all rows from the right DataFrame and matching rows from the left.
  • Outer Join: Includes all rows from both DataFrames, filling missing values with NaN.
# Perform an outer join
outer_joined_df = pd.merge(df1, df2, on="ID", how="outer")
print(outer_joined_df)

Output

ID Name City
1 Karthick NaN
2 Durai Chennai
3 Praveen Madurai
4 NaN Trichy

Explanation: The outer join includes all rows from both DataFrames. Missing values are filled with NaN.

Joining DataFrames

The join() method is used for joining DataFrames on their index. Here’s an example:

# Create DataFrames with indices
df1 = pd.DataFrame({"Name": ["Karthick", "Durai", "Praveen"]}, index=[1, 2, 3])
df2 = pd.DataFrame({"City": ["Chennai", "Madurai", "Trichy"]}, index=[2, 3, 4])

# Join DataFrames on index
joined_df = df1.join(df2, how="outer")
print(joined_df)

Output

Index Name City
1 Karthick NaN
2 Durai Chennai
3 Praveen Madurai
4 NaN Trichy

Explanation: The join() method joins DataFrames on their index. The outer join includes all rows from both DataFrames, with missing values filled as NaN.

Key Takeaways

  • Merge: Use merge() to combine DataFrames based on columns or indices.
  • Join: Use join() for index-based merging of DataFrames.
  • Join Types: Inner, left, right, and outer joins allow flexibility in combining datasets.
  • Handling Missing Values: Joins often introduce NaN values, which can be handled using fillna().