Merging and Joining
Combining datasets is a common task in data analysis. Pandas provides powerful functions such as merge()
, join()
, and concat()
to merge and join DataFrames based on keys or indices. This tutorial explains how to use these methods effectively.
Merging DataFrames
The merge()
method combines DataFrames based on common columns or indices. It supports different types of joins, such as inner, left, right, and outer. Here’s an example:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({
"ID": [1, 2, 3],
"Name": ["Karthick", "Durai", "Praveen"]
})
df2 = pd.DataFrame({
"ID": [2, 3, 4],
"City": ["Chennai", "Madurai", "Trichy"]
})
# Merge DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on="ID", how="inner")
print(merged_df)
Output
ID | Name | City |
---|---|---|
2 | Durai | Chennai |
3 | Praveen | Madurai |
Explanation: The merge()
method performs an inner join on the ID
column. Only rows with matching ID
values in both DataFrames are included in the result.
Types of Joins
You can specify the type of join using the how
parameter in merge()
. The options include:
- Inner Join: Includes rows with matching keys in both DataFrames (default).
- Left Join: Includes all rows from the left DataFrame and matching rows from the right.
- Right Join: Includes all rows from the right DataFrame and matching rows from the left.
- Outer Join: Includes all rows from both DataFrames, filling missing values with
NaN
.
# Perform an outer join
outer_joined_df = pd.merge(df1, df2, on="ID", how="outer")
print(outer_joined_df)
Output
ID | Name | City |
---|---|---|
1 | Karthick | NaN |
2 | Durai | Chennai |
3 | Praveen | Madurai |
4 | NaN | Trichy |
Explanation: The outer
join includes all rows from both DataFrames. Missing values are filled with NaN
.
Joining DataFrames
The join()
method is used for joining DataFrames on their index. Here’s an example:
# Create DataFrames with indices
df1 = pd.DataFrame({"Name": ["Karthick", "Durai", "Praveen"]}, index=[1, 2, 3])
df2 = pd.DataFrame({"City": ["Chennai", "Madurai", "Trichy"]}, index=[2, 3, 4])
# Join DataFrames on index
joined_df = df1.join(df2, how="outer")
print(joined_df)
Output
Index | Name | City |
---|---|---|
1 | Karthick | NaN |
2 | Durai | Chennai |
3 | Praveen | Madurai |
4 | NaN | Trichy |
Explanation: The join()
method joins DataFrames on their index. The outer
join includes all rows from both DataFrames, with missing values filled as NaN
.
Key Takeaways
- Merge: Use
merge()
to combine DataFrames based on columns or indices. - Join: Use
join()
for index-based merging of DataFrames. - Join Types: Inner, left, right, and outer joins allow flexibility in combining datasets.
- Handling Missing Values: Joins often introduce
NaN
values, which can be handled usingfillna()
.