Cross-join two pandas DataFrames

In data manipulation and analysis, joining different DataFrames is a common operation. One specific type of join is the cross-join, which combines all possible combinations of rows from two DataFrames. This can be particularly useful when working with hierarchical data, such as groups and subgroups, where you want to explore all possible combinations. Let’s see how to perform a cross-join using Pandas.

Consider two DataFrames, one representing groups and the other representing subgroups:

# Import libraries
import pandas as pd
from IPython.display import display

# Create DataFrames
df_groups = pd.DataFrame({'group': ['A', 'B', 'C']})
display(df_groups)

df_subgroups = pd.DataFrame({'subgroup': list(range(5))})
display(df_subgroups)
group
0
A
1
B
2
C
subgroup
0
0
1
1
2
2
3
3
4
4

Here, df_groups contains three groups 'A', 'B', and 'C', and df_subgroups contains five subgroups ranging from 0 to 4.

Cross-Join with merge()

A cross-join can be performed using the merge() function in Pandas with the argument how='cross'. This will create a new DataFrame containing all possible combinations of groups and subgroups:

# Combine DataFrames with a cross-join
df_groups.merge(df_subgroups, how='cross')

The resulting DataFrame will have 15 rows, representing all combinations of the 3 groups and 5 subgroups:

group
subgroup
0
A
0
1
A
1
2
A
2
3
A
3
4
A
4
5
B
0
6
B
1
7
B
2
8
B
3
9
B
4
10
C
0
11
C
1
12
C
2
13
C
3
14
C
4