Get differences between two pandas DataFrames

When working with Pandas DataFrames, it's common to need to compare two datasets to identify differences, similarities, and changes. This comparison can be essential for data cleaning, validation, and analysis.

Setup

We'll start by creating two sample DataFrames, where one DataFrame has a modified value:

# Import libraries
import pandas as pd
import numpy as np 

# Create sample dataframes
df1 = pd.DataFrame({
		'city': ['Paris', 'Lyon',  'Marseille', 'Lille', 'Strasbourg'],
		'population': [2148000, 513300, 861635, 232741, 277270]
})
df2 = df1.copy()
df2.iloc[1,1] = 0

df2
city
population
0
Paris
2148000
1
Lyon
0
2
Marseille
861635
3
Lille
232741
4
Strasbourg
277270

Check if DataFrames are identical

The equals() function checks if two Pandas DataFrames have the same shape and contain the same elements, and returns a boolean value indicating whether they are identical or not.

# Check if dataframes are identical
df1.equals(df2)
False

This will return False since there is a difference in the population value for Lyon.

Get delta rows between DataFrames

Using drop_duplicates()

A first option is to concatenate the two dataframes, and drop all duplicate rows, leaving only the non-identical rows:

# Get rows that are different between dataframes, with `drop_duplicates()`
pd.concat([df1,df2]).drop_duplicates(keep=False)
city
population
1
Lyon
513300
1
Lyon
0

This will return the rows with different population values for Lyon.

Using merge()

Another option is to merge the dataframes with outer join, adding an indicator of which side each rows comes from (left or right), and then drops all rows that come from both:

# Get delta rows with `merge()`
df1.merge(df2, how='outer', indicator=True).loc[lambda x: x['_merge'] != "both"]
city
population
_merge
1
Lyon
513300
left_only
5
Lyon
0
right_only

This will return the rows with different population values for Lyon, along with an indicator column showing whether the row is from the left or right DataFrame.

Get common rows between DataFrames

To get common rows, an easy way is to merge the dataframes with inner join:

# Get common rows between dataframes
df1.merge(df2, how='inner')
city
population
0
Paris
2148000
1
Marseille
861635
2
Lille
232741
3
Strasbourg
277270

This will return the rows that are identical in both DataFrames.

Comparing and analyzing differences between two Pandas DataFrames is a common task in data analysis. Whether you need to validate data consistency, identify changes over time, or analyze differences between datasets, Pandas provides flexible and efficient methods to achieve these goals.

By leveraging functions like equals(), drop_duplicates(), and merge(), you can tailor the comparison to your specific needs, extracting differences, similarities, and specific changes.