When working with Pandas DataFrames, it's common to need to compare two datasets to identify differences, similarities, and changes. This comparison can be essential for data cleaning, validation, and analysis.
Setup
We'll start by creating two sample DataFrames, where one DataFrame has a modified value:
# Import libraries
import pandas as pd
import numpy as np
# Create sample dataframes
df1 = pd.DataFrame({
'city': ['Paris', 'Lyon', 'Marseille', 'Lille', 'Strasbourg'],
'population': [2148000, 513300, 861635, 232741, 277270]
})
df2 = df1.copy()
df2.iloc[1,1] = 0
df2
city | population | |
0 | Paris | 2148000 |
1 | Lyon | 0 |
2 | Marseille | 861635 |
3 | Lille | 232741 |
4 | Strasbourg | 277270 |
Check if DataFrames are identical
The equals()
function checks if two Pandas DataFrames have the same shape and contain the same elements, and returns a boolean value indicating whether they are identical or not.
# Check if dataframes are identical
df1.equals(df2)
False
This will return False
since there is a difference in the population value for Lyon.
Get delta rows between DataFrames
Using drop_duplicates()
A first option is to concatenate the two dataframes, and drop all duplicate rows, leaving only the non-identical rows:
# Get rows that are different between dataframes, with `drop_duplicates()`
pd.concat([df1,df2]).drop_duplicates(keep=False)
city | population | |
1 | Lyon | 513300 |
1 | Lyon | 0 |
This will return the rows with different population values for Lyon.
Using merge()
Another option is to merge the dataframes with outer join, adding an indicator of which side each rows comes from (left or right), and then drops all rows that come from both:
# Get delta rows with `merge()`
df1.merge(df2, how='outer', indicator=True).loc[lambda x: x['_merge'] != "both"]
city | population | _merge | |
1 | Lyon | 513300 | left_only |
5 | Lyon | 0 | right_only |
This will return the rows with different population values for Lyon, along with an indicator column showing whether the row is from the left or right DataFrame.
Get common rows between DataFrames
To get common rows, an easy way is to merge the dataframes with inner join:
# Get common rows between dataframes
df1.merge(df2, how='inner')
city | population | |
0 | Paris | 2148000 |
1 | Marseille | 861635 |
2 | Lille | 232741 |
3 | Strasbourg | 277270 |
This will return the rows that are identical in both DataFrames.
Comparing and analyzing differences between two Pandas DataFrames is a common task in data analysis. Whether you need to validate data consistency, identify changes over time, or analyze differences between datasets, Pandas provides flexible and efficient methods to achieve these goals.
By leveraging functions like equals()
, drop_duplicates()
, and merge()
, you can tailor the comparison to your specific needs, extracting differences, similarities, and specific changes.