Apply computation within groups in a pandas DataFrame
One useful capability of pandas is to execute computations within groups of a DataFrame. This is achieved using groupby()
followed by transform()
.
# Import libraries
import pandas as pd
import seaborn as sns
# Load sample data in a DataFrame
df = (
sns.load_dataset('iris')
.sample(n=12, random_state=24)
.sort_values('species')
.reset_index(drop=True)
[['species', 'sepal_width']]
)
df
species | sepal_width | |
---|---|---|
0 | setosa | 3.4 |
1 | setosa | 3.7 |
2 | setosa | 3.2 |
3 | setosa | 3.1 |
4 | setosa | 3.8 |
5 | versicolor | 2.4 |
6 | versicolor | 2.7 |
7 | versicolor | 2.5 |
8 | virginica | 2.9 |
9 | virginica | 2.8 |
10 | virginica | 3.0 |
11 | virginica | 3.4 |
Compute mean at group-level
To get group-level statistics like mean, sum or count, use transform('function')
.
# Get group mean
df.assign(mean=
df.groupby('species').transform('mean')
)
species | sepal_width | mean | |
---|---|---|---|
0 | setosa | 3.4 | 3.440000 |
1 | setosa | 3.7 | 3.440000 |
2 | setosa | 3.2 | 3.440000 |
3 | setosa | 3.1 | 3.440000 |
4 | setosa | 3.8 | 3.440000 |
5 | versicolor | 2.4 | 2.533333 |
6 | versicolor | 2.7 | 2.533333 |
7 | versicolor | 2.5 | 2.533333 |
8 | virginica | 2.9 | 3.025000 |
9 | virginica | 2.8 | 3.025000 |
10 | virginica | 3.0 | 3.025000 |
11 | virginica | 3.4 | 3.025000 |
Standardize values
You can center values inside a group by substracting the group mean to each row.
# Standardize values
df.assign(standardized=
df.groupby('species').transform(lambda x: x - x.mean())
)
species | sepal_width | standardized | |
---|---|---|---|
0 | setosa | 3.4 | -0.040000 |
1 | setosa | 3.7 | 0.260000 |
2 | setosa | 3.2 | -0.240000 |
3 | setosa | 3.1 | -0.340000 |
4 | setosa | 3.8 | 0.360000 |
5 | versicolor | 2.4 | -0.133333 |
6 | versicolor | 2.7 | 0.166667 |
7 | versicolor | 2.5 | -0.033333 |
8 | virginica | 2.9 | -0.125000 |
9 | virginica | 2.8 | -0.225000 |
10 | virginica | 3.0 | -0.025000 |
11 | virginica | 3.4 | 0.375000 |
Rank values inside groups
Besides computing group-wise values, you can also rank values within each group.
# Rank values
df.assign(rank=
df.groupby('species').transform(lambda x: x.rank())
)
species | sepal_width | rank | |
---|---|---|---|
0 | setosa | 3.4 | 3.0 |
1 | setosa | 3.7 | 4.0 |
2 | setosa | 3.2 | 2.0 |
3 | setosa | 3.1 | 1.0 |
4 | setosa | 3.8 | 5.0 |
5 | versicolor | 2.4 | 1.0 |
6 | versicolor | 2.7 | 3.0 |
7 | versicolor | 2.5 | 2.0 |
8 | virginica | 2.9 | 2.0 |
9 | virginica | 2.8 | 1.0 |
10 | virginica | 3.0 | 3.0 |
11 | virginica | 3.4 | 4.0 |
Error when output has multiple columns
If the result of the transform generates more than 1 column, and you try to assign it to a column of an existing DataFrame, you will encounter an error like ValueError: Wrong number of items passed X, placement implies 1
. To avoid this, pass only one column in the result:
df.assign(rank=
df.groupby('species').transform(lambda x: x.rank()).iloc[:, 0]
)
species | sepal_width | rank | |
---|---|---|---|
0 | setosa | 3.4 | 3.0 |
1 | setosa | 3.7 | 4.0 |
2 | setosa | 3.2 | 2.0 |
3 | setosa | 3.1 | 1.0 |
4 | setosa | 3.8 | 5.0 |
5 | versicolor | 2.4 | 1.0 |
6 | versicolor | 2.7 | 3.0 |
7 | versicolor | 2.5 | 2.0 |
8 | virginica | 2.9 | 2.0 |
9 | virginica | 2.8 | 1.0 |
10 | virginica | 3.0 | 3.0 |
11 | virginica | 3.4 | 4.0 |