One useful capability of pandas is to execute computations within groups of a DataFrame. This is achieved using groupby()
followed by transform()
.
species | sepal_width | |
0 | setosa | 3.4 |
1 | setosa | 3.7 |
2 | setosa | 3.2 |
3 | setosa | 3.1 |
4 | setosa | 3.8 |
5 | versicolor | 2.4 |
6 | versicolor | 2.7 |
7 | versicolor | 2.5 |
8 | virginica | 2.9 |
9 | virginica | 2.8 |
10 | virginica | 3.0 |
11 | virginica | 3.4 |
Compute mean at group-level
To get group-level statistics like mean, sum or count, use transform('function')
.
species | sepal_width | mean | |
0 | setosa | 3.4 | 3.440000 |
1 | setosa | 3.7 | 3.440000 |
2 | setosa | 3.2 | 3.440000 |
3 | setosa | 3.1 | 3.440000 |
4 | setosa | 3.8 | 3.440000 |
5 | versicolor | 2.4 | 2.533333 |
6 | versicolor | 2.7 | 2.533333 |
7 | versicolor | 2.5 | 2.533333 |
8 | virginica | 2.9 | 3.025000 |
9 | virginica | 2.8 | 3.025000 |
10 | virginica | 3.0 | 3.025000 |
11 | virginica | 3.4 | 3.025000 |
Standardize values
You can center values inside a group by substracting the group mean to each row.
species | sepal_width | standardized | |
0 | setosa | 3.4 | -0.040000 |
1 | setosa | 3.7 | 0.260000 |
2 | setosa | 3.2 | -0.240000 |
3 | setosa | 3.1 | -0.340000 |
4 | setosa | 3.8 | 0.360000 |
5 | versicolor | 2.4 | -0.133333 |
6 | versicolor | 2.7 | 0.166667 |
7 | versicolor | 2.5 | -0.033333 |
8 | virginica | 2.9 | -0.125000 |
9 | virginica | 2.8 | -0.225000 |
10 | virginica | 3.0 | -0.025000 |
11 | virginica | 3.4 | 0.375000 |
Rank values inside groups
Besides computing group-wise values, you can also rank values within each group.
species | sepal_width | rank | |
0 | setosa | 3.4 | 3.0 |
1 | setosa | 3.7 | 4.0 |
2 | setosa | 3.2 | 2.0 |
3 | setosa | 3.1 | 1.0 |
4 | setosa | 3.8 | 5.0 |
5 | versicolor | 2.4 | 1.0 |
6 | versicolor | 2.7 | 3.0 |
7 | versicolor | 2.5 | 2.0 |
8 | virginica | 2.9 | 2.0 |
9 | virginica | 2.8 | 1.0 |
10 | virginica | 3.0 | 3.0 |
11 | virginica | 3.4 | 4.0 |
Error when output has multiple columns
If the result of the transform generates more than 1 column, and you try to assign it to a column of an existing DataFrame, you will encounter an error like ValueError: Wrong number of items passed X, placement implies 1
. To avoid this, pass only one column in the result:
species | sepal_width | rank | |
0 | setosa | 3.4 | 3.0 |
1 | setosa | 3.7 | 4.0 |
2 | setosa | 3.2 | 2.0 |
3 | setosa | 3.1 | 1.0 |
4 | setosa | 3.8 | 5.0 |
5 | versicolor | 2.4 | 1.0 |
6 | versicolor | 2.7 | 3.0 |
7 | versicolor | 2.5 | 2.0 |
8 | virginica | 2.9 | 2.0 |
9 | virginica | 2.8 | 1.0 |
10 | virginica | 3.0 | 3.0 |
11 | virginica | 3.4 | 4.0 |