Go up a hierarchical tree with pandas

A common way of storing hierarchical data is the tree:

tree structure

In SQL, the methods for going up a tree are very different between SQL dialects. In pandas however, it is relatively simple.

For this example, we’ll use the list of Google Ads geotargets, that stores locations in a hierarchical structure, and get the list of all parents for each location.

Setup

# Import libraries
import pandas as pd
import numpy as np
# Import Google Ads location data
df = (
    pd.read_csv('https://developers.google.com/adwords/api/docs/appendix/geo/geotargets-2021-02-24.csv')
    .rename(columns=lambda x: x.lower().replace(' ', '_'))
    .loc[lambda x: x['country_code'] == 'FR']
    .reset_index(drop=True)
    .assign(parent_id=lambda x: x['parent_id'].fillna(0).astype('int'))
    [['criteria_id', 'canonical_name', 'parent_id']]
)
df
criteria_id canonical_name parent_id
0 1005781 Algolsheim,Grand Est,France 9068896
1 1005782 Cernay,Grand Est,France 9068896
2 1005783 Colmar,Grand Est,France 9068896
3 1005784 Ensisheim,Grand Est,France 9068896
4 1005785 Erstein,Grand Est,France 9068896
... ... ... ...
3755 9060754 Petite-France,Grand Est,France 9068896
3756 9072487 Les Halles,Ile-de-France,France 20321
3757 9072488 Place Vendome,Ile-de-France,France 20321
3758 9072489 Sorbonne,Ile-de-France,France 20321
3759 9072490 Saint-Germain-des-Pres,Ile-de-France,France 20321

3760 rows × 3 columns

Recursive function

# Recursive algorithm: list the next parent if any, otherwise exit
def recursive(row, parents):
    if row['parent_id'] == 0:
        return parents
    else:
        parents.append(row['parent_id'])
        return recursive(df.loc[df['criteria_id'] == row['parent_id'], :].squeeze(), parents)

# Call the recursive function for each row
def get_parents(row):
    parents = []
    return recursive(row, parents)

# Apply the function and create a column with the parents list for each location
df['parents_list'] = df.apply(get_parents, axis=1)
# Display resulting DataFrame
df.sort_values('parents_list')
criteria_id canonical_name parent_id parents_list
683 2250 France 0 []
669 20320 Franche-Comte,France 2250 [2250]
675 20326 Nord-Pas-de-Calais,France 2250 [2250]
674 20325 Midi-Pyrenees,France 2250 [2250]
673 20324 Lorraine,France 2250 [2250]
... ... ... ... ...
3088 9056091 73370,Auvergne-Rhone-Alpes,France 9069525 [9069525, 2250]
3089 9056092 73400,Auvergne-Rhone-Alpes,France 9069525 [9069525, 2250]
3090 9056093 73600,Auvergne-Rhone-Alpes,France 9069525 [9069525, 2250]
638 1006423 Saint-Agreve,Auvergne-Rhone-Alpes,France 9069525 [9069525, 2250]
3112 9056115 74500,Auvergne-Rhone-Alpes,France 9069525 [9069525, 2250]

3760 rows × 4 columns