Convert categorical variables to numeric with pandas

When working with datasets for Machine Learning models, it can be necessary to encode categorical features as numbers. Either you want to convert label into a list of corresponding integers, or you want to dummy-encode the labels, i.e. create a column for each possible value.

# Import libraries
import pandas as pd

# Create sample data
df = pd.DataFrame({'fruit': ['apple', 'banana', 'banana', 'orange', 'apple', 'orange']})

0 apple
1 banana
2 banana
3 orange
4 apple
5 orange

Convert labels to integers with pd.factorize()

To convert labels (either strings or categorical values) into integers, an easy solution is to use pandas factorize() function. It assigns each unique value in the column to a category based on its order of appearance. The first unique value encountered is assigned to category 0, the second unique value is assigned to category 1, and so on.

The function returns a list of integers, and an index with the corresping labels.

# Factorize

(array([0, 1, 1, 2, 0, 2]),
 Index(['apple', 'banana', 'orange'], dtype='object'))

To only keep the integers and assign them to a column, just select the array:

# Assign the values to a column
df['fruit_int'] = df['fruit'].factorize()[0]

fruit fruit_int
0 apple 0
1 banana 1
2 banana 1
3 orange 2
4 apple 0
5 orange 2

Dummy-encode categories with pd.get_dummies()

To create a dummy column for each possible value, use get_dummies().

It is possible to get k-1 columns with argument drop_first=True. By default, NA values will be dropped, but they can be included with dummy_na=True.

# Dummy-encode the fruits
apple banana orange
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 0 1