Convert categorical variables to numeric with pandas

When working with datasets for Machine Learning models, it can be necessary to encode categorical features as numbers. Either you want to convert label into a list of corresponding integers, or you want to dummy-encode the labels, i.e. create a column for each possible value.

Categorical variables represent types or categories and are typically found in datasets as text labels. These could be anything from colors, product categories, or geographical locations. Converting these into a numerical format is crucial for many machine learning algorithms to function correctly.

Consider the following sample data representing different types of fruits:

# Import libraries
import pandas as pd

# Create sample data
df = pd.DataFrame({'fruit': ['apple', 'banana', 'banana', 'orange', 'apple', 'orange']})

df
fruit
0
apple
1
banana
2
banana
3
orange
4
apple
5
orange

Convert labels to integers with pd.factorize()

Label encoding is a method to convert categorical labels into a set of integers. Each unique label is assigned a unique integer.

To convert labels (either strings or categorical values) into integers, an easy solution is to use pandas factorize() function. It assigns each unique value in the column to a category based on its order of appearance. The first unique value encountered is assigned to category 0, the second unique value is assigned to category 1, and so on.

The function returns a list of integers, and an index with the corresping labels.

# Factorize
df['fruit'].factorize()
(array([0, 1, 1, 2, 0, 2]),
 Index(['apple', 'banana', 'orange'], dtype='object'))

To only keep the integers and assign them to a column, just select the array:

# Assign the values to a column
df['fruit_int'] = df['fruit'].factorize()[0]
df
fruit
fruit_int
0
apple
0
1
banana
1
2
banana
1
3
orange
2
4
apple
0
5
orange
2

Dummy-encode categories with pd.get_dummies()

To create a dummy column for each possible value, use get_dummies().

It is possible to get k-1 columns with argument drop_first=True. By default, NA values will be dropped, but they can be included with dummy_na=True.

# Dummy-encode the fruits
df['fruit'].str.get_dummies()
apple
banana
orange
0
1
0
0
1
0
1
0
2
0
1
0
3
0
0
1
4
1
0
0
5
0
0
1