Introduction
When working on a case of fraud prediction, I wanted to identify distinctive behaviours among fraudsters. After a number of data visualisation trials, here are a few insights that I now keep in mind for representing clusters visually in order to draw insights:
- Variables that have been useful for classification might not be the most appropriate for clustering the predicted results within one class. You may have to pickup other features.
- Choosing a limited number of variables is better to obtain clear and interpretable results, usually with a maximum of three. Above, it becomes hard to interpret.
- Plotting empirical cumulative distribution function (eCDF) graphs for each feature, with one distinct line per cluster, can help identify patterns in the data.
- Creating heatmaps for pairs of two features, with one heatmap per cluster, can further reveal insights into the relationships between the features.
Compute the clusters
In this example, we will consider data that has been predicted as likely fraudulent behavior. Specifically, we will apply a K-means algorithm on two features and compute 3 clusters among the observations classified as fraudulent.
# Prepare data with the relevant features
X = df[['feature_1', 'feature_2']].copy()
# Run algo
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters).fit(X)
# Prepare results
df_clusters = pd.concat([
df['session_id'],
X,
pd.Series(kmeans.labels_, name='cluster'
)], axis=1)
# Show population by cluster, and average probability from the model
(
df_clusters
.merge(
df[['session_id', 'proba_1']].drop_duplicates(subset='session_id'),
on='session_id', how='left'
)
.groupby('cluster')
.agg(n=('session_id', 'nunique'), avg_proba=('proba_1', 'mean'))
.assign(pct=lambda x: (x['n']/x['n'].sum()).round(2))
[['n', 'pct', 'avg_p']]
)
cluster | n | pct | avg_proba |
0 | 134 | 0.40 | 0.6720 |
1 | 48 | 0.14 | 0.5998 |
2 | 150 | 0.45 | 0.5634 |
The distribution of data among the clusters is not uniform, as cluster 1 contains only 14% of the observations, but this is not a problem. Also, by calculating the average prediction probability (avg_proba
) for each cluster, we can observe that certain clusters are more "assertive" than others, meaning they group observations that have been classified with higher probabilities. This information can be useful in understanding the confidence levels of the clustering results and identifying potential outliers or misclassifications.
Plot eCDF of each feature
We can use empirical cumulative distribution function (eCDF) plots to visualize the distribution of clusters across each feature. Here, we clearly see that for feature_1
, the distribution of cluster 0 is very distinct from the other clusters, while for feature_2
, cluster 1 has a very different distribution.
# Plot distributions
fig, ax = plt.subplots(1, 2, figsize=(10,4))
sns.ecdfplot(data=df_clusters, x='feature_1', hue='cluster', ax=ax[0])
sns.ecdfplot(data=df_clusters, x='feature_2', hue='cluster', ax=ax[1])
Plot heatmap of pairs of features
It is also possible to plot pairs of features on heatmaps, to see how they interact together. In this case, we need to plot one heatmap per cluster. The distinct distributions between clusters is even more obious with these graphs. This is a very good start to identify different patterns within fraudulent behaviours.
# Bin features from 0 to 9
df_clusters['feature_1_bins'] = (df_clusters['feature_1'] // 10).astype(int)
df_clusters['feature_2_bins'] = (df_clusters['feature_2'] // 10).astype(int)
# Plot one heatmap per cluster
fig, ax = plt.subplots(1, n_clusters, figsize=(16,4))
for i in range(n_clusters):
df_graph = (
df_clusters
.loc[lambda x: x['cluster'] == i]
.groupby(['feature_1_bins', 'feature_2_bins'])
.size()
.unstack(fill_value=0)
.reindex(list(range(10)), axis=0)
.reindex(list(range(10)), axis=1)
.fillna(0)
)
sns.heatmap(df_graph, square=False, cbar=False, vmin=0, ax=ax[i])
ax[i].invert_yaxis()
ax[i].set_title("Cluster " + str(i))