Consumer Segmentation (Quantitative Market Analysis)

K-Modes Clustering for Coffee Customer Survey Data

Author: Luke Catalano

Affiliation: University of California, Santa Cruz, M.S. Quantitative Economics & Finance

Overview

This project develops an unsupervised consumer segmentation pipeline using K-Modes clustering to analyze mixed categorical survey data for a coffee retailer. The primary objective is to identify distinct customer segments based on purchasing preferences, price sensitivity, and loyalty behaviors, and to translate these segments into actionable pricing and retention strategies aimed at improving customer lifetime value (CLV).

Traditional clustering methods such as K-Means are not well-suited for categorical data due to data type mismatches and the non-numerical natrue of the majority of responses. This project instead applies K-Modes, which replaces Euclidean distance with a mode-based dissimilarity measure (hamming distance), allowing for meaningful segmentation of survey-based customer attributes.

Economic Motivation

Retailers often collect rich survey data through mobile app or web-based surveys, but struggle to convert qualitative responses into quantitative insights. Without segmentation, pricing and marketing strategies are typically applied uniformly, leading to: • Over-discounting low-sensitivity customers • Under-serving high-value loyalists • Inefficient allocation of marketing spend

This project demonstrates how unsupervised learning can uncover latent structure in customer preferences and spending behavior, enabling segment-specific decision-making. In economic terms, such segmentation supports third-degree price discrimination, where consumers are grouped by quantifiable characteristics and firms tailor prices or offerings across segments to more efficiently capture surplus.

Tools & Models

• Python

• K-Modes clustering

• Pandas, NumPy

• Seaborn

Methodologies

Optimal Cluster Selection

To determine the proper cluster count, the within-cluster dissimilarity (cost function) and Silhouette Score were both analyzed to balance statistical fit and interpretability. While lower cluster counts produced marginally higher Silhouette Scores, they failed to capture meaningful behavioral heterogeneity (near-identical clusters). A four-cluster solution represented the point at which additional clusters yielded diminishing reductions in cost while preserving interpretable segment structure.

Cluster Stability

To assess robustness, the K-Modes algorithm was run 20 independent times using different random seeds and Huang initialization (Convergance Speed Optimization). Across all runs, the model converged to nearly identical solutions with minimal variation in total cost, indicating that the identified segments are structurally stable rather than by-products of random initialization.

Results (Segment Profiles)

The final model identified four distinct customer segments, each exhibiting consistent and interpretable behavioral patterns:

• Premium Customers (Cluster 0)

Price-insensitive, high-spending consumers with strong engagement across product categories. This segment represents the highest revenue potential and is well-suited for premium offerings, subscriptions, and high-margin upsells.

• Core Regulars (Cluster 1)

Moderate-to-high spenders with stable purchasing behavior. These customers form the revenue backbone of the business and benefit most from retention-focused strategies such as loyalty programs and consistency-based rewards.

• Budget-Conscious Consumers (Cluster 3)

Highly price-sensitive customers concentrated in lower spending brackets. Engagement can be increased through targeted promotions, value bundles, and limited-time discounts designed to raise basket size without eroding margins elsewhere.

• Minimalists (Cluster 2)

Low-frequency, low-spend customers with limited engagement. This segment exhibits the highest churn risk, and marketing investment here yields the lowest marginal returns.

Note: spending bounds are averaged and normalized for analysis

Key Takeaway

The results demonstrate that unsupervised learning can reliably uncover latent preference and spending structure in categorical survey data. The identified segments are reproducible, interpretable, and directly actionable, supporting differentiated pricing, retention, and marketing strategies aligned with heterogeneous consumer behavior.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
data		data
methods		methods
Consumer_Segmentation.pdf		Consumer_Segmentation.pdf
README.md		README.md
eda.ipynb		eda.ipynb
evaluation.ipynb		evaluation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consumer Segmentation (Quantitative Market Analysis)

K-Modes Clustering for Coffee Customer Survey Data

Overview

Economic Motivation

Tools & Models

Methodologies

Optimal Cluster Selection

Cluster Stability

Results (Segment Profiles)

Key Takeaway

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Consumer Segmentation (Quantitative Market Analysis)

K-Modes Clustering for Coffee Customer Survey Data

Overview

Economic Motivation

Tools & Models

Methodologies

Optimal Cluster Selection

Cluster Stability

Results (Segment Profiles)

Key Takeaway

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages