ALS Implicit Recommender System

This project documents my journey of building a production-ready recommender system, starting from a naive prototype and evolving into a scalable, cloud-optimized ML pipeline.

What began as a simple "statistical model running on fake data" quickly turned into a deep engineering challenge once real production data arrived. This repository contains the code, pipelines, and tooling that came out of that process.

Overview

The first version of the model was built in a few days:

Local CPU-only training
Fake synthetic data
Deployed on Azure as a proof-of-concept

It worked… until real data entered. Suddenly there were millions of interactions, messy schemas, mixed types, and inconsistent column names. From this point on, the real project began.

Key Challenges and Solutions

1. Data Cleaning at Scale

Real production data was far from clean. I built a full preprocessing pipeline that:

Normalizes and aligns inconsistent schemas
Fixes mismatched data types
Removes corrupted or incomplete rows
Validates field formats

Data cleaning turned out to be more demanding than the model itself.

2. Vectorization and Performance Bottlenecks

The first implementation was too slow for large datasets. To fix this, I:

Vectorized heavy operations
Profiled performance hotspots
Removed Python loops and rewrote them using NumPy logic

This reduced training time, locally, from infinite to minutes.

3. Understanding the Algorithm (ALS in Implicit)

Once the data was finally clean, the results were still bad. That's when I realized I could no longer treat the model as a "black box", I had to understand what was actually happening under the hood.

I used the excellent benfred/implicit library, specifically the Alternating Least Squares (ALS) implementation. The first version "worked" but I didn't understand why it worked or why it wasn't working well on my real data.

To fix that, I had to learn:

How ALS factorization actually works
Why implicit feedback requires confidence weights
The role of matrix multiplication
What each hyperparameter actually does and their trade-offs
How to run systematic parameter searches

If I need to single out one paper that helped me the most with the general understanding it's: Collaborative Filtering for Implicit Feedback Datasets.

This shifted the model from a "black box" to something I could reason about.

4. Production Workflow Optimization

Fetching the entire dataset for every train was inefficient. I redesigned the process to:

Keep the previous year of training data inside Azure blob storage
Fetch only the last 24 hours of new interactions

This drastically reduced:

Database load
Production server load
Training time

5. Cloud Migration and Cost Efficiency

I originally picked Azure because of my .NET background, but I quickly realized it wasn't the right fit. Even though CPU training wasn't that slow on my local machine, Azure CPU runs were extremely slow, mostly because I was using the cheaper App Service tier, while everything faster was unreasonably expensive. Instead of scaling up on Azure, I migrated to Google Cloud with GPU support, where:

Training went from multi-hour → 2 minutes
GPU instances were much more affordable
The environment felt easier to work with

Azure now only handles the daily scheduling, while all heavy lifting happens on GCP.

6. In-Memory vs Distributed Caching

Originally everything was stored in in-memory hash maps, not scalable. I moved to Redis, and the production recommendation flow became:

API → userId → Redis → Top 20 recommendations → API

7. Real-Time vs Offline Training

A real-time updating model was unrealistic after analyzing:

Compute cost
Complexity
Required freshness

Switched to offline training with scheduled model updates, simpler and more reliable.

8. Debugging "Good Metrics, Bad Results"

Even with low loss, results looked off. Digging deeper uncovered:

Thousands of bot interactions (Google/Facebook crawlers)
Internal admin activity polluting the dataset

Cleaning this noise improved results more than any hyperparameter tweak.

9. Global Error Handling & Daily Reporting

To make the pipeline more reliable, I added a simple global error-handling layer using Python and Flask.

All unhandled errors are caught by a global Flask error handler.
The handler formats the exception and stack trace and sends it directly to my email.
This lets me immediately know when the training or serving pipeline fails.

I also added a small reporting script so that after each daily training run:

the training summary (metrics, counts, durations, etc.) is automatically sent by email to the business owner.

This way both errors and daily results are delivered without needing to check logs or dashboards.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.vscode		.vscode
DTO		DTO
docs		docs
examples		examples
exceptions		exceptions
images		images
scripts		scripts
utils		utils
.deployment		.deployment
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cloudbuild.yaml		cloudbuild.yaml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALS Implicit Recommender System

Overview

Key Challenges and Solutions

1. Data Cleaning at Scale

2. Vectorization and Performance Bottlenecks

3. Understanding the Algorithm (ALS in Implicit)

4. Production Workflow Optimization

5. Cloud Migration and Cost Efficiency

6. In-Memory vs Distributed Caching

7. Real-Time vs Offline Training

8. Debugging "Good Metrics, Bad Results"

9. Global Error Handling & Daily Reporting

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ALS Implicit Recommender System

Overview

Key Challenges and Solutions

1. Data Cleaning at Scale

2. Vectorization and Performance Bottlenecks

3. Understanding the Algorithm (ALS in Implicit)

4. Production Workflow Optimization

5. Cloud Migration and Cost Efficiency

6. In-Memory vs Distributed Caching

7. Real-Time vs Offline Training

8. Debugging "Good Metrics, Bad Results"

9. Global Error Handling & Daily Reporting

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages