Objective: Develop a recommender system to suggest short videos to users based on preferences, interaction histories, and video content using the KuaiRec dataset. The goal is to create a personalized and scalable recommendation engine.
This project implements a multi-stage hybrid recommender system for short videos, leveraging the KuaiRec dataset. The system provides personalized video suggestions by first generating a broad set of candidate videos for each user, then re-ranking these candidates using a model that incorporates rich user, item, and interaction features.
The KuaiRec dataset's small_matrix is described as "fully observed," meaning nearly every user has a watch_ratio for every item. Thus, the task is primarily one of ranking items based on predicted engagement rather than inferring missing interactions.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
bash scripts/download_data.shThe system uses a two-stage architecture—common in large-scale recommender systems—for optimal trade-offs between speed and quality.
Justification for a Two-Stage Approach:
- Scalability: Ranking all items for every user is infeasible at scale due to high latency and cost.
- Efficiency: Stage 1 (candidate generation) narrows millions of items to a few hundred; Stage 2 (ranking) applies a more accurate, resource-intensive model on this shortlist.
- Signal Diversity: Each stage can exploit different strengths—broad patterns in Stage 1 (collaborative filtering, etc...), and detailed, fine-grained features in Stage 2 (user-item interaction features, item metadata, etc...).
Detailed in notebooks/1_Data_Preparation.ipynb, this phase includes:
-
Loading Raw Data: Ingests
big_matrix.csv,small_matrix.csv,user_features.csv,item_categories.csv, anditem_daily_features.csv. -
Initial Cleaning: Handles missing values, data types, and parses complex fields (item tags, etc...).
-
Feature Creation:
- Temporal Features: Extract
interaction_hourandinteraction_day_of_week. - User Features: Includes
user_active_degree,fans_user_num,register_days,is_lowactive_period,is_live_streamer,is_video_author, andonehot_feat*. - Static Item Features: Extracts
num_item_tagsfromitem_categories.csv. - Dynamic Item Features: Derives
video_age_days, engagement ratios likedaily_play_per_show_ratio, and raw stats likevideo_duration_daily. - Duplicate Handling: Removes duplicate entries in
item_daily_featuresbased on(video_id, date).
- Temporal Features: Extract
-
Data Splitting & Saving:
big_matrixis split chronologically: 80% train, 20% test.- Feature-engineered
small_matrixsaved separately. - All outputs saved in Parquet format.
-
Model: ALS from the
implicitlibrary (matrix factorization for implicit feedback). -
Training Data:
big_matrixwith clippedwatch_ratio > 0used as confidence. -
Output: List of candidate item IDs per user.
-
Reference:
notebooks/3_Model_Training.ipynb
-
Model: LightGBM (gradient boosting) for regression.
-
Task: Predict
watch_ratiofor(user, item)pairs using engineered features. -
Training Data: 80% training split of
big_matrix. -
Loss Function: L1 loss (MAE), robust to outliers.
-
Output: Predicted
watch_ratioused to re-rank ALS candidates. -
Reference:
notebooks/3_Model_Training.ipynb
Detailed in notebooks/4_Model_Evaluation.ipynb.
-
Purpose: Evaluate LightGBM on known users/items from training data.
-
Methodology:
- Predict
watch_ratiofor all items per user insmall_matrix. - Rank items by predicted scores.
- Predict
-
Metrics:
- Pointwise: RMSE, MAE
- Ranking: Precision@k, Recall@k, nDCG@k (k = 5, 10, 20, 50, 100, 250)
-
Purpose: Evaluate on unseen interactions from test split.
-
Methodology:
- Use 20% test split.
- Predict and rank only items the user interacted with.
-
Metrics: Ranking metrics as above.
-
Pointwise Metrics:
- RMSE:
1.3214 - MAE:
0.3473
- RMSE:
-
Ranking Metrics (Relevance:
watch_ratio > 1.0):
| Metric | @5 | @10 | @20 | @50 | @100 | @250 |
|---|---|---|---|---|---|---|
| Avg Precision | 0.9137 | 0.9117 | 0.9019 | 0.8571 | 0.8234 | 0.7725 |
| Avg Recall | 0.0048 | 0.0095 | 0.0188 | 0.0443 | 0.0848 | 0.1973 |
| Avg NDCG | 0.9146 | 0.9128 | 0.9057 | 0.8706 | 0.8396 | 0.7904 |
- Ranking Metrics (Relevance:
watch_ratio > 1.0):
| Metric | @5 | @10 | @20 | @50 | @100 | @250 |
|---|---|---|---|---|---|---|
| Avg Precision | 0.6859 | 0.6534 | 0.6143 | 0.5339 | 0.4491 | 0.3122 |
| Avg Recall | 0.0669 | 0.1220 | 0.2139 | 0.4128 | 0.6123 | 0.8831 |
| Avg NDCG | 0.7668 | 0.7508 | 0.7326 | 0.7291 | 0.7781 | 0.9156 |
-
Strong Performance on Known Data (
small_matrix):-
The LightGBM ranker demonstrates excellent precision and nDCG on the small_matrix. This indicates it has effectively learned to identify and rank highly engaging items for users and items it was exposed to during training (as this cohort is part of big_matrix).
-
The low recall is an artifact of the dense evaluation setup, where each user has many "relevant" items, making it hard to capture a large fraction in a short top-K list. The MAE of ~0.35 suggests the watch_ratio predictions are, on average, reasonably close to the actuals.
-
-
Generalization to Unseen Data (
big_matrixtest):-
As expected, precision and nDCG are lower on the big_matrix test set compared to the small_matrix. This reflects the increased difficulty of predicting for chronologically newer interactions, which may involve less familiar items or evolving user preferences.
-
Recall is significantly higher on the big_matrix test set. This is because users in the test set have a smaller number of actual positive interactions (compared to the ~3300 items per user in the small_matrix scope), making it easier to recall a larger fraction of these true positives.
-
-
Overall: The model shows strong learning capabilities. The difference in metrics between the two evaluation sets highlights the importance of evaluating on data that mirrors the deployment scenario (he big_matrix holdout).
- Careful feature engineering—especially optimized daily item features—was critical for memory efficiency and model performance.
- LightGBM ranked well on known data and showed solid generalization on
big_matrix, offering a reliable baseline. small_matrixproved valuable for in-depth validation on dense user cohorts.- L1 loss suited the skewed nature of the
watch_ratiotarget.
Memory Optimization Highlights:
- Selective Feature Loading: Only essential columns were read from raw files.
- Slim Merge Tables: Merged pre-aggregated ratios, not raw counts.
- Duplicate Removal: Cleaning
(video_id, date)duplicates prevented merge-time row explosion. - Manual Memory Management: Frequent
delandgc.collect()kept memory use low. - Optimized Dtypes: Used compact numerical types where possible.
These steps enabled processing the 12.5M-row big_matrix on a single machine.
-
End-to-End Evaluation: Implement full ALS → LightGBM pipeline on test set.
-
Advanced Feature Engineering:
- Embeddings: For users, items, tags, authors, etc.
- Interaction History: E.g., avg user engagement with similar items.
- Social Network: Integrate
social_network.csv(e.g., friends' interactions).
-
Alternative Candidate Generators: Try item-based CF or two-tower models.
-
Alternative Ranking Models: Explore deep learning-based ranking models (e.g., neural nets, transformers).