steamLensAI Architecture Documentation

 ╔═══════════════════════════════════════════════════════════════════════════════════════════╗
 ║ ███████╗████████╗███████╗ █████╗  ███╗ ███╗ ██╗     ███████╗███╗   ██╗███████╗ █████╗ ██╗ ║
 ║ ██╔════╝╚══██╔══╝██╔════╝██╔══██╗████╗ ████║██║     ██╔════╝████╗  ██║██╔════╝██╔══██╗██║ ║
 ║ ███████╗   ██║   █████╗  ███████║██╔████╔██║██║     █████╗  ██╔██╗ ██║███████╗███████║██║ ║
 ║ ╚════██║   ██║   ██╔══╝  ██╔══██║██║╚██╔╝██║██║     ██╔══╝  ██║╚██╗██║╚════██║██╔══██║██║ ║
 ║ ███████║   ██║   ███████╗██║  ██║██║ ╚═╝ ██║███████╗███████╗██║ ╚████║███████║██║  ██║██║ ║
 ║ ╚══════╝   ╚═╝   ╚══════╝╚═╝ ╚═╝ ╚═╝    ╚═╝╚══════╝ ╚══════╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝╚═╝ ║
 ╚═══════════════════════════════════════════════════════════════════════════════════════════╝

All data has been uploaded on kaggle: https://www.kaggle.com/datasets/rishikeshgharat/steam-games-data-40-gb

steamLensAI Architecture Documentation

This document explains the technical architecture, distributed processing design, and engineering decisions behind steamLensAI's high-performance Steam review analysis system.

Architecture Overview

steamLensAI is built as a distributed processing pipeline that leverages parallel computing and GPU acceleration to analyze large volumes of Steam reviews efficiently. Topic Assignment using seed-values (Theme based categorization) and sentence-transformers

1.2M reviews in 2 minutes, 30 seconds

Summarization (Heirarchical Topic Based Summarization) based on the now categorized data from the previous step:

1.2M reviews in 8 minutes

The following are the Execution Time Metrics for the above mentioned data:

The core innovation lies in its distributed computing approach where multiple worker processes share a single GPU through intelligent model distribution, achieving maximum hardware utilization while maintaining processing efficiency.

┌──────────────────┐    ┌─────────────────────┐
│  Distributed     │───▶│   Summarization     │
│  Processing      │    │   Pipeline          │
│  (process_files) │    │   (summarization)   │
└──────────────────┘    └─────────────────────┘
         │                         │
         ▼                         ▼
┌─────────────────────────────────────────────┐
│           Dask LocalCluster                 │
│                                             │
│  Multiple Workers → Single GPU Sharing      │
│  • Model Distribution via publish_dataset   │
│  • Coordinated GPU Memory Management        │
│  • Parallel Processing with Shared Models   │
└─────────────────────────────────────────────┘

Core Components

0. Models Used

Sentence Transformer Model

Model: all-MiniLM-L6-v2
Purpose: Converting review text to numerical embeddings for semantic similarity matching
Task: Topic assignment and theme categorization
Provider: Sentence Transformers library

Summarization Model

Model: sshleifer/distilbart-cnn-12-6
Purpose: Generating concise summaries of positive and negative reviews
Task: Hierarchical text summarization with sentiment separation
Provider: Hugging Face Transformers (DistilBART variant)

Both models support GPU acceleration and are distributed across multiple workers using Dask's publish_dataset() mechanism for efficient parallel processing.

1. Distributed Processing Engine (`processing/process_files.py`)

Role: Multi-worker data processing coordination
Key Technologies: Dask + LocalCluster + Sentence Transformers
Distributed Computing Features:
- Creates LocalCluster with multiple worker processes (Lighter than using mini-kubes)
- Distributes the transformer model () across workers using publish_dataset()
- Coordinates parallel processing of review chunks
- Manages shared GPU resources across workers
- Handles worker-to-worker communication and synchronization

2. Topic Assignment Module (`processing/topic_assignment.py`)

Role: ML-powered review categorization on distributed workers
Key Technologies: Sentence Transformers + Semantic Similarity
Worker-Level Responsibilities:
- Retrieves published models from worker dataset storage
- Converts review text to numerical embeddings using shared GPU
- Performs semantic similarity matching against game themes
- Processes data chunks independently across multiple workers

3. Summarization Pipeline (`processing/summarization.py` + `summarize_processor.py`)

Role: Distributed text summarization across workers
Key Technologies: Transformers (DistilBART) + Multi-Worker GPU Processing
Distributed Features:
- Sets up dedicated Dask cluster for summarization tasks
- Distributes summarization models to all workers via dataset publishing
- Coordinates hierarchical summarization across worker processes
- Manages GPU memory sharing for multiple model instances
- Aggregates results from parallel summarization workers

Data Flow Architecture

Phase 1: File Processing & Validation

Uploaded Files → Temporary Storage → App ID Extraction → Theme Validation
     │                │                    │                  │
     ▼                ▼                    ▼                  ▼
 [file1.parquet]  [/tmp/uuid/]      [extract_appid()]   [themes.json]
 [file2.parquet]      ...              [12345, 67890]      [lookup]
 [file3.parquet]                           ...              [✓/✗]

Phase 2: Distributed Processing Architecture

┌─────────────────┐
│   Main Process  │
│                 │
│ 1. Load Files   │──┐
│ 2. Combine Data │  │
│ 3. Create Chunks│  │
│ 4. Setup Dask   │  │
└─────────────────┘  │
                     │
                     ▼
┌───────────────────────────────────────────────────────────────────┐
│                 Dask LocalCluster                                 │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  │
│  │  Worker 1   │ │  Worker 2   │ │  Worker 3   │ │  Worker 4   │  │
│  │             │ │             │ │             │ │             │  │
│  │ • Get Model │ │ • Get Model │ │ • Get Model │ │ • Get Model │  │
│  │ • Process   │ │ • Process   │ │ • Process   │ │ • Process   │  │
│  │   Chunk A   │ │   Chunk B   │ │   Chunk C   │ │   Chunk D   │  │
│  │ • Save      │ │ • Save      │ │ • Save      │ │ • Save      │  │
│  │   Results   │ │   Results   │ │   Results   │ │   Results   │  │
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘  │
│                                  │                                │
└──────────────────────────────────┼────────────────────────────────┘
                                   │
                                   ▼
                            ┌─────────────┐
                            │  RTX 4080   │
                            │   16GB GPU  │
                            │             │
                            │ • 4 Model   │
                            │   Copies    │
                            │ • Shared    │
                            │   Compute   │
                            └─────────────┘

Phase 3: Result Aggregation

Temporary Files → Aggregation → Final Report → CSV Export
     │               │             │            │
     ▼               ▼             ▼            ▼
┌─────────────┐ ┌─────────────┐ ┌───────────┐ ┌──────────┐
│ /tmp/       │ │ Combine     │ │ Structured│ │ Download │
│ • pos_revs/ │→│ • Count     │→│ DataFrame │→│ CSV File │
│ • neg_revs/ │ │ • Merge     │ │ • Metrics │ │          │
│ • agg_data/ │ │ • Calculate │ │ • Reviews │ │          │
└─────────────┘ └─────────────┘ └───────────┘ └──────────┘

Performance Optimizations

1. GPU Model Distribution Strategy

Challenge: Distribute ML models to multiple workers on single GPU

The Serialization Problem: Imagine you have a recipe book (ML model) and you try to tear out individual pages to give different pages to different chefs (workers). The problem is that recipes are interconnected - Chef A gets page 5 (which says "add the mixture from step 3"), but Chef B has page 3 (which explains what "the mixture" is). Neither chef can cook properly because they only have fragments of the complete recipe.

Technical Details:

Dask.scatter() Fragmentation: Traditional scatter tries to break models into pieces and distribute fragments
Model Interconnectedness: ML model layers reference each other, weights depend on other weights
Incomplete Models Fail: Workers receiving fragmented models cannot perform inference
Models Need Integrity: Each worker requires the complete, intact model to function

# This FAILS - Cannot serialize CUDA tensors
model = SentenceTransformer('model', device='cuda')  # Model on GPU
client.scatter(model)  # ERROR: Cannot serialize CUDA tensors!

Solution: CPU-Serialize-GPU pattern

# Step 1: "Translate" the recipe book to common language (move to CPU)
model.to('cpu')                              # Move model to CPU memory
for param in model.parameters():
    param.data = param.data.cpu()           # Ensure ALL components are on CPU

# Step 2: "Photocopy and distribute" (serialize and send to workers)
client.publish_dataset(model, name='model') # Successfully distribute CPU model

# Step 3: Each kitchen "translates back" (workers move to GPU)
# In each worker:
worker_model = get_dataset('model')          # Get CPU copy
worker_model.to('cuda')                      # Move to shared GPU
# Result: 4 model instances on 1 GPU (efficient memory sharing)

Why This Works:

CPU models are just arrays of numbers (easily serializable)
Each worker gets its own independent copy of the model
Workers can move their copies to GPU without conflicts
Multiple model instances can coexist on the same GPU efficiently

2. Optimal Chunk Sizing

Previously handled inefficienty (pre commit 90) the whole dataset was given to the workers causing memory issues, now handled with chunking instead. Hardware-Aware Configuration:

# System: RTX 4080 Super (16GB VRAM) + Ryzen 7700X + 32GB RAM
PROCESSING_CONFIG = {
    'chunk_size': 700,        # Optimized for GPU batch processing
    'n_workers': 8,             # Leave 2 cores for system
    'memory_per_worker': '4GB', # 24GB total (8GB system reserve)
    'gpu_batch_size': 512,      # Maximum GPU utilization
}

3. Memory Management Strategy

# Aggressive cleanup to prevent memory leaks
del large_dataframe
gc.collect()                    # Force garbage collection
torch.cuda.empty_cache()       # Clear GPU memory

Design Patterns

1. Producer-Consumer with Dask Delayed

# Producer: Create work items (lazy evaluation)
tasks = [delayed(process_chunk)(chunk) for chunk in chunks]

# Consumer: Execute and collect results
results = client.compute(tasks)

2. Temporary Storage Pattern

# Intermediate results stored in structured temp directories
temp_structure = {
    'base': '/tmp/steamlens_temp_123/',
    'aggregations': 'aggregations/',     # Count data
    'positive_reviews': 'positive_reviews/', # Positive review text
    'negative_reviews': 'negative_reviews/', # Negative review text
}

3. Progress Monitoring with Futures

# Non-blocking progress tracking
for future in as_completed(futures):
    result = future.result()
    update_progress_display(completed, total)
    log_performance_metrics(result)

GPU Resource Management

Multi-Worker Single-GPU Architecture

This system uses an unconventional but efficient approach for GPU utilization:

# Standard Industry Approach:
1 GPU = 1 Worker = 1 Model Instance

# steamLensAI Approach:
1 GPU = 4 Workers = 4 Model Instances (shared GPU memory)

Benefits:

100% GPU utilization through intelligent resource sharing
Memory efficient (shared model weights across workers)
Maximum performance extraction from single GPU hardware

Why This Works:

PyTorch handles concurrent GPU access efficiently
Modern GPUs (RTX 4080) have sufficient memory bandwidth for multiple models
Batch processing workloads tolerate occasional synchronization overhead

System Requirements

Hardware Requirements

GPU: CUDA-capable GPU
Other: All other specifications configurable via app_config.py

Performance Scaling

# Observed throughput:
1. 1.2M reviews in 2:19 = ~8,633 reviews/second
1. 1.2M reviews in 8:02 = ~2,489 reviews/second
# Scaling characteristics:
- CPU-bound: Linear scaling with cores (up to ~8 cores for my computer)
- GPU-bound: Linear scaling with cores (I can hold at most 12 workers at a time in my GPU but the GPU hits 100% at 8 workers)
- I/O-bound: Depends on storage speed (biggest challenge as chunking requires multiple IO reads which can't be parallelized)

The followin are terminal images showing multiple models being loaded in the GPU at the same time: Terminal for topic assignmnet Terminal for summarization

Configuration Management

Hardware-Adaptive Configuration

The system automatically detects hardware and adjusts processing parameters:

# config/app_config.py
if SYSTEM_MEMORY_GB >= 32 and GPU_MEMORY_GB >= 16:
    PROCESSING_CONFIG = {
        'n_workers': 6,              # More workers for high-end system
        'memory_per_worker': '4GB',  # Higher memory per worker  
        'gpu_batch_size': 512,       # Larger batches for 16GB VRAM
        'chunk_size': 25000,         # Larger chunks for efficiency
    }
elif SYSTEM_MEMORY_GB >= 16:
    # Mid-range configuration
    PROCESSING_CONFIG = { ... }
else:
    # Low-memory configuration  
    PROCESSING_CONFIG = { ... }

Monitoring & Observability

Built-in Monitoring

Dask Dashboard: Real-time worker and task monitoring
Progress Tracking: Chunk-level completion monitoring
Performance Metrics: Throughput and timing statistics
Resource Usage: Memory and GPU utilization tracking

Debugging Features

Unique Chunk IDs: Granular task tracking and error isolation
Timing Metrics: Phase-by-phase performance analysis
Error Handling: Graceful failure recovery and reporting

Outputs:

topic assignment output summarization output

Results:

The following is the results page:

Key Engineering Decisions

Decision	Reasoning	Trade-offs
Dask over Ray	Simpler setup(idk ray, will try it out tho), better Pandas integration	Less GPU-native features
Single-GPU Multi-Worker	Maximum hardware utilization	Complex memory management
Temporary File Storage	Fault tolerance, progress tracking	Disk space usage
CPU Model Distribution	Avoid GPU serialization issues	Extra CPU↔GPU transfers

This architecture prioritizes maximum performance from limited hardware over traditional scalability patterns, making it ideal for research, prototyping, and resource-constrained environments.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
archive/data_gathering_and_trials		archive/data_gathering_and_trials
artifacts/outputs		artifacts/outputs
demo		demo
docs		docs
images		images
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
RESTRUCTURE_COMPLETE.md		RESTRUCTURE_COMPLETE.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
uv.lock		uv.lock

License

Matrix030/SteamLensAI

Folders and files

Latest commit

History

Repository files navigation

steamLensAI Architecture Documentation

Architecture Overview

Core Components

0. Models Used

Sentence Transformer Model

Summarization Model

1. Distributed Processing Engine (processing/process_files.py)

2. Topic Assignment Module (processing/topic_assignment.py)

3. Summarization Pipeline (processing/summarization.py + summarize_processor.py)

Data Flow Architecture

Phase 1: File Processing & Validation

Phase 2: Distributed Processing Architecture

Phase 3: Result Aggregation

Performance Optimizations

1. GPU Model Distribution Strategy

2. Optimal Chunk Sizing

3. Memory Management Strategy

Design Patterns

1. Producer-Consumer with Dask Delayed

2. Temporary Storage Pattern

3. Progress Monitoring with Futures

GPU Resource Management

Multi-Worker Single-GPU Architecture

System Requirements

Hardware Requirements

Performance Scaling

Configuration Management

Hardware-Adaptive Configuration

Monitoring & Observability

Built-in Monitoring

Debugging Features

Outputs:

Results:

Key Engineering Decisions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. Distributed Processing Engine (`processing/process_files.py`)

2. Topic Assignment Module (`processing/topic_assignment.py`)

3. Summarization Pipeline (`processing/summarization.py` + `summarize_processor.py`)

Packages