Skip to content

Add Optional Model Quantization for Backend Performance Optimization#556

Open
piyush06singhal wants to merge 2 commits intoAOSSIE-Org:mainfrom
piyush06singhal:perf/model-quantization-optimization
Open

Add Optional Model Quantization for Backend Performance Optimization#556
piyush06singhal wants to merge 2 commits intoAOSSIE-Org:mainfrom
piyush06singhal:perf/model-quantization-optimization

Conversation

@piyush06singhal
Copy link
Copy Markdown

@piyush06singhal piyush06singhal commented Mar 11, 2026

Addressed Issues:

Fixes #554

Additional Notes:

This PR introduces optional model quantization support to reduce memory usage and improve inference performance for the EduAid backend models. The feature is disabled by default and can be enabled through environment variables.

Key points:

  • Adds optional quantization for transformer models
  • Supports INT8 dynamic quantization for CPU
  • Supports FP16 half precision for GPU
  • Fully backward compatible
  • No changes to existing APIs or workflows

Description

The EduAid backend loads several large transformer models (T5-large, T5-base, DistilBERT, BERT) using FP32 precision, which leads to high memory consumption and increased computational load.

This PR introduces an optional quantization mechanism to reduce memory usage and improve inference performance. Quantization is applied immediately after model loading and device placement without modifying the existing architecture.

Changes Made

Modified Files

backend/Generator/main.py

  • Added configuration import with fallback handling
  • Implemented apply_quantization() helper function
  • Applied quantization to transformer models after loading

New File

backend/config.py

  • Configuration module for quantization settings
  • Reads configuration values from environment variables
  • Provides default safe values

Supported Quantization Modes

INT8 (CPU)

  • Uses PyTorch dynamic quantization

FP16 (GPU)

  • Uses half precision inference

Quantization is only applied when explicitly enabled.

Usage

Enable INT8 (CPU)

export ENABLE_MODEL_QUANTIZATION=true
export MODEL_PRECISION=int8
python server.py

Enable FP16 (GPU)

export ENABLE_MODEL_QUANTIZATION=true
export MODEL_PRECISION=fp16
python server.py

Default behavior (quantization disabled)

python server.py

Testing

Verified that:

  • Backend runs normally when quantization is disabled
  • INT8 quantization works correctly on CPU
  • FP16 quantization works correctly on GPU
  • All endpoints function as expected
  • Model outputs remain valid
  • Quantization only activates when enabled
  • System gracefully falls back to FP32 if quantization fails

Breaking Changes

None.
This change is fully backward compatible.

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: TODO

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Summary by CodeRabbit

  • New Features

    • Configurable, device-aware model quantization (precision control and safe fallbacks)
    • Outputs now include timing metadata for question-generation flows
  • Bug Fixes

    • Ensures model inputs run on the correct device and cleans CUDA buffers after quantized inference
  • Chores

    • Added lightweight runtime logging for quantization actions and decisions

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Adds an optional, config-driven model quantization workflow and a centralized apply_quantization(model, device) function; applies quantization (FP16 on CUDA, INT8 on CPU) to multiple generators and nested sub-models, adjusts device handling and final_output initialization, and adds CUDA buffer cleanup and logging.

Changes

Cohort / File(s) Summary
Configuration
backend/config.py
New config flags: ENABLE_MODEL_QUANTIZATION (env-driven bool) and MODEL_PRECISION (env-driven string, default "int8").
Quantization & Generators
backend/Generator/main.py
Adds apply_quantization(model, device) and imports config; applies quantization to many generators and nested sub-models (MCQGenerator, ShortQGenerator, ParaphraseGenerator, BoolQGenerator, AnswerPredictor, QAEvaluator, QG/NLI/QA submodels); moves inputs/tensors to device for quantized paths; adds CUDA buffer cleanup after quantized inference; initializes/extends final_output structures with time_taken; adds logging and safe fallbacks on quantization errors.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Config as Config\n(backend/config.py)
participant Generator as Generator\n(backend/Generator/main.py)
participant Model as ModelLoader
participant Quant as apply_quantization()
participant Device as Device\n(CUDA / CPU)
Config->>Generator: ENABLE_MODEL_QUANTIZATION, MODEL_PRECISION
Generator->>Model: load model instance
Model->>Quant: pass model + device
Quant->>Device: decide FP16 (CUDA) or INT8 (CPU)
Quant-->>Model: return quantized or original model
Generator->>Device: move inputs/tensors for inference
Generator->>Model: run inference
Model->>Generator: outputs
Generator->>Device: cleanup CUDA buffers (if applicable)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through configs, tiny and spry,

Quantized my models to leap to the sky,
FP16 for CUDA, INT8 for the rest,
Lighter, quicker—my code does its best!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the PR: adding optional model quantization for backend performance optimization, matching the core feature introduced.
Linked Issues check ✅ Passed The PR implements all coding objectives from issue #554: optional quantization during model loading with INT8 for CPU and FP16 for GPU via configuration flags, while maintaining backward compatibility and graceful fallback.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #554 requirements: quantization implementation in main.py, configuration file creation, and application to transformer models without unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 65-66: Replace the print call in the except block that catches
"except Exception as e:" with a logger warning call to match the module's
logging usage; update the handler in main.py (the quantization exception block)
to call logger.warning with a clear message and include the exception object (e)
so the warning contains the error details instead of using print.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 748cccfc-4ec4-42cf-bf7a-fe374d7ffe82

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 8e82729.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/config.py

Comment thread backend/Generator/main.py Outdated
@piyush06singhal
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/Generator/main.py (2)

50-64: Consider logging a clearer message when precision/device mismatch occurs.

When MODEL_PRECISION='fp16' but device is CPU, or MODEL_PRECISION='int8' but device is CUDA, quantization is silently skipped with only an info-level log. Users enabling quantization might not realize it's not being applied due to a configuration mismatch.

Consider making this more explicit:

Proposed enhancement for clarity
         elif MODEL_PRECISION == 'int8' and device.type == 'cpu':
             # Apply INT8 dynamic quantization for CPU
             logger.info("Applying INT8 dynamic quantization for CPU inference")
             model = torch.quantization.quantize_dynamic(
                 model,
                 {torch.nn.Linear},
                 dtype=torch.qint8
             )
+        elif MODEL_PRECISION == 'fp16' and device.type == 'cpu':
+            logger.warning("FP16 quantization requested but device is CPU; skipping (FP16 requires CUDA)")
+        elif MODEL_PRECISION == 'int8' and device.type == 'cuda':
+            logger.warning("INT8 quantization requested but device is CUDA; skipping (INT8 dynamic quantization is CPU-only)")
         else:
             logger.info(f"Quantization skipped: precision={MODEL_PRECISION}, device={device.type}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 50 - 64, The quantization branch
silently skips when MODEL_PRECISION and device.type mismatch; update the logic
around MODEL_PRECISION/device.type checks (the block using MODEL_PRECISION,
device.type, logger.info, model.half(), and torch.quantization.quantize_dynamic)
to emit a clearer, higher-severity log (e.g., logger.warning or logger.error)
when a requested precision cannot be applied due to device mismatch, including
both the requested precision and the actual device (e.g., "Requested fp16 but
device is cpu — skipping quantization"), while keeping existing
successful-application logs as-is.

169-171: API inconsistency: ShortQGenerator omits time_taken field.

MCQGenerator.generate_mcq() returns {"questions": [], "time_taken": 0} but ShortQGenerator.generate_shortq() returns {"questions": []}. This inconsistency could confuse API consumers expecting a uniform response structure.

Consider adding timing to ShortQGenerator for consistency, or document the difference if intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 169 - 171,
ShortQGenerator.generate_shortq currently builds final_output = {"questions":
[]} which omits the time_taken key that MCQGenerator.generate_mcq includes;
update generate_shortq to compute and include a time_taken value (e.g., measure
start/end or elapsed seconds) so final_output matches the same shape as
MCQGenerator.generate_mcq, ensuring the returned object contains both
"questions" and "time_taken" keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 50-64: The quantization branch silently skips when MODEL_PRECISION
and device.type mismatch; update the logic around MODEL_PRECISION/device.type
checks (the block using MODEL_PRECISION, device.type, logger.info, model.half(),
and torch.quantization.quantize_dynamic) to emit a clearer, higher-severity log
(e.g., logger.warning or logger.error) when a requested precision cannot be
applied due to device mismatch, including both the requested precision and the
actual device (e.g., "Requested fp16 but device is cpu — skipping
quantization"), while keeping existing successful-application logs as-is.
- Around line 169-171: ShortQGenerator.generate_shortq currently builds
final_output = {"questions": []} which omits the time_taken key that
MCQGenerator.generate_mcq includes; update generate_shortq to compute and
include a time_taken value (e.g., measure start/end or elapsed seconds) so
final_output matches the same shape as MCQGenerator.generate_mcq, ensuring the
returned object contains both "questions" and "time_taken" keys.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6da7ffe2-14f9-473c-8a1a-e5da6f2196d2

📥 Commits

Reviewing files that changed from the base of the PR and between 8e82729 and 9f7922e.

📒 Files selected for processing (1)
  • backend/Generator/main.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance

1 participant