Add Optional Model Quantization for Backend Performance Optimization by piyush06singhal · Pull Request #556 · AOSSIE-Org/EduAid

piyush06singhal · 2026-03-11T14:33:28Z

Addressed Issues:

Fixes #554

Additional Notes:

This PR introduces optional model quantization support to reduce memory usage and improve inference performance for the EduAid backend models. The feature is disabled by default and can be enabled through environment variables.

Key points:

Adds optional quantization for transformer models
Supports INT8 dynamic quantization for CPU
Supports FP16 half precision for GPU
Fully backward compatible
No changes to existing APIs or workflows

Description

The EduAid backend loads several large transformer models (T5-large, T5-base, DistilBERT, BERT) using FP32 precision, which leads to high memory consumption and increased computational load.

This PR introduces an optional quantization mechanism to reduce memory usage and improve inference performance. Quantization is applied immediately after model loading and device placement without modifying the existing architecture.

Changes Made

Modified Files

backend/Generator/main.py

Added configuration import with fallback handling
Implemented apply_quantization() helper function
Applied quantization to transformer models after loading

New File

backend/config.py

Configuration module for quantization settings
Reads configuration values from environment variables
Provides default safe values

Supported Quantization Modes

INT8 (CPU)

Uses PyTorch dynamic quantization

FP16 (GPU)

Uses half precision inference

Quantization is only applied when explicitly enabled.

Usage

Enable INT8 (CPU)

export ENABLE_MODEL_QUANTIZATION=true
export MODEL_PRECISION=int8
python server.py

Enable FP16 (GPU)

export ENABLE_MODEL_QUANTIZATION=true
export MODEL_PRECISION=fp16
python server.py

Default behavior (quantization disabled)

python server.py

Testing

Verified that:

Backend runs normally when quantization is disabled
INT8 quantization works correctly on CPU
FP16 quantization works correctly on GPU
All endpoints function as expected
Model outputs remain valid
Quantization only activates when enabled
System gracefully falls back to FP32 if quantization fails

Breaking Changes

None.
This change is fully backward compatible.

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Check one of the checkboxes below:

This PR does not contain AI-generated code at all.
This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: TODO

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Summary by CodeRabbit

New Features
- Configurable, device-aware model quantization (precision control and safe fallbacks)
- Outputs now include timing metadata for question-generation flows
Bug Fixes
- Ensures model inputs run on the correct device and cleans CUDA buffers after quantized inference
Chores
- Added lightweight runtime logging for quantization actions and decisions

…sion

coderabbitai · 2026-03-11T14:33:47Z

📝 Walkthrough

Walkthrough

Adds an optional, config-driven model quantization workflow and a centralized apply_quantization(model, device) function; applies quantization (FP16 on CUDA, INT8 on CPU) to multiple generators and nested sub-models, adjusts device handling and final_output initialization, and adds CUDA buffer cleanup and logging.

Changes

Cohort / File(s)	Summary
Configuration `backend/config.py`	New config flags: `ENABLE_MODEL_QUANTIZATION` (env-driven bool) and `MODEL_PRECISION` (env-driven string, default "int8").
Quantization & Generators `backend/Generator/main.py`	Adds `apply_quantization(model, device)` and imports config; applies quantization to many generators and nested sub-models (MCQGenerator, ShortQGenerator, ParaphraseGenerator, BoolQGenerator, AnswerPredictor, QAEvaluator, QG/NLI/QA submodels); moves inputs/tensors to device for quantized paths; adds CUDA buffer cleanup after quantized inference; initializes/extends `final_output` structures with `time_taken`; adds logging and safe fallbacks on quantization errors.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Config as Config\n(backend/config.py)
participant Generator as Generator\n(backend/Generator/main.py)
participant Model as ModelLoader
participant Quant as apply_quantization()
participant Device as Device\n(CUDA / CPU)
Config->>Generator: ENABLE_MODEL_QUANTIZATION, MODEL_PRECISION
Generator->>Model: load model instance
Model->>Quant: pass model + device
Quant->>Device: decide FP16 (CUDA) or INT8 (CPU)
Quant-->>Model: return quantized or original model
Generator->>Device: move inputs/tensors for inference
Generator->>Model: run inference
Model->>Generator: outputs
Generator->>Device: cleanup CUDA buffers (if applicable)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through configs, tiny and spry,

Quantized my models to leap to the sky,
FP16 for CUDA, INT8 for the rest,
Lighter, quicker—my code does its best!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main objective of the PR: adding optional model quantization for backend performance optimization, matching the core feature introduced.
Linked Issues check	✅ Passed	The PR implements all coding objectives from issue `#554`: optional quantization during model loading with INT8 for CPU and FP16 for GPU via configuration flags, while maintaining backward compatibility and graceful fallback.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#554` requirements: quantization implementation in main.py, configuration file creation, and application to transformer models without unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 65-66: Replace the print call in the except block that catches
"except Exception as e:" with a logger warning call to match the module's
logging usage; update the handler in main.py (the quantization exception block)
to call logger.warning with a clear message and include the exception object (e)
so the warning contains the error details instead of using print.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 748cccfc-4ec4-42cf-bf7a-fe374d7ffe82

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 8e82729.

📒 Files selected for processing (2)

backend/Generator/main.py
backend/config.py

piyush06singhal · 2026-03-11T15:19:34Z

@coderabbitai review

coderabbitai · 2026-03-11T15:19:54Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (2)

backend/Generator/main.py (2)

50-64: Consider logging a clearer message when precision/device mismatch occurs.

When MODEL_PRECISION='fp16' but device is CPU, or MODEL_PRECISION='int8' but device is CUDA, quantization is silently skipped with only an info-level log. Users enabling quantization might not realize it's not being applied due to a configuration mismatch.

Consider making this more explicit:

Proposed enhancement for clarity

         elif MODEL_PRECISION == 'int8' and device.type == 'cpu':
             # Apply INT8 dynamic quantization for CPU
             logger.info("Applying INT8 dynamic quantization for CPU inference")
             model = torch.quantization.quantize_dynamic(
                 model,
                 {torch.nn.Linear},
                 dtype=torch.qint8
             )
+        elif MODEL_PRECISION == 'fp16' and device.type == 'cpu':
+            logger.warning("FP16 quantization requested but device is CPU; skipping (FP16 requires CUDA)")
+        elif MODEL_PRECISION == 'int8' and device.type == 'cuda':
+            logger.warning("INT8 quantization requested but device is CUDA; skipping (INT8 dynamic quantization is CPU-only)")
         else:
             logger.info(f"Quantization skipped: precision={MODEL_PRECISION}, device={device.type}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 50 - 64, The quantization branch
silently skips when MODEL_PRECISION and device.type mismatch; update the logic
around MODEL_PRECISION/device.type checks (the block using MODEL_PRECISION,
device.type, logger.info, model.half(), and torch.quantization.quantize_dynamic)
to emit a clearer, higher-severity log (e.g., logger.warning or logger.error)
when a requested precision cannot be applied due to device mismatch, including
both the requested precision and the actual device (e.g., "Requested fp16 but
device is cpu — skipping quantization"), while keeping existing
successful-application logs as-is.

169-171: API inconsistency: ShortQGenerator omits time_taken field.

MCQGenerator.generate_mcq() returns {"questions": [], "time_taken": 0} but ShortQGenerator.generate_shortq() returns {"questions": []}. This inconsistency could confuse API consumers expecting a uniform response structure.

Consider adding timing to ShortQGenerator for consistency, or document the difference if intentional.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 169 - 171,
ShortQGenerator.generate_shortq currently builds final_output = {"questions":
[]} which omits the time_taken key that MCQGenerator.generate_mcq includes;
update generate_shortq to compute and include a time_taken value (e.g., measure
start/end or elapsed seconds) so final_output matches the same shape as
MCQGenerator.generate_mcq, ensuring the returned object contains both
"questions" and "time_taken" keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 50-64: The quantization branch silently skips when MODEL_PRECISION
and device.type mismatch; update the logic around MODEL_PRECISION/device.type
checks (the block using MODEL_PRECISION, device.type, logger.info, model.half(),
and torch.quantization.quantize_dynamic) to emit a clearer, higher-severity log
(e.g., logger.warning or logger.error) when a requested precision cannot be
applied due to device mismatch, including both the requested precision and the
actual device (e.g., "Requested fp16 but device is cpu — skipping
quantization"), while keeping existing successful-application logs as-is.
- Around line 169-171: ShortQGenerator.generate_shortq currently builds
final_output = {"questions": []} which omits the time_taken key that
MCQGenerator.generate_mcq includes; update generate_shortq to compute and
include a time_taken value (e.g., measure start/end or elapsed seconds) so
final_output matches the same shape as MCQGenerator.generate_mcq, ensuring the
returned object contains both "questions" and "time_taken" keys.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6da7ffe2-14f9-473c-8a1a-e5da6f2196d2

📥 Commits

Reviewing files that changed from the base of the PR and between 8e82729 and 9f7922e.

📒 Files selected for processing (1)

backend/Generator/main.py

feat: add optional model quantization support with configurable preci…

8e82729

…sion

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread backend/Generator/main.py Outdated

piyush06singhal mentioned this pull request Mar 11, 2026

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance #554

Open

2 tasks

fixes the coderabbit issues

9f7922e

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Optional Model Quantization for Backend Performance Optimization#556

Add Optional Model Quantization for Backend Performance Optimization#556
piyush06singhal wants to merge 2 commits intoAOSSIE-Org:mainfrom
piyush06singhal:perf/model-quantization-optimization

piyush06singhal commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

piyush06singhal commented Mar 11, 2026

Uh oh!

coderabbitai Bot commented Mar 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

piyush06singhal commented Mar 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Additional Notes:

Description

Changes Made

Modified Files

New File

Supported Quantization Modes

Usage

Testing

Breaking Changes

Checklist

AI Usage Disclosure:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piyush06singhal commented Mar 11, 2026

Uh oh!

coderabbitai Bot commented Mar 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

piyush06singhal commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading