Add Optional Model Quantization for Backend Performance Optimization#556
Add Optional Model Quantization for Backend Performance Optimization#556piyush06singhal wants to merge 2 commits intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds an optional, config-driven model quantization workflow and a centralized apply_quantization(model, device) function; applies quantization (FP16 on CUDA, INT8 on CPU) to multiple generators and nested sub-models, adjusts device handling and final_output initialization, and adds CUDA buffer cleanup and logging. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 65-66: Replace the print call in the except block that catches
"except Exception as e:" with a logger warning call to match the module's
logging usage; update the handler in main.py (the quantization exception block)
to call logger.warning with a clear message and include the exception object (e)
so the warning contains the error details instead of using print.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 748cccfc-4ec4-42cf-bf7a-fe374d7ffe82
📒 Files selected for processing (2)
backend/Generator/main.pybackend/config.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/Generator/main.py (2)
50-64: Consider logging a clearer message when precision/device mismatch occurs.When
MODEL_PRECISION='fp16'but device is CPU, orMODEL_PRECISION='int8'but device is CUDA, quantization is silently skipped with only an info-level log. Users enabling quantization might not realize it's not being applied due to a configuration mismatch.Consider making this more explicit:
Proposed enhancement for clarity
elif MODEL_PRECISION == 'int8' and device.type == 'cpu': # Apply INT8 dynamic quantization for CPU logger.info("Applying INT8 dynamic quantization for CPU inference") model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) + elif MODEL_PRECISION == 'fp16' and device.type == 'cpu': + logger.warning("FP16 quantization requested but device is CPU; skipping (FP16 requires CUDA)") + elif MODEL_PRECISION == 'int8' and device.type == 'cuda': + logger.warning("INT8 quantization requested but device is CUDA; skipping (INT8 dynamic quantization is CPU-only)") else: logger.info(f"Quantization skipped: precision={MODEL_PRECISION}, device={device.type}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 50 - 64, The quantization branch silently skips when MODEL_PRECISION and device.type mismatch; update the logic around MODEL_PRECISION/device.type checks (the block using MODEL_PRECISION, device.type, logger.info, model.half(), and torch.quantization.quantize_dynamic) to emit a clearer, higher-severity log (e.g., logger.warning or logger.error) when a requested precision cannot be applied due to device mismatch, including both the requested precision and the actual device (e.g., "Requested fp16 but device is cpu — skipping quantization"), while keeping existing successful-application logs as-is.
169-171: API inconsistency:ShortQGeneratoromitstime_takenfield.
MCQGenerator.generate_mcq()returns{"questions": [], "time_taken": 0}butShortQGenerator.generate_shortq()returns{"questions": []}. This inconsistency could confuse API consumers expecting a uniform response structure.Consider adding timing to
ShortQGeneratorfor consistency, or document the difference if intentional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 169 - 171, ShortQGenerator.generate_shortq currently builds final_output = {"questions": []} which omits the time_taken key that MCQGenerator.generate_mcq includes; update generate_shortq to compute and include a time_taken value (e.g., measure start/end or elapsed seconds) so final_output matches the same shape as MCQGenerator.generate_mcq, ensuring the returned object contains both "questions" and "time_taken" keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 50-64: The quantization branch silently skips when MODEL_PRECISION
and device.type mismatch; update the logic around MODEL_PRECISION/device.type
checks (the block using MODEL_PRECISION, device.type, logger.info, model.half(),
and torch.quantization.quantize_dynamic) to emit a clearer, higher-severity log
(e.g., logger.warning or logger.error) when a requested precision cannot be
applied due to device mismatch, including both the requested precision and the
actual device (e.g., "Requested fp16 but device is cpu — skipping
quantization"), while keeping existing successful-application logs as-is.
- Around line 169-171: ShortQGenerator.generate_shortq currently builds
final_output = {"questions": []} which omits the time_taken key that
MCQGenerator.generate_mcq includes; update generate_shortq to compute and
include a time_taken value (e.g., measure start/end or elapsed seconds) so
final_output matches the same shape as MCQGenerator.generate_mcq, ensuring the
returned object contains both "questions" and "time_taken" keys.
Addressed Issues:
Fixes #554
Additional Notes:
This PR introduces optional model quantization support to reduce memory usage and improve inference performance for the EduAid backend models. The feature is disabled by default and can be enabled through environment variables.
Key points:
Description
The EduAid backend loads several large transformer models (T5-large, T5-base, DistilBERT, BERT) using FP32 precision, which leads to high memory consumption and increased computational load.
This PR introduces an optional quantization mechanism to reduce memory usage and improve inference performance. Quantization is applied immediately after model loading and device placement without modifying the existing architecture.
Changes Made
Modified Files
backend/Generator/main.pyapply_quantization()helper functionNew File
backend/config.pySupported Quantization Modes
INT8 (CPU)
FP16 (GPU)
Quantization is only applied when explicitly enabled.
Usage
Enable INT8 (CPU)
Enable FP16 (GPU)
Default behavior (quantization disabled)
Testing
Verified that:
Breaking Changes
None.
This change is fully backward compatible.
Checklist
Check one of the checkboxes below:
I have used the following AI models and tools: TODO
AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Summary by CodeRabbit
New Features
Bug Fixes
Chores