Skip to content

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance#555

Open
mahek2016 wants to merge 7 commits intoAOSSIE-Org:mainfrom
mahek2016:enhancement/model-quantization
Open

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance#555
mahek2016 wants to merge 7 commits intoAOSSIE-Org:mainfrom
mahek2016:enhancement/model-quantization

Conversation

@mahek2016
Copy link
Copy Markdown

@mahek2016 mahek2016 commented Mar 11, 2026

Addressed Issues

Fixes #554

Screenshots/Recordings

Not applicable. This change is related to backend model optimization and does not introduce any UI changes.

Additional Notes

This PR adds optional model quantization to reduce memory usage and improve inference performance for transformer models used in the EduAid backend.

Main changes:

  • Added dynamic INT8 quantization when running on CPU
  • Enabled FP16 precision when CUDA is available
  • Introduced a configuration flag ENABLE_MODEL_QUANTIZATION to optionally enable or disable the optimization
  • Applied the optimization during model loading across generator classes

These improvements help reduce memory consumption and improve inference efficiency without changing the existing functionality of the quiz generation pipeline.

AI Usage Disclosure

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-assisted code. I have reviewed and tested the implementation locally and take responsibility for the changes.

AI tools used: ChatGPT (for assistance and suggestions)

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features
    • Added optional model optimization with FP16 support on CUDA GPUs and int8 quantization on CPU devices.
    • Introduced per-request tracing with unique request IDs and enhanced logging capabilities.
    • Added test application for local development.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c827f277-f636-45e8-a15a-99881b6595f8

📥 Commits

Reviewing files that changed from the base of the PR and between e956f55 and de49b7c.

📒 Files selected for processing (2)
  • backend/Generator/main.py
  • backend/test_app.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/test_app.py
  • backend/Generator/main.py

📝 Walkthrough

Walkthrough

The pull request adds optional model quantization to the backend's ML pipeline for reduced memory consumption, introduces per-request tracing infrastructure for logging, and adds a new test Flask application for development purposes.

Changes

Cohort / File(s) Summary
Model Quantization & Optimization
backend/Generator/main.py
Introduces optimize_model() function applying FP16 precision on CUDA or INT8 dynamic quantization on CPU. Adds ENABLE_MODEL_QUANTIZATION flag from environment. Applies optimization to MCQGenerator, ShortQGenerator, ParaphraseGenerator, BoolQGenerator, AnswerPredictor, QuestionGenerator, and NLI components before device transfer. Updates imports to group transformer model classes.
Request Tracing & Logging Infrastructure
backend/server.py
Adds Flask before_request handler that generates unique request IDs (UUID) and logs incoming requests. Introduces global logging configuration and new logger format. Adds imports for g, logging, and uuid4. Replaces get_content endpoint to return disabled Google Docs API message for local development.
Test Application
backend/test_app.py
New Flask application file defining app instance, home route ("/") handler returning "Flask is working", and main guard that starts server with debug mode derived from FLASK_DEBUG environment variable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰✨ Hop hop, the models now run lean,
Quantized bytes, a memory dream!
With tracing logs and test apps bright,
This backend bounds through data's night! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements core quantization requirements [#554]: ENABLE_MODEL_QUANTIZATION flag, optimize_model() function with INT8 on CPU and FP16 on CUDA, and consistent application across model loading. However, changes to backend/server.py and addition of backend/test_app.py appear unrelated to issue #554's quantization scope. Remove unrelated changes to backend/server.py (request tracing, logging, Google Docs API modifications) and backend/test_app.py (new Flask test app) that are outside the quantization feature scope defined in issue #554.
Out of Scope Changes check ⚠️ Warning Changes to backend/server.py (request tracing, logging, Google Docs API disabling) and addition of backend/test_app.py (new Flask test application) are out of scope relative to issue #554's quantization objectives. Remove backend/server.py modifications and backend/test_app.py entirely as they are not part of the model quantization feature defined in issue #554.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly matches the main objective: adding optional model quantization to reduce memory usage and improve inference performance, which is the primary change in the backend/Generator/main.py file.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 15-24: ENABLE_MODEL_QUANTIZATION is hard-coded True making
optimize_model (the quantization/FP16 path) mandatory; change it to read a
configurable boolean (e.g., from an environment variable like
MODEL_QUANTIZATION_ENABLED or the existing config loader) and default to False
so behavior stays backward compatible. Update the top-level constant and any
initialization logic used by optimize_model to parse the env/config value
reliably (handle "1"/"true"/"false" forms) so teams can opt-in to quantization
per-deployment without editing code; keep the optimize_model function signature
and internal logic unchanged except for referencing the new config flag.
- Around line 289-291: self.nli_model is optimized but never moved to the target
device and predict_boolean_answer() passes tokenizer outputs on CPU, causing
device mismatches; after calling optimize_model(self.nli_model) move the model
to self.device (e.g., self.nli_model.to(self.device)) and update
predict_boolean_answer() to send tokenized inputs from self.nli_tokenizer to the
same device (convert the dict from encode_plus/encode to tensors and
.to(self.device) for each value) so the model and inputs are co-located and
FP16/CUDA optimization works correctly.

In `@backend/Generator/test_rag.py`:
- Around line 3-19: The file contains demo code that runs at import time
(creating RAGService(), calling RAGService.index_text and RAGService.query and
printing answer), which breaks test discovery; wrap the demo flow including
sample_text, rag = RAGService(), rag.index_text(...), question, answer and the
print statements inside an if __name__ == "__main__": guard OR convert it into a
proper test function named test_* that instantiates RAGService, calls index_text
and query, and asserts the returned answer meets expectations (use sample_text
and question variables and assert on answer) so no side-effectful code runs on
import.
- Line 1: The test imports a missing symbol RAGService (in test_rag.py), causing
ModuleNotFoundError; fix by either landing the RAGService implementation/package
in this PR and exporting it under the rag module (so from rag import RAGService
resolves), or remove/disable test_rag.py and any references to RAGService until
the service exists; if you intend to test an existing API, update the import to
the correct exported class name instead of RAGService.

In `@backend/server.py`:
- Around line 203-208: The /get_content handler (get_content) currently returns
a static success payload which breaks the frontend flow; revert to the original
Google Docs retrieval logic but gate it behind a config flag (e.g.,
ENABLE_GOOGLE_DOCS) so local dev can disable it, and when the flag is false
return a non-200, machine-readable error (e.g., 503 or 410) with an explicit
error field like {"error":"google_docs_disabled","message":"Google Docs service
disabled in local development","retryable":false} so the client can handle it;
also add server-side logging of the disabled state and any real fetch errors so
failures are actionable.

In `@backend/test_app.py`:
- Around line 9-10: Replace the hardcoded debug=True in the main guard (if
__name__ == "__main__") by reading an environment variable and converting it to
a boolean; import os at top, then call app.run(debug=...) where the debug value
comes from something like os.getenv('FLASK_DEBUG', '0') or 'False' and converted
to a boolean (e.g., check for '1', 'true', 'yes' case-insensitively) so the
interactive debugger is enabled only when the FLASK_DEBUG env var is set.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cdbc21b4-d65d-4220-b205-9c697302c2aa

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and e956f55.

📒 Files selected for processing (4)
  • backend/Generator/main.py
  • backend/Generator/test_rag.py
  • backend/server.py
  • backend/test_app.py

Comment thread backend/Generator/main.py Outdated
Comment thread backend/Generator/main.py Outdated
Comment thread backend/Generator/test_rag.py Outdated
Comment thread backend/Generator/test_rag.py Outdated
Comment thread backend/server.py
Comment on lines 203 to +208
@app.route('/get_content', methods=['POST'])
def get_content():
try:
data = request.get_json()
document_url = data.get('document_url')
if not document_url:
return jsonify({'error': 'Document URL is required'}), 400

text = docs_service.get_document_content(document_url)
return jsonify(text)
except ValueError as e:
return jsonify({'error': str(e)}), 400
except Exception as e:
return jsonify({'error': str(e)}), 500
# Google Docs API disabled for local development
return jsonify({
"content": "Google Docs service temporarily disabled in local development."
}), 200
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

/get_content now silently breaks the Google Docs flow.

The frontend consumes this endpoint as document text, so returning a placeholder payload here swaps real content for a static message instead of surfacing an actionable failure. Keep the old implementation behind config, or return an explicit error/feature-flag response the client can handle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 203 - 208, The /get_content handler
(get_content) currently returns a static success payload which breaks the
frontend flow; revert to the original Google Docs retrieval logic but gate it
behind a config flag (e.g., ENABLE_GOOGLE_DOCS) so local dev can disable it, and
when the flag is false return a non-200, machine-readable error (e.g., 503 or
410) with an explicit error field like
{"error":"google_docs_disabled","message":"Google Docs service disabled in local
development","retryable":false} so the client can handle it; also add
server-side logging of the disabled state and any real fetch errors so failures
are actionable.

Comment thread backend/test_app.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance

1 participant