[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance by mahek2016 · Pull Request #555 · AOSSIE-Org/EduAid

mahek2016 · 2026-03-11T14:32:21Z

Addressed Issues

Fixes #554

Screenshots/Recordings

Not applicable. This change is related to backend model optimization and does not introduce any UI changes.

Additional Notes

This PR adds optional model quantization to reduce memory usage and improve inference performance for transformer models used in the EduAid backend.

Main changes:

Added dynamic INT8 quantization when running on CPU
Enabled FP16 precision when CUDA is available
Introduced a configuration flag ENABLE_MODEL_QUANTIZATION to optionally enable or disable the optimization
Applied the optimization during model loading across generator classes

These improvements help reduce memory consumption and improve inference efficiency without changing the existing functionality of the quiz generation pipeline.

AI Usage Disclosure

This PR does not contain AI-generated code at all.
This PR contains AI-assisted code. I have reviewed and tested the implementation locally and take responsibility for the changes.

AI tools used: ChatGPT (for assistance and suggestions)

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

New Features
- Added optional model optimization with FP16 support on CUDA GPUs and int8 quantization on CPU devices.
- Introduced per-request tracing with unique request IDs and enhanced logging capabilities.
- Added test application for local development.

…ove inference performance

coderabbitai · 2026-03-11T14:32:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c827f277-f636-45e8-a15a-99881b6595f8

📥 Commits

Reviewing files that changed from the base of the PR and between e956f55 and de49b7c.

📒 Files selected for processing (2)

backend/Generator/main.py
backend/test_app.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/test_app.py
backend/Generator/main.py

📝 Walkthrough

Walkthrough

The pull request adds optional model quantization to the backend's ML pipeline for reduced memory consumption, introduces per-request tracing infrastructure for logging, and adds a new test Flask application for development purposes.

Changes

Cohort / File(s)	Summary
Model Quantization & Optimization `backend/Generator/main.py`	Introduces `optimize_model()` function applying FP16 precision on CUDA or INT8 dynamic quantization on CPU. Adds `ENABLE_MODEL_QUANTIZATION` flag from environment. Applies optimization to MCQGenerator, ShortQGenerator, ParaphraseGenerator, BoolQGenerator, AnswerPredictor, QuestionGenerator, and NLI components before device transfer. Updates imports to group transformer model classes.
Request Tracing & Logging Infrastructure `backend/server.py`	Adds Flask `before_request` handler that generates unique request IDs (UUID) and logs incoming requests. Introduces global logging configuration and new logger format. Adds imports for `g`, `logging`, and `uuid4`. Replaces `get_content` endpoint to return disabled Google Docs API message for local development.
Test Application `backend/test_app.py`	New Flask application file defining app instance, home route ("/") handler returning "Flask is working", and main guard that starts server with debug mode derived from `FLASK_DEBUG` environment variable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰✨ Hop hop, the models now run lean,
Quantized bytes, a memory dream!
With tracing logs and test apps bright,
This backend bounds through data's night! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR implements core quantization requirements [`#554`]: ENABLE_MODEL_QUANTIZATION flag, optimize_model() function with INT8 on CPU and FP16 on CUDA, and consistent application across model loading. However, changes to backend/server.py and addition of backend/test_app.py appear unrelated to issue `#554`'s quantization scope.	Remove unrelated changes to backend/server.py (request tracing, logging, Google Docs API modifications) and backend/test_app.py (new Flask test app) that are outside the quantization feature scope defined in issue `#554`.
Out of Scope Changes check	⚠️ Warning	Changes to backend/server.py (request tracing, logging, Google Docs API disabling) and addition of backend/test_app.py (new Flask test application) are out of scope relative to issue `#554`'s quantization objectives.	Remove backend/server.py modifications and backend/test_app.py entirely as they are not part of the model quantization feature defined in issue `#554`.
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly matches the main objective: adding optional model quantization to reduce memory usage and improve inference performance, which is the primary change in the backend/Generator/main.py file.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 15-24: ENABLE_MODEL_QUANTIZATION is hard-coded True making
optimize_model (the quantization/FP16 path) mandatory; change it to read a
configurable boolean (e.g., from an environment variable like
MODEL_QUANTIZATION_ENABLED or the existing config loader) and default to False
so behavior stays backward compatible. Update the top-level constant and any
initialization logic used by optimize_model to parse the env/config value
reliably (handle "1"/"true"/"false" forms) so teams can opt-in to quantization
per-deployment without editing code; keep the optimize_model function signature
and internal logic unchanged except for referencing the new config flag.
- Around line 289-291: self.nli_model is optimized but never moved to the target
device and predict_boolean_answer() passes tokenizer outputs on CPU, causing
device mismatches; after calling optimize_model(self.nli_model) move the model
to self.device (e.g., self.nli_model.to(self.device)) and update
predict_boolean_answer() to send tokenized inputs from self.nli_tokenizer to the
same device (convert the dict from encode_plus/encode to tensors and
.to(self.device) for each value) so the model and inputs are co-located and
FP16/CUDA optimization works correctly.

In `@backend/Generator/test_rag.py`:
- Around line 3-19: The file contains demo code that runs at import time
(creating RAGService(), calling RAGService.index_text and RAGService.query and
printing answer), which breaks test discovery; wrap the demo flow including
sample_text, rag = RAGService(), rag.index_text(...), question, answer and the
print statements inside an if __name__ == "__main__": guard OR convert it into a
proper test function named test_* that instantiates RAGService, calls index_text
and query, and asserts the returned answer meets expectations (use sample_text
and question variables and assert on answer) so no side-effectful code runs on
import.
- Line 1: The test imports a missing symbol RAGService (in test_rag.py), causing
ModuleNotFoundError; fix by either landing the RAGService implementation/package
in this PR and exporting it under the rag module (so from rag import RAGService
resolves), or remove/disable test_rag.py and any references to RAGService until
the service exists; if you intend to test an existing API, update the import to
the correct exported class name instead of RAGService.

In `@backend/server.py`:
- Around line 203-208: The /get_content handler (get_content) currently returns
a static success payload which breaks the frontend flow; revert to the original
Google Docs retrieval logic but gate it behind a config flag (e.g.,
ENABLE_GOOGLE_DOCS) so local dev can disable it, and when the flag is false
return a non-200, machine-readable error (e.g., 503 or 410) with an explicit
error field like {"error":"google_docs_disabled","message":"Google Docs service
disabled in local development","retryable":false} so the client can handle it;
also add server-side logging of the disabled state and any real fetch errors so
failures are actionable.

In `@backend/test_app.py`:
- Around line 9-10: Replace the hardcoded debug=True in the main guard (if
__name__ == "__main__") by reading an environment variable and converting it to
a boolean; import os at top, then call app.run(debug=...) where the debug value
comes from something like os.getenv('FLASK_DEBUG', '0') or 'False' and converted
to a boolean (e.g., check for '1', 'true', 'yes' case-insensitively) so the
interactive debugger is enabled only when the FLASK_DEBUG env var is set.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cdbc21b4-d65d-4220-b205-9c697302c2aa

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and e956f55.

📒 Files selected for processing (4)

backend/Generator/main.py
backend/Generator/test_rag.py
backend/server.py
backend/test_app.py

coderabbitai · 2026-03-11T14:40:53Z

 @app.route('/get_content', methods=['POST'])
 def get_content():
-    try:
-        data = request.get_json()
-        document_url = data.get('document_url')
-        if not document_url:
-            return jsonify({'error': 'Document URL is required'}), 400
-
-        text = docs_service.get_document_content(document_url)
-        return jsonify(text)
-    except ValueError as e:
-        return jsonify({'error': str(e)}), 400
-    except Exception as e:
-        return jsonify({'error': str(e)}), 500
+    # Google Docs API disabled for local development
+    return jsonify({
+        "content": "Google Docs service temporarily disabled in local development."
+    }), 200


⚠️ Potential issue | 🟠 Major

/get_content now silently breaks the Google Docs flow.

The frontend consumes this endpoint as document text, so returning a placeholder payload here swaps real content for a static message instead of surfacing an actionable failure. Keep the old implementation behind config, or return an explicit error/feature-flag response the client can handle.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 203 - 208, The /get_content handler (get_content) currently returns a static success payload which breaks the frontend flow; revert to the original Google Docs retrieval logic but gate it behind a config flag (e.g., ENABLE_GOOGLE_DOCS) so local dev can disable it, and when the flag is false return a non-200, machine-readable error (e.g., 503 or 410) with an explicit error field like {"error":"google_docs_disabled","message":"Google Docs service disabled in local development","retryable":false} so the client can handle it; also add server-side logging of the disabled state and any real fetch errors so failures are actionable.

mahek2016 added 5 commits March 10, 2026 16:45

Add request ID tracing middleware for backend API requests

6097c8b

Fix logging configuration and simplify get_content endpoint

3e3d894

Fix duplicate logging import and address CodeRabbit review comments

708fc24

feat: add optional model quantization to reduce memory usage and impr…

39ef3bb

…ove inference performance

chore: remove accidental files

e956f55

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

mahek2016 added 2 commits March 11, 2026 20:20

chore: remove invalid RAG test file causing ModuleNotFoundError

1d2b3b7

fix: address CodeRabbit review comments and remove invalid RAG test

de49b7c

mahek2016 mentioned this pull request Mar 11, 2026

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance #554

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance#555

[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance#555
mahek2016 wants to merge 7 commits intoAOSSIE-Org:mainfrom
mahek2016:enhancement/model-quantization

mahek2016 commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (3 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mahek2016 commented Mar 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues

Screenshots/Recordings

Additional Notes

AI Usage Disclosure

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (3 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mahek2016 commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading