[Enhancement] Add Optional Model Quantization to Reduce Memory Usage and Improve Inference Performance#555
Conversation
…ove inference performance
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe pull request adds optional model quantization to the backend's ML pipeline for reduced memory consumption, introduces per-request tracing infrastructure for logging, and adds a new test Flask application for development purposes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 15-24: ENABLE_MODEL_QUANTIZATION is hard-coded True making
optimize_model (the quantization/FP16 path) mandatory; change it to read a
configurable boolean (e.g., from an environment variable like
MODEL_QUANTIZATION_ENABLED or the existing config loader) and default to False
so behavior stays backward compatible. Update the top-level constant and any
initialization logic used by optimize_model to parse the env/config value
reliably (handle "1"/"true"/"false" forms) so teams can opt-in to quantization
per-deployment without editing code; keep the optimize_model function signature
and internal logic unchanged except for referencing the new config flag.
- Around line 289-291: self.nli_model is optimized but never moved to the target
device and predict_boolean_answer() passes tokenizer outputs on CPU, causing
device mismatches; after calling optimize_model(self.nli_model) move the model
to self.device (e.g., self.nli_model.to(self.device)) and update
predict_boolean_answer() to send tokenized inputs from self.nli_tokenizer to the
same device (convert the dict from encode_plus/encode to tensors and
.to(self.device) for each value) so the model and inputs are co-located and
FP16/CUDA optimization works correctly.
In `@backend/Generator/test_rag.py`:
- Around line 3-19: The file contains demo code that runs at import time
(creating RAGService(), calling RAGService.index_text and RAGService.query and
printing answer), which breaks test discovery; wrap the demo flow including
sample_text, rag = RAGService(), rag.index_text(...), question, answer and the
print statements inside an if __name__ == "__main__": guard OR convert it into a
proper test function named test_* that instantiates RAGService, calls index_text
and query, and asserts the returned answer meets expectations (use sample_text
and question variables and assert on answer) so no side-effectful code runs on
import.
- Line 1: The test imports a missing symbol RAGService (in test_rag.py), causing
ModuleNotFoundError; fix by either landing the RAGService implementation/package
in this PR and exporting it under the rag module (so from rag import RAGService
resolves), or remove/disable test_rag.py and any references to RAGService until
the service exists; if you intend to test an existing API, update the import to
the correct exported class name instead of RAGService.
In `@backend/server.py`:
- Around line 203-208: The /get_content handler (get_content) currently returns
a static success payload which breaks the frontend flow; revert to the original
Google Docs retrieval logic but gate it behind a config flag (e.g.,
ENABLE_GOOGLE_DOCS) so local dev can disable it, and when the flag is false
return a non-200, machine-readable error (e.g., 503 or 410) with an explicit
error field like {"error":"google_docs_disabled","message":"Google Docs service
disabled in local development","retryable":false} so the client can handle it;
also add server-side logging of the disabled state and any real fetch errors so
failures are actionable.
In `@backend/test_app.py`:
- Around line 9-10: Replace the hardcoded debug=True in the main guard (if
__name__ == "__main__") by reading an environment variable and converting it to
a boolean; import os at top, then call app.run(debug=...) where the debug value
comes from something like os.getenv('FLASK_DEBUG', '0') or 'False' and converted
to a boolean (e.g., check for '1', 'true', 'yes' case-insensitively) so the
interactive debugger is enabled only when the FLASK_DEBUG env var is set.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cdbc21b4-d65d-4220-b205-9c697302c2aa
📒 Files selected for processing (4)
backend/Generator/main.pybackend/Generator/test_rag.pybackend/server.pybackend/test_app.py
| @app.route('/get_content', methods=['POST']) | ||
| def get_content(): | ||
| try: | ||
| data = request.get_json() | ||
| document_url = data.get('document_url') | ||
| if not document_url: | ||
| return jsonify({'error': 'Document URL is required'}), 400 | ||
|
|
||
| text = docs_service.get_document_content(document_url) | ||
| return jsonify(text) | ||
| except ValueError as e: | ||
| return jsonify({'error': str(e)}), 400 | ||
| except Exception as e: | ||
| return jsonify({'error': str(e)}), 500 | ||
| # Google Docs API disabled for local development | ||
| return jsonify({ | ||
| "content": "Google Docs service temporarily disabled in local development." | ||
| }), 200 |
There was a problem hiding this comment.
/get_content now silently breaks the Google Docs flow.
The frontend consumes this endpoint as document text, so returning a placeholder payload here swaps real content for a static message instead of surfacing an actionable failure. Keep the old implementation behind config, or return an explicit error/feature-flag response the client can handle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 203 - 208, The /get_content handler
(get_content) currently returns a static success payload which breaks the
frontend flow; revert to the original Google Docs retrieval logic but gate it
behind a config flag (e.g., ENABLE_GOOGLE_DOCS) so local dev can disable it, and
when the flag is false return a non-200, machine-readable error (e.g., 503 or
410) with an explicit error field like
{"error":"google_docs_disabled","message":"Google Docs service disabled in local
development","retryable":false} so the client can handle it; also add
server-side logging of the disabled state and any real fetch errors so failures
are actionable.
Addressed Issues
Fixes #554
Screenshots/Recordings
Not applicable. This change is related to backend model optimization and does not introduce any UI changes.
Additional Notes
This PR adds optional model quantization to reduce memory usage and improve inference performance for transformer models used in the EduAid backend.
Main changes:
ENABLE_MODEL_QUANTIZATIONto optionally enable or disable the optimizationThese improvements help reduce memory consumption and improve inference efficiency without changing the existing functionality of the quiz generation pipeline.
AI Usage Disclosure
AI tools used: ChatGPT (for assistance and suggestions)
Checklist
Summary by CodeRabbit