[FEATURE] Add Rate Limiting and Abuse Protection for API Endpoints#563
[FEATURE] Add Rate Limiting and Abuse Protection for API Endpoints#563mahek2016 wants to merge 6 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Flask-Limiter to backend/server.py with a global Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Flask as Flask App
participant Limiter as Limiter (Flask-Limiter)
participant Handler as RateLimit Handler
participant Endpoint as Endpoint Logic
Client->>Flask: HTTP request to endpoint
Flask->>Limiter: check request vs route limit
alt within limit
Limiter->>Endpoint: allow request
Endpoint->>Client: 200 + payload
else limit exceeded
Limiter->>Handler: trigger RateLimitExceeded
Handler->>Client: 429 + {"error":"Rate limit exceeded","code":"rate_limit_exceeded"}
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/server.py (1)
140-170:⚠️ Potential issue | 🟠 MajorPer-request work is still unbounded on the answer endpoints.
These handlers accept arbitrary
input_questionarrays and then run model logic in a loop. A single oversized request can still force hundreds of evaluations while only consuming one rate-limit hit, which undercuts the abuse protection this PR is adding. Add a hard batch-size cap before the loops, and capinput_optionson/get_mcq_answeras well.Possible guardrail
+MAX_ANSWER_BATCH = 20 + def get_mcq_answer(): data = request.get_json() input_text = data.get("input_text", "") input_questions = data.get("input_question", []) input_options = data.get("input_options", []) + if len(input_questions) > MAX_ANSWER_BATCH or len(input_options) > MAX_ANSWER_BATCH: + return jsonify({ + "error": "Too many questions in one request", + "code": "batch_too_large", + }), 400 outputs = [] def get_answer(): data = request.get_json() input_text = data.get("input_text", "") input_questions = data.get("input_question", []) + if len(input_questions) > MAX_ANSWER_BATCH: + return jsonify({ + "error": "Too many questions in one request", + "code": "batch_too_large", + }), 400 def get_boolean_answer(): data = request.get_json() input_text = data.get("input_text", "") input_questions = data.get("input_question", []) + if len(input_questions) > MAX_ANSWER_BATCH: + return jsonify({ + "error": "Too many questions in one request", + "code": "batch_too_large", + }), 400Also applies to: 174-184, 188-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 140 - 170, The get_mcq_answer handler performs unbounded per-request work by looping over input_question and input_options and running qa_model, TF-IDF and cosine_similarity for each item; add a hard batch-size cap (e.g., MAX_BATCH) at the start of get_mcq_answer to truncate or reject input_question and input_options so you never process more than the cap, and also enforce a MAX_OPTIONS per question to truncate input_options entries before vectorizing; ensure you keep the lengths matched (truncate input_questions and corresponding input_options together) and return a clear short-circuit response if nothing remains. Also apply the same pattern to the other QA handlers that loop per-request (the other endpoints referenced in the review) so all loops are bounded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 73-74: The hard-generation endpoints get_shortq_hard,
get_mcq_hard, and get_boolq_hard are missing rate limits and allow bypassing the
new protection; add the same limiter decorator used on get_mcq (e.g.,
`@limiter.limit`("20 per minute")) above each of those function definitions and
any other unprotected generator endpoints referenced in the comment (lines
calling qg.generate), ensure the limiter import/instance is in scope, and run
the app/tests to confirm the decorators are applied correctly.
- Around line 209-211: The /get_content handler currently returns a 200 JSON
placeholder ("Google Docs integration disabled for local testing") which breaks
callers expecting document text; change the handler (the function that serves
the /get_content endpoint in backend/server.py) to gate Docs-disabled behavior
behind a config flag (e.g., is_docs_enabled or an environment var) and when Docs
is disabled return a non-2xx error response (HTTP 503 or 400) with a clear error
message instead of a 200 payload so callers in Text_Input.jsx and TextInput.jsx
follow their error path; ensure the response uses Flask's abort or
jsonify+status_code so the HTTP status is correct.
In `@requirements.txt`:
- Line 35: Remove the unused dependency slowapi from requirements by deleting
the "slowapi" entry in requirements.txt; ensure the project uses the existing
Flask-Limiter implementation (no code references to slowapi should remain), then
update your dependency lockfile or regenerate requirements (pip-compile / pip
freeze) and run the test suite/CI to confirm nothing breaks.
---
Outside diff comments:
In `@backend/server.py`:
- Around line 140-170: The get_mcq_answer handler performs unbounded per-request
work by looping over input_question and input_options and running qa_model,
TF-IDF and cosine_similarity for each item; add a hard batch-size cap (e.g.,
MAX_BATCH) at the start of get_mcq_answer to truncate or reject input_question
and input_options so you never process more than the cap, and also enforce a
MAX_OPTIONS per question to truncate input_options entries before vectorizing;
ensure you keep the lengths matched (truncate input_questions and corresponding
input_options together) and return a clear short-circuit response if nothing
remains. Also apply the same pattern to the other QA handlers that loop
per-request (the other endpoints referenced in the review) so all loops are
bounded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eeb345f5-f52a-4f76-bdae-e2f04f5bef03
📒 Files selected for processing (2)
backend/server.pyrequirements.txt
| return jsonify({ | ||
| "message": "Google Docs integration disabled for local testing" | ||
| }) |
There was a problem hiding this comment.
Don’t return a 200 placeholder from /get_content.
Both eduaid_web/src/pages/Text_Input.jsx at Lines 54-64 and extension/src/pages/text_input/TextInput.jsx at Lines 67-84 treat a successful response as extracted document text and assign it directly to UI state. This 200 OK object changes the contract and will surface a broken value instead of driving their existing error path. If local testing needs Docs disabled, gate it with config and return a non-2xx error.
Safer fallback if Docs must stay disabled locally
def get_content():
return jsonify({
- "message": "Google Docs integration disabled for local testing"
- })
+ "error": "Google Docs integration unavailable",
+ "code": "google_docs_disabled",
+ }), 503📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return jsonify({ | |
| "message": "Google Docs integration disabled for local testing" | |
| }) | |
| return jsonify({ | |
| "error": "Google Docs integration unavailable", | |
| "code": "google_docs_disabled", | |
| }), 503 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 209 - 211, The /get_content handler currently
returns a 200 JSON placeholder ("Google Docs integration disabled for local
testing") which breaks callers expecting document text; change the handler (the
function that serves the /get_content endpoint in backend/server.py) to gate
Docs-disabled behavior behind a config flag (e.g., is_docs_enabled or an
environment var) and when Docs is disabled return a non-2xx error response (HTTP
503 or 400) with a clear error message instead of a 200 payload so callers in
Text_Input.jsx and TextInput.jsx follow their error path; ensure the response
uses Flask's abort or jsonify+status_code so the HTTP status is correct.
There was a problem hiding this comment.
🧹 Nitpick comments (5)
backend/server.py (4)
46-51: Prefix unused parameter to suppress linter warning.The
eparameter is required by Flask's error handler signature but unused here. Prefix it with underscore to indicate it's intentionally ignored.Proposed fix
`@app.errorhandler`(RateLimitExceeded) -def rate_limit_handler(e): +def rate_limit_handler(_e): return jsonify({ "error": "Rate limit exceeded", "code": "rate_limit_exceeded" }), 429🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 46 - 51, The error handler function rate_limit_handler currently declares an unused parameter named e; to suppress linter warnings, rename that parameter to _e (or _ ) in the rate_limit_handler signature so Flask still receives the exception argument but linters recognize it as intentionally unused, leaving the function body and return value unchanged.
489-510: Consider rate-limiting/getTranscriptendpoint.This endpoint spawns a subprocess (
yt-dlp) and performs file I/O, making it susceptible to abuse. While the PR scope focuses on NLP endpoints, this could also exhaust resources under heavy load.Proposed fix
`@app.route`('/getTranscript', methods=['GET']) +@limiter.limit("10 per minute") def get_transcript():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 489 - 510, The /getTranscript handler (function get_transcript) can be abused because it spawns subprocess.run and does file I/O; add rate-limiting to prevent resource exhaustion by either integrating Flask-Limiter and decorating get_transcript with a sensible per-IP limit (e.g., "10 per minute" with a smaller burst) or implementing a simple in-memory token-bucket/Leaky-Bucket decorator that returns HTTP 429 when limits are exceeded; ensure the limiter applies before calling subprocess.run so abusive requests are rejected early and include a clear 429 JSON response, and document/configure the limit values centrally so they can be tuned.
441-456: Consider rate-limiting the/uploadendpoint.File processing can be resource-intensive. A rate limit would provide consistent protection against abuse.
Proposed fix
`@app.route`('/upload', methods=['POST']) +@limiter.limit("20 per minute") def upload_file():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 441 - 456, The /upload endpoint (function upload_file) lacks rate limiting which can allow abusive or expensive file_processor.process_file calls; add a rate limiter (e.g., Flask-Limiter or a lightweight token-bucket/middleware) and apply it to the upload_file route (decorate or wrap the route handler) with sensible limits (e.g., requests per minute per IP or authenticated user), ensure limiter returns 429 on exceeded quota and include a clear Retry-After header/message, and add tests to verify limit enforcement and that file_processor.process_file is still invoked when under the limit.
38-42: Configure a shared storage backend for production deployments.Flask-Limiter defaults to in-memory storage when no
storage_uriis specified. With in-memory storage:
- Rate limits aren't shared across Gunicorn workers—each process maintains independent counters, allowing clients to exceed limits by being routed to different workers
- Limits reset on application restart
- Won't work in distributed environments with multiple server instances
For production deployments with multiple workers or instances, configure a shared backend like Redis:
Example with Redis
limiter = Limiter( key_func=get_remote_address, app=app, default_limits=["100 per hour"], storage_uri="redis://localhost:6379" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 38 - 42, The Limiter is currently created without a persistent storage backend (Limiter(... key_func=get_remote_address, app=app, default_limits=["100 per hour"])), which uses in-memory storage and breaks rate limiting across Gunicorn workers or multiple instances; update the Limiter instantiation to include a shared storage backend by adding an appropriate storage_uri (e.g., a Redis URI like "redis://host:port") so that the Limiter (class) shares counters across processes and survives restarts—modify the Limiter call that references get_remote_address, app, and default_limits to pass storage_uri pointing to your production Redis (or other supported backend).requirements.txt (1)
36-36: Consider pinning the Flask-Limiter version.Unpinned dependencies can lead to unexpected behavior if a breaking release is published. Consider pinning to a specific version (e.g.,
Flask-Limiter==3.5.0) consistent with other pinned dependencies in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` at line 36, The requirements entry for Flask-Limiter is unpinned; update the dependency line for "Flask-Limiter" to a pinned version (for example "Flask-Limiter==3.5.0") to match the project's pinned-dependency practices and prevent accidental breaking upgrades; ensure the formatting matches other entries in requirements.txt.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/server.py`:
- Around line 46-51: The error handler function rate_limit_handler currently
declares an unused parameter named e; to suppress linter warnings, rename that
parameter to _e (or _ ) in the rate_limit_handler signature so Flask still
receives the exception argument but linters recognize it as intentionally
unused, leaving the function body and return value unchanged.
- Around line 489-510: The /getTranscript handler (function get_transcript) can
be abused because it spawns subprocess.run and does file I/O; add rate-limiting
to prevent resource exhaustion by either integrating Flask-Limiter and
decorating get_transcript with a sensible per-IP limit (e.g., "10 per minute"
with a smaller burst) or implementing a simple in-memory
token-bucket/Leaky-Bucket decorator that returns HTTP 429 when limits are
exceeded; ensure the limiter applies before calling subprocess.run so abusive
requests are rejected early and include a clear 429 JSON response, and
document/configure the limit values centrally so they can be tuned.
- Around line 441-456: The /upload endpoint (function upload_file) lacks rate
limiting which can allow abusive or expensive file_processor.process_file calls;
add a rate limiter (e.g., Flask-Limiter or a lightweight
token-bucket/middleware) and apply it to the upload_file route (decorate or wrap
the route handler) with sensible limits (e.g., requests per minute per IP or
authenticated user), ensure limiter returns 429 on exceeded quota and include a
clear Retry-After header/message, and add tests to verify limit enforcement and
that file_processor.process_file is still invoked when under the limit.
- Around line 38-42: The Limiter is currently created without a persistent
storage backend (Limiter(... key_func=get_remote_address, app=app,
default_limits=["100 per hour"])), which uses in-memory storage and breaks rate
limiting across Gunicorn workers or multiple instances; update the Limiter
instantiation to include a shared storage backend by adding an appropriate
storage_uri (e.g., a Redis URI like "redis://host:port") so that the Limiter
(class) shares counters across processes and survives restarts—modify the
Limiter call that references get_remote_address, app, and default_limits to pass
storage_uri pointing to your production Redis (or other supported backend).
In `@requirements.txt`:
- Line 36: The requirements entry for Flask-Limiter is unpinned; update the
dependency line for "Flask-Limiter" to a pinned version (for example
"Flask-Limiter==3.5.0") to match the project's pinned-dependency practices and
prevent accidental breaking upgrades; ensure the formatting matches other
entries in requirements.txt.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6b4d84e6-b8c1-46ac-a030-91afb37e109a
📒 Files selected for processing (2)
backend/server.pyrequirements.txt
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/server.py (1)
214-215: Consider rate-limiting/generate_gform.This endpoint makes external Google Forms API calls and triggers OAuth flows. While not an NLP inference endpoint, unthrottled access could exhaust API quotas, spam form creation, or trigger Google rate limits upstream.
♻️ Suggested addition
`@app.route`("/generate_gform", methods=["POST"]) +@limiter.limit("5 per minute") def generate_gform():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 214 - 215, Add rate-limiting to the /generate_gform endpoint to prevent quota exhaustion and abuse: install/configure a limiter (e.g., Flask-Limiter) and apply it to the generate_gform view (or wrap with a custom rate_limit decorator) so requests to generate_gform are throttled per client/IP (for example N requests per minute) and return a 429 on exceed; make sure to import and initialize the limiter at module scope, attach the limiter to the app, and document any admin/exemptions if needed so the symbol generate_gform (and its `@app.route`) is protected by the limiter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/server.py`:
- Around line 214-215: Add rate-limiting to the /generate_gform endpoint to
prevent quota exhaustion and abuse: install/configure a limiter (e.g.,
Flask-Limiter) and apply it to the generate_gform view (or wrap with a custom
rate_limit decorator) so requests to generate_gform are throttled per client/IP
(for example N requests per minute) and return a 429 on exceed; make sure to
import and initialize the limiter at module scope, attach the limiter to the
app, and document any admin/exemptions if needed so the symbol generate_gform
(and its `@app.route`) is protected by the limiter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b20ea0b0-75e9-45f9-bf2b-097aba2dfc9b
📒 Files selected for processing (2)
backend/server.pyrequirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- requirements.txt
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
backend/server.py (3)
503-511:⚠️ Potential issue | 🟠 MajorRace condition in subtitle file handling.
Using
glob("subtitles/*.vtt")and selecting bymax(ctime)can return a different request's file when concurrent requests are processed. One request may also delete another's file before it's read.Use the expected filename directly instead of globbing:
🔧 Proposed fix
- # Find the latest .vtt file in the "subtitles" folder - subtitle_files = glob.glob("subtitles/*.vtt") - if not subtitle_files: - return jsonify({"error": "No subtitles found"}), 404 - - latest_subtitle = max(subtitle_files, key=os.path.getctime) - transcript_text = clean_transcript(latest_subtitle) - - # Optional: Clean up the file after reading - os.remove(latest_subtitle) + # yt-dlp may append language code to filename + expected_file = f"subtitles/{video_id}.en.vtt" + if not os.path.exists(expected_file): + # Fallback to check without language suffix + expected_file = f"subtitles/{video_id}.vtt" + + if not os.path.exists(expected_file): + return jsonify({"error": "No subtitles found"}), 404 + + transcript_text = clean_transcript(expected_file) + os.remove(expected_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 503 - 511, The current logic uses glob("subtitles/*.vtt") and max(os.path.getctime) then os.remove(latest_subtitle) which causes race conditions across concurrent requests; change to use the concrete expected filename (pass the filename or unique temp path into the handler instead of scanning) and operate only on that path when calling clean_transcript(filename) and os.remove(filename); alternatively generate a per-request unique filename when saving uploads and reference that unique name in the request flow so functions like clean_transcript, the variable latest_subtitle, and the deletion call operate on a single known file rather than the glob result.
491-513:⚠️ Potential issue | 🔴 CriticalCommand injection and path traversal vulnerability via unsanitized
video_id.The
video_idparameter is taken directly from user input and interpolated into both the subprocess command and file path without validation. While using a list withsubprocess.runmitigates shell injection, the unsanitized input still poses risks:
- Path traversal: A
video_idlike../../etc/passwdcould write/read outside the intended directory.- Unexpected behavior: Special characters could cause yt-dlp to misinterpret arguments.
Validate that
video_idmatches an expected YouTube ID format (11 alphanumeric characters plus-and_):🛡️ Proposed fix
`@app.route`('/getTranscript', methods=['GET']) `@limiter.limit`("10 per minute") def get_transcript(): video_id = request.args.get('videoId') if not video_id: return jsonify({"error": "No video ID provided"}), 400 + + # Validate video_id format (YouTube IDs are 11 chars: alphanumeric, -, _) + if not re.match(r'^[a-zA-Z0-9_-]{11}$', video_id): + return jsonify({"error": "Invalid video ID format"}), 400 subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 491 - 513, The get_transcript handler currently uses unsanitized video_id in subprocess.run and file paths; validate video_id first (e.g., require a regex ^[A-Za-z0-9_-]{11}$) and return 400 on mismatch, then use the validated id only. After validation, construct the yt-dlp call using the validated id for the URL and avoid interpolating user input into arbitrary paths by using os.path.join("subtitles", f"{video_id}.vtt") (or use a safe temporary filename via tempfile with the validated id as a suffix). Also ensure any file removal/read uses that same safe path variable (latest_subtitle should be constrained to the subtitles directory or resolved via os.path.abspath and checked to start with the subtitles directory) before calling clean_transcript and os.remove.
498-500:⚠️ Potential issue | 🟠 MajorUnhandled
CalledProcessErrorwill cause 500 errors.With
check=True, a failedyt-dlpcommand (e.g., video unavailable, no subtitles, network error) raisesCalledProcessError, resulting in an unhandled 500 response to clients.🔧 Proposed fix
+ try: + subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", + "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], + check=True, capture_output=True, text=True) + except subprocess.CalledProcessError as e: + return jsonify({"error": "Failed to fetch transcript", "details": e.stderr}), 502 - subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download", - "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"], - check=True, capture_output=True, text=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 498 - 500, Wrap the subprocess.run(["yt-dlp", ...], check=True, capture_output=True, text=True) call that downloads subtitles for video_id in a try/except; catch subprocess.CalledProcessError as e (and optionally FileNotFoundError) and log e.returncode and e.stderr along with the video_id, then return/raise an appropriate HTTP error or domain-specific exception instead of letting the CalledProcessError propagate (e.g., use Flask's abort or return a 404/400 with a clear message) so failures like missing subtitles, network errors or unavailable videos are handled gracefully.
🧹 Nitpick comments (1)
backend/server.py (1)
38-42: In-memory storage won't persist limits across workers.Without specifying a
storage_uri, Flask-Limiter defaults to in-memory storage. This means rate limits won't be shared across multiple Gunicorn/uWSGI workers or server instances, allowing clients to effectively multiply their allowed requests.For production with multiple workers, configure a shared backend:
limiter = Limiter( key_func=get_remote_address, app=app, default_limits=["100 per hour"], storage_uri="redis://localhost:6379" # or memcached://... )If single-worker local development is the only target, this is acceptable but should be documented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 38 - 42, The Limiter instantiation (Limiter, key_func=get_remote_address, app=app) currently uses default in-memory storage which won't persist limits across workers; update the Limiter call to supply a shared backend via the storage_uri parameter (e.g., "redis://..." or "memcached://...") so rate limits are shared between Gunicorn/uWSGI workers in production, or alternatively add a clear comment near the Limiter creation explaining that in-memory is intentionally used only for single-worker local dev.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@backend/server.py`:
- Around line 503-511: The current logic uses glob("subtitles/*.vtt") and
max(os.path.getctime) then os.remove(latest_subtitle) which causes race
conditions across concurrent requests; change to use the concrete expected
filename (pass the filename or unique temp path into the handler instead of
scanning) and operate only on that path when calling clean_transcript(filename)
and os.remove(filename); alternatively generate a per-request unique filename
when saving uploads and reference that unique name in the request flow so
functions like clean_transcript, the variable latest_subtitle, and the deletion
call operate on a single known file rather than the glob result.
- Around line 491-513: The get_transcript handler currently uses unsanitized
video_id in subprocess.run and file paths; validate video_id first (e.g.,
require a regex ^[A-Za-z0-9_-]{11}$) and return 400 on mismatch, then use the
validated id only. After validation, construct the yt-dlp call using the
validated id for the URL and avoid interpolating user input into arbitrary paths
by using os.path.join("subtitles", f"{video_id}.vtt") (or use a safe temporary
filename via tempfile with the validated id as a suffix). Also ensure any file
removal/read uses that same safe path variable (latest_subtitle should be
constrained to the subtitles directory or resolved via os.path.abspath and
checked to start with the subtitles directory) before calling clean_transcript
and os.remove.
- Around line 498-500: Wrap the subprocess.run(["yt-dlp", ...], check=True,
capture_output=True, text=True) call that downloads subtitles for video_id in a
try/except; catch subprocess.CalledProcessError as e (and optionally
FileNotFoundError) and log e.returncode and e.stderr along with the video_id,
then return/raise an appropriate HTTP error or domain-specific exception instead
of letting the CalledProcessError propagate (e.g., use Flask's abort or return a
404/400 with a clear message) so failures like missing subtitles, network errors
or unavailable videos are handled gracefully.
---
Nitpick comments:
In `@backend/server.py`:
- Around line 38-42: The Limiter instantiation (Limiter,
key_func=get_remote_address, app=app) currently uses default in-memory storage
which won't persist limits across workers; update the Limiter call to supply a
shared backend via the storage_uri parameter (e.g., "redis://..." or
"memcached://...") so rate limits are shared between Gunicorn/uWSGI workers in
production, or alternatively add a clear comment near the Limiter creation
explaining that in-memory is intentionally used only for single-worker local
dev.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/server.py (1)
431-440:⚠️ Potential issue | 🔴 CriticalPreserve the boolean question objects when hardening.
Here Line 438 sends the whole generated item into
make_question_harderand returns those transformed values directly. That is inconsistent with the other hard endpoints and will either fail on dict input or collapse the response from{"question": ...}objects into plain strings.Possible fix
- harder_questions = [make_question_harder(q) for q in generated] - - return jsonify({"output": harder_questions}) + for item in generated: + item["question"] = make_question_harder(item["question"]) + + return jsonify({"output": generated})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 431 - 440, The code passes entire generated items into make_question_harder and returns those results, which collapses question objects into plain strings or breaks dict handling; instead iterate over each item in generated, call make_question_harder only on the question text (e.g., item["question"] or the appropriate field) and build a new list of question objects by copying each original dict and replacing its question field with the hardened question (preserving boolean/answer fields for "true_false" style). Update the mapping that produces harder_questions so jsonify returns the list of intact objects rather than raw strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 503-521: The subprocess.run call that invokes yt-dlp (in
backend/server.py within the try block that builds the args list including
"yt-dlp" and f"https://www.youtube.com/watch?v={video_id}") must be bounded by a
timeout: add the timeout parameter (e.g. timeout=30 or a configurable value) to
the subprocess.run invocation and catch subprocess.TimeoutExpired in the
surrounding try/except; in the TimeoutExpired handler return a suitable error
response (e.g. jsonify({"error":"yt-dlp timed out"}), 504) and optionally
log/attach stderr for diagnostics, keeping the existing CalledProcessError and
FileNotFoundError handlers intact.
- Around line 38-42: The global Limiter instantiation currently sets
default_limits=["100 per hour"], which unintentionally rate-limits all
undecorated routes (e.g., the root handler "/" and the get_content endpoint
handled by the function for "/get_content"); remove the default_limits argument
from the Limiter(...) call or keep it but explicitly mark
non-inference/monitoring handlers with `@limiter.exempt` (apply the decorator to
the route functions that serve health checks/monitoring such as the "/" handler
and the function serving "/get_content") so that only intended inference
endpoints remain rate-limited.
- Around line 500-529: Replace the fixed subtitle_path strategy with a
per-request temporary directory: create a tempfile.TemporaryDirectory() for the
request, run subprocess.run (the existing call in the block around subtitle_path
and yt-dlp) with the output directory set to that temp dir (so yt-dlp writes
into an isolated folder), then use glob.glob to find the generated VTT file for
video_id (match f"{video_id}*.vtt" to handle language suffixes) and pass that
found path to clean_transcript; if no match return the same 404 error, and rely
on the TemporaryDirectory context manager to clean up instead of os.remove; keep
existing exception handling for subprocess.CalledProcessError and
FileNotFoundError but update messages as needed and reference variables
subtitle_path, subprocess.run, clean_transcript, and os.remove when making the
change.
---
Outside diff comments:
In `@backend/server.py`:
- Around line 431-440: The code passes entire generated items into
make_question_harder and returns those results, which collapses question objects
into plain strings or breaks dict handling; instead iterate over each item in
generated, call make_question_harder only on the question text (e.g.,
item["question"] or the appropriate field) and build a new list of question
objects by copying each original dict and replacing its question field with the
hardened question (preserving boolean/answer fields for "true_false" style).
Update the mapping that produces harder_questions so jsonify returns the list of
intact objects rather than raw strings.
| # Safe subtitle file path | ||
| subtitle_path = os.path.join("subtitles", f"{video_id}.vtt") | ||
|
|
||
| try: | ||
| subprocess.run( | ||
| [ | ||
| "yt-dlp", | ||
| "--write-auto-sub", | ||
| "--sub-lang", "en", | ||
| "--skip-download", | ||
| "--sub-format", "vtt", | ||
| "-o", subtitle_path, | ||
| f"https://www.youtube.com/watch?v={video_id}" | ||
| ], | ||
| check=True, | ||
| capture_output=True, | ||
| text=True | ||
| ) | ||
| except subprocess.CalledProcessError: | ||
| return jsonify({"error": "Failed to fetch subtitles"}), 400 | ||
| except FileNotFoundError: | ||
| return jsonify({"error": "yt-dlp is not installed on the server"}), 500 | ||
|
|
||
| if not os.path.exists(subtitle_path): | ||
| return jsonify({"error": "No subtitles found"}), 404 | ||
|
|
||
| latest_subtitle = max(subtitle_files, key=os.path.getctime) | ||
| transcript_text = clean_transcript(latest_subtitle) | ||
| transcript_text = clean_transcript(subtitle_path) | ||
|
|
||
| # Optional: Clean up the file after reading | ||
| os.remove(latest_subtitle) | ||
| # Clean up subtitle file | ||
| os.remove(subtitle_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the backend/server.py file
find . -name "server.py" -o -name "*.py" | grep -E "(backend|server)" | head -20Repository: AOSSIE-Org/EduAid
Length of output: 264
🏁 Script executed:
# Check if backend/server.py exists and examine lines around 500-529
if [ -f "backend/server.py" ]; then
wc -l backend/server.py
echo "=== Lines 500-529 ==="
sed -n '500,529p' backend/server.py
fiRepository: AOSSIE-Org/EduAid
Length of output: 1060
🏁 Script executed:
# Search for subtitles directory creation in the codebase
rg "subtitles" --type py -B 3 -A 3Repository: AOSSIE-Org/EduAid
Length of output: 1237
🏁 Script executed:
# Look for __main__ and directory creation patterns
rg "if __name__ == \"__main__\"" --type py -A 20Repository: AOSSIE-Org/EduAid
Length of output: 199
🌐 Web query:
yt-dlp subtitle output filename format --write-auto-sub language code
💡 Result:
Pick the auto-sub language (language code)
Use --sub-langs (regex supported) together with --write-auto-subs (aka --write-auto-sub). Example (any English variant like en, en-US, etc.):
yt-dlp --write-auto-subs --sub-langs "en.*" --skip-download URL--list-subs shows the exact language tags you can match/select. [1]
Control the subtitle output filename template
Set a subtitle-specific output template using the subtitle: output type:
yt-dlp --write-auto-subs --sub-langs "en.*" \
-o "%(title)s [%(id)s].%(ext)s" \
-o "subtitle:%(title)s [%(id)s].%(ext)s" \
URLyt-dlp supports per-file-type templates (including subtitle:). [2]
Important: yt-dlp will still append the subtitle language to the filename
Even if your subtitle: template doesn’t include a language field, yt-dlp appends the subtitle language/id before the subtitle extension (e.g., My Title [id].en.vtt). This is expected behavior and isn’t removable via the output template alone (people typically rename via --exec afterward if they really need to). [3]
Sources
[1] yt-dlp README / help: subtitle options (--write-auto-subs, --list-subs, --sub-langs with regex) (github.com)
[2] yt-dlp README: per-type output templates (-o "subtitle:...") (github.com)
[3] Discussion showing subtitle language/id is always added to subtitle filenames (reddit.com)
Citations:
- 1: https://github.com/yt-dlp/yt-dlp
- 2: https://github.com/yt-dlp/yt-dlp
- 3: https://www.reddit.com/r/youtubedl/comments/xder5v?utm_source=openai
Use a request-scoped subtitle path instead of a shared fixed filename.
subtitles/{video_id}.vtt fails in three ways: it races when two requests fetch the same video concurrently, the subtitles directory is only created in __main__ so WSGI deployments lack the directory entirely, and the existence check at line 523 always fails because yt-dlp appends the language code to the filename (e.g., {video_id}.en.vtt instead of {video_id}.vtt). A per-request temp directory with glob to find the generated file avoids all three failure modes.
Possible fix
+import glob
+import tempfile
...
- subtitle_path = os.path.join("subtitles", f"{video_id}.vtt")
-
- try:
- subprocess.run(
- [
- "yt-dlp",
- "--write-auto-sub",
- "--sub-lang", "en",
- "--skip-download",
- "--sub-format", "vtt",
- "-o", subtitle_path,
- f"https://www.youtube.com/watch?v={video_id}"
- ],
- check=True,
- capture_output=True,
- text=True
- )
+ with tempfile.TemporaryDirectory(prefix="subtitles_") as subtitle_dir:
+ output_template = os.path.join(subtitle_dir, video_id)
+ try:
+ subprocess.run(
+ [
+ "yt-dlp",
+ "--write-auto-sub",
+ "--sub-lang", "en",
+ "--skip-download",
+ "--sub-format", "vtt",
+ "-o", output_template,
+ f"https://www.youtube.com/watch?v={video_id}"
+ ],
+ check=True,
+ capture_output=True,
+ text=True,
+ )
...
- if not os.path.exists(subtitle_path):
- return jsonify({"error": "No subtitles found"}), 404
-
- transcript_text = clean_transcript(subtitle_path)
-
- # Clean up subtitle file
- os.remove(subtitle_path)
+ matches = glob.glob(os.path.join(subtitle_dir, "*.vtt"))
+ if not matches:
+ return jsonify({"error": "No subtitles found"}), 404
+
+ transcript_text = clean_transcript(matches[0])🧰 Tools
🪛 Ruff (0.15.5)
[error] 504-504: subprocess call: check for execution of untrusted input
(S603)
[error] 505-513: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 500 - 529, Replace the fixed subtitle_path
strategy with a per-request temporary directory: create a
tempfile.TemporaryDirectory() for the request, run subprocess.run (the existing
call in the block around subtitle_path and yt-dlp) with the output directory set
to that temp dir (so yt-dlp writes into an isolated folder), then use glob.glob
to find the generated VTT file for video_id (match f"{video_id}*.vtt" to handle
language suffixes) and pass that found path to clean_transcript; if no match
return the same 404 error, and rely on the TemporaryDirectory context manager to
clean up instead of os.remove; keep existing exception handling for
subprocess.CalledProcessError and FileNotFoundError but update messages as
needed and reference variables subtitle_path, subprocess.run, clean_transcript,
and os.remove when making the change.
Addressed Issues:
Fixes #561
Screenshots/Recordings:
Not applicable. This change introduces backend rate limiting and does not affect the UI.
Additional Notes:
This PR introduces rate limiting for backend API endpoints that trigger computationally expensive NLP model inference operations.
Key changes:
Example response when the rate limit is exceeded:
{
"error": "Rate limit exceeded",
"code": "rate_limit_exceeded"
}
This helps protect the backend from automated abuse and excessive requests, improving system stability and resource management.
AI Usage Disclosure:
I have used the following AI models and tools: ChatGPT (for guidance and explanations)
Checklist
Summary by CodeRabbit
New Features
Changes
Bug Fixes
Chores