Skip to content

[FEATURE] Add Rate Limiting and Abuse Protection for API Endpoints#563

Open
mahek2016 wants to merge 6 commits intoAOSSIE-Org:mainfrom
mahek2016:feature/api-rate-limiting
Open

[FEATURE] Add Rate Limiting and Abuse Protection for API Endpoints#563
mahek2016 wants to merge 6 commits intoAOSSIE-Org:mainfrom
mahek2016:feature/api-rate-limiting

Conversation

@mahek2016
Copy link
Copy Markdown

@mahek2016 mahek2016 commented Mar 12, 2026

Addressed Issues:

Fixes #561

Screenshots/Recordings:

Not applicable. This change introduces backend rate limiting and does not affect the UI.

Additional Notes:

This PR introduces rate limiting for backend API endpoints that trigger computationally expensive NLP model inference operations.

Key changes:

  • Integrated Flask-Limiter with the Flask backend.
  • Implemented per-IP request limits (20 requests per minute).
  • Protected endpoints such as:
    • /get_mcq
    • /get_boolq
    • /get_shortq
    • /get_mcq_answer
    • /get_shortq_answer
    • /get_boolean_answer
  • Added a structured JSON response when the rate limit is exceeded.

Example response when the rate limit is exceeded:

{
"error": "Rate limit exceeded",
"code": "rate_limit_exceeded"
}

This helps protect the backend from automated abuse and excessive requests, improving system stability and resource management.

AI Usage Disclosure:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: ChatGPT (for guidance and explanations)

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features

    • Broad API rate limiting applied across many endpoints; exceeding limits returns a consistent 429 JSON error.
  • Changes

    • External content integration disabled and now returns a standardized 503 service-unavailable response.
    • Rate limits extended to hard/problem and transcript-related endpoints.
  • Bug Fixes

    • Improved transcript handling with validation, better error reporting, and cleanup of temporary files.
  • Chores

    • Dependency added to support rate-limiting behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Flask-Limiter to backend/server.py with a global limiter, a rate_limit_handler returning JSON 429 on exceed, applies per-route limits across many endpoints (MCQ/short/bool/problem/answer variants, hard variants, upload/transcript), replaces /get_content with a 503 response, and improves transcript fetching/cleanup and error reporting.

Changes

Cohort / File(s) Summary
Rate limiting & handlers
backend/server.py
Introduce global limiter (Flask-Limiter), add rate_limit_handler for RateLimitExceeded, apply per-route rate limits across numerous endpoints, and adjust decorator placements.
Transcript & endpoint logic
backend/server.py
Add validation and safe path handling for /getTranscript, improved subtitle fetch error handling, temp file cleanup, and replace /get_content behavior with 503.
Dependencies
requirements.txt
Add Flask-Limiter==3.5.0 and adjust EOF/newline formatting.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant Flask as Flask App
    participant Limiter as Limiter (Flask-Limiter)
    participant Handler as RateLimit Handler
    participant Endpoint as Endpoint Logic

    Client->>Flask: HTTP request to endpoint
    Flask->>Limiter: check request vs route limit
    alt within limit
        Limiter->>Endpoint: allow request
        Endpoint->>Client: 200 + payload
    else limit exceeded
        Limiter->>Handler: trigger RateLimitExceeded
        Handler->>Client: 429 + {"error":"Rate limit exceeded","code":"rate_limit_exceeded"}
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped into the server with bright, tiny paws,
Counting each request and guarding the cause.
A nibble of limits, a gentle, firm cue,
Now models can breathe — and I nibble on code too. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements rate limiting for NLP endpoints, but applies a 100/hour global limit and additional per-route limits instead of the specified 20 requests per minute per IP, and diverges by modifying unrelated endpoints like /get_content and /getTranscript. Align rate limits to 20 requests per minute per IP for NLP inference endpoints (#561), remove unrelated changes like /get_content 503 response and /getTranscript modifications, and verify the implementation matches the exact specifications.
Out of Scope Changes check ⚠️ Warning The PR includes out-of-scope modifications: replacing /get_content with a 503 response and adding robust validation/cleanup to /getTranscript endpoint, neither of which are mentioned in the linked issues. Remove the /get_content replacement (503 response) and the /getTranscript security enhancements, keeping only rate limiting for the specified NLP inference endpoints (#561).
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main feature added: rate limiting and abuse protection for API endpoints, which matches the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/server.py (1)

140-170: ⚠️ Potential issue | 🟠 Major

Per-request work is still unbounded on the answer endpoints.

These handlers accept arbitrary input_question arrays and then run model logic in a loop. A single oversized request can still force hundreds of evaluations while only consuming one rate-limit hit, which undercuts the abuse protection this PR is adding. Add a hard batch-size cap before the loops, and cap input_options on /get_mcq_answer as well.

Possible guardrail
+MAX_ANSWER_BATCH = 20
+
 def get_mcq_answer():
     data = request.get_json()
     input_text = data.get("input_text", "")
     input_questions = data.get("input_question", [])
     input_options = data.get("input_options", [])
+    if len(input_questions) > MAX_ANSWER_BATCH or len(input_options) > MAX_ANSWER_BATCH:
+        return jsonify({
+            "error": "Too many questions in one request",
+            "code": "batch_too_large",
+        }), 400
     outputs = []

 def get_answer():
     data = request.get_json()
     input_text = data.get("input_text", "")
     input_questions = data.get("input_question", [])
+    if len(input_questions) > MAX_ANSWER_BATCH:
+        return jsonify({
+            "error": "Too many questions in one request",
+            "code": "batch_too_large",
+        }), 400

 def get_boolean_answer():
     data = request.get_json()
     input_text = data.get("input_text", "")
     input_questions = data.get("input_question", [])
+    if len(input_questions) > MAX_ANSWER_BATCH:
+        return jsonify({
+            "error": "Too many questions in one request",
+            "code": "batch_too_large",
+        }), 400

Also applies to: 174-184, 188-204

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 140 - 170, The get_mcq_answer handler
performs unbounded per-request work by looping over input_question and
input_options and running qa_model, TF-IDF and cosine_similarity for each item;
add a hard batch-size cap (e.g., MAX_BATCH) at the start of get_mcq_answer to
truncate or reject input_question and input_options so you never process more
than the cap, and also enforce a MAX_OPTIONS per question to truncate
input_options entries before vectorizing; ensure you keep the lengths matched
(truncate input_questions and corresponding input_options together) and return a
clear short-circuit response if nothing remains. Also apply the same pattern to
the other QA handlers that loop per-request (the other endpoints referenced in
the review) so all loops are bounded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 73-74: The hard-generation endpoints get_shortq_hard,
get_mcq_hard, and get_boolq_hard are missing rate limits and allow bypassing the
new protection; add the same limiter decorator used on get_mcq (e.g.,
`@limiter.limit`("20 per minute")) above each of those function definitions and
any other unprotected generator endpoints referenced in the comment (lines
calling qg.generate), ensure the limiter import/instance is in scope, and run
the app/tests to confirm the decorators are applied correctly.
- Around line 209-211: The /get_content handler currently returns a 200 JSON
placeholder ("Google Docs integration disabled for local testing") which breaks
callers expecting document text; change the handler (the function that serves
the /get_content endpoint in backend/server.py) to gate Docs-disabled behavior
behind a config flag (e.g., is_docs_enabled or an environment var) and when Docs
is disabled return a non-2xx error response (HTTP 503 or 400) with a clear error
message instead of a 200 payload so callers in Text_Input.jsx and TextInput.jsx
follow their error path; ensure the response uses Flask's abort or
jsonify+status_code so the HTTP status is correct.

In `@requirements.txt`:
- Line 35: Remove the unused dependency slowapi from requirements by deleting
the "slowapi" entry in requirements.txt; ensure the project uses the existing
Flask-Limiter implementation (no code references to slowapi should remain), then
update your dependency lockfile or regenerate requirements (pip-compile / pip
freeze) and run the test suite/CI to confirm nothing breaks.

---

Outside diff comments:
In `@backend/server.py`:
- Around line 140-170: The get_mcq_answer handler performs unbounded per-request
work by looping over input_question and input_options and running qa_model,
TF-IDF and cosine_similarity for each item; add a hard batch-size cap (e.g.,
MAX_BATCH) at the start of get_mcq_answer to truncate or reject input_question
and input_options so you never process more than the cap, and also enforce a
MAX_OPTIONS per question to truncate input_options entries before vectorizing;
ensure you keep the lengths matched (truncate input_questions and corresponding
input_options together) and return a clear short-circuit response if nothing
remains. Also apply the same pattern to the other QA handlers that loop
per-request (the other endpoints referenced in the review) so all loops are
bounded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eeb345f5-f52a-4f76-bdae-e2f04f5bef03

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 512c301.

📒 Files selected for processing (2)
  • backend/server.py
  • requirements.txt

Comment thread backend/server.py
Comment thread backend/server.py Outdated
Comment on lines +209 to +211
return jsonify({
"message": "Google Docs integration disabled for local testing"
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t return a 200 placeholder from /get_content.

Both eduaid_web/src/pages/Text_Input.jsx at Lines 54-64 and extension/src/pages/text_input/TextInput.jsx at Lines 67-84 treat a successful response as extracted document text and assign it directly to UI state. This 200 OK object changes the contract and will surface a broken value instead of driving their existing error path. If local testing needs Docs disabled, gate it with config and return a non-2xx error.

Safer fallback if Docs must stay disabled locally
 def get_content():
     return jsonify({
-        "message": "Google Docs integration disabled for local testing"
-    })
+        "error": "Google Docs integration unavailable",
+        "code": "google_docs_disabled",
+    }), 503
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return jsonify({
"message": "Google Docs integration disabled for local testing"
})
return jsonify({
"error": "Google Docs integration unavailable",
"code": "google_docs_disabled",
}), 503
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 209 - 211, The /get_content handler currently
returns a 200 JSON placeholder ("Google Docs integration disabled for local
testing") which breaks callers expecting document text; change the handler (the
function that serves the /get_content endpoint in backend/server.py) to gate
Docs-disabled behavior behind a config flag (e.g., is_docs_enabled or an
environment var) and when Docs is disabled return a non-2xx error response (HTTP
503 or 400) with a clear error message instead of a 200 payload so callers in
Text_Input.jsx and TextInput.jsx follow their error path; ensure the response
uses Flask's abort or jsonify+status_code so the HTTP status is correct.

Comment thread requirements.txt Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
backend/server.py (4)

46-51: Prefix unused parameter to suppress linter warning.

The e parameter is required by Flask's error handler signature but unused here. Prefix it with underscore to indicate it's intentionally ignored.

Proposed fix
 `@app.errorhandler`(RateLimitExceeded)
-def rate_limit_handler(e):
+def rate_limit_handler(_e):
     return jsonify({
         "error": "Rate limit exceeded",
         "code": "rate_limit_exceeded"
     }), 429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 46 - 51, The error handler function
rate_limit_handler currently declares an unused parameter named e; to suppress
linter warnings, rename that parameter to _e (or _ ) in the rate_limit_handler
signature so Flask still receives the exception argument but linters recognize
it as intentionally unused, leaving the function body and return value
unchanged.

489-510: Consider rate-limiting /getTranscript endpoint.

This endpoint spawns a subprocess (yt-dlp) and performs file I/O, making it susceptible to abuse. While the PR scope focuses on NLP endpoints, this could also exhaust resources under heavy load.

Proposed fix
 `@app.route`('/getTranscript', methods=['GET'])
+@limiter.limit("10 per minute")
 def get_transcript():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 489 - 510, The /getTranscript handler
(function get_transcript) can be abused because it spawns subprocess.run and
does file I/O; add rate-limiting to prevent resource exhaustion by either
integrating Flask-Limiter and decorating get_transcript with a sensible per-IP
limit (e.g., "10 per minute" with a smaller burst) or implementing a simple
in-memory token-bucket/Leaky-Bucket decorator that returns HTTP 429 when limits
are exceeded; ensure the limiter applies before calling subprocess.run so
abusive requests are rejected early and include a clear 429 JSON response, and
document/configure the limit values centrally so they can be tuned.

441-456: Consider rate-limiting the /upload endpoint.

File processing can be resource-intensive. A rate limit would provide consistent protection against abuse.

Proposed fix
 `@app.route`('/upload', methods=['POST'])
+@limiter.limit("20 per minute")
 def upload_file():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 441 - 456, The /upload endpoint (function
upload_file) lacks rate limiting which can allow abusive or expensive
file_processor.process_file calls; add a rate limiter (e.g., Flask-Limiter or a
lightweight token-bucket/middleware) and apply it to the upload_file route
(decorate or wrap the route handler) with sensible limits (e.g., requests per
minute per IP or authenticated user), ensure limiter returns 429 on exceeded
quota and include a clear Retry-After header/message, and add tests to verify
limit enforcement and that file_processor.process_file is still invoked when
under the limit.

38-42: Configure a shared storage backend for production deployments.

Flask-Limiter defaults to in-memory storage when no storage_uri is specified. With in-memory storage:

  • Rate limits aren't shared across Gunicorn workers—each process maintains independent counters, allowing clients to exceed limits by being routed to different workers
  • Limits reset on application restart
  • Won't work in distributed environments with multiple server instances

For production deployments with multiple workers or instances, configure a shared backend like Redis:

Example with Redis
limiter = Limiter(
    key_func=get_remote_address,
    app=app,
    default_limits=["100 per hour"],
    storage_uri="redis://localhost:6379"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 38 - 42, The Limiter is currently created
without a persistent storage backend (Limiter(... key_func=get_remote_address,
app=app, default_limits=["100 per hour"])), which uses in-memory storage and
breaks rate limiting across Gunicorn workers or multiple instances; update the
Limiter instantiation to include a shared storage backend by adding an
appropriate storage_uri (e.g., a Redis URI like "redis://host:port") so that the
Limiter (class) shares counters across processes and survives restarts—modify
the Limiter call that references get_remote_address, app, and default_limits to
pass storage_uri pointing to your production Redis (or other supported backend).
requirements.txt (1)

36-36: Consider pinning the Flask-Limiter version.

Unpinned dependencies can lead to unexpected behavior if a breaking release is published. Consider pinning to a specific version (e.g., Flask-Limiter==3.5.0) consistent with other pinned dependencies in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 36, The requirements entry for Flask-Limiter is
unpinned; update the dependency line for "Flask-Limiter" to a pinned version
(for example "Flask-Limiter==3.5.0") to match the project's pinned-dependency
practices and prevent accidental breaking upgrades; ensure the formatting
matches other entries in requirements.txt.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/server.py`:
- Around line 46-51: The error handler function rate_limit_handler currently
declares an unused parameter named e; to suppress linter warnings, rename that
parameter to _e (or _ ) in the rate_limit_handler signature so Flask still
receives the exception argument but linters recognize it as intentionally
unused, leaving the function body and return value unchanged.
- Around line 489-510: The /getTranscript handler (function get_transcript) can
be abused because it spawns subprocess.run and does file I/O; add rate-limiting
to prevent resource exhaustion by either integrating Flask-Limiter and
decorating get_transcript with a sensible per-IP limit (e.g., "10 per minute"
with a smaller burst) or implementing a simple in-memory
token-bucket/Leaky-Bucket decorator that returns HTTP 429 when limits are
exceeded; ensure the limiter applies before calling subprocess.run so abusive
requests are rejected early and include a clear 429 JSON response, and
document/configure the limit values centrally so they can be tuned.
- Around line 441-456: The /upload endpoint (function upload_file) lacks rate
limiting which can allow abusive or expensive file_processor.process_file calls;
add a rate limiter (e.g., Flask-Limiter or a lightweight
token-bucket/middleware) and apply it to the upload_file route (decorate or wrap
the route handler) with sensible limits (e.g., requests per minute per IP or
authenticated user), ensure limiter returns 429 on exceeded quota and include a
clear Retry-After header/message, and add tests to verify limit enforcement and
that file_processor.process_file is still invoked when under the limit.
- Around line 38-42: The Limiter is currently created without a persistent
storage backend (Limiter(... key_func=get_remote_address, app=app,
default_limits=["100 per hour"])), which uses in-memory storage and breaks rate
limiting across Gunicorn workers or multiple instances; update the Limiter
instantiation to include a shared storage backend by adding an appropriate
storage_uri (e.g., a Redis URI like "redis://host:port") so that the Limiter
(class) shares counters across processes and survives restarts—modify the
Limiter call that references get_remote_address, app, and default_limits to pass
storage_uri pointing to your production Redis (or other supported backend).

In `@requirements.txt`:
- Line 36: The requirements entry for Flask-Limiter is unpinned; update the
dependency line for "Flask-Limiter" to a pinned version (for example
"Flask-Limiter==3.5.0") to match the project's pinned-dependency practices and
prevent accidental breaking upgrades; ensure the formatting matches other
entries in requirements.txt.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b4d84e6-b8c1-46ac-a030-91afb37e109a

📥 Commits

Reviewing files that changed from the base of the PR and between 512c301 and 44578dd.

📒 Files selected for processing (2)
  • backend/server.py
  • requirements.txt

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/server.py (1)

214-215: Consider rate-limiting /generate_gform.

This endpoint makes external Google Forms API calls and triggers OAuth flows. While not an NLP inference endpoint, unthrottled access could exhaust API quotas, spam form creation, or trigger Google rate limits upstream.

♻️ Suggested addition
 `@app.route`("/generate_gform", methods=["POST"])
+@limiter.limit("5 per minute")
 def generate_gform():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 214 - 215, Add rate-limiting to the
/generate_gform endpoint to prevent quota exhaustion and abuse:
install/configure a limiter (e.g., Flask-Limiter) and apply it to the
generate_gform view (or wrap with a custom rate_limit decorator) so requests to
generate_gform are throttled per client/IP (for example N requests per minute)
and return a 429 on exceed; make sure to import and initialize the limiter at
module scope, attach the limiter to the app, and document any admin/exemptions
if needed so the symbol generate_gform (and its `@app.route`) is protected by the
limiter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/server.py`:
- Around line 214-215: Add rate-limiting to the /generate_gform endpoint to
prevent quota exhaustion and abuse: install/configure a limiter (e.g.,
Flask-Limiter) and apply it to the generate_gform view (or wrap with a custom
rate_limit decorator) so requests to generate_gform are throttled per client/IP
(for example N requests per minute) and return a 429 on exceed; make sure to
import and initialize the limiter at module scope, attach the limiter to the
app, and document any admin/exemptions if needed so the symbol generate_gform
(and its `@app.route`) is protected by the limiter.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b20ea0b0-75e9-45f9-bf2b-097aba2dfc9b

📥 Commits

Reviewing files that changed from the base of the PR and between 44578dd and 33c8a42.

📒 Files selected for processing (2)
  • backend/server.py
  • requirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements.txt

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/server.py (3)

503-511: ⚠️ Potential issue | 🟠 Major

Race condition in subtitle file handling.

Using glob("subtitles/*.vtt") and selecting by max(ctime) can return a different request's file when concurrent requests are processed. One request may also delete another's file before it's read.

Use the expected filename directly instead of globbing:

🔧 Proposed fix
-    # Find the latest .vtt file in the "subtitles" folder
-    subtitle_files = glob.glob("subtitles/*.vtt")
-    if not subtitle_files:
-        return jsonify({"error": "No subtitles found"}), 404
-
-    latest_subtitle = max(subtitle_files, key=os.path.getctime)
-    transcript_text = clean_transcript(latest_subtitle)
-
-    # Optional: Clean up the file after reading
-    os.remove(latest_subtitle)
+    # yt-dlp may append language code to filename
+    expected_file = f"subtitles/{video_id}.en.vtt"
+    if not os.path.exists(expected_file):
+        # Fallback to check without language suffix
+        expected_file = f"subtitles/{video_id}.vtt"
+    
+    if not os.path.exists(expected_file):
+        return jsonify({"error": "No subtitles found"}), 404
+
+    transcript_text = clean_transcript(expected_file)
+    os.remove(expected_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 503 - 511, The current logic uses
glob("subtitles/*.vtt") and max(os.path.getctime) then
os.remove(latest_subtitle) which causes race conditions across concurrent
requests; change to use the concrete expected filename (pass the filename or
unique temp path into the handler instead of scanning) and operate only on that
path when calling clean_transcript(filename) and os.remove(filename);
alternatively generate a per-request unique filename when saving uploads and
reference that unique name in the request flow so functions like
clean_transcript, the variable latest_subtitle, and the deletion call operate on
a single known file rather than the glob result.

491-513: ⚠️ Potential issue | 🔴 Critical

Command injection and path traversal vulnerability via unsanitized video_id.

The video_id parameter is taken directly from user input and interpolated into both the subprocess command and file path without validation. While using a list with subprocess.run mitigates shell injection, the unsanitized input still poses risks:

  1. Path traversal: A video_id like ../../etc/passwd could write/read outside the intended directory.
  2. Unexpected behavior: Special characters could cause yt-dlp to misinterpret arguments.

Validate that video_id matches an expected YouTube ID format (11 alphanumeric characters plus - and _):

🛡️ Proposed fix
 `@app.route`('/getTranscript', methods=['GET'])
 `@limiter.limit`("10 per minute")
 def get_transcript():
     video_id = request.args.get('videoId')
     if not video_id:
         return jsonify({"error": "No video ID provided"}), 400
+    
+    # Validate video_id format (YouTube IDs are 11 chars: alphanumeric, -, _)
+    if not re.match(r'^[a-zA-Z0-9_-]{11}$', video_id):
+        return jsonify({"error": "Invalid video ID format"}), 400

     subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 491 - 513, The get_transcript handler
currently uses unsanitized video_id in subprocess.run and file paths; validate
video_id first (e.g., require a regex ^[A-Za-z0-9_-]{11}$) and return 400 on
mismatch, then use the validated id only. After validation, construct the yt-dlp
call using the validated id for the URL and avoid interpolating user input into
arbitrary paths by using os.path.join("subtitles", f"{video_id}.vtt") (or use a
safe temporary filename via tempfile with the validated id as a suffix). Also
ensure any file removal/read uses that same safe path variable (latest_subtitle
should be constrained to the subtitles directory or resolved via os.path.abspath
and checked to start with the subtitles directory) before calling
clean_transcript and os.remove.

498-500: ⚠️ Potential issue | 🟠 Major

Unhandled CalledProcessError will cause 500 errors.

With check=True, a failed yt-dlp command (e.g., video unavailable, no subtitles, network error) raises CalledProcessError, resulting in an unhandled 500 response to clients.

🔧 Proposed fix
+    try:
+        subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
+                    "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
+                   check=True, capture_output=True, text=True)
+    except subprocess.CalledProcessError as e:
+        return jsonify({"error": "Failed to fetch transcript", "details": e.stderr}), 502
-    subprocess.run(["yt-dlp", "--write-auto-sub", "--sub-lang", "en", "--skip-download",
-                "--sub-format", "vtt", "-o", f"subtitles/{video_id}.vtt", f"https://www.youtube.com/watch?v={video_id}"],
-               check=True, capture_output=True, text=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 498 - 500, Wrap the subprocess.run(["yt-dlp",
...], check=True, capture_output=True, text=True) call that downloads subtitles
for video_id in a try/except; catch subprocess.CalledProcessError as e (and
optionally FileNotFoundError) and log e.returncode and e.stderr along with the
video_id, then return/raise an appropriate HTTP error or domain-specific
exception instead of letting the CalledProcessError propagate (e.g., use Flask's
abort or return a 404/400 with a clear message) so failures like missing
subtitles, network errors or unavailable videos are handled gracefully.
🧹 Nitpick comments (1)
backend/server.py (1)

38-42: In-memory storage won't persist limits across workers.

Without specifying a storage_uri, Flask-Limiter defaults to in-memory storage. This means rate limits won't be shared across multiple Gunicorn/uWSGI workers or server instances, allowing clients to effectively multiply their allowed requests.

For production with multiple workers, configure a shared backend:

limiter = Limiter(
    key_func=get_remote_address,
    app=app,
    default_limits=["100 per hour"],
    storage_uri="redis://localhost:6379"  # or memcached://...
)

If single-worker local development is the only target, this is acceptable but should be documented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 38 - 42, The Limiter instantiation (Limiter,
key_func=get_remote_address, app=app) currently uses default in-memory storage
which won't persist limits across workers; update the Limiter call to supply a
shared backend via the storage_uri parameter (e.g., "redis://..." or
"memcached://...") so rate limits are shared between Gunicorn/uWSGI workers in
production, or alternatively add a clear comment near the Limiter creation
explaining that in-memory is intentionally used only for single-worker local
dev.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/server.py`:
- Around line 503-511: The current logic uses glob("subtitles/*.vtt") and
max(os.path.getctime) then os.remove(latest_subtitle) which causes race
conditions across concurrent requests; change to use the concrete expected
filename (pass the filename or unique temp path into the handler instead of
scanning) and operate only on that path when calling clean_transcript(filename)
and os.remove(filename); alternatively generate a per-request unique filename
when saving uploads and reference that unique name in the request flow so
functions like clean_transcript, the variable latest_subtitle, and the deletion
call operate on a single known file rather than the glob result.
- Around line 491-513: The get_transcript handler currently uses unsanitized
video_id in subprocess.run and file paths; validate video_id first (e.g.,
require a regex ^[A-Za-z0-9_-]{11}$) and return 400 on mismatch, then use the
validated id only. After validation, construct the yt-dlp call using the
validated id for the URL and avoid interpolating user input into arbitrary paths
by using os.path.join("subtitles", f"{video_id}.vtt") (or use a safe temporary
filename via tempfile with the validated id as a suffix). Also ensure any file
removal/read uses that same safe path variable (latest_subtitle should be
constrained to the subtitles directory or resolved via os.path.abspath and
checked to start with the subtitles directory) before calling clean_transcript
and os.remove.
- Around line 498-500: Wrap the subprocess.run(["yt-dlp", ...], check=True,
capture_output=True, text=True) call that downloads subtitles for video_id in a
try/except; catch subprocess.CalledProcessError as e (and optionally
FileNotFoundError) and log e.returncode and e.stderr along with the video_id,
then return/raise an appropriate HTTP error or domain-specific exception instead
of letting the CalledProcessError propagate (e.g., use Flask's abort or return a
404/400 with a clear message) so failures like missing subtitles, network errors
or unavailable videos are handled gracefully.

---

Nitpick comments:
In `@backend/server.py`:
- Around line 38-42: The Limiter instantiation (Limiter,
key_func=get_remote_address, app=app) currently uses default in-memory storage
which won't persist limits across workers; update the Limiter call to supply a
shared backend via the storage_uri parameter (e.g., "redis://..." or
"memcached://...") so rate limits are shared between Gunicorn/uWSGI workers in
production, or alternatively add a clear comment near the Limiter creation
explaining that in-memory is intentionally used only for single-worker local
dev.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d81a68de-5c15-4c0b-8f24-1624abb121e3

📥 Commits

Reviewing files that changed from the base of the PR and between 33c8a42 and dc6b0c7.

📒 Files selected for processing (1)
  • backend/server.py

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/server.py (1)

431-440: ⚠️ Potential issue | 🔴 Critical

Preserve the boolean question objects when hardening.

Here Line 438 sends the whole generated item into make_question_harder and returns those transformed values directly. That is inconsistent with the other hard endpoints and will either fail on dict input or collapse the response from {"question": ...} objects into plain strings.

Possible fix
-    harder_questions = [make_question_harder(q) for q in generated]
-
-    return jsonify({"output": harder_questions})
+    for item in generated:
+        item["question"] = make_question_harder(item["question"])
+
+    return jsonify({"output": generated})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 431 - 440, The code passes entire generated
items into make_question_harder and returns those results, which collapses
question objects into plain strings or breaks dict handling; instead iterate
over each item in generated, call make_question_harder only on the question text
(e.g., item["question"] or the appropriate field) and build a new list of
question objects by copying each original dict and replacing its question field
with the hardened question (preserving boolean/answer fields for "true_false"
style). Update the mapping that produces harder_questions so jsonify returns the
list of intact objects rather than raw strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Around line 503-521: The subprocess.run call that invokes yt-dlp (in
backend/server.py within the try block that builds the args list including
"yt-dlp" and f"https://www.youtube.com/watch?v={video_id}") must be bounded by a
timeout: add the timeout parameter (e.g. timeout=30 or a configurable value) to
the subprocess.run invocation and catch subprocess.TimeoutExpired in the
surrounding try/except; in the TimeoutExpired handler return a suitable error
response (e.g. jsonify({"error":"yt-dlp timed out"}), 504) and optionally
log/attach stderr for diagnostics, keeping the existing CalledProcessError and
FileNotFoundError handlers intact.
- Around line 38-42: The global Limiter instantiation currently sets
default_limits=["100 per hour"], which unintentionally rate-limits all
undecorated routes (e.g., the root handler "/" and the get_content endpoint
handled by the function for "/get_content"); remove the default_limits argument
from the Limiter(...) call or keep it but explicitly mark
non-inference/monitoring handlers with `@limiter.exempt` (apply the decorator to
the route functions that serve health checks/monitoring such as the "/" handler
and the function serving "/get_content") so that only intended inference
endpoints remain rate-limited.
- Around line 500-529: Replace the fixed subtitle_path strategy with a
per-request temporary directory: create a tempfile.TemporaryDirectory() for the
request, run subprocess.run (the existing call in the block around subtitle_path
and yt-dlp) with the output directory set to that temp dir (so yt-dlp writes
into an isolated folder), then use glob.glob to find the generated VTT file for
video_id (match f"{video_id}*.vtt" to handle language suffixes) and pass that
found path to clean_transcript; if no match return the same 404 error, and rely
on the TemporaryDirectory context manager to clean up instead of os.remove; keep
existing exception handling for subprocess.CalledProcessError and
FileNotFoundError but update messages as needed and reference variables
subtitle_path, subprocess.run, clean_transcript, and os.remove when making the
change.

---

Outside diff comments:
In `@backend/server.py`:
- Around line 431-440: The code passes entire generated items into
make_question_harder and returns those results, which collapses question objects
into plain strings or breaks dict handling; instead iterate over each item in
generated, call make_question_harder only on the question text (e.g.,
item["question"] or the appropriate field) and build a new list of question
objects by copying each original dict and replacing its question field with the
hardened question (preserving boolean/answer fields for "true_false" style).
Update the mapping that produces harder_questions so jsonify returns the list of
intact objects rather than raw strings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42bc4bcf-ecd0-49c2-be08-76d421d3e33c

📥 Commits

Reviewing files that changed from the base of the PR and between dc6b0c7 and 30a433f.

📒 Files selected for processing (1)
  • backend/server.py

Comment thread backend/server.py
Comment thread backend/server.py Outdated
Comment on lines +500 to +529
# Safe subtitle file path
subtitle_path = os.path.join("subtitles", f"{video_id}.vtt")

try:
subprocess.run(
[
"yt-dlp",
"--write-auto-sub",
"--sub-lang", "en",
"--skip-download",
"--sub-format", "vtt",
"-o", subtitle_path,
f"https://www.youtube.com/watch?v={video_id}"
],
check=True,
capture_output=True,
text=True
)
except subprocess.CalledProcessError:
return jsonify({"error": "Failed to fetch subtitles"}), 400
except FileNotFoundError:
return jsonify({"error": "yt-dlp is not installed on the server"}), 500

if not os.path.exists(subtitle_path):
return jsonify({"error": "No subtitles found"}), 404

latest_subtitle = max(subtitle_files, key=os.path.getctime)
transcript_text = clean_transcript(latest_subtitle)
transcript_text = clean_transcript(subtitle_path)

# Optional: Clean up the file after reading
os.remove(latest_subtitle)
# Clean up subtitle file
os.remove(subtitle_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the backend/server.py file
find . -name "server.py" -o -name "*.py" | grep -E "(backend|server)" | head -20

Repository: AOSSIE-Org/EduAid

Length of output: 264


🏁 Script executed:

# Check if backend/server.py exists and examine lines around 500-529
if [ -f "backend/server.py" ]; then
  wc -l backend/server.py
  echo "=== Lines 500-529 ===" 
  sed -n '500,529p' backend/server.py
fi

Repository: AOSSIE-Org/EduAid

Length of output: 1060


🏁 Script executed:

# Search for subtitles directory creation in the codebase
rg "subtitles" --type py -B 3 -A 3

Repository: AOSSIE-Org/EduAid

Length of output: 1237


🏁 Script executed:

# Look for __main__ and directory creation patterns
rg "if __name__ == \"__main__\"" --type py -A 20

Repository: AOSSIE-Org/EduAid

Length of output: 199


🌐 Web query:

yt-dlp subtitle output filename format --write-auto-sub language code

💡 Result:

Pick the auto-sub language (language code)

Use --sub-langs (regex supported) together with --write-auto-subs (aka --write-auto-sub). Example (any English variant like en, en-US, etc.):

yt-dlp --write-auto-subs --sub-langs "en.*" --skip-download URL

--list-subs shows the exact language tags you can match/select. [1]

Control the subtitle output filename template

Set a subtitle-specific output template using the subtitle: output type:

yt-dlp --write-auto-subs --sub-langs "en.*" \
  -o "%(title)s [%(id)s].%(ext)s" \
  -o "subtitle:%(title)s [%(id)s].%(ext)s" \
  URL

yt-dlp supports per-file-type templates (including subtitle:). [2]

Important: yt-dlp will still append the subtitle language to the filename

Even if your subtitle: template doesn’t include a language field, yt-dlp appends the subtitle language/id before the subtitle extension (e.g., My Title [id].en.vtt). This is expected behavior and isn’t removable via the output template alone (people typically rename via --exec afterward if they really need to). [3]


Sources
[1] yt-dlp README / help: subtitle options (--write-auto-subs, --list-subs, --sub-langs with regex) (github.com)
[2] yt-dlp README: per-type output templates (-o "subtitle:...") (github.com)
[3] Discussion showing subtitle language/id is always added to subtitle filenames (reddit.com)

Citations:


Use a request-scoped subtitle path instead of a shared fixed filename.

subtitles/{video_id}.vtt fails in three ways: it races when two requests fetch the same video concurrently, the subtitles directory is only created in __main__ so WSGI deployments lack the directory entirely, and the existence check at line 523 always fails because yt-dlp appends the language code to the filename (e.g., {video_id}.en.vtt instead of {video_id}.vtt). A per-request temp directory with glob to find the generated file avoids all three failure modes.

Possible fix
+import glob
+import tempfile
...
-    subtitle_path = os.path.join("subtitles", f"{video_id}.vtt")
-
-    try:
-        subprocess.run(
-            [
-                "yt-dlp",
-                "--write-auto-sub",
-                "--sub-lang", "en",
-                "--skip-download",
-                "--sub-format", "vtt",
-                "-o", subtitle_path,
-                f"https://www.youtube.com/watch?v={video_id}"
-            ],
-            check=True,
-            capture_output=True,
-            text=True
-        )
+    with tempfile.TemporaryDirectory(prefix="subtitles_") as subtitle_dir:
+        output_template = os.path.join(subtitle_dir, video_id)
+        try:
+            subprocess.run(
+                [
+                    "yt-dlp",
+                    "--write-auto-sub",
+                    "--sub-lang", "en",
+                    "--skip-download",
+                    "--sub-format", "vtt",
+                    "-o", output_template,
+                    f"https://www.youtube.com/watch?v={video_id}"
+                ],
+                check=True,
+                capture_output=True,
+                text=True,
+            )
         ...
-    if not os.path.exists(subtitle_path):
-        return jsonify({"error": "No subtitles found"}), 404
-
-    transcript_text = clean_transcript(subtitle_path)
-
-    # Clean up subtitle file
-    os.remove(subtitle_path)
+        matches = glob.glob(os.path.join(subtitle_dir, "*.vtt"))
+        if not matches:
+            return jsonify({"error": "No subtitles found"}), 404
+
+        transcript_text = clean_transcript(matches[0])
🧰 Tools
🪛 Ruff (0.15.5)

[error] 504-504: subprocess call: check for execution of untrusted input

(S603)


[error] 505-513: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 500 - 529, Replace the fixed subtitle_path
strategy with a per-request temporary directory: create a
tempfile.TemporaryDirectory() for the request, run subprocess.run (the existing
call in the block around subtitle_path and yt-dlp) with the output directory set
to that temp dir (so yt-dlp writes into an isolated folder), then use glob.glob
to find the generated VTT file for video_id (match f"{video_id}*.vtt" to handle
language suffixes) and pass that found path to clean_transcript; if no match
return the same 404 error, and rely on the TemporaryDirectory context manager to
clean up instead of os.remove; keep existing exception handling for
subprocess.CalledProcessError and FileNotFoundError but update messages as
needed and reference variables subtitle_path, subprocess.run, clean_transcript,
and os.remove when making the change.

Comment thread backend/server.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add Rate Limiting and Abuse Protection for API Endpoints

1 participant