feat: Add question quality scoring and improve input validation#598
feat: Add question quality scoring and improve input validation#598piyush06singhal wants to merge 8 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an optional question-quality scoring and ranking layer to MCQ/BoolQ/ShortQ endpoints, plus logging, a thread-safe lazy QAEvaluator singleton, input validation helpers, scoring/ranking logic with safe fallbacks, and tests covering scoring behavior and edge cases. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Endpoint as /get_mcq|/get_boolq|/get_shortq
participant Generator as QuestionGenerator
participant Scorer as score_and_rank_questions
participant QAEval as QAEvaluator
participant Response
Client->>Endpoint: POST (input_text, max_questions, use_scoring:true)
Endpoint->>Endpoint: validate input_text, parse_max_questions
alt invalid input_text or max_questions
Endpoint->>Response: 400 Error
else valid
Endpoint->>Generator: generate questions (limit = max_questions × 2)
Generator->>Endpoint: generated questions
alt use_scoring == true and >= 3 questions
Endpoint->>Scorer: score_and_rank_questions(questions)
Scorer->>QAEval: encode_qa_pairs
QAEval->>Scorer: encoded vectors
Scorer->>QAEval: get_scores
QAEval->>Scorer: scores / indices
Scorer->>Scorer: sort by score, handle exceptions/fallback
Scorer->>Endpoint: ranked questions
end
Endpoint->>Response: return top-N questions (JSON)
end
Response->>Client: JSON payload
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
backend/server.py (1)
89-91: Simplify answer extraction usingdict.get.This removes redundant membership checks and matches Ruff’s RUF019 hint.
Proposed fix
- if 'answer' in q and q['answer']: - answer_texts.append(q['answer']) - elif 'options' in q and q['options']: + if q.get("answer"): + answer_texts.append(q.get("answer")) + elif q.get("options"): # fallback only if answer missing - answer_texts.append(q['options'][0]) + answer_texts.append(q.get("options")[0]) else: answer_texts.append('')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 89 - 91, Replace the membership checks on q with dict.get to simplify extraction: in the block that appends to answer_texts (referencing q and answer_texts) use q.get('answer') to fetch the answer and fall back to q.get('options') for options instead of testing 'answer' in q / 'options' in q; ensure you keep the same truthiness checks (i.e., only append when the retrieved value is truthy) and preserve the existing control flow that handles answers vs options.backend/test_server.py (1)
36-46: Strengthen scoring tests with behavioral assertions.These tests currently only check presence of
output, so ranking/slicing regressions can pass unnoticed. Assert response status, output type, andlen(output) <= max_questions.Proposed test enhancement
def test_get_mcq_with_scoring(): @@ response = make_post_request(endpoint, data) print(f'/get_mcq with scoring Response: {response}') assert 'output' in response + assert isinstance(response['output'], list) + assert len(response['output']) <= data['max_questions'] def test_get_boolq_with_scoring(): @@ response = make_post_request(endpoint, data) print(f'/get_boolq with scoring Response: {response}') assert 'output' in response + assert isinstance(response['output'], list) + assert len(response['output']) <= data['max_questions'] def test_get_shortq_with_scoring(): @@ response = make_post_request(endpoint, data) print(f'/get_shortq with scoring Response: {response}') assert 'output' in response + assert isinstance(response['output'], list) + assert len(response['output']) <= data['max_questions']Also applies to: 57-67, 78-88
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_server.py` around lines 36 - 46, The test_get_mcq_with_scoring currently only checks that 'output' exists; update it (and the similar tests at the other ranges) to assert the HTTP success status, output type, and length constraints: after calling make_post_request(endpoint, data) assert the response indicates success (e.g., response.get('status_code') == 200 or response.get('status') == 'ok'), assert isinstance(response['output'], list), and assert len(response['output']) <= data['max_questions']; keep references to test_get_mcq_with_scoring, endpoint '/get_mcq', make_post_request and the response['output'] field so the assertions are added in the same test and mirrored in the other tests mentioned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 139-144: The current handlers only slice questions to
max_questions inside the scoring branch, so when use_scoring=True but
len(questions)<3 the result can exceed max_questions; update the logic in the
affected endpoints (the /get_boolq and /get_shortq handlers and the block using
use_scoring, questions, max_questions, and score_and_rank_questions) to always
enforce questions = questions[:max_questions] after any ranking/processing step
(i.e., perform the slicing outside or after the conditional that calls
score_and_rank_questions) so the returned list never exceeds max_questions even
when scoring is skipped.
- Around line 125-132: Validate and normalize max_questions before using it in
arithmetic: parse the incoming data["max_questions"] into an integer (or reject
if it cannot be parsed), ensure it is a positive int (>0), and if not return a
400 response with a clear error message; then use the validated variable (e.g.,
validated_max_questions) when computing generation_limit =
validated_max_questions * 2 if use_scoring else validated_max_questions. Apply
this change wherever generation_limit is computed (the blocks that reference
max_questions, use_scoring, and generation_limit) and ensure
process_input_text(input_text, use_mediawiki) remains unchanged.
- Around line 86-88: The loop that builds scoring inputs uses q.get('question',
'') which misses generator outputs that populate 'question_statement'; update
the logic in the for q in questions loop that appends to question_texts to
prefer q.get('question_statement') and fall back to q.get('question', '') (or
empty string) so scoring uses the actual question text produced by the
MCQ/ShortQ generators; locate the append to question_texts in backend/server.py
and replace the key access accordingly.
- Around line 170-174: The request handler for /get_boolq currently normalizes
boolean_questions into dicts unconditionally (the boolean_questions list
comprehension), which changes the response shape even when scoring is disabled;
change the logic so the original boolean_questions variable is preserved for
non-scoring responses and only create a normalized scoring_payload (e.g.,
scoring_payload = [{"question": q} if isinstance(q, str) else q for q in
boolean_questions]) inside the scoring branch where scoring is performed (the
code that calls the scorer), leaving the top-level boolean_questions list
unchanged and returning the original raw list when scoring is not requested.
- Line 53: The file currently creates qa_evaluator = main.QAEvaluator() at
import time which forces heavy model load even when use_scoring is false; change
this to lazy initialization by setting qa_evaluator = None at module scope and
instantiating main.QAEvaluator() on first scoring request (inside the scoring
handler function that checks use_scoring) while wrapping the construction in a
try/except to catch model load failures, set qa_evaluator back to None on error,
and return a clean fallback response or error message when use_scoring is
disabled or the model failed to load.
---
Nitpick comments:
In `@backend/server.py`:
- Around line 89-91: Replace the membership checks on q with dict.get to
simplify extraction: in the block that appends to answer_texts (referencing q
and answer_texts) use q.get('answer') to fetch the answer and fall back to
q.get('options') for options instead of testing 'answer' in q / 'options' in q;
ensure you keep the same truthiness checks (i.e., only append when the retrieved
value is truthy) and preserve the existing control flow that handles answers vs
options.
In `@backend/test_server.py`:
- Around line 36-46: The test_get_mcq_with_scoring currently only checks that
'output' exists; update it (and the similar tests at the other ranges) to assert
the HTTP success status, output type, and length constraints: after calling
make_post_request(endpoint, data) assert the response indicates success (e.g.,
response.get('status_code') == 200 or response.get('status') == 'ok'), assert
isinstance(response['output'], list), and assert len(response['output']) <=
data['max_questions']; keep references to test_get_mcq_with_scoring, endpoint
'/get_mcq', make_post_request and the response['output'] field so the assertions
are added in the same test and mirrored in the other tests mentioned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e354f48-9e0a-4e02-acdc-ef7ecf1c992f
📒 Files selected for processing (2)
backend/server.pybackend/test_server.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
backend/server.py (2)
88-99: Unused parameter:contextis documented but never used.The
contextparameter is defined in the signature and documented in the docstring, but it's not referenced anywhere in the function body.♻️ Either remove the parameter or use it for enhanced scoring
If the parameter is intended for future use, consider removing it for now and adding it when needed:
-def score_and_rank_questions(questions, context): +def score_and_rank_questions(questions): """ Use QAEvaluator to score and rank questions based on quality. Returns sorted list with best questions first. Args: questions: List of question dictionaries - context: The input text context Returns: List of questions sorted by quality score (best first) """Then update call sites to remove the
input_textargument.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 88 - 99, The function score_and_rank_questions currently declares and documents the parameter context but never uses it; either remove the unused parameter from the signature and docstring in score_and_rank_questions and update all call sites that pass input_text, or incorporate context into the scoring logic (e.g., pass context into QAEvaluator calls inside score_and_rank_questions and update the docstring to reflect its use); ensure the function signature, docstring, and any callers are kept in sync to avoid lint/type errors.
55-59: Race condition in lazy initialization.Two concurrent requests could both see
qa_evaluator is Noneand instantiateQAEvaluator()simultaneously. Given the heavy model loading (BERT from HuggingFace), this wastes resources and could cause memory issues.♻️ Thread-safe lazy initialization
+import threading + qa_evaluator = None +_qa_evaluator_lock = threading.Lock() def get_qa_evaluator(): global qa_evaluator - if qa_evaluator is None: - qa_evaluator = main.QAEvaluator() + if qa_evaluator is None: + with _qa_evaluator_lock: + if qa_evaluator is None: + qa_evaluator = main.QAEvaluator() return qa_evaluator🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 55 - 59, The lazy initializer get_qa_evaluator has a race where multiple threads can instantiate main.QAEvaluator simultaneously; fix by introducing a module-level threading.Lock (e.g., qa_evaluator_lock) and use double-checked locking inside get_qa_evaluator: first check qa_evaluator is None, acquire the lock, check again, and only then set qa_evaluator = main.QAEvaluator(); release the lock and return qa_evaluator so only one heavy model load occurs. Ensure the lock is initialized at module import alongside the qa_evaluator variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 112-122: The for-loop starting with "for q in questions:" is
mis-indented and causing a Python IndentationError; adjust the indentation so
the "for q in questions:" line aligns with its surrounding block (the enclosing
try/except or function scope) and not indented one extra space, ensuring
subsequent lines that append to question_texts and answer_texts remain
consistently indented under that for loop; locate the block by searching for the
symbols "for q in questions:", "question_texts", and "answer_texts" and fix
indentation to match the surrounding code structure.
- Around line 209-217: boolean_questions is a list of plain strings (from
beam_search_decoding) but score_and_rank_questions expects dicts with keys like
"question" or "question_statement", so convert each string in boolean_questions
into a dict (e.g., {"question": <string>}) before calling
score_and_rank_questions; update the block that calls score_and_rank_questions
to map/transform boolean_questions into the expected dict shape (refer to how
MCQ formatting is done) and then pass the transformed list into
score_and_rank_questions(boolean_questions, input_text) while keeping the final
slicing boolean_questions = boolean_questions[:max_questions].
---
Nitpick comments:
In `@backend/server.py`:
- Around line 88-99: The function score_and_rank_questions currently declares
and documents the parameter context but never uses it; either remove the unused
parameter from the signature and docstring in score_and_rank_questions and
update all call sites that pass input_text, or incorporate context into the
scoring logic (e.g., pass context into QAEvaluator calls inside
score_and_rank_questions and update the docstring to reflect its use); ensure
the function signature, docstring, and any callers are kept in sync to avoid
lint/type errors.
- Around line 55-59: The lazy initializer get_qa_evaluator has a race where
multiple threads can instantiate main.QAEvaluator simultaneously; fix by
introducing a module-level threading.Lock (e.g., qa_evaluator_lock) and use
double-checked locking inside get_qa_evaluator: first check qa_evaluator is
None, acquire the lock, check again, and only then set qa_evaluator =
main.QAEvaluator(); release the lock and return qa_evaluator so only one heavy
model load occurs. Ensure the lock is initialized at module import alongside the
qa_evaluator variable.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Around line 218-225: The current code mutates boolean_questions into dicts for
scoring which changes the /get_boolq response shape; instead, create a separate
variable (e.g., boolean_questions_for_scoring) that normalizes strings to
{"question": q} only for passing into
score_and_rank_questions(boolean_questions_for_scoring, input_text), then merge
the returned scores/ranks back onto the original boolean_questions items
(preserving strings as strings and original dict shapes) before returning;
update the return path that currently uses boolean_questions to use the merged
result so the endpoint output schema remains unchanged when scoring is enabled.
- Line 92: The function score_and_rank_questions currently declared as
score_and_rank_questions(questions) has a call-signature mismatch with its
callers (they pass two arguments); fix by updating the function signature to
accept the second parameter (e.g., def score_and_rank_questions(questions,
context=None)) and adjust the implementation to use the new context parameter
(or ignore if None), ensuring all internal calls to the scorer (and any use of
the second arg) match the updated signature; specifically update the function
definition and any internal references inside score_and_rank_questions to accept
and handle the extra argument so the calls from the locations that pass two args
succeed without TypeError.
|
Needs test!!!! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/test_scoring_feature.py`:
- Line 65: Tests are calling score_and_rank_questions with three positional args
but the function defined in server.py (score_and_rank_questions(questions,
_input_text=None)) only accepts two; update each call in
backend/test_scoring_feature.py to pass at most the questions and the optional
_input_text (or omit the second arg) and remove the extra third argument, and
also correct any lowercase false to Python's False so calls use the proper
boolean literal.
- Line 93: The test calls score_and_rank_questions with three positional
arguments and uses the undefined lowercase false; update the assertion to call
score_and_rank_questions with only the supported parameters and use Python
booleans if needed (e.g., replace the three-argument call with
score_and_rank_questions([], 0) or, if a boolean flag was intended, pass
True/False using the proper capitalized name). Ensure the change targets the
test assertion in test_scoring_feature.py referencing score_and_rank_questions
so it no longer triggers NameError or TypeError.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 084faca6-6833-4cc6-a24c-02298ff66074
📒 Files selected for processing (3)
backend/server.pybackend/test_scoring_feature.pybackend/test_server.py
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/test_server.py
- backend/server.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/test_scoring_feature.py (1)
212-231: Add non-JSON request coverage for safe JSON parsing.The PR objective includes
request.get_json(silent=True) or {}hardening; current API tests only send JSON payloads.Suggested test case
class TestAPIValidation: @@ def test_invalid_input_text(self, client, payload): """Invalid input_text returns 400.""" response = client.post('/get_mcq', json=payload) assert response.status_code == 400 + + def test_non_json_body_returns_400(self, client): + """Non-JSON body should be handled safely and return 400.""" + response = client.post('/get_mcq', data="plain text", content_type="text/plain") + assert response.status_code == 400🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/test_scoring_feature.py` around lines 212 - 231, Tests only cover JSON requests but the code now uses request.get_json(silent=True) or {} so we must add a non-JSON request test to ensure the endpoint handles empty/invalid JSON safely; add a new test (e.g., test_non_json_body_returns_400) that posts plain text (Content-Type: text/plain or no JSON) to the /get_mcq endpoint and asserts a 400 response, referencing the existing test patterns (test_invalid_input_text and test_invalid_max_questions) to match style and reuse the client fixture; ensure the request content is non-JSON (e.g., data="not json", headers={"Content-Type":"text/plain"}) so the silent JSON parsing branch is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/test_scoring_feature.py`:
- Around line 175-187: Extend the invalid-types tests to assert that booleans
are rejected: update the test function test_invalid_types_rejected (and the
similar assertions around the other block mentioned) to include cases like
parse_max_questions({"max_questions": True}) and
parse_max_questions({"max_questions": False}) wrapped in
pytest.raises(ValueError); this targets the parse_max_questions function which
currently relies on isinstance(value, int) and ensures True/False do not pass as
valid integers.
---
Nitpick comments:
In `@backend/test_scoring_feature.py`:
- Around line 212-231: Tests only cover JSON requests but the code now uses
request.get_json(silent=True) or {} so we must add a non-JSON request test to
ensure the endpoint handles empty/invalid JSON safely; add a new test (e.g.,
test_non_json_body_returns_400) that posts plain text (Content-Type: text/plain
or no JSON) to the /get_mcq endpoint and asserts a 400 response, referencing the
existing test patterns (test_invalid_input_text and test_invalid_max_questions)
to match style and reuse the client fixture; ensure the request content is
non-JSON (e.g., data="not json", headers={"Content-Type":"text/plain"}) so the
silent JSON parsing branch is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a1326a0d-38ea-4c82-bd10-1d0410935064
📒 Files selected for processing (1)
backend/test_scoring_feature.py
|
Hi @Aditya30ag I’ve added comprehensive pytest test coverage for the feature as suggested. This includes:
All tests are properly mocked and designed to ensure stability without external dependencies. Please let me know if you’d like any additional test cases or improvements. Happy to iterate further! |
|
Its looks too much Ai generated code! |
Addressed Issues:
Fixes #597
Summary
This PR introduces a Question Quality Scoring & Ranking system using the existing QAEvaluator, along with input validation improvements to enhance API robustness and output quality.
Key Changes
1. Question Quality Scoring & Ranking
Generate → Score → Rank → Slice → Return
2. Input Validation & Safety
3. Code Quality Improvements
Benefits
Backward Compatibility
Testing
Additional Notes
This feature leverages the already available QAEvaluator and integrates it into the main generation pipeline without introducing new dependencies or altering existing behaviour.
AI Usage Disclosure
I have used the following AI models and tools: ChatGPT
Checklist
AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Summary by CodeRabbit
New Features
Bug Fixes
Tests