Skip to content

Feature: Add semantic duplicate issue detection for GitHub issues #551

Open
vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
vaibhavbaghel:backend
Open

Feature: Add semantic duplicate issue detection for GitHub issues #551
vaibhavbaghel wants to merge 4 commits intoAOSSIE-Org:mainfrom
vaibhavbaghel:backend

Conversation

@vaibhavbaghel
Copy link
Copy Markdown

@vaibhavbaghel vaibhavbaghel commented Mar 10, 2026

Summary

This PR introduces a semantic duplicate issue detection tool for GitHub issues.
The script analyzes newly created issues and compares them against existing issues using sentence embeddings to identify potential duplicates.

Related Issue: #396

Key Features

  • Uses sentence-transformers to generate semantic embeddings
  • Calculates cosine similarity between issues
  • Flags potential duplicates above a configurable similarity threshold
  • Posts a comment suggesting related issues for maintainers to review
  • Does not automatically close issues (maintainer decision required)

Configuration

The behavior can be configured using environment variables:

Variable Description Default
SIMILARITY_THRESHOLD Similarity score threshold for duplicate detection 0.75
LIMIT_RESULTS Maximum number of suggested issues 5
EXCLUDE_CLOSED Whether to exclude closed issues false

Workflow

  1. Fetch existing issues from the repository
  2. Generate embeddings for the current issue
  3. Compute semantic similarity with existing issues
  4. Identify potential duplicates
  5. Post a comment suggesting related issues

Benefits

  • Helps reduce duplicate issues
  • Improves maintainers' triage efficiency
  • Maintains a cleaner issue tracker
  • Provides suggestions without automatic enforcement

Future Improvements

  • Embedding caching for faster comparisons
  • Support for issue labels filtering
  • Incremental similarity search

Summary by CodeRabbit

  • New Features

    • Automated duplicate-issue detection: analyzes new/edited issues, ranks similar existing issues by semantic similarity, and posts a formatted comment listing matches with status indicators and links.
  • Chores

    • Added a background workflow to run detection on issue events and a pinned dependency manifest to support the automation.
  • Style

    • Improved visual spacing on quantity adjustment buttons.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 456da1a5-afd6-4cde-b930-91e45bb7c773

📥 Commits

Reviewing files that changed from the base of the PR and between a2704d1 and c24b12a.

📒 Files selected for processing (1)
  • .github/workflows/duplicate-detection.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/duplicate-detection.yml

📝 Walkthrough

Walkthrough

This PR adds an automated duplicate-issue detection pipeline: a new Python checker script, a pinned requirements file, and a GitHub Actions workflow that runs the checker on issue opened/edited events. Also includes a small UI padding tweak in a React component.

Changes

Cohort / File(s) Summary
Duplicate detection script
.​github/scripts/check-duplicates.py
New script that fetches up to 500 repo issues (paginated, optional exclude-closed), prepares title+first-500-chars of body, computes embeddings with SentenceTransformer, calculates cosine similarities, ranks/filter results above a configurable threshold, formats a markdown comment listing similar issues, and posts it to the current issue via the GitHub API.
Dependencies
.​github/scripts/requirements.txt
New requirements file pinning: sentence-transformers>=2.2.2, scikit-learn>=1.3.0, requests>=2.31.0, numpy>=1.24.0, torch>=2.0.0.
CI workflow
.​github/workflows/duplicate-detection.yml
New GitHub Actions workflow triggered on issue opened/edited; checks out repo, sets up Python 3.11, installs deps from the new requirements file, and runs the duplicate-check script with ISSUE_NUMBER, ISSUE_TITLE, ISSUE_BODY, REPO and GITHUB_TOKEN provided via env.
UI styling
eduaid_web/src/pages/Text_Input.jsx
Minor visual change: adds pb-1 bottom padding to the +/- quantity buttons (styling only).

Sequence Diagram

sequenceDiagram
    actor GitHub
    participant Workflow as rgba(0,128,255,0.5) GitHub Actions
    participant Script as rgba(0,200,100,0.5) check-duplicates.py
    participant GitHubAPI as rgba(255,165,0,0.5) GitHub API
    participant Model as rgba(128,0,128,0.5) SentenceTransformer

    GitHub->>Workflow: issue opened/edited event
    Workflow->>Workflow: checkout, setup Python, install deps
    Workflow->>Script: run with ISSUE_NUMBER, TITLE, BODY, REPO, GITHUB_TOKEN
    Script->>GitHubAPI: GET /issues (paginated, exclude closed optional, max 500)
    GitHubAPI-->>Script: issues list
    Script->>Model: prepare texts (title + 500 chars), compute embeddings
    Model-->>Script: embeddings
    Script->>Script: compute cosine similarities, filter & sort by threshold
    Script->>Script: format markdown comment
    Script->>GitHubAPI: POST comment to current issue (if matches)
    GitHubAPI-->>Script: comment response
    Script-->>Workflow: exit with status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 I hopped through titles, sniffed the thread,
Found echoes where new and old were led,
Embeddings whispered "these are near",
I left a note so folks can steer,
Now issue burrows are tidy and fed.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main feature being added: semantic duplicate issue detection for GitHub issues. It aligns with the primary objective of the PR and is specific enough for teammates to understand the change.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vaibhavbaghel vaibhavbaghel changed the title Feature: AI-powered duplicate issue detection using sentence embeddings Feature: Add semantic duplicate issue detection for GitHub issues Mar 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.github/workflows/duplicate-detection.yml (1)

24-36: Cache the sentence-transformer model between runs.

The workflow caches pip dependencies but not the Hugging Face model that check-duplicates.py loads at line 103 (SentenceTransformer("all-MiniLM-L6-v2")). On cold runner starts, this model will be redownloaded for every issue event, adding unnecessary latency and flakiness to triage. Add a caching step for the Hugging Face model directory using actions/cache and HF_HOME environment variable to persist the model across workflow runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/duplicate-detection.yml around lines 24 - 36, The workflow
currently installs pip deps but doesn't persist the Hugging Face model used in
check-duplicates.py (SentenceTransformer("all-MiniLM-L6-v2")), so add an
actions/cache step that caches the model directory and set HF_HOME to that cache
path in the job environment; specifically, define HF_HOME (e.g., an env like
HF_HOME: ${{ runner.temp }}/hf or similar), add an actions/cache entry that keys
and restores the HF_HOME directory (or the Hugging Face cache path) before
running pip and the python script, and ensure check-duplicates.py continues to
load the model from SentenceTransformer using the HF_HOME-backed cache.
.github/scripts/requirements.txt (1)

1-5: Use pinned dependency versions for the duplicate-detection workflow.

The workflow installs .github/scripts/requirements.txt with floating >= specs on every issue event, making duplicate detection behavior non-reproducible across runs and over time. The root requirements.txt already demonstrates pinned versions are maintained for this project (e.g., torch==2.5.1, scikit-learn==1.5.2). Update .github/scripts/requirements.txt to use pinned versions (==) from the root file or generate a proper lock file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/scripts/requirements.txt around lines 1 - 5, Update the
.github/scripts/requirements.txt to use pinned versions instead of floating >=
specs so duplicate-detection runs are reproducible: replace each dependency
(sentence-transformers, scikit-learn, requests, numpy, torch) with exact
versions (==) matching the project's root requirements.txt or freeze them into a
lock file and reference that; ensure the file uses the same pinned versions as
root (e.g., the root's torch and scikit-learn pins) or generate a pinned
requirements.lock and update the workflow to install that lock file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/scripts/check-duplicates.py:
- Around line 141-145: The try/except around calculate_similarity is swallowing
errors and converting failures into an empty duplicate list; instead propagate
the failure so the job fails or returns an explicit error state—remove the
blanket except or re-raise the caught exception (or return a distinct error
value) from where calculate_similarity is called, and update main() to treat
that propagated exception/value as a fatal error (exit non‑zero or surface an
error message) rather than treating it as “no duplicates found.” Ensure you
reference calculate_similarity and main() when making the change so callers
handle real failures instead of returning [].
- Around line 189-205: The post_comment function currently always posts a new
comment; change it to upsert a single detector comment by first listing existing
comments for the issue (GET to
f"{GITHUB_API_URL}/repos/{REPO}/issues/{ISSUE_NUMBER}/comments" with HEADERS),
search for a prior bot comment using a unique marker string you add to the
comment body (e.g., "<!-- duplicate-detector -->") or by matching the bot's
user, and if found use PATCH to update that comment (PATCH to
f"{GITHUB_API_URL}/repos/{REPO}/issues/comments/{comment_id}") otherwise POST as
before; keep using HEADERS, timeout and response.raise_for_status() and return
True/False as currently implemented in post_comment to preserve behavior.
- Around line 21-27: The dynamic runtime pip-install fallback around the
SentenceTransformer import must be removed so missing dependencies fail fast;
delete the try/except block that catches ImportError and runs
subprocess.check_call, and replace it with a single top-level import from
sentence_transformers import SentenceTransformer so an ImportError surfaces
immediately (this touches the import of SentenceTransformer in the script).

In @.github/workflows/duplicate-detection.yml:
- Around line 18-22: Update the GitHub Actions step that currently uses
actions/setup-python@v4 to a supported major version (actions/setup-python@v5 or
`@v6`) to avoid Node16 runner removal; specifically modify the step referencing
"actions/setup-python@v4" to use at least "actions/setup-python@v5" (or "@v6"),
keep the existing inputs (python-version: '3.11' and cache: 'pip') unchanged,
then run the workflow to verify the runner no longer errors due to the
deprecated Node runtime.

---

Nitpick comments:
In @.github/scripts/requirements.txt:
- Around line 1-5: Update the .github/scripts/requirements.txt to use pinned
versions instead of floating >= specs so duplicate-detection runs are
reproducible: replace each dependency (sentence-transformers, scikit-learn,
requests, numpy, torch) with exact versions (==) matching the project's root
requirements.txt or freeze them into a lock file and reference that; ensure the
file uses the same pinned versions as root (e.g., the root's torch and
scikit-learn pins) or generate a pinned requirements.lock and update the
workflow to install that lock file.

In @.github/workflows/duplicate-detection.yml:
- Around line 24-36: The workflow currently installs pip deps but doesn't
persist the Hugging Face model used in check-duplicates.py
(SentenceTransformer("all-MiniLM-L6-v2")), so add an actions/cache step that
caches the model directory and set HF_HOME to that cache path in the job
environment; specifically, define HF_HOME (e.g., an env like HF_HOME: ${{
runner.temp }}/hf or similar), add an actions/cache entry that keys and restores
the HF_HOME directory (or the Hugging Face cache path) before running pip and
the python script, and ensure check-duplicates.py continues to load the model
from SentenceTransformer using the HF_HOME-backed cache.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 376fea32-ba12-447c-bb31-636e7976cfd8

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and ad77624.

📒 Files selected for processing (4)
  • .github/scripts/check-duplicates.py
  • .github/scripts/requirements.txt
  • .github/workflows/duplicate-detection.yml
  • eduaid_web/src/pages/Text_Input.jsx

Comment thread .github/scripts/check-duplicates.py Outdated
Comment thread .github/scripts/check-duplicates.py
Comment thread .github/scripts/check-duplicates.py
Comment thread .github/workflows/duplicate-detection.yml
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/scripts/check-duplicates.py:
- Around line 62-80: The current pagination loop in check-duplicates.py swallows
RequestException and breaks, returning a partial or empty issues list; change
the except requests.exceptions.RequestException block to propagate the failure
instead of breaking (e.g., re-raise the caught exception or call sys.exit(1)) so
that callers of the function (the workflow) fail fast on API/auth/network
errors; update the handler around the requests.get/response.raise_for_status
section (the try/except surrounding page_issues, issues.extend and pagination
logic) to surface the original exception and include the error context in the
raised error/logging.
- Around line 240-242: The current main() ignores post_comment()'s return value
so comment-post failures don't fail the job; update the block that calls
post_comment(comment) to check its boolean result and exit non‑zero on failure
(e.g., if not post_comment(comment): log an error and sys.exit(1) or raise
SystemExit) so permission/API failures cause the script to fail; reference the
post_comment function and the main() flow where the comment is printed and
posted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3eb61906-b169-4909-b104-8d3abe5eb2c7

📥 Commits

Reviewing files that changed from the base of the PR and between ad77624 and a2704d1.

📒 Files selected for processing (1)
  • .github/scripts/check-duplicates.py

Comment thread .github/scripts/check-duplicates.py
Comment thread .github/scripts/check-duplicates.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant