Multi-signal code similarity analysis platform using FastAPI, CodeBERT, FAISS, SQLite, and a React frontend.
This system evaluates similarity using layered signals instead of plain text matching:
- Semantic similarity from transformer embeddings (CodeBERT)
- Token overlap via Jaccard similarity
- Structural similarity via AST (Python) and heuristic extraction (non-Python)
- Exact normalized corpus matching for known samples
The output includes:
- plagiarism_percentage
- ai_probability
- confidence
- explanation payload (signal values, reasoning, highlights, metrics)
- Frontend: file/code submission, result visualization, export
- Backend API: validation, orchestration, response shaping
- Similarity Engine: normalization + feature extraction + vector search + scoring
- Persistence: SQLite for records, FAISS for runtime nearest-neighbor search
- Runtime: local mode and Docker mode with persistent mounts
Frontend (React + Vite)
|
v
FastAPI API Layer
|
v
Validation + Language Normalization
|
v
AnalysisPipeline
|- Code Normalizer
|- Dataset Matcher (exact normalized match)
|- Embedding Generator (CodeBERT)
|- Token Similarity (Jaccard)
|- AST/Heuristic Structure Features
|- FAISS Semantic Search
`- Score Aggregator
|
+--> SQLite (analysis records + embedding bytes)
`--> FAISS Index (runtime vector retrieval)
Startup Sync:
SQLite embedding count <-> FAISS cache metadata
cache hit => load index
cache miss => rebuild from DB embeddings
ai-code-plagiarism-detector/
|-- src/
| |-- api/
| |-- models/
| |-- pipeline/
| |-- storage/
| `-- utils/
|-- frontend/
| `-- src/
|-- scripts/
|-- configs/
|-- data/
| |-- raw/
| |-- embeddings/
| |-- processed/
| `-- results/
|-- docker/
|-- tests/
|-- docker-compose.yml
`-- requirements.txt
- FastAPI starts and builds shared dependencies.
- Analysis pipeline is initialized.
- FAISS/DB synchronization runs:
- If cached FAISS metadata matches DB embedding count, cache is loaded.
- Otherwise FAISS is rebuilt from DB embeddings and cache is refreshed.
- Validate request and normalize language.
- Normalize code for canonical comparison.
- Try exact normalized corpus match.
- If no exact match:
- create embedding
- run FAISS semantic retrieval
- compute token and structure similarity against stored records
- Aggregate scores and confidence.
- Return explanation payload.
- Persist result and incrementally update FAISS cache.
-
POST /analyze/ Description: analyze JSON payload with code and optional language.
-
POST /analyze/file Description: analyze a single uploaded file.
-
POST /analyze/files Description: analyze a batch of uploaded files.
-
GET /health Description: service health check.
Supported extensions:
- .py, .java, .js, .jsx, .ts, .tsx, .cpp, .c, .go, .rs
Final outputs:
- plagiarism_percentage: overlap-oriented similarity estimate
- ai_probability: AI-style pattern estimate
- confidence: low/medium/high confidence band
Interpretation notes:
- Exact corpus matches can drive high plagiarism scores.
- ai_probability is conservative for short/low-agreement code.
- Size penalties and damping are intentionally applied to reduce overconfident false positives.
- ai_probability is normalized to configured AI weight totals for better score-range representation.
The system maintains balance between persistent storage and vector index state:
- SQLite stores canonical analysis records and serialized embeddings.
- FAISS stores normalized embedding vectors for fast nearest-neighbor retrieval.
- Startup sync checks DB embedding count against cached FAISS metadata.
- Request-time insert path updates DB first, then FAISS, then persists FAISS cache metadata.
This design ensures runtime speed while preserving reproducibility across restarts.
python -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate
pip install -r requirements.txt
python scripts/init_db.py
uvicorn src.api.main:app --host 127.0.0.1 --port 8000 --reloadcd frontend
npm installCreate frontend/.env.local:
VITE_API_BASE_URL=http://127.0.0.1:8000Run frontend:
npm run devDefault endpoints:
- API: http://127.0.0.1:8000
- API Docs: http://127.0.0.1:8000/docs
- Frontend: http://127.0.0.1:3000
Build and run:
docker compose build
docker compose up -dService endpoints:
- Backend: http://localhost:8000
- Frontend: http://localhost:3000
Persistent runtime mounts:
- runtime/db -> SQLite DB
- runtime/data -> raw corpus and FAISS cache files
- runtime/hf_cache -> Hugging Face cache
python scripts/load_datasets.py --source autoVariants:
python scripts/load_datasets.py --source filesystem
python scripts/load_datasets.py --source csv --csv-path data/results/evaluation_results.csv
python scripts/load_datasets.py --source filesystem --rebuild-faisspython scripts/sanity_check.py
python scripts/build_faiss_index.pypython scripts/evaluate_dataset.py
python scripts/plot_results.pyOutputs are generated under data/results and assets.
- Shows score spread and median across evaluated samples.
- Useful for checking score stability after threshold changes.
- Shows how strongly samples trend toward AI-style patterns.
- Useful for validating AI probability calibration behavior.
- Summarizes cross-label preference behavior from evaluation output.
- Useful for checking cluster separation trends.
- Visual sanity check for semantic separation behavior.
- Useful when tuning thresholds or weighting strategy.
- Single and batch analysis
- Result cards and signal metrics panel
- Source-code highlight overlays
- Export support (JSON, CSV, print)
Suitable for:
- Similarity triage
- Pattern exploration
- Reviewer-assist workflows
Not designed as:
- standalone legal decision engine
- direct disciplinary automation
Remove-Item .\runtime\db\plagiarism.db -Force -ErrorAction SilentlyContinue
Remove-Item .\runtime\data\embeddings\faiss.index -Force -ErrorAction SilentlyContinue
Remove-Item .\runtime\data\embeddings\faiss.meta.json -Force -ErrorAction SilentlyContinue
python scripts/init_db.py
uvicorn src.api.main:app --host 127.0.0.1 --port 8000 --reload- Backend API, similarity pipeline, persistence, and FAISS sync are integrated.
- Frontend analysis and export flow is integrated.
- Incremental dataset ingestion and evaluation tooling are available.
Last updated: April 1, 2026



